I wanted to get better at using AI to automate tasks. I already use ChatGPT for everyday chores, but I wanted to use it on something more challenging: receipts. They may look deceptively simple, but each squeezes a lot of information into a tiny space, making it surprisingly difficult to decode.
Receipts are a difficult problem to solve. To turn a crumpled piece of paper into structured data, you have to read three layers of data:
Since these layers arrive in messy, mixed formats, receipts are an ideal test case for learning how to use AI to automate real-world tasks. The rest of this post describes how I disentangle those three layers and stitch the answers back together.
Every receipt's journey begins as raw pixels, whether a crisp scan or a hastily snapped photo. But hidden within those pixels is structured information waiting to be extracted.
Text recognition finds words, but not context. Multiple receipts become one confusing jumble. The critical step: determining which words belong together.
This is where the difference between scanned receipts and photographed receipts becomes crucial. Scans are the easy case: flat, well-lit, and aligned. But photos? They're captured at odd angles, under harsh store lighting, with shadows and curves that confuse even the best text recognition. Each type needs its own approach to group related words and separate one receipt from another.
With scans being the easier of the two, we can say that the receipt is parallel to the image sensor.
Grouping the scanned words is straightforward. Since the scanner captures everything flat and aligned, we can use simple rules: words that line up horizontally likely belong to the same line item, and everything on the page belongs to the same receipt. It's like reading a well-organized spreadsheet.
Photos are trickier. When you snap a picture of a receipt on a counter, the camera captures it from an angle. Words at the top might appear smaller than words at the bottom. Lines that should be straight look curved. And if there are multiple receipts in frame? Now you need to figure out which words belong to which piece of paper.
To solve this puzzle, I look for clusters of words that seem to move together—like finding constellations in a sky full of stars. Words that are close together and follow similar patterns likely belong to the same receipt. Once identified, I can digitally "flatten" each receipt, making it as clean as a scan.
Now that we can group words correctly, we face a new challenge: processing thousands of receipts efficiently. My solution? My laptop handles the tricky word orientation while the cloud handles the visual processing (finding and flattening receipts).
This hybrid approach has processed hundreds of receipts, transforming messy photos and scans into organized, searchable data:
In order to explain the text on the receipt, I needed to explain how the words semantically relate to one another. But before I can explain, we have to understand what embedding is.
I had never used embeddings before. They are not exactly new, but they have become widely available in the last few years. What embeddings offer is the ability to discover connections between things at previously impossible scales.
If someone were to ask you to embed something, what do you need? You start with textual representation of the thing they're asking to embed.
What do you get back? You get a structure of numbers.
While each input might be different, we get the same structure of numbers back. Here's the magic. Because we get the same structure, we have a way to mathematically compare two pieces of text together. But what do the numbers mean?
There are many services that offer to generate embeddings, but I ended up going with OpenAI.
No. I experimented a lot. I developed a way to batch embeddings to cut costs further. This part is also free.
So back to the question: what do the numbers mean? Let's think about coordinates on a map. Suppose I give you three points:
Point | X | Y |
---|---|---|
A | 3 | 2 |
B | 1 | 1 |
C | -2 | -2 |
There are 2 dimensions to this map: X and Y. Each point lives at the intersection of an X and Y coordinate.
Is A closer to B or C?
A is closer to B.
Here's the mental leap. Embeddings are similar to points on a map. Each number in the embedding is a coordinate in a complicated map. When OpenAI sends a list of numbers, it's telling you where that text semantically lives in that map. When we ask what the distance between two embeddings are, what we're really doing is asking how semantically close or far apart two pieces of text are.
This concept of positioning items in multi-dimensional space like this, where related items are clustevar(--color-red) near each other, goes by the name of latent space.
Latent space is a powerful concept. It allows us to discover connections between things at previously impossible scales.
Writing and reading these lists of numbers get's complicated fast. After some research, I found Pinecone, a vector database that allows me to store and retrieve embeddings.
Pinecone's real strength shows when you attach meaningfulinformation to each embedding. The embedding by itself can telling you which words are similar, but adding context, store name, location, or even category, let's you find the semantically similar words you're looking for.
Imagine looking for the word "latte" across 10,000 receipts. Without context, you'll get results from latte flavovar(--color-red) cereal at grocery stores, expensive drinks at coffee shops, and even brown colovar(--color-red) paint from hardware stores.
With context, you can filter out the results that don't make sense.
I used OpenAI's Agents SDK with the Google Places API to get the context needed for rich, semantic search.
Pinecone doesn't just help me find similar words, it allows me to act on them. After every receipt is embedded, an OpenAI agent retrieves the "nearest neighbors" of each unlabeled word, filtevar(--color-red) by the receipt's merchant-specific metadata.
For the token "latte" on a Starbucks receipt, an agent pulls semantically similar words from other Starbucks receipts and asks:
"Given these examples and surrounding words, what label would best describe latte?"
The agent then validates the proposed label using custom rules, similar examples where the word is correctly labeled with that label, and examples where the word is incorrectly labeled.
This process is repeated to continuously improve the accuracy of the labels. As part of this loop, I use RAGAS to evaluate how faithful and relevant the model's responses are to the retrieved context.
This project was a great learning experience. The best way for me to learn is by actually doing. Experimenting with different tools and techniques allowed me to reflect on what worked and what didn't.
I used Github and Pulumi to manage the cloud and code.
I got my build time down to ~10 seconds. This allows me to make a change locally and get feedback in seconds. Github allows me to track my changes and deploy to production.
I'm using React to build the frontend. It's a great way to get started, but I wanted to continue to increase my tech stack. I ended up porting to NextJS.
Moving to NextJS was really easy. Using a combination of Cursor and OpenAI's Codex allowed me to move this from one framework to another with minimal effort.
I'm currently training a custom model to improve the process of getting an image and structuring the data. Having to query a database of similar words works for large datasets, but there's room here for a simple model that can run on my laptop. I've been playing with a few models on Hugging Face. Hugging Face.
Artificial intelligence is advancing quickly and changing how we write software. However, no matter how smart computers become, we still need solid engineering practices, clever problem-solving, and expert knowledge. I'm excited to see a future where people and AI work together to make programming faster, smarter, and easier for everyone.
You can also drag and drop images anywhere on this page