I wanted to know exactly how much I spent on milk. This should be easy, but receipts are the worst documents ever designed. I taught my laptop to read them for me so I don't have to.
I tried scanning, taking photos, and OCR. None of them worked well. I pointed an old OCR engine at a CVS receipt. Here's what I got:
CVS/pharmacy
1tem Qy Pr1ce
M1LK 2% 1 $4.4gI had to get creative. The first principles approach is: a receipt is a piece of paper with words on it.
After determining what a receipt is, I needed to pull the piece of paper with words on it out of the image.
This flattened receipt is now as clean as it can be.
Every store formats their receipts differently. Fast food has your order number, big and at the top. Target doesn't even have the word "Target" at the top??! I needed a way to figure out which store each receipt came from.
First idea: Google Maps. Feed it an address, phone number, website, etc. get the store name. This worked! It also cost money. After processing 200 receipts I checked my bill: $8. I had 300 left. I needed a better way to group the receipts by store without wasting another $8.
So I added Chroma, a vector database. Great, another database. But Chroma let me do something clever: instead of asking Google "what store is at 123 Main St," I could ask my own receipts "have I seen this address before?"
Turns out receipts from the same store look like receipts from the same store. Same address, same phone number, same weird formatting. Now I only hit Google for stores I've genuinely never seen. My bill dropped to cents.
Every receipt has the same kinds of words on it, but every store formats it differently. I needed a shared vocabulary.
The idea: show AI a receipt, ask it to tag each word with a label, then compare its answers to other receipts from the same store to see if the patterns still hold.
This works, kind of. AI isn't consistent. It would call the price of milk the subtotal. It confused "DAIRY" with "MILK." I can't trust something that doesn't know what milk is. So I corrected the results by asking AI to verify again.
And again. And again.
Each pass got a little better. The red shrinks, the green grows. But asking ~4 different AI, 5+ times to verify the results was slow and expensive. I needed a better way.
I needed to stop paying OpenAI to argue with me about what milk is. The answer, apparently, was to train my own model.
I don't know how to train a model...
This is why I needed to make my own AI. I don't know how to make an AI. Turns out training an AI involves staring at metrics I didn't fully understand. Precision and recall, apparently, are in some way related to the accuracy of the model. High precision means "when it says 'MILK', it's probably right." High recall means "it finds most of the MILKs" You can't crank both to 100%.
After a lot of trial and error-tweaking parameters, retraining, staring at graphs, I mostly understood what was going on. I found out that this is called hyper-parameter tuning.
The custom model does most of the work. Then I ship the results to AWS where a single AI pass confirms or corrects the labels. One call instead of 15+.
Same results. Fraction of the time. Fraction of the cost. I can finally afford to find out about the milk.
The question still stands: "How much did I spend on milk?"
When I ask, the system transforms my question it can actually search for:
It digs through the corpus, finds every mention of milk, and adds them up.
Ok, I over-engineered a milk tracker. But here's the thing: now I have a system I can actually break.
What if I ask it a weird question? What if the receipt is formatted in a way I've never seen? What if the AI hallucinates a grocery store that doesn't exist? I need to find out.
LangChain lets me wire up the whole pipeline: question in, answer out. But more importantly, it lets me throw hundreds of fake questions at the system to see what breaks.
Some work. Some don't. That's the point.
LangSmith records what happened: which questions worked, which failed, and why. I can use AI to annotate bad answers, use AI to evaluate why it went wrong, and plan a new experiment.
Change something, ask the questions, check the results, repeat. It's less "artificial intelligence" and more "arguing with a very fast intern who keeps misreading receipts".
Anyway, $800+ on milk last year. I might have a problem.
If you're still here, you might want to know what's actually under the hood.
I didn't write most of this code by hand. I orchestrated it. Cursor and Claude did the typing; I did the thinking. I know I'm cooking when I spend more time reviewing changes than writing code.
GitHub Actions let me experiment fast: push a change, watch it break, fix it, repeat. Cheap iteration without breaking production.
I've used Terraform before, but this time I tried Pulumi. It lets me define AWS infrastructure in Python, which means I can hack it with tools I already know.
This project has a lot of docker containers. My Pulumi hack bundles and ships them to AWS CodeBuild, which builds and deploys them without melting my laptop.
The other thing I leaned into: event-driven architecture. DynamoDB has a change data capture stream—whenever something changes, I can react to it.
That's how I keep DynamoDB and Chroma in sync. A change hits Dynamo, a Lambda picks it up, Chroma gets updated. No polling, no cron jobs.
Same pattern for my laptop talking to AWS. SQS queues everywhere. Things fail, things retry, nothing gets lost.
I'll probably keep building this way. It's nice when the system does the work.
The code is on GitHub if you want to see how the sausage gets made. Or the milk, I guess.
You can also drag and drop images anywhere on this page