Albert Heijn Receipt Parser
Albert Heijn is where I do most of my grocery shopping. Through their loyalty program, every receipt, from apples to toilet paper, gets logged in the app. That gave me access to over 600 digital receipts spanning March 2023 to May 2025.
But they don’t make it easy to use that data. The receipts are stored as unsearchable PDFs, image files with no structure, no export as table button, and no way to actually work with the data. If you want to analyze your own purchasing history, you’re on your own.
That’s what this project was for. I built a pipeline to unlock all that hidden structure.
What the Pipeline Does
The parser takes a raw PDF receipt and returns a clean table of product-level data. Each row corresponds to one purchase, with fields like product name, price, quantity, timestamp, discount flag, and source. It supports multi-person tagging too. I used it to separate my receipts from my roommate’s.
Under the hood, it combines OCR, regex logic, and some assumptions about Albert Heijn’s receipt format. The key challenge wasn’t the OCR itself. The difficulty was getting reliable detection of messy edge cases. Bonus discounts, for example, are printed at the end of lines. Timestamps show up in inconsistent formats. I ended up tuning the OCR engine to treat each receipt as a single block of text. That made it easier to extract structured patterns.
But all that is beside the point. The goal was never to build a perfect parser. It was to get reliable, analyzable data out of a fixed, messy format. And to do it without relying on manual entry or external APIs.
Why I Built It
This parser was a prerequisite. I built it so I could build something else: a full personal inflation index, modeled after official CPI logic but applied to my actual groceries. You can read about that here.
That project needed high-resolution input. Real prices, real dates, real purchases. And none of that would’ve been available without this custom pipeline.
It wasn’t glamorous work. But it’s the kind of infrastructure you need if you want to do serious individual-level analysis without access to loyalty card datasets or retailer APIs.
What It Shows
- You don’t need a clean source to get a clean dataset. Just some engineering and persistence.
- Focused tools outperform general-purpose ones when the data source is known.
- Most personal data is locked away in formats built for compliance, not comprehension. But it doesn’t have to stay that way.