So this story takes place during the Amazon ML Challenge, a 72-hour marathon that started like any typical machine learning competition. We were pumped, confident that we could handle the task at hand. After all, we had solved many ML problems before. When we saw the problem statement involving 300,000 images, our first thought was, “This is CNN time!”
However, we soon realized that extracting text from these images was crucial. We chose PaddleOCR, as it was delivering the best results. But there was a major roadblock—our humble GPU, an RTX 3050 with just 4 GB VRAM (nicknamed “@jetengine”). Extracting text from a single image took about 1 second, and with 300,000 images, we were staring at 83.3 hours of processing time. The challenge was only 72 hours long!
We decided to leverage the GPU for OCR, but PaddleOCR’s documentation was largely in Chinese. The English sections were less than helpful. Still, we pushed forward, wrote a Python script to utilize the GPU, and handled Out of Memory (OOM) errors by catching exceptions and skipping problematic images. After hours of trial and error—around 4 a.m.—we finally extracted the text from all the images and celebrated with some much-needed coffee.
With the text extracted, our next move was simple: map the text to the target variable. We built a base model using regex and submitted it. To our amazement, we were ranked 9th at the start! Another round of celebrations followed, this time with pastries.
But our early success was short-lived. No matter what machine learning techniques we threw at the problem—TF-IDF, BERT, NLP models—nothing improved our accuracy. We were stuck, wondering how the top teams were pulling off such high scores.
With just 12 hours left, we decided to try Named Entity Recognition (NER). We manually annotated around 400 images each, hoping that this last-ditch effort would pay off. Unfortunately, the results were no better. This was the point where I hit a low, realizing that despite everything we knew and tried, it wasn’t enough.
The next morning, I messaged the leaderboard toppers to figure out what we had missed. That’s when we learned they had used Vision-Language (VL) models, something we hadn’t even considered. The news hit me hard—happy because it wasn’t a skill issue, but sad because it was an information gap that cost us. It took two days to shake off the disappointment.
Despite everything, I ranked 193 out of 75,000+ participants, and looking back, those 72 hours were among the most intense and rewarding I’ve ever experienced. This challenge showed me that no matter how much you know, there’s always more to learn. That’s the beauty of it.
Published on: rahulsingh.xyz
Written by: Me