Fine Tuning OCR Free Document Understanding Transformer for Image-to-Text Captioning

[ Presentation, Demo, Github, Website, Report ]

🎓 Completed a final CAPSTONE project in collaboration with Refiberd, working alongside an exceptional team including Isidora Rollan , Erin Jones, Mustafa Hameed , and Prashant Sharma.
🔧🧩 Prototyped an image-to-text captioning system for Refiberd projecting reduced workforce requirements for label collection by 50%, thereby saving 1,500 annual work hours.
💡 Conceptualized, tested, and finalized the ML model, followed by fine-tuning and deployment preparation.
🧠💻 Optimized the state-of-the-art multimodal encoder-decoder DONUT model, achieving an exceptional normalized Levenshtein distance of 0.05.
🗄️📊 Developed a custom dataset by sourcing raw images of vendor tags and converting them into the Hugging Face Apache Arrow format, enhancing model training efficiency.
🏗️📊 Assembled Train Dataset: 469 Images
🔧📊 Tuned Validation Dataset: 112 Images
🎯📊 Evaluated Test Dataset: 66 Images.
🔁 Developed a custom PyTorch training loop, logging key metrics such as train loss, validation accuracy, and test accuracy, while also synchronizing model states and checkpoints with the Hugging Face Hub.
⚙️ Explored and tuned hyperparameters, experimenting with various optimizers including SGD, SGD with momentum, and second-order optimizers like Adam and AdamW to enhance model performance.
🖥️ Leveraged L4 GPU architecture to enhance training loops, efficiently managing GPU resources for improved model training workflows.
🏆 Achieved a perfect match on 47 out of 66 test images underscoring the model's robustness to out-of-distribution data.
⚖️ Attained a Cross Entropy Loss of just 0.0005 on the training dataset for the decoder's next-token-prediction task, indicating high predictive accuracy.
🔍 Currently developing an attention mask using output_state and BertViz for the decoder side to enhance insight into the attention mechanisms within the model.