Fine-tuning Vision Transformer-Based Model for Pose-Estimation

🗄️📊 Dataset: MPII - 1463 images with an 80/20 training/validation split
🤖 Model: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation (BLIP)
🤗: Source Code
📉 Training Loss: Language Modelling Loss from Text Decoder
📏 Evaluation Metric: Mean Absolute Error (MAE) [PyTorch]
⚖️ Validation Threshold Range: [1, 5, 25] pixels
⚙️ Hyperparameter choice:
- Batch size: 4
- Learning Rate: 2e-5
- Optimizer: AdamW
🎯 Validation Accuracy of 92.5% with a threshold of 25 pixels