CAI Logo

Gaze-based Transformer for Improving Action Recognition in Egocentric Videos


Description: Current SOTA work in action recognition from egocentric videos uses hand-object interaction to create better class tokens in video transformers. However, gaze information can add richer information and create eye-hand coordination into model, which can potentially further increase the performance in action recognition. In this project, we will use transformer to extract gaze feature which is correlated to the hand and context information and use the gaze feature to improve the action recognition in egocentric videos.

Goal: Modify and finetune gazeformer and develop a method using both hand-object interaction and eye-hand coordination to improve the accuracy of action recognition.

Supervisor: Lei Shi

Distribution: : 70% implementation, 20% analysis, 10% literature review

Requirements: Strong programming skills, Experience in deep learning, Familiar with PyTorch, Knowledge in transformer.

Literature:

Mondal, Sounak, et al. "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention." 2Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

Li, Yin, Miao Liu, and James M. Rehg. "In the eye of beholder: Joint learning of gaze and actions in first person video." Proceedings of the European conference on computer vision (ECCV). 2018.

Pan, Chenbin, et al. "EgoViT: Pyramid Video Transformer for Egocentric Action Recognition." arXiv preprint arXiv:2303.08920 (2023).