Perceptual User Interfaces Logo
University of Stuttgart Logo

Learning-based Region Selection for End-to-End Gaze Estimation

Xucong Zhang, Yusuke Sugano, Andreas Bulling, Otmar Hilliges

Proc. British Machine Vision Conference (BMVC), pp. 1-13, 2020.

Method Overview. The RSN takes face image as input and outputs an index for the location pool. This location index is then mapped to a pixel location in the face image from which a surrounding region is cropped. The gaze net takes the cropped region image and the corresponding region grid as inputs. We extract features from the crop and the region grid and concatenate them. The combined feature vector is fed into three fully-connected layers to estimate a 2D gaze direction g.


Traditionally, appearance-based gaze estimation methods use statically defined face regions as input to the gaze estimator, such as eye patches, and therefore suffer from difficult lighting conditions and extreme head poses for which these regions are often not the most informative with respect to the gaze estimation task. We posit that facial regions should be selected dynamically based on the image content and propose a novel gaze estimation method that combines the task of region proposal and gaze estimation into a single end-to-end trainable framework. We introduce a novel loss that allows for unsupervised training of a region proposal network alongside the (supervised) training of the final gaze estimator. We show that our method can learn meaningful region selection strategies and outperforms fixed region approaches. We further show that our method performs particularly well for challenging cases, i.e., those with difficult lighting conditions such as directional lights, extreme head angles, or self-occlusion. Finally, we show that the proposed method achieves better results than the current state-of-the-art method in within and cross-dataset evaluations.



@inproceedings{zhang20_bmvc, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas and Hilliges, Otmar}, title = {Learning-based Region Selection for End-to-End Gaze Estimation}, booktitle = {Proc. British Machine Vision Conference (BMVC)}, year = {2020}, pages = {1-13} }