Perceptual User Interfaces Logo
University of Stuttgart Logo

Next Question Prediction in Goal-oriented Visual Question Answering by a Theory of Mind model


Description: Visual Question Answering (VQA) is a scenario where an image, a dialog about the image and a question are given and the task is to answer the question based on the given information. In this work, the task is to predict the next question in the dialog instead of answering the last question. It is interpreted as intention prediction in the VQA setting. A computational Theory of Mind (ToM) model, i.e. ToMNet+ will be used to encode the hidden state in human mind during questioning-answering and infer the next question.

In this project, you work on a synthetic goal-oriented VQA dataset, you will need to, 1) replace the CNN backbone in ToMNet+ with Graph Neural Network (GNN) backbones and compare the results. 2) compare different approaches to form graphs for GNN and 3) Generate the synthetic dataset with different statistics to see the performance.

Supervisor: Lei Shi

Distribution: Implementation: 70%, Evaluation: 30%

Requirements: Strong programming skills in Python and Pytorch. Good knowledge of deep learning. Preferable: Interest/Knowledge of graph neural network. Preferable: Knowledge of graph neural network library e.g. DGL, Pytorch Geometric.

Literature:

Zhao, R. and V. Tresp. 2018. Efficient dialog policy learning via positive memory retention. 2018 IEEE Spoken Language Technology Workshop (SLT), p. 823-830.

De Vries, H., F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.5503-5512.

Chuang, Y.-S., H.-Y. Hung, E. Gamborino, J. O. S. Goh, T.-R. Huang, Y.-L. Chang, S.-L. Yeh, and L.-C. Fu. 2020. Using machine theory of mind to learn agent social network structures from observed interactive behaviors with targets. 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), p.1013-1019.