KnowIT VQA Answering Knowledge-Based Questions about Videos

KnowIT VQA Paper


KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.


Our AAAI 2020 paper:






If you find our paper useful, please cite us:

@InProceedings{garcia2020knowit,
   author    = {Noa Garcia and Mayu Otani and Chenhui Chu and Yuta Nakashima},
   title     = {KnowIT VQA: Answering Knowledge-Based Questions about Videos},
   booktitle = {Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
   year      = {2020},
}

Dataset Download


▪ KnowIT VQA Annotations: [Download]

It contains 3 csv files (tab separated): knowit_data_train.csv, knowit_data_val.csv, and knowit_data_test.csv.

▪ Annotations Format:

Each row in the csv files corresponds to a sample.
Each sample contains the following fields:

Field Type Description
scene str Video clip id as sXXeYY_sceneZZZ_AAAA_BBBB
  • XX is the season number.
  • YY is the episode number
  • ZZZ is scene number
  • AAAA is the first frame number of the scene (extracted at 1fps)
  • BBBB is the last frame number of the scene (extracted at 1fps)
question str Question.
answer1 str First candidate answer.
answer2 str Second candidate answer.
answer3 str Third candidate answer.
answer4 str Forth candidate answer.
idxCorrect int Index of the correct answer (1-4).
reason str Knowledge, i.e. information that is required to answer the question.
kg_type str Whether the knowledge type is episode-specific or recurrent.
subtitle str Subtitles of the video clip.
QType str Question type (only on the test set).

ROCK: Retrieval Over Collected Knowledge


ROCK is a model for Knowledge-Based Visual Question Answering in Videos. It incorporates the use of external knowledge to answer questions about video clips.

ROCK is based on the availability of language instances representing the knowledge in a certain universe. ROCK retrieves those instances and fuses them with video representations for answer prediction.





KnowIT VQA Leaderboard


Rank Date Model Accuracy
1 Sep 5, 2019 ROCK-concepts 0.654
2 Sep 5, 2019 ROCK-image 0.654
3 Sep 5, 2019 ROCK-facial 0.654
4 Sep 5, 2019 ROCK-caption 0.635

Contact us!


If you have any inquiry, suggestion, or doubt, please contact Noa Garcia:

noagarcia@ids.osaka-u.ac.jp