Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Dongliang He; Xiang Zhao; Jizhou Huang; Fu Li; Xiao Liu; Shilei Wen

doi:10.1609/aaai.v33i01.33018393

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Authors

Dongliang He Baidu, Inc.
Xiang Zhao Baidu, Inc.
Jizhou Huang Baidu, Inc.
Fu Li Baidu, Inc.
Xiao Liu Baidu, Inc.
Shilei Wen Baidu Research

DOI:

https://doi.org/10.1609/aaai.v33i01.33018393

Abstract

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

Downloads

Published

2019-07-17

How to Cite

He, D., Zhao, X., Huang, J., Li, F., Liu, X., & Wen, S. (2019). Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8393-8400. https://doi.org/10.1609/aaai.v33i01.33018393

Download Citation

Issue

Vol. 33 No. 01: AAAI-19, IAAI-19, EAAI-20

Section

AAAI Technical Track: Vision

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription