Video Moment Retrieval via Natural Language Queries. (arXiv:2009.02406v2 [cs.CV] UPDATED)
In this paper, we propose a novel method for video moment retrieval (VMR)
that achieves state of the arts (SOTA) performance on R@1 metrics and
surpassing the SOTA on the high IoU metric (R@1, IoU=0.7).
First, we propose to use a multi-head self-attention mechanism, and further a
cross-attention scheme to capture video/query interaction and long-range query
dependencies from video context. The attention-based methods can develop
frame-to-query interaction and query-to-frame interaction at arbitrary
positions and the multi-head setting ensures the sufficient understanding of
complicated dependencies. Our model has a simple architecture, which enables
faster training and inference while maintaining .
Second, We also propose to use multiple task training objective consists of
moment segmentation task, start/end distribution prediction and start/end
location regression task. We have verified that start/end prediction are noisy
due to annotator disagreement and joint training with moment segmentation task
can provide richer information since frames inside the target clip are also
utilized as positive training examples.
Third, we propose to use an early fusion approach, which achieves better
performance at the cost of inference time. However, the inference time will not
be a problem for our model since our model has a simple architecture which
enables efficient training and inference.