List of Temporal Video Grounding (TVG) papers.
The task is also usually referred to as:
- Temporal Sentence Grounding (TSG)
- Video Moment Retrieval (VMR)
- Temporal Activity Localization via Language Query (TALL)
TVG was initially introduced in 2017 as a novel task designed to localize specific moments in videos that are semantically relevant to given natural language queries. Recent studies have started investigating techniques to augment the grounding capacity of large language models (LLMs), enabling them to better comprehend and temporally align visual information with natural language inputs.
- Awesome-Temporal-Video-Grounding
- Content
- 1 Survey
- 2 Datasets
- 3 LLM for TVG
- 4 Traditional TVG
- [TPAMI'23] Temporal Sentence Grounding in Videos: A Survey and Future Directions. NTU 孙爱欣团队
- [ACM Comput. Surv.'23] A Survey on Video Moment Localization. 哈工大 聂礼强团队
- Charades-STA: VGG, C3D, I3D, CLIP+SF
- TACoS: C3D, I3D
- ActivityNet Captions: C3D
- QVHighlights: CLIP+SF
- [ACL] Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. [code]
- [ICCVW] LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling.
- [NeurIPS] Self-Chained Image-Language Model for Video Localization and Question Answering. [code]
- [arXiv] Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos.
- [arXiv] LLM4VG: Large Language Models Evaluation for Video Grounding
- [ACL] GroundingGPT: Language Enhanced Multi-modal Grounding Model. [code]
- [CVPR] VTimeLLM: Empower LLM to grasp video moments. [code]
- [CVPR] TimeChat: A time-sensitive multimodal large language model for long video understanding. [code]
- [ECCV] Training-free video temporal grounding using large-scale pre-trained models. [code]
- [EMNLP] Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge. [code]
- [NeurIPS] SlowFocus: Enhancing fine-grained temporal understanding in video LLM. [code]
- [arXiv] The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval. [code]
- [arXiv] LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval.
- [arXiv] HawkEye: Training Video-Text LLMs for Grounding Text in Videos. [code]
- [arXiv] Video LLMs for temporal reasoning in long videos
- [TMM] ETC: Temporal boundary expand then clarify for weakly supervised video grounding with multimodal large language model.
- [AAAI] VTG-LLM: Integrating timestamp knowledge into video LLMs for enhanced video temporal grounding. [code]
- [AAAI] Zero-shot video moment retrieval via off-the-shelf multimodal large language models.
- [ICLR] TRACE: Temporal grounding video LLM via causal event modeling. [code]
- [ICLR] TimeSuite: Improving MLLMs for long video understanding via grounded tuning. [code]
- [CVPR] SVLTA: Benchmarking vision-language temporal alignment via synthetic video situation. [code]
- [CVPR] ReVisionLLM: Recursive vision-language model for temporal grounding in hour-long videos. [code]
- [CVPR] Number it: Temporal grounding videos like flipping manga. [code]
- [COLING] Mitigating the discrepancy between video and text temporal sequences: A time-perception enhanced video grounding method for LLM.
- [arXiv] Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization. [code]
- [arXiv] TimeRefine: Temporal grounding with time refining video LLM. [code]
- [arXiv] TimeZero: Temporal video grounding with reasoning-guided LVLM. [code]
- [arXiv] Time-R1: Post-training large vision language model for temporal video grounding. [code]
- [arXiv] MomentSeeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos.
- [arXiv] VideoExpert: Augmented LLM for temporal-sensitive video understanding.
- [arXiv] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models.
- [arXiv] VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning. [code]
- [arXiv] Invert4TVG: A temporal video grounding framework with inversion tasks for enhanced action understanding.
首次提出TSG任务。
Proposal-based
- [ICCV'17] TALL: Temporal Activity Localization via Language Query. 南加大 高继扬 [code]
- [ICCV'17] Localizing Moments in Video with Natural Language. 伯克利 Lisa Anne Hendricks [code]
Proposal-based
- [EMNLP'18] Temporally Grounding Natural Sentence in Video. NUS Tat-Seng Chua团队
- [IJCAI'18] Multi-modal Circulant Fusion for Video-to-Language and Backward. 天大 韩亚洪团队
- [ACM MM'18] Cross-modal Moment Localization in Videos. 山东大学 聂礼强团队 [code]
- [SIGIR'18] Attentive Moment Retrieval in Videos. 山东大学 聂礼强团队 [code]
Proposal-free
- [AAAI'19] Localizing Natural Language in Videos. 腾讯AI lab
Reconstruction-based
- [NeurIPS'18] Weakly Supervised Dense Event Captioning in Videos. 清华 朱文武团队 [code]
- 首次提出弱监督密集事件描述,在训练中涉及到了TSG问题
Proposal-based
- [AAAI'19] Semantic Proposal for Activity Localization in Videos via Sentence Query. 复旦 姜育刚团队
- [CVPR'19] MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. UCSB Da Zhang
- [ACM MM'19] Exploiting Temporal Relationships in Video Moment Localization with Natural Language. UR 罗杰波团队 [code]
- [NeurIPS'19] Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. 清华 朱文武团队 [code]
- [SIGIR'19] Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. 浙大 赵洲团队 [code]
- [WACV'19] MAC: Mining Activity Concepts for Language-based Temporal Localization. 南加大 [code]
Proposal-free
- [AAAI'19] Multilevel Language and Vision Integration for Text-to-Clip Retrieval. BU Huijuan Xu [code]
- [AAAI'19] To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression. 清华 朱文武团队 [code]
- [EMNLP'19] DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. 浙大 肖俊团队
RL-based
- [AAAI'19] Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. 百度
- [CVPR'19] Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. 中科院 王亮团队
MIL-based
- [CVPR'19] Weakly Supervised Video Moment Retrieval From Text Queries. UCR Amit K. Roy-Chowdhury团队 [code]
- 正式提出weakly supervised temporal sentence grounding任务。
- [EMNLP'19] WSLLN:Weakly Supervised Natural Language Localization Networks. Salesforce
Proposal-based
- [AAAI'20] Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. UR 罗杰波团队 [code]
- 首次提出2D map的方法,后面proposal-based的论文大多都是基于这个方法。
Proposal-free
- [ACL'20] Span-based Localizing Network for Natural Language Video Localization. NTU 孙爱欣团队 [code]
Reconstruction-based
- [AAAI'20] Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. 浙大 赵洲团队 [code]
- 首次在WTSG任务中使用掩码重建的方法。
Proposal-based
- [SIGIR'21] Deconfounded Video Moment Retrieval with Causal Intervention. NUS Tat-Seng Chua 团队 [code]
- 将因果推理引入TSG,消除视频中的位置信息带来的偏差
- [CVPR'21] Interventional Video Grounding with Dual Contrastive Learning. 北邮 南国顺
- Contrastive learning + causal intervention
- [CVPR'21] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. 湖南大学 曹达团队
- [ICCV'21] Fast Video Moment Retrieval. 中科院 徐常胜团队
Proposal-free
- [TPAMI'21] Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework. NTU 孙爱欣团队
- VSLNet (ACL'20)的扩展版
- [TMM'21] Frame-Wise Cross-Modal Matching for Video Moment Retrieval. 齐鲁工业大学 程志勇团队 [code]
DETR-based
- [NeurIPS'21] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries. UNC 雷杰 [code]
- 将MR和HD任务联合,首次将DETR引入VMR领域。
首次提出无监督任务。
- [ICCV'21] Zero-shot Natural Language Video Localization. 首尔大学 Jonghyun Choi团队 [code]
- [TCSVT'21] Learning Video Moment Retrieval Without a Single Annotated Video. 中科院 徐常胜团队
Proposal-based
- [SIGIR'22] You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos. 上交 周曦团队 [code]
- [TCSVT'22] Efficient Video Grounding With Which-Where Reading Comprehension. 上交 周曦团队
Proposal-free
- [TIP'22] HiSA: Hierarchically Semantic Associating for Video Temporal Grounding. 西电 邓成团队 [code]
DETR-based
- [CVPR'22] UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection. 腾讯ARC lab [code]
Reconstruction-based
- [AAAI'22] Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining. 北大 刘洋团队 [code]
- [CVPR'22] Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning. 北大 刘洋团队 [code]
- 挖掘负样本信息,以更好地区分同一视频中极易混淆的场景。
- 后续的弱监督方法都是以CPL为baseline做的了。
首次提出单帧监督任务。
- [TMM'22] Point-Supervised Video Temporal Grounding. 西电 邓成团队
- [SIGIR'22] Video Moment Retrieval from Text Queries via Single Frame Annotation. 复旦 姜育刚团队 [code]
Proposal-based
- [AAAI'23] Phrase-Level Temporal Relationship Mining for Temporal Sentence Localization. 北大 刘洋团队 [code]
- [ICCV'23] G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory. 北大 邹月娴团队
Proposal-free
DETR-based
- [ACL'23] MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction. NTU 孙爱欣团队 [code]
- [CVPR'23] Query-Dependent Video Representation for Moment Retrieval and Highlight Detection. 成均馆大学 Jae-Pil Heo团队 [code]
- [ICCV'23] Knowing Where to Focus: Event-aware Transformer for Video Grounding. 延世大学 Kwanghoon Sohn团队 [code]
- [NeurIPS'23] MomentDiff: Generative Video Moment Retrieval from Random to Real. 中科大 谢洪涛团队 [code]
- 利用diffusion的思想去噪生成预测时刻
Bias
- [AAAI'23] Curriculum Multi-Negative Augmentation for Debiased Video Grounding. 清华 朱文武团队
Reconstruction-based
- [CVPR'23] Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training. 东京大学 Yoichi Sato团队
- [CVPR'23] Iterative Proposal Refinement for Weakly-Supervised Video Grounding. 北大 邹月娴团队
- [ICCV'23] SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval. 韩国科学技术院 Chang D. Yoo团队
- [ICCV'23] D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation. 腾讯优图 [code]
- [ACL'23] Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. 北大 刘洋团队 [code]
Proposal-based
- [ACM MM'24] Maskable Retentive Network for Video Moment Retrieval. 合工大 汪萌团队 [code]
- [AAAI'24] Exploiting Auxiliary Caption for Video Grounding. 北大 邹月娴团队
Proposal-free
DETR-based
- [AAAI'24] Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval. 中科大 谢洪涛团队 [code]
- 针对模态不平衡问题
- [AAAI'24] TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection. 华中师范 谢伟团队 [code]
- [CVPR'24] Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection. 西交 魏平团队 [code]
- [CVPR'24] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection. 清华 李秀团队 [code]
- [ACM MM'24] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval. 港浸大 魏骁勇团队 [code]
Bias
- [AAAI'24] Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video. 哈工大 张维刚团队 [code]
Reconstruction-based
- [AAAI'24] Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding. 首尔大学 Jin Young Choi团队 [code]
- [AAAI'24] Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency. NTU Alex C. Kot团队
- [PR'24] Triadic temporal-semantic alignment for weakly-supervised video moment retrieval. 山东大学 周风余团队
- [ACL'24] Exploiting Intrinsic Multilateral Logical Rules for Weakly Supervised Natural Language Video Localization. 西电 邓成团队