📹 长视频理解 · 记忆机制Long Video Understanding · Memory

长视频理解与记忆机制综述
2019–2026

Long Video Understanding
with Memory Survey 2019–2026

从稀疏采样到流式记忆,从外部存储到万亿上下文窗口 —— 系统梳理长视频理解技术的演进脉络与前沿趋势。

From sparse sampling to streaming memory, from external storage to million-token context windows — a systematic survey of long video understanding and memory techniques.

50+
核心论文
Key Papers
5
技术范式
Paradigms
12+
主流基准
Benchmarks
2026
持续更新
Updated
🌐
方向概述
Direction Overview
Long Video Understanding with Memory — What & Why
长视频理解与记忆 — 是什么 & 为什么

🎯 核心问题

🎯 Core Problem

视频时长爆炸式增长(电影/直播/监控),而现有多模态大模型受限于上下文窗口(<128K tokens),无法端到端处理超长视频。

Video durations are exploding (films, live streams, surveillance), but current multimodal LLMs are constrained by context windows (<128K tokens), making end-to-end processing of ultra-long video infeasible.

🧠 记忆的意义

🧠 Role of Memory

记忆机制让模型能以固定计算代价访问任意时刻的历史信息,是解决长时依赖推理("两小时前发生了什么")的关键。

Memory mechanisms allow models to access historical information at any moment with fixed compute cost, crucial for long-range temporal reasoning ("what happened two hours ago").

🔗 与 LLM 的融合

🔗 LLM Integration

2023年以来,主流方向从纯视觉模型转向 Video LMM/MLLM:将视频帧编码为 token 注入 LLM,关键挑战是如何在 token 数量 vs. 信息保真度之间取得平衡。

Since 2023, the mainstream shifted from pure vision to Video LMM/MLLM: encoding frames as tokens injected into LLMs. The key challenge is balancing token count vs. information fidelity.

🚀 应用场景

🚀 Applications

电影问答、体育分析、医疗视频诊断、会议记录、监控异常检测、具身 Agent 长程规划、视频检索与摘要。

Movie QA, sports analytics, medical video diagnosis, meeting transcription, surveillance anomaly detection, embodied agent long-horizon planning, video retrieval and summarization.

⚠️
核心挑战
Core Challenges
💾

计算与内存瓶颈

Compute & Memory Bottleneck

视频帧数与 token 数呈线性增长,Attention 复杂度 O(n²),直接处理 1 小时视频需百万级 token。

Frame and token counts grow linearly; O(n²) attention complexity. Processing one hour of video requires millions of tokens directly.

长时依赖推理

Long-Range Temporal Reasoning

问答需要跨越数小时的跨帧推理(如"主角的动机是什么"),稀疏采样丢失关键帧,密集采样超出窗口。

QA requires cross-hour temporal reasoning (e.g., "what motivated the character"). Sparse sampling drops key frames; dense sampling exceeds the window.

🔍

时序定位精度

Temporal Grounding Accuracy

在小时级视频中精准定位特定事件的时间戳(秒级),需要细粒度的时序感知能力。

Precisely localizing specific event timestamps (second-level precision) in hour-long videos requires fine-grained temporal perception.

🗜️

信息压缩 vs. 保真

Compression vs. Fidelity

激进压缩(如平均池化)会抹去局部细节;过于保守(保留所有 token)则计算不可行。寻找最优压缩策略是核心。

Aggressive compression (e.g., average pooling) erases local details; conservative approaches (keep all tokens) are computationally infeasible. Finding optimal compression is the core challenge.

🌊

流式 / 在线处理

Streaming / Online Processing

直播与监控场景要求模型实时处理视频流,不能等待全部帧到齐,因果约束下的记忆设计需要特殊考量。

Live streaming and surveillance require real-time processing without waiting for all frames. Causal memory design under streaming constraints needs special treatment.

🤔

多模态对齐

Multimodal Alignment

视觉 token 与文本 token 的语义对齐在长视频中更难维持,随时序增长,历史帧的视觉信息与当前查询的相关性逐渐退化。

Maintaining semantic alignment between visual and text tokens is harder in long videos; historical frame relevance to current queries degrades over time.

📅
发展时间线
Development Timeline
2019 → 2026
2019 · 奠基期 / Foundation Era

VideoBERT (Sun et al., Google)

首次将 BERT 自监督预训练迁移至视频-文本联合学习;将视频离散化为视觉 token,与文字 token 交织训练。奠定了视频-语言预训练范式,长视频研究的远祖。

First application of BERT-style self-supervised pretraining to joint video-text learning; discretized video into visual tokens interleaved with text. Laid the foundation for video-language pretraining.

Video-Language PretrainingToken
2021 · 长视频建模启动

Longformer / BigBird for Video (多个工作)

稀疏注意力机制(滑动窗口 + 全局 token)被引入视频理解,将 O(n²) 复杂度降为 O(n),首次让长序列视频 Transformer 变得可行。

Sparse attention (sliding window + global tokens) introduced to video understanding, reducing O(n²) to O(n) complexity. Made long-sequence video Transformers feasible for the first time.

Sparse AttentionEfficiency
2022 · 视频-LLM 元年

Frozen / FrozenBiLM (Alayrac et al., DeepMind)

冻结 LLM 权重,仅通过可学习视觉适配层将视频帧注入语言模型,验证了 LLM 作为视频推理引擎的潜力,奠定"冻结 LLM + 视觉桥接"范式。

Froze LLM weights, injecting video frames via learnable visual adapters. Validated LLMs as video reasoning engines; established the "frozen LLM + visual bridge" paradigm.

Video-LLMAdapter
2023-Q1 · 对话式视频理解爆发

VideoChat (OpenGVLab) / Video-ChatGPT (MBZUAI)

首批基于对话的视频多模态模型。VideoChat 引入事件记忆(Event Memory),将视频语义压缩为文字摘要输入 LLM;Video-ChatGPT 完成大规模视频指令调优数据集构建。

First conversational video multimodal models. VideoChat introduced event memory (compressing video semantics into text summaries for LLM). Video-ChatGPT built large-scale video instruction tuning datasets.

Instruction TuningEvent Memory
2023-Q3 · 长视频稀疏记忆先驱

MovieChat (CVPR 2024) ⭐ 长视频记忆命名起点

以 Atkinson-Shiffrin 认知记忆模型(短时记忆 + 长时记忆)为框架,设计 Dense Token → Sparse Memory 的两阶段压缩,首次支持 10K+ 帧长视频理解,配套发布 MovieChat-1K 基准。

Framed on Atkinson-Shiffrin memory model (short-term + long-term), designed Dense Token → Sparse Memory two-stage compression. First to support 10K+ frame long video understanding. Released MovieChat-1K benchmark.

Sparse MemoryCognitive Memory10K Frames
2023-Q4 · 时序理解深化

TimeChat (ICLR 2024) & LLaMA-VID

TimeChat 引入时间感知帧编码器(Timestamp-aware),每帧附加绝对时间戳 token,使模型具备时序定位能力。LLaMA-VID 将每帧压缩至 2 个 token(上下文 + 内容),使百分钟视频成为可能。

TimeChat introduced timestamp-aware frame encoding, appending absolute timestamp tokens to each frame for temporal grounding. LLaMA-VID compressed each frame to 2 tokens (context + content), enabling hour-long video processing.

Temporal GroundingToken Compression
2024-Q1 · 记忆增强 Agent 涌现

VideoAgent (ECCV 2024) · VideoAgent-Long (EMNLP 2024)

VideoAgent 构建结构化记忆(时序事件描述 + 对象追踪状态),迭代召回相关片段回答问题,开创"基于外部记忆的视频 Agent"范式。VideoAgent-Long 专注小时级长视频,用 GPT-4V 逐步构建语义记忆库。

VideoAgent built structured memory (temporal event descriptions + object tracking states), iteratively retrieving relevant clips to answer questions. Pioneered the "external-memory video agent" paradigm. VideoAgent-Long targeted hour-long videos using GPT-4V to progressively build a semantic memory store.

Memory AgentStructured MemoryIterative Retrieval
2024-Q2 · 大厂长上下文窗口竞赛

Gemini 1.5 Pro (Google) — 1M Token Context ⭐

Gemini 1.5 Pro 以 100 万 token 超长上下文窗口实现端到端长视频理解,无需外部记忆模块即可处理 1 小时以上视频。引发"暴力扩展上下文 vs. 精细记忆设计"的学术争论,在 Video-MME 基准上全面领先 GPT-4o。

Gemini 1.5 Pro's 1M-token context window enabled end-to-end long video understanding without external memory, processing 1+ hour videos. Sparked debate on "brute-force context scaling vs. fine-grained memory design." Outperformed GPT-4o on Video-MME benchmark.

Long Context1M TokensIndustry
2024-Q2 · 记忆增强 LMM 双雄

MA-LMM (CVPR 2024) & MovieChat+

MA-LMM 设计可插拔长期记忆库(Long-term Memory Bank),在线压缩历史视觉 token,支持视频描述/预测/问答多任务,是记忆增强视频 LMM 的标准基线。MovieChat+ 改进为问题感知稀疏记忆(Question-aware),提升检索相关性。

MA-LMM designed a plug-in long-term memory bank, compressing historical visual tokens online across video captioning, prediction, and QA tasks. MovieChat+ improved to question-aware sparse memory for better retrieval relevance.

Memory BankOnline Compression
2024-Q3 · 层次事件记忆

HEM-LLM: Hierarchical Event-Based Memory (2024)

设计层次化事件记忆机制,将视频切分为多粒度事件(片段→事件→故事弧),每层独立压缩后跨层级检索,缓解平坦记忆结构的信息稀释问题。

Designed hierarchical event-based memory, segmenting videos at multiple granularities (clips → events → story arcs), compressing each level independently and enabling cross-level retrieval to alleviate flat-memory information dilution.

Hierarchical MemoryEvent Segmentation
2024-Q4 · 流式记忆 & 大规模训练

LLaVA-Video (ByteDance / NeurIPS 2024) & VideoStreaming

LLaVA-Video 通过大规模合成视频指令数据(1.6M 视频-文本对)显著提升长视频理解,成为开源视频 LMM 的新基线。VideoStreaming 提出固定 token 预算下的流式压缩,常数计算量处理任意长度视频。

LLaVA-Video significantly improved long video understanding via large-scale synthetic instruction data (1.6M video-text pairs), becoming the new open-source Video LMM baseline. VideoStreaming proposed streaming compression with fixed token budget, processing arbitrary-length video with constant compute.

StreamingSynthetic DataOpen Source
2024-Q4 · 可学习记忆

ReWind (CVPR 2025) — Instructed Learnable Memory

两阶段框架:阶段一维护动态记忆(按问题指令更新),阶段二在记忆 + 关键帧上进行推理。引入"Instructed Learnable Memory"——记忆内容由用户查询驱动动态调整,而非固定压缩规则。

Two-stage framework: Stage 1 maintains dynamic memory (updated by query instruction); Stage 2 reasons over memory + key frames. Introduced "Instructed Learnable Memory" — memory content is dynamically adjusted by user queries rather than fixed compression rules.

Learnable MemoryQuery-DrivenCVPR 2025
2024-Q4 · 循环记忆桥接

VideoLLaMB (ICCV 2025) — Recurrent Memory Bridges

引入循环记忆桥接(Recurrent Memory Bridges)和时序记忆 token(Temporal Memory Tokens),在保持流式处理的同时维持跨片段的语义连贯性,以极低的 token 开销支持长程推理。

Introduced Recurrent Memory Bridges and Temporal Memory Tokens, maintaining semantic coherence across segments during streaming processing, supporting long-range reasoning with very low token overhead.

Recurrent MemoryStreamingICCV 2025
2024-Q4 · 超长视频自适应记忆

VideoMem (2024) — Adaptive Memory for Ultra-Long Video

专攻超长视频(小时级以上),引入自适应记忆更新策略,根据语义重要性动态决定哪些帧应进入长期记忆,哪些可丢弃,有效解决超长时序的信息保留问题。

Specialized for ultra-long video (hour+), introduced adaptive memory update strategy that dynamically decides which frames enter long-term memory based on semantic importance, effectively solving ultra-long temporal information retention.

Adaptive MemoryUltra-Long
2025-Q1 · 🆕 RAG-Video & 图记忆

Vgent (NeurIPS 2025, KAUST × Meta AI) & Graph-based Memory

Vgent 将检索增强生成(RAG)引入视频 Agent,构建基于图结构的视频知识库,通过多跳检索回答复杂跨段问题。图节点=场景/对象,边=时序/因果关系,突破了线性时序记忆的局限。

Vgent brought Retrieval-Augmented Generation (RAG) to video agents, constructing a graph-structured video knowledge base with multi-hop retrieval for complex cross-segment questions. Graph nodes = scenes/objects; edges = temporal/causal relations. Overcame limitations of linear temporal memory.

RAGGraph MemoryMulti-hop
2025-Q1 · 🆕 固定记忆窗口加速

Long-VMNet (2025) — Fixed Memory for Acceleration

提出固定记忆窗口(Fixed Memory Window)加速长视频理解,相比动态记忆方案显著降低延迟,在多个基准上以极小精度损失换取大幅推理加速。

Proposed Fixed Memory Window to accelerate long video understanding, significantly reducing latency compared to dynamic memory approaches while achieving large speedups with minimal accuracy loss on multiple benchmarks.

Fixed MemoryEfficiency
2025-Q2 · 🆕 事件中心情景记忆

Video-EM (2025) — Event-Centric Episodic Memory

以"事件"为基本单元构建情景记忆(Episodic Memory),克服传统基于帧/片段的碎片化检索问题,强化叙事主线的完整性,在长视频问答中实现更连贯的时序证据链。

Built episodic memory with "events" as the fundamental unit, overcoming fragmented retrieval in frame/clip-based approaches, reinforcing narrative completeness and achieving more coherent temporal evidence chains in long video QA.

Episodic MemoryEvent-Centric
2025-Q3 · 🆕 深度回溯记忆

VideoLucy (2025) — Deep Memory Backtracking

受人类回忆"由粗到精"过程启发,设计深度记忆回溯框架:先粗粒度定位大致时段,再细粒度回溯精确片段,避免全量扫描的计算浪费,同时保证关键细节不丢失。

Inspired by human coarse-to-fine recall, designed a deep memory backtracking framework: coarse-grained temporal localization followed by fine-grained segment backtracking, avoiding full-scan compute waste while retaining critical details.

Deep BacktrackingCoarse-to-Fine
2025-Q4 · 🆕 流式 Token 压缩

StreamingTOM (2025) — Streaming Token Compression

无需训练的两阶段插件式框架,同时解决 LLM 前(视觉 token 冗余)和 LLM 后(KV Cache 膨胀)两个瓶颈;因果时序缩减 + 语义 token 合并;预测延迟稳定、精度与 SOTA 持平。

Training-free, plug-and-play two-stage framework addressing both pre-LLM (visual token redundancy) and post-LLM (KV Cache inflation) bottlenecks. Causal temporal reduction + semantic token merging; predictable latency with state-of-the-art accuracy.

Token CompressionTraining-FreeStreaming
🔬
五大技术范式
Five Technical Paradigms
Long Video Understanding with Memory — Taxonomy

📸 稀疏采样范式 (Sparse Sampling)

📸 Sparse Sampling Paradigm

核心思路:从长视频中均匀或自适应地抽取 N 帧(通常 8–64),将所有帧 token 拼接送入模型。优点是实现简单;缺点是面对小时级视频时采样率极低,高密度事件段信息丢失严重。

Core idea: uniformly or adaptively sample N frames (typically 8–64) from the long video, concatenating all frame tokens as model input. Simple to implement; suffers from extreme undersampling on hour-long video, causing severe information loss in dense event segments.

2022

CLIP-ViP

视频-文本对比学习,均匀采样16帧,奠定基础采样基线。

Video-text contrastive learning with 16-frame uniform sampling; established the basic sampling baseline.

SamplingCLIP
2023

Video-LLaMA

将 BLIP-2 的图像理解扩展至视频,帧级 Q-Former 提取视觉特征后注入 LLaMA。

Extended BLIP-2 image understanding to video with a frame-level Q-Former injecting features into LLaMA.

Video-LLMQ-Former
2023

LLaMA-VID

极端压缩:每帧仅保留 2 token(上下文+内容),首次使百分钟级视频处理成为可能。

Extreme compression: 2 tokens per frame (context + content). First to make hour-length video processing feasible.

arxiv 2311.17043
2-token/frameCompression
2024

Video Token Merging (NeurIPS 2024)

探索多种视频 token 合并策略(均匀/区域集中/时序),在长视频分类上以低算力匹配高算力模型。

Explored video token merging strategies (uniform/region-concentrated/temporal) for long video classification, matching high-compute models at low cost.

NeurIPS 2024
Token Merging
2024

LLaVA-Video (ByteDance)

1.6M 合成视频指令数据 + 高质量采样策略,成为开源视频 LMM 的新 SOTA 基线。

1.6M synthetic video instruction data + high-quality sampling; new open-source Video LMM SOTA baseline.

arxiv 2411.10442
Instruction TuningSOTA
2025

DYTO (ICCV 2025)

动态 token 合并:层次帧选择 + 二部图 token 合并,零样本自适应压缩,平衡关键帧保留与冗余消除。

Dynamic token merging: hierarchical frame selection + bipartite token merging; zero-shot adaptive compression balancing key frame retention and redundancy removal.

ICCV 2025
Dynamic MergingZero-Shot

🧠 记忆增强范式 (Memory-Augmented)

🧠 Memory-Augmented Paradigm

核心思路:将"已处理"的视频帧信息压缩存入外部/内部记忆库,推理时按需检索或直接读取。分为短时缓冲(近 K 帧精细 token)+ 长期压缩存储(历史帧蒸馏后的稀疏表示)。受人类认知记忆模型启发。

Core idea: compress processed frame information into external/internal memory banks; retrieve or read on demand during inference. Consists of short-term buffer (recent K frames' fine-grained tokens) + long-term compressed storage (distilled sparse representations). Inspired by human cognitive memory models.

2023

MovieChat (CVPR 2024)

Atkinson-Shiffrin 模型:短时 token + 长期稀疏记忆,10K 帧突破。

Atkinson-Shiffrin model: short-term tokens + long-term sparse memory; 10K frame breakthrough.

CVPR 2024
Short+Long Memory
2024

MA-LMM (CVPR 2024)

即插即用记忆库,在线 token 压缩,支持多任务;标准基线。

Plug-in memory bank with online token compression supporting multi-task; standard baseline.

CVPR 2024
Memory BankMulti-task
2024

MovieChat+

问题感知稀疏记忆,检索相关性大幅提升。

Question-aware sparse memory with significantly improved retrieval relevance.

arxiv 2404.17176
Query-Aware
2024

HEM-LLM

层次化事件记忆,多粒度分层压缩,跨级检索。

Hierarchical event memory with multi-granularity layered compression and cross-level retrieval.

arxiv 2409.06299
HierarchicalEvent
2024

ReWind (CVPR 2025)

指令驱动可学习记忆,两阶段(建记忆→推理)框架,时序保真。

Instruction-driven learnable memory in a two-stage (build memory → reason) framework with temporal fidelity.

CVPR 2025
LearnableInstruction-Driven
2024

VideoLLaMB (ICCV 2025)

循环记忆桥接 + 时序记忆 token,流式高效,长程连贯。

Recurrent memory bridges + temporal memory tokens; streaming-efficient with long-range coherence.

ICCV 2025
RecurrentStreaming
2024

VideoMem

超长视频自适应记忆,语义重要性动态决策,小时级以上视频。

Adaptive memory for ultra-long video; semantic importance-based dynamic decision-making for hour+ video.

arxiv 2512.04540
AdaptiveUltra-Long
2025

Video-EM

事件中心情景记忆,叙事完整性强化,跨段推理连贯。

Event-centric episodic memory reinforcing narrative completeness and cross-segment reasoning coherence.

arxiv 2508.09486
EpisodicNarrative
2025

VideoLucy

深度记忆回溯,由粗到精,仿人类回忆过程。

Deep memory backtracking; coarse-to-fine, mimicking human recollection process.

arxiv 2510.12422
BacktrackingCoarse-to-Fine

📏 长上下文扩展范式 (Long Context Scaling)

📏 Long Context Scaling Paradigm

核心思路:通过扩展模型原生上下文窗口(RoPE 外推/YaRN/位置插值等),让模型直接接受百万级 token 的视频序列,无需外部记忆模块。代表:Gemini 1.5(1M)、GPT-4o(128K)。优点是架构简洁,端到端推理;缺点是推理计算量随上下文平方增长,成本极高。

Core idea: extend native context window (RoPE extrapolation/YaRN/position interpolation) to accept million-token video sequences directly, without external memory. Examples: Gemini 1.5 (1M), GPT-4o (128K). Advantages: simple architecture, end-to-end; drawbacks: inference cost scales quadratically with context.

2024

Gemini 1.5 Pro (Google)

1M token 上下文,原生处理整部电影;Video-MME 基准全面领先 GPT-4o。

1M token context, natively processes entire movies. Outperformed GPT-4o on Video-MME benchmark across all categories.

2024 Technical Report
1M ContextIndustry
2024

GPT-4o (OpenAI)

128K 上下文,多模态原生支持,视频理解能力强但上下文仍有限。

128K context, native multimodal support; strong video understanding but context still limited.

128K ContextIndustry
2024

LongVA (EECS, Berkeley)

长上下文 LLM(224K)用于高分辨率长视频,无需专项记忆模块。

Long-context LLM (224K) for high-resolution long video, requiring no dedicated memory module.

224K ContextHigh-Res
2025

Long-VMNet

固定记忆窗口加速方案,用小计算量近似长上下文效果。

Fixed memory window acceleration approximating long-context performance at reduced compute.

arxiv 2503.13707
Fixed WindowAcceleration

🔎 检索增强范式 (RAG for Video)

🔎 Retrieval-Augmented Generation for Video

核心思路:将长视频预处理为索引(文本描述/视觉向量),问答时动态检索相关片段再交给 LLM 推理,避免全量处理。优点是可扩展性强、支持超长视频;缺点是检索错误会级联影响推理质量。

Core idea: preprocess long video into an index (text descriptions/visual vectors), dynamically retrieve relevant clips during QA then hand to LLM for reasoning. Highly scalable, supports ultra-long video; retrieval errors can cascade and degrade reasoning quality.

2024

VideoAgent (ECCV 2024)

结构化事件记忆 + 对象状态追踪,迭代检索-推理循环。

Structured event memory + object tracking states; iterative retrieve-reason loop.

ECCV 2024
Structured MemoryIterative
2024

VideoAgent-Long

GPT-4V 辅助构建语义记忆库,专攻小时级长视频问答。

GPT-4V-assisted semantic memory store construction for hour-long video QA.

arxiv 2403.10517
Hour-LongGPT-4V
2024

EgoVideo-RAG

专为自我中心视频设计的 RAG,动作/对象检索融合,EgoSchema 基准提升。

RAG designed for egocentric video, fusing action/object retrieval; improved EgoSchema benchmark results.

EgocentricFusion Retrieval
2025

Vgent (NeurIPS 2025)

图结构知识库 + 多跳检索,突破线性时序记忆局限,复杂推理大幅提升。

Graph-structured knowledge base + multi-hop retrieval; breaks linear temporal memory limitations with major complex reasoning gains.

Graph RAGMulti-hop

🤖 Agent 式理解范式 (Agentic Video Understanding)

🤖 Agentic Video Understanding Paradigm

核心思路:将视频理解包装为多步 Agent 循环(计划→工具调用→记忆更新→回答),支持自适应采样、多工具协作(OCR/ASR/对象检测/字幕生成)。最适合需要多步推理的复杂问答。

Core idea: wrap video understanding as a multi-step agent loop (plan → tool call → memory update → answer), supporting adaptive sampling and multi-tool collaboration (OCR/ASR/object detection/captioning). Best for complex QA requiring multi-step reasoning.

2024

VideoAgent (ECCV 2024)

记忆增强 Agent,结构化记忆库,LLM 驱动推理循环。

Memory-augmented agent with structured memory store; LLM-driven reasoning loop.

ECCV 2024
Memory Agent
2025

LVAgent (ICCV 2025)

多轮动态协作 MLLM,多 Agent 互补理解,长视频复杂推理新 SOTA。

Multi-round dynamic collaboration of MLLMs; complementary multi-agent understanding; new SOTA for complex long video reasoning.

ICCV 2025
Multi-AgentDynamic
2025

MARC (RL Token Compression)

强化学习驱动 token 压缩 Agent,动态决策哪些 token 保留,性能与效率双优。

RL-driven token compression agent that dynamically decides which tokens to retain; achieves both performance and efficiency.

RL AgentToken Compression
2025

Vgent (NeurIPS 2025)

图增强视频 Agent,知识图谱 + RAG,复杂因果推理。

Graph-augmented video agent; knowledge graph + RAG for complex causal reasoning.

Graph AgentKnowledge Graph
📊
主流基准测试
Major Benchmarks
Benchmark 视频时长Video Duration 规模Scale 任务类型Task Type 发布时间Year 特点Highlights
EgoSchema 3 min avg 5,000 QA Video QA 2023 自我中心视频,多项选择,长时推理Egocentric, multiple-choice, long-range reasoning
Video-MME 11s–1h 900 videos, 2700 QA Comprehensive 2024 首个全面视频 MLLM 评估基准,短/中/长三档,Gemini 1.5 Pro 登顶First comprehensive Video MLLM benchmark; short/medium/long tiers; Gemini 1.5 Pro topped leaderboard
LongVideoBench 15s–60min 6,678 QA (17 cats) Interleaved 2024 视频-语言交织输入,17类问题,最全面长视频 QA 基准之一Video-language interleaved input; 17 QA categories; one of the most comprehensive long video benchmarks
LVBench 1h+ 超长视频集 Extreme Long 2024 专为超长视频设计,平均时长 1 小时以上,当前模型在此表现仍差Designed for extreme long video (avg 1h+); current models still struggle significantly
MovieChat-1K 电影级 / Movie-length 1K videos, 2K grounding Grounding 2024 配套 MovieChat 发布,包含时序定位标注,首个10K帧级评测集Released with MovieChat; includes temporal grounding annotations; first 10K-frame scale evaluation set
ActivityNet-QA 3 min avg 58,000 QA Open-Ended QA 2019 开放式问答,覆盖日常活动,长视频理解早期经典基准Open-ended QA on daily activities; classic early long video benchmark
How2QA / How2R 90s avg 22K QA Retrieval+QA 2020 多语言教育视频,检索+问答双任务,测试跨段推理Multilingual educational videos; dual retrieval+QA tasks; tests cross-segment reasoning
Ego4D-NLQ 小时级 / Hour-long 15K 查询 Temporal NLQ 2022 自我中心视频自然语言查询时序定位,Meta+UC Berkeley 联合构建Egocentric video natural language query temporal localization; jointly built by Meta+UC Berkeley
MLVUBench 3–120 min 2,593 QA Multi-task 2024 多维度长视频评测(叙事/推理/识别),Needle-in-a-Haystack 专项Multi-dimensional long video evaluation (narrative/reasoning/recognition); Needle-in-a-Haystack special test
⚖️
主要方法能力对比
Method Capability Comparison
方法 / MethodMethod 记忆类型Memory Type 小时级Hour+ 流式Streaming 问题感知Query-Aware 时序定位Temporal Grounding 多步推理Multi-step 无需训练Training-Free
MovieChat (CVPR 2024) 短+长期稀疏Short+Long Sparse
MA-LMM (CVPR 2024) 在线记忆库Online Memory Bank
ReWind (CVPR 2025) 指令驱动可学习Instruction-Learnable
VideoLLaMB (ICCV 2025) 循环记忆桥接Recurrent Memory Bridge
VideoAgent (ECCV 2024) 结构化检索记忆Structured Retrieval
Gemini 1.5 Pro 1M 上下文窗口1M Context Window
StreamingTOM (2025) 流式 Token 压缩Streaming Token Compress
VideoMem (2024) 自适应记忆Adaptive Memory
Vgent (NeurIPS 2025) 图结构记忆 + RAGGraph Memory + RAG
LVAgent (ICCV 2025) 多 Agent 协作Multi-Agent Collab.

✓ 支持 / Supported  ·  △ 部分支持 / Partial  ·  ✗ 不支持 / Not Supported