从稀疏采样到流式记忆,从外部存储到万亿上下文窗口 —— 系统梳理长视频理解技术的演进脉络与前沿趋势。
From sparse sampling to streaming memory, from external storage to million-token context windows — a systematic survey of long video understanding and memory techniques.
视频时长爆炸式增长(电影/直播/监控),而现有多模态大模型受限于上下文窗口(<128K tokens),无法端到端处理超长视频。
Video durations are exploding (films, live streams, surveillance), but current multimodal LLMs are constrained by context windows (<128K tokens), making end-to-end processing of ultra-long video infeasible.
记忆机制让模型能以固定计算代价访问任意时刻的历史信息,是解决长时依赖推理("两小时前发生了什么")的关键。
Memory mechanisms allow models to access historical information at any moment with fixed compute cost, crucial for long-range temporal reasoning ("what happened two hours ago").
2023年以来,主流方向从纯视觉模型转向 Video LMM/MLLM:将视频帧编码为 token 注入 LLM,关键挑战是如何在 token 数量 vs. 信息保真度之间取得平衡。
Since 2023, the mainstream shifted from pure vision to Video LMM/MLLM: encoding frames as tokens injected into LLMs. The key challenge is balancing token count vs. information fidelity.
电影问答、体育分析、医疗视频诊断、会议记录、监控异常检测、具身 Agent 长程规划、视频检索与摘要。
Movie QA, sports analytics, medical video diagnosis, meeting transcription, surveillance anomaly detection, embodied agent long-horizon planning, video retrieval and summarization.
视频帧数与 token 数呈线性增长,Attention 复杂度 O(n²),直接处理 1 小时视频需百万级 token。
Frame and token counts grow linearly; O(n²) attention complexity. Processing one hour of video requires millions of tokens directly.
问答需要跨越数小时的跨帧推理(如"主角的动机是什么"),稀疏采样丢失关键帧,密集采样超出窗口。
QA requires cross-hour temporal reasoning (e.g., "what motivated the character"). Sparse sampling drops key frames; dense sampling exceeds the window.
在小时级视频中精准定位特定事件的时间戳(秒级),需要细粒度的时序感知能力。
Precisely localizing specific event timestamps (second-level precision) in hour-long videos requires fine-grained temporal perception.
激进压缩(如平均池化)会抹去局部细节;过于保守(保留所有 token)则计算不可行。寻找最优压缩策略是核心。
Aggressive compression (e.g., average pooling) erases local details; conservative approaches (keep all tokens) are computationally infeasible. Finding optimal compression is the core challenge.
直播与监控场景要求模型实时处理视频流,不能等待全部帧到齐,因果约束下的记忆设计需要特殊考量。
Live streaming and surveillance require real-time processing without waiting for all frames. Causal memory design under streaming constraints needs special treatment.
视觉 token 与文本 token 的语义对齐在长视频中更难维持,随时序增长,历史帧的视觉信息与当前查询的相关性逐渐退化。
Maintaining semantic alignment between visual and text tokens is harder in long videos; historical frame relevance to current queries degrades over time.
首次将 BERT 自监督预训练迁移至视频-文本联合学习;将视频离散化为视觉 token,与文字 token 交织训练。奠定了视频-语言预训练范式,长视频研究的远祖。
First application of BERT-style self-supervised pretraining to joint video-text learning; discretized video into visual tokens interleaved with text. Laid the foundation for video-language pretraining.
稀疏注意力机制(滑动窗口 + 全局 token)被引入视频理解,将 O(n²) 复杂度降为 O(n),首次让长序列视频 Transformer 变得可行。
Sparse attention (sliding window + global tokens) introduced to video understanding, reducing O(n²) to O(n) complexity. Made long-sequence video Transformers feasible for the first time.
冻结 LLM 权重,仅通过可学习视觉适配层将视频帧注入语言模型,验证了 LLM 作为视频推理引擎的潜力,奠定"冻结 LLM + 视觉桥接"范式。
Froze LLM weights, injecting video frames via learnable visual adapters. Validated LLMs as video reasoning engines; established the "frozen LLM + visual bridge" paradigm.
首批基于对话的视频多模态模型。VideoChat 引入事件记忆(Event Memory),将视频语义压缩为文字摘要输入 LLM;Video-ChatGPT 完成大规模视频指令调优数据集构建。
First conversational video multimodal models. VideoChat introduced event memory (compressing video semantics into text summaries for LLM). Video-ChatGPT built large-scale video instruction tuning datasets.
以 Atkinson-Shiffrin 认知记忆模型(短时记忆 + 长时记忆)为框架,设计 Dense Token → Sparse Memory 的两阶段压缩,首次支持 10K+ 帧长视频理解,配套发布 MovieChat-1K 基准。
Framed on Atkinson-Shiffrin memory model (short-term + long-term), designed Dense Token → Sparse Memory two-stage compression. First to support 10K+ frame long video understanding. Released MovieChat-1K benchmark.
TimeChat 引入时间感知帧编码器(Timestamp-aware),每帧附加绝对时间戳 token,使模型具备时序定位能力。LLaMA-VID 将每帧压缩至 2 个 token(上下文 + 内容),使百分钟视频成为可能。
TimeChat introduced timestamp-aware frame encoding, appending absolute timestamp tokens to each frame for temporal grounding. LLaMA-VID compressed each frame to 2 tokens (context + content), enabling hour-long video processing.
VideoAgent 构建结构化记忆(时序事件描述 + 对象追踪状态),迭代召回相关片段回答问题,开创"基于外部记忆的视频 Agent"范式。VideoAgent-Long 专注小时级长视频,用 GPT-4V 逐步构建语义记忆库。
VideoAgent built structured memory (temporal event descriptions + object tracking states), iteratively retrieving relevant clips to answer questions. Pioneered the "external-memory video agent" paradigm. VideoAgent-Long targeted hour-long videos using GPT-4V to progressively build a semantic memory store.
Gemini 1.5 Pro 以 100 万 token 超长上下文窗口实现端到端长视频理解,无需外部记忆模块即可处理 1 小时以上视频。引发"暴力扩展上下文 vs. 精细记忆设计"的学术争论,在 Video-MME 基准上全面领先 GPT-4o。
Gemini 1.5 Pro's 1M-token context window enabled end-to-end long video understanding without external memory, processing 1+ hour videos. Sparked debate on "brute-force context scaling vs. fine-grained memory design." Outperformed GPT-4o on Video-MME benchmark.
MA-LMM 设计可插拔长期记忆库(Long-term Memory Bank),在线压缩历史视觉 token,支持视频描述/预测/问答多任务,是记忆增强视频 LMM 的标准基线。MovieChat+ 改进为问题感知稀疏记忆(Question-aware),提升检索相关性。
MA-LMM designed a plug-in long-term memory bank, compressing historical visual tokens online across video captioning, prediction, and QA tasks. MovieChat+ improved to question-aware sparse memory for better retrieval relevance.
设计层次化事件记忆机制,将视频切分为多粒度事件(片段→事件→故事弧),每层独立压缩后跨层级检索,缓解平坦记忆结构的信息稀释问题。
Designed hierarchical event-based memory, segmenting videos at multiple granularities (clips → events → story arcs), compressing each level independently and enabling cross-level retrieval to alleviate flat-memory information dilution.
LLaVA-Video 通过大规模合成视频指令数据(1.6M 视频-文本对)显著提升长视频理解,成为开源视频 LMM 的新基线。VideoStreaming 提出固定 token 预算下的流式压缩,常数计算量处理任意长度视频。
LLaVA-Video significantly improved long video understanding via large-scale synthetic instruction data (1.6M video-text pairs), becoming the new open-source Video LMM baseline. VideoStreaming proposed streaming compression with fixed token budget, processing arbitrary-length video with constant compute.
两阶段框架:阶段一维护动态记忆(按问题指令更新),阶段二在记忆 + 关键帧上进行推理。引入"Instructed Learnable Memory"——记忆内容由用户查询驱动动态调整,而非固定压缩规则。
Two-stage framework: Stage 1 maintains dynamic memory (updated by query instruction); Stage 2 reasons over memory + key frames. Introduced "Instructed Learnable Memory" — memory content is dynamically adjusted by user queries rather than fixed compression rules.
引入循环记忆桥接(Recurrent Memory Bridges)和时序记忆 token(Temporal Memory Tokens),在保持流式处理的同时维持跨片段的语义连贯性,以极低的 token 开销支持长程推理。
Introduced Recurrent Memory Bridges and Temporal Memory Tokens, maintaining semantic coherence across segments during streaming processing, supporting long-range reasoning with very low token overhead.
专攻超长视频(小时级以上),引入自适应记忆更新策略,根据语义重要性动态决定哪些帧应进入长期记忆,哪些可丢弃,有效解决超长时序的信息保留问题。
Specialized for ultra-long video (hour+), introduced adaptive memory update strategy that dynamically decides which frames enter long-term memory based on semantic importance, effectively solving ultra-long temporal information retention.
Vgent 将检索增强生成(RAG)引入视频 Agent,构建基于图结构的视频知识库,通过多跳检索回答复杂跨段问题。图节点=场景/对象,边=时序/因果关系,突破了线性时序记忆的局限。
Vgent brought Retrieval-Augmented Generation (RAG) to video agents, constructing a graph-structured video knowledge base with multi-hop retrieval for complex cross-segment questions. Graph nodes = scenes/objects; edges = temporal/causal relations. Overcame limitations of linear temporal memory.
提出固定记忆窗口(Fixed Memory Window)加速长视频理解,相比动态记忆方案显著降低延迟,在多个基准上以极小精度损失换取大幅推理加速。
Proposed Fixed Memory Window to accelerate long video understanding, significantly reducing latency compared to dynamic memory approaches while achieving large speedups with minimal accuracy loss on multiple benchmarks.
以"事件"为基本单元构建情景记忆(Episodic Memory),克服传统基于帧/片段的碎片化检索问题,强化叙事主线的完整性,在长视频问答中实现更连贯的时序证据链。
Built episodic memory with "events" as the fundamental unit, overcoming fragmented retrieval in frame/clip-based approaches, reinforcing narrative completeness and achieving more coherent temporal evidence chains in long video QA.
受人类回忆"由粗到精"过程启发,设计深度记忆回溯框架:先粗粒度定位大致时段,再细粒度回溯精确片段,避免全量扫描的计算浪费,同时保证关键细节不丢失。
Inspired by human coarse-to-fine recall, designed a deep memory backtracking framework: coarse-grained temporal localization followed by fine-grained segment backtracking, avoiding full-scan compute waste while retaining critical details.
无需训练的两阶段插件式框架,同时解决 LLM 前(视觉 token 冗余)和 LLM 后(KV Cache 膨胀)两个瓶颈;因果时序缩减 + 语义 token 合并;预测延迟稳定、精度与 SOTA 持平。
Training-free, plug-and-play two-stage framework addressing both pre-LLM (visual token redundancy) and post-LLM (KV Cache inflation) bottlenecks. Causal temporal reduction + semantic token merging; predictable latency with state-of-the-art accuracy.
核心思路:从长视频中均匀或自适应地抽取 N 帧(通常 8–64),将所有帧 token 拼接送入模型。优点是实现简单;缺点是面对小时级视频时采样率极低,高密度事件段信息丢失严重。
Core idea: uniformly or adaptively sample N frames (typically 8–64) from the long video, concatenating all frame tokens as model input. Simple to implement; suffers from extreme undersampling on hour-long video, causing severe information loss in dense event segments.
视频-文本对比学习,均匀采样16帧,奠定基础采样基线。
Video-text contrastive learning with 16-frame uniform sampling; established the basic sampling baseline.
将 BLIP-2 的图像理解扩展至视频,帧级 Q-Former 提取视觉特征后注入 LLaMA。
Extended BLIP-2 image understanding to video with a frame-level Q-Former injecting features into LLaMA.
极端压缩:每帧仅保留 2 token(上下文+内容),首次使百分钟级视频处理成为可能。
Extreme compression: 2 tokens per frame (context + content). First to make hour-length video processing feasible.
探索多种视频 token 合并策略(均匀/区域集中/时序),在长视频分类上以低算力匹配高算力模型。
Explored video token merging strategies (uniform/region-concentrated/temporal) for long video classification, matching high-compute models at low cost.
1.6M 合成视频指令数据 + 高质量采样策略,成为开源视频 LMM 的新 SOTA 基线。
1.6M synthetic video instruction data + high-quality sampling; new open-source Video LMM SOTA baseline.
动态 token 合并:层次帧选择 + 二部图 token 合并,零样本自适应压缩,平衡关键帧保留与冗余消除。
Dynamic token merging: hierarchical frame selection + bipartite token merging; zero-shot adaptive compression balancing key frame retention and redundancy removal.
核心思路:将"已处理"的视频帧信息压缩存入外部/内部记忆库,推理时按需检索或直接读取。分为短时缓冲(近 K 帧精细 token)+ 长期压缩存储(历史帧蒸馏后的稀疏表示)。受人类认知记忆模型启发。
Core idea: compress processed frame information into external/internal memory banks; retrieve or read on demand during inference. Consists of short-term buffer (recent K frames' fine-grained tokens) + long-term compressed storage (distilled sparse representations). Inspired by human cognitive memory models.
Atkinson-Shiffrin 模型:短时 token + 长期稀疏记忆,10K 帧突破。
Atkinson-Shiffrin model: short-term tokens + long-term sparse memory; 10K frame breakthrough.
即插即用记忆库,在线 token 压缩,支持多任务;标准基线。
Plug-in memory bank with online token compression supporting multi-task; standard baseline.
问题感知稀疏记忆,检索相关性大幅提升。
Question-aware sparse memory with significantly improved retrieval relevance.
层次化事件记忆,多粒度分层压缩,跨级检索。
Hierarchical event memory with multi-granularity layered compression and cross-level retrieval.
指令驱动可学习记忆,两阶段(建记忆→推理)框架,时序保真。
Instruction-driven learnable memory in a two-stage (build memory → reason) framework with temporal fidelity.
循环记忆桥接 + 时序记忆 token,流式高效,长程连贯。
Recurrent memory bridges + temporal memory tokens; streaming-efficient with long-range coherence.
超长视频自适应记忆,语义重要性动态决策,小时级以上视频。
Adaptive memory for ultra-long video; semantic importance-based dynamic decision-making for hour+ video.
事件中心情景记忆,叙事完整性强化,跨段推理连贯。
Event-centric episodic memory reinforcing narrative completeness and cross-segment reasoning coherence.
深度记忆回溯,由粗到精,仿人类回忆过程。
Deep memory backtracking; coarse-to-fine, mimicking human recollection process.
核心思路:通过扩展模型原生上下文窗口(RoPE 外推/YaRN/位置插值等),让模型直接接受百万级 token 的视频序列,无需外部记忆模块。代表:Gemini 1.5(1M)、GPT-4o(128K)。优点是架构简洁,端到端推理;缺点是推理计算量随上下文平方增长,成本极高。
Core idea: extend native context window (RoPE extrapolation/YaRN/position interpolation) to accept million-token video sequences directly, without external memory. Examples: Gemini 1.5 (1M), GPT-4o (128K). Advantages: simple architecture, end-to-end; drawbacks: inference cost scales quadratically with context.
1M token 上下文,原生处理整部电影;Video-MME 基准全面领先 GPT-4o。
1M token context, natively processes entire movies. Outperformed GPT-4o on Video-MME benchmark across all categories.
128K 上下文,多模态原生支持,视频理解能力强但上下文仍有限。
128K context, native multimodal support; strong video understanding but context still limited.
长上下文 LLM(224K)用于高分辨率长视频,无需专项记忆模块。
Long-context LLM (224K) for high-resolution long video, requiring no dedicated memory module.
固定记忆窗口加速方案,用小计算量近似长上下文效果。
Fixed memory window acceleration approximating long-context performance at reduced compute.
核心思路:将长视频预处理为索引(文本描述/视觉向量),问答时动态检索相关片段再交给 LLM 推理,避免全量处理。优点是可扩展性强、支持超长视频;缺点是检索错误会级联影响推理质量。
Core idea: preprocess long video into an index (text descriptions/visual vectors), dynamically retrieve relevant clips during QA then hand to LLM for reasoning. Highly scalable, supports ultra-long video; retrieval errors can cascade and degrade reasoning quality.
结构化事件记忆 + 对象状态追踪,迭代检索-推理循环。
Structured event memory + object tracking states; iterative retrieve-reason loop.
GPT-4V 辅助构建语义记忆库,专攻小时级长视频问答。
GPT-4V-assisted semantic memory store construction for hour-long video QA.
专为自我中心视频设计的 RAG,动作/对象检索融合,EgoSchema 基准提升。
RAG designed for egocentric video, fusing action/object retrieval; improved EgoSchema benchmark results.
图结构知识库 + 多跳检索,突破线性时序记忆局限,复杂推理大幅提升。
Graph-structured knowledge base + multi-hop retrieval; breaks linear temporal memory limitations with major complex reasoning gains.
核心思路:将视频理解包装为多步 Agent 循环(计划→工具调用→记忆更新→回答),支持自适应采样、多工具协作(OCR/ASR/对象检测/字幕生成)。最适合需要多步推理的复杂问答。
Core idea: wrap video understanding as a multi-step agent loop (plan → tool call → memory update → answer), supporting adaptive sampling and multi-tool collaboration (OCR/ASR/object detection/captioning). Best for complex QA requiring multi-step reasoning.
记忆增强 Agent,结构化记忆库,LLM 驱动推理循环。
Memory-augmented agent with structured memory store; LLM-driven reasoning loop.
多轮动态协作 MLLM,多 Agent 互补理解,长视频复杂推理新 SOTA。
Multi-round dynamic collaboration of MLLMs; complementary multi-agent understanding; new SOTA for complex long video reasoning.
强化学习驱动 token 压缩 Agent,动态决策哪些 token 保留,性能与效率双优。
RL-driven token compression agent that dynamically decides which tokens to retain; achieves both performance and efficiency.
图增强视频 Agent,知识图谱 + RAG,复杂因果推理。
Graph-augmented video agent; knowledge graph + RAG for complex causal reasoning.
| Benchmark | 视频时长 | Video Duration | 规模 | Scale | 任务类型 | Task Type | 发布时间 | Year | 特点 | Highlights |
|---|---|---|---|---|---|---|---|---|---|---|
| EgoSchema | 3 min avg | 5,000 QA | Video QA | 2023 | 自我中心视频,多项选择,长时推理 | Egocentric, multiple-choice, long-range reasoning | ||||
| Video-MME | 11s–1h | 900 videos, 2700 QA | Comprehensive | 2024 | 首个全面视频 MLLM 评估基准,短/中/长三档,Gemini 1.5 Pro 登顶 | First comprehensive Video MLLM benchmark; short/medium/long tiers; Gemini 1.5 Pro topped leaderboard | ||||
| LongVideoBench | 15s–60min | 6,678 QA (17 cats) | Interleaved | 2024 | 视频-语言交织输入,17类问题,最全面长视频 QA 基准之一 | Video-language interleaved input; 17 QA categories; one of the most comprehensive long video benchmarks | ||||
| LVBench | 1h+ | 超长视频集 | Extreme Long | 2024 | 专为超长视频设计,平均时长 1 小时以上,当前模型在此表现仍差 | Designed for extreme long video (avg 1h+); current models still struggle significantly | ||||
| MovieChat-1K | 电影级 / Movie-length | 1K videos, 2K grounding | Grounding | 2024 | 配套 MovieChat 发布,包含时序定位标注,首个10K帧级评测集 | Released with MovieChat; includes temporal grounding annotations; first 10K-frame scale evaluation set | ||||
| ActivityNet-QA | 3 min avg | 58,000 QA | Open-Ended QA | 2019 | 开放式问答,覆盖日常活动,长视频理解早期经典基准 | Open-ended QA on daily activities; classic early long video benchmark | ||||
| How2QA / How2R | 90s avg | 22K QA | Retrieval+QA | 2020 | 多语言教育视频,检索+问答双任务,测试跨段推理 | Multilingual educational videos; dual retrieval+QA tasks; tests cross-segment reasoning | ||||
| Ego4D-NLQ | 小时级 / Hour-long | 15K 查询 | Temporal NLQ | 2022 | 自我中心视频自然语言查询时序定位,Meta+UC Berkeley 联合构建 | Egocentric video natural language query temporal localization; jointly built by Meta+UC Berkeley | ||||
| MLVUBench | 3–120 min | 2,593 QA | Multi-task | 2024 | 多维度长视频评测(叙事/推理/识别),Needle-in-a-Haystack 专项 | Multi-dimensional long video evaluation (narrative/reasoning/recognition); Needle-in-a-Haystack special test |
| 方法 / Method | Method | 记忆类型 | Memory Type | 小时级 | Hour+ | 流式 | Streaming | 问题感知 | Query-Aware | 时序定位 | Temporal Grounding | 多步推理 | Multi-step | 无需训练 | Training-Free |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MovieChat (CVPR 2024) | 短+长期稀疏 | Short+Long Sparse | ✓ | ✗ | ✗ | △ | ✗ | ✗ | |||||||
| MA-LMM (CVPR 2024) | 在线记忆库 | Online Memory Bank | ✓ | △ | ✗ | △ | ✗ | ✗ | |||||||
| ReWind (CVPR 2025) | 指令驱动可学习 | Instruction-Learnable | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | |||||||
| VideoLLaMB (ICCV 2025) | 循环记忆桥接 | Recurrent Memory Bridge | ✓ | ✓ | ✗ | △ | ✗ | ✗ | |||||||
| VideoAgent (ECCV 2024) | 结构化检索记忆 | Structured Retrieval | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | |||||||
| Gemini 1.5 Pro | 1M 上下文窗口 | 1M Context Window | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | |||||||
| StreamingTOM (2025) | 流式 Token 压缩 | Streaming Token Compress | ✓ | ✓ | ✗ | △ | ✗ | ✓ | |||||||
| VideoMem (2024) | 自适应记忆 | Adaptive Memory | ✓ | △ | ✓ | ✓ | ✗ | ✗ | |||||||
| Vgent (NeurIPS 2025) | 图结构记忆 + RAG | Graph Memory + RAG | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | |||||||
| LVAgent (ICCV 2025) | 多 Agent 协作 | Multi-Agent Collab. | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
✓ 支持 / Supported · △ 部分支持 / Partial · ✗ 不支持 / Not Supported
早期方案(MovieChat/MA-LMM)采用固定压缩规则(FIFO、平均池化);2024年后转向问题感知、指令驱动的动态记忆(ReWind、MovieChat+);趋势终点是完全端到端可学习的记忆读写机制,记忆内容由模型自主决策。
Early approaches (MovieChat/MA-LMM) used fixed compression rules (FIFO, average pooling); post-2024 shifted to query-aware, instruction-driven dynamic memory (ReWind, MovieChat+); the end goal is fully end-to-end learnable memory read/write, where memory content is autonomously decided by the model.
随着视频 LMM 进入工业落地,推理成本成为瓶颈。从简单均匀采样→基于注意力的 token 合并→训练无关的流式压缩(StreamingTOM)→RL 驱动的动态压缩(MARC)。目标:在固定 token 预算下最大化信息密度。
As Video LMMs enter production, inference cost is the bottleneck. Evolution: simple uniform sampling → attention-based token merging → training-free streaming compression (StreamingTOM) → RL-driven dynamic compression (MARC). Goal: maximize information density under fixed token budget.
Gemini 1.5 Pro 的 1M 上下文引发"是否需要专用记忆模块"的争论。学界认为:长上下文适合短时高密度推理,结构化记忆适合超长视频(数小时)的高效检索与低成本部署。两者将长期共存,分别占据不同应用场景。
Gemini 1.5's 1M context sparked debate on "whether dedicated memory modules are needed." Academic consensus: long context suits short high-density reasoning; structured memory suits ultra-long video (hours) for efficient retrieval and low-cost deployment. Both will coexist for different use cases.
线性时序记忆难以表达复杂的因果、空间和实体关系。HEM-LLM 引入层次化事件记忆;Vgent 引入图结构记忆,支持多跳推理。预计未来视频知识图谱(Video KG)将成为长视频理解的基础设施。
Linear temporal memory fails to capture complex causal, spatial, and entity relations. HEM-LLM introduced hierarchical event memory; Vgent introduced graph memory supporting multi-hop reasoning. Video Knowledge Graphs (KGs) are expected to become long video understanding infrastructure.
单一 MLLM 的能力上限难以覆盖长视频所有任务(OCR/ASR/对象追踪/时序推理)。LVAgent、VideoAgent 等工作转向多 Agent 分工协作,每个 Agent 专精子任务,通过共享记忆库和协调机制融合输出,显著提升复杂长视频问答性能。
A single MLLM cannot cover all long video tasks (OCR/ASR/object tracking/temporal reasoning). LVAgent, VideoAgent et al. shifted to multi-agent division of labor, each specializing in subtasks, fusing outputs via shared memory and coordination, significantly improving complex long video QA.
早期工作均为离线处理(等待全部帧)。直播/监控场景需要实时感知与记忆更新。VideoStreaming、StreamingTOM、VideoLLaMB 等引入因果约束下的流式记忆更新,在实时推理效率上取得突破。体现了长视频理解从"学术玩具"走向"工业落地"的核心挑战。
Early work was all offline (wait for all frames). Live streaming/surveillance requires real-time perception and memory updates. VideoStreaming, StreamingTOM, VideoLLaMB et al. introduced causal-constrained streaming memory updates, achieving breakthroughs in real-time inference. Reflects the core challenge of moving from "academic toy" to "industrial deployment."
LVBench 显示当前最优模型在 1 小时以上视频上准确率仍不足 50%,核心瓶颈是超长时序依赖建模与关键帧定位。
LVBench shows current best models achieve <50% accuracy on 1h+ videos. Core bottleneck: ultra-long temporal dependency modeling and key frame localization.
"模型记住了什么"目前几乎无法解释。可解释性记忆(类似注意力可视化)是提升用户信任和模型 Debug 的关键研究方向。
"What did the model remember" is currently nearly uninterpretable. Interpretable memory (analogous to attention visualization) is key for user trust and model debugging.
绝大多数基准与模型以英文为主。多语言(特别是低资源语言)长视频理解数据极度匮乏,是重要的未探索空白。
Most benchmarks and models are English-centric. Multilingual (especially low-resource) long video understanding data is severely lacking — an important unexplored gap.
具身 Agent 需要长时序视频记忆来完成多步任务("一小时前我把工具放在哪里")。长视频记忆与 VLA/WM 的深度融合是下一个重大方向。
Embodied agents need long-temporal video memory for multi-step tasks ("where did I put the tool an hour ago"). Deep integration of long video memory with VLA/WM is the next major direction.
监控/直播场景需要端到端延迟 <200ms 的流式推理,而目前最好的模型仍需秒级甚至分钟级推理时间,工业落地差距巨大。
Surveillance/live-streaming requires <200ms end-to-end latency for streaming inference, but current best models still need seconds to minutes — a huge gap for industrial deployment.
目前各基准测试场景割裂(EgoSchema/Video-MME/LVBench),缺少统一的跨任务、跨时长、跨场景的长视频理解综合评测体系。
Current benchmarks are fragmented (EgoSchema/Video-MME/LVBench). A unified cross-task, cross-duration, cross-domain long video understanding evaluation framework is urgently needed.