Long Video Understanding with Memory 综述 (2019

🌐

方向概述

Direction Overview

Long Video Understanding with Memory — What & Why

长视频理解与记忆 — 是什么 & 为什么

🎯 核心问题

🎯 Core Problem

视频时长爆炸式增长（电影/直播/监控），而现有多模态大模型受限于上下文窗口（<128K tokens），无法端到端处理超长视频。

Video durations are exploding (films, live streams, surveillance), but current multimodal LLMs are constrained by context windows (<128K tokens), making end-to-end processing of ultra-long video infeasible.

🧠 记忆的意义

🧠 Role of Memory

记忆机制让模型能以固定计算代价访问任意时刻的历史信息，是解决长时依赖推理（"两小时前发生了什么"）的关键。

Memory mechanisms allow models to access historical information at any moment with fixed compute cost, crucial for long-range temporal reasoning ("what happened two hours ago").

🔗 与 LLM 的融合

🔗 LLM Integration

2023年以来，主流方向从纯视觉模型转向 Video LMM/MLLM：将视频帧编码为 token 注入 LLM，关键挑战是如何在 token 数量 vs. 信息保真度之间取得平衡。

Since 2023, the mainstream shifted from pure vision to Video LMM/MLLM: encoding frames as tokens injected into LLMs. The key challenge is balancing token count vs. information fidelity.

🚀 应用场景

🚀 Applications

电影问答、体育分析、医疗视频诊断、会议记录、监控异常检测、具身 Agent 长程规划、视频检索与摘要。

Movie QA, sports analytics, medical video diagnosis, meeting transcription, surveillance anomaly detection, embodied agent long-horizon planning, video retrieval and summarization.

⚠️

核心挑战

Core Challenges

💾

计算与内存瓶颈

Compute & Memory Bottleneck

视频帧数与 token 数呈线性增长，Attention 复杂度 O(n²)，直接处理 1 小时视频需百万级 token。

Frame and token counts grow linearly; O(n²) attention complexity. Processing one hour of video requires millions of tokens directly.

⏳

长时依赖推理

Long-Range Temporal Reasoning

问答需要跨越数小时的跨帧推理（如"主角的动机是什么"），稀疏采样丢失关键帧，密集采样超出窗口。

QA requires cross-hour temporal reasoning (e.g., "what motivated the character"). Sparse sampling drops key frames; dense sampling exceeds the window.

🔍

时序定位精度

Temporal Grounding Accuracy

在小时级视频中精准定位特定事件的时间戳（秒级），需要细粒度的时序感知能力。

Precisely localizing specific event timestamps (second-level precision) in hour-long videos requires fine-grained temporal perception.

🗜️

信息压缩 vs. 保真

Compression vs. Fidelity

激进压缩（如平均池化）会抹去局部细节；过于保守（保留所有 token）则计算不可行。寻找最优压缩策略是核心。

Aggressive compression (e.g., average pooling) erases local details; conservative approaches (keep all tokens) are computationally infeasible. Finding optimal compression is the core challenge.

🌊

流式 / 在线处理

Streaming / Online Processing

直播与监控场景要求模型实时处理视频流，不能等待全部帧到齐，因果约束下的记忆设计需要特殊考量。

Live streaming and surveillance require real-time processing without waiting for all frames. Causal memory design under streaming constraints needs special treatment.

🤔

多模态对齐

Multimodal Alignment

视觉 token 与文本 token 的语义对齐在长视频中更难维持，随时序增长，历史帧的视觉信息与当前查询的相关性逐渐退化。

Maintaining semantic alignment between visual and text tokens is harder in long videos; historical frame relevance to current queries degrades over time.

📅

发展时间线

Development Timeline

2019 → 2026

2019 · 奠基期 / Foundation Era

VideoBERT (Sun et al., Google)

首次将 BERT 自监督预训练迁移至视频-文本联合学习；将视频离散化为视觉 token，与文字 token 交织训练。奠定了视频-语言预训练范式，长视频研究的远祖。

First application of BERT-style self-supervised pretraining to joint video-text learning; discretized video into visual tokens interleaved with text. Laid the foundation for video-language pretraining.

Video-Language PretrainingToken

↗ 1904.01766

2021 · 长视频建模启动

Longformer / BigBird for Video (多个工作)

稀疏注意力机制（滑动窗口 + 全局 token）被引入视频理解，将 O(n²) 复杂度降为 O(n)，首次让长序列视频 Transformer 变得可行。

Sparse attention (sliding window + global tokens) introduced to video understanding, reducing O(n²) to O(n) complexity. Made long-sequence video Transformers feasible for the first time.

Sparse AttentionEfficiency

2022 · 视频-LLM 元年

Frozen / FrozenBiLM (Alayrac et al., DeepMind)

冻结 LLM 权重，仅通过可学习视觉适配层将视频帧注入语言模型，验证了 LLM 作为视频推理引擎的潜力，奠定"冻结 LLM + 视觉桥接"范式。

Froze LLM weights, injecting video frames via learnable visual adapters. Validated LLMs as video reasoning engines; established the "frozen LLM + visual bridge" paradigm.

Video-LLMAdapter

2023-Q1 · 对话式视频理解爆发

VideoChat (OpenGVLab) / Video-ChatGPT (MBZUAI)

首批基于对话的视频多模态模型。VideoChat 引入事件记忆（Event Memory），将视频语义压缩为文字摘要输入 LLM；Video-ChatGPT 完成大规模视频指令调优数据集构建。

First conversational video multimodal models. VideoChat introduced event memory (compressing video semantics into text summaries for LLM). Video-ChatGPT built large-scale video instruction tuning datasets.

Instruction TuningEvent Memory

VideoChat ↗ 2305.06355

2023-Q3 · 长视频稀疏记忆先驱

MovieChat (CVPR 2024) ⭐ 长视频记忆命名起点

以 Atkinson-Shiffrin 认知记忆模型（短时记忆 + 长时记忆）为框架，设计 Dense Token → Sparse Memory 的两阶段压缩，首次支持 10K+ 帧长视频理解，配套发布 MovieChat-1K 基准。

Framed on Atkinson-Shiffrin memory model (short-term + long-term), designed Dense Token → Sparse Memory two-stage compression. First to support 10K+ frame long video understanding. Released MovieChat-1K benchmark.

Sparse MemoryCognitive Memory10K Frames

↗ 2307.16449

2023-Q4 · 时序理解深化

TimeChat (ICLR 2024) & LLaMA-VID

TimeChat 引入时间感知帧编码器（Timestamp-aware），每帧附加绝对时间戳 token，使模型具备时序定位能力。LLaMA-VID 将每帧压缩至 2 个 token（上下文 + 内容），使百分钟视频成为可能。

TimeChat introduced timestamp-aware frame encoding, appending absolute timestamp tokens to each frame for temporal grounding. LLaMA-VID compressed each frame to 2 tokens (context + content), enabling hour-long video processing.

Temporal GroundingToken Compression

TimeChat ↗ 2312.02051 · LLaMA-VID ↗ 2311.17043

2024-Q1 · 记忆增强 Agent 涌现

VideoAgent (ECCV 2024) · VideoAgent-Long (EMNLP 2024)

VideoAgent 构建结构化记忆（时序事件描述 + 对象追踪状态），迭代召回相关片段回答问题，开创"基于外部记忆的视频 Agent"范式。VideoAgent-Long 专注小时级长视频，用 GPT-4V 逐步构建语义记忆库。

VideoAgent built structured memory (temporal event descriptions + object tracking states), iteratively retrieving relevant clips to answer questions. Pioneered the "external-memory video agent" paradigm. VideoAgent-Long targeted hour-long videos using GPT-4V to progressively build a semantic memory store.

Memory AgentStructured MemoryIterative Retrieval

VideoAgent ↗ 2403.11481 · VideoAgent-Long ↗ 2403.10517

2024-Q2 · 大厂长上下文窗口竞赛

Gemini 1.5 Pro (Google) — 1M Token Context ⭐

Gemini 1.5 Pro 以 100 万 token 超长上下文窗口实现端到端长视频理解，无需外部记忆模块即可处理 1 小时以上视频。引发"暴力扩展上下文 vs. 精细记忆设计"的学术争论，在 Video-MME 基准上全面领先 GPT-4o。

Gemini 1.5 Pro's 1M-token context window enabled end-to-end long video understanding without external memory, processing 1+ hour videos. Sparked debate on "brute-force context scaling vs. fine-grained memory design." Outperformed GPT-4o on Video-MME benchmark.

Long Context1M TokensIndustry

2024-Q2 · 记忆增强 LMM 双雄

MA-LMM (CVPR 2024) & MovieChat+

MA-LMM 设计可插拔长期记忆库（Long-term Memory Bank），在线压缩历史视觉 token，支持视频描述/预测/问答多任务，是记忆增强视频 LMM 的标准基线。MovieChat+ 改进为问题感知稀疏记忆（Question-aware），提升检索相关性。

MA-LMM designed a plug-in long-term memory bank, compressing historical visual tokens online across video captioning, prediction, and QA tasks. MovieChat+ improved to question-aware sparse memory for better retrieval relevance.

Memory BankOnline Compression

MA-LMM ↗ 2404.05726 · MovieChat+ ↗ 2404.17176

2024-Q3 · 层次事件记忆

HEM-LLM: Hierarchical Event-Based Memory (2024)

设计层次化事件记忆机制，将视频切分为多粒度事件（片段→事件→故事弧），每层独立压缩后跨层级检索，缓解平坦记忆结构的信息稀释问题。

Designed hierarchical event-based memory, segmenting videos at multiple granularities (clips → events → story arcs), compressing each level independently and enabling cross-level retrieval to alleviate flat-memory information dilution.

Hierarchical MemoryEvent Segmentation

↗ 2409.06299

2024-Q4 · 流式记忆 & 大规模训练

LLaVA-Video (ByteDance / NeurIPS 2024) & VideoStreaming

LLaVA-Video 通过大规模合成视频指令数据（1.6M 视频-文本对）显著提升长视频理解，成为开源视频 LMM 的新基线。VideoStreaming 提出固定 token 预算下的流式压缩，常数计算量处理任意长度视频。

LLaVA-Video significantly improved long video understanding via large-scale synthetic instruction data (1.6M video-text pairs), becoming the new open-source Video LMM baseline. VideoStreaming proposed streaming compression with fixed token budget, processing arbitrary-length video with constant compute.

StreamingSynthetic DataOpen Source

LLaVA-Video ↗ 2411.10442

2024-Q4 · 可学习记忆

ReWind (CVPR 2025) — Instructed Learnable Memory

两阶段框架：阶段一维护动态记忆（按问题指令更新），阶段二在记忆 + 关键帧上进行推理。引入"Instructed Learnable Memory"——记忆内容由用户查询驱动动态调整，而非固定压缩规则。

Two-stage framework: Stage 1 maintains dynamic memory (updated by query instruction); Stage 2 reasons over memory + key frames. Introduced "Instructed Learnable Memory" — memory content is dynamically adjusted by user queries rather than fixed compression rules.

Learnable MemoryQuery-DrivenCVPR 2025

↗ 2411.15556

2024-Q4 · 循环记忆桥接

VideoLLaMB (ICCV 2025) — Recurrent Memory Bridges

引入循环记忆桥接（Recurrent Memory Bridges）和时序记忆 token（Temporal Memory Tokens），在保持流式处理的同时维持跨片段的语义连贯性，以极低的 token 开销支持长程推理。

Introduced Recurrent Memory Bridges and Temporal Memory Tokens, maintaining semantic coherence across segments during streaming processing, supporting long-range reasoning with very low token overhead.

Recurrent MemoryStreamingICCV 2025

2024-Q4 · 超长视频自适应记忆

VideoMem (2024) — Adaptive Memory for Ultra-Long Video

专攻超长视频（小时级以上），引入自适应记忆更新策略，根据语义重要性动态决定哪些帧应进入长期记忆，哪些可丢弃，有效解决超长时序的信息保留问题。

Specialized for ultra-long video (hour+), introduced adaptive memory update strategy that dynamically decides which frames enter long-term memory based on semantic importance, effectively solving ultra-long temporal information retention.

Adaptive MemoryUltra-Long

↗ 2512.04540

2025-Q1 · 🆕 RAG-Video & 图记忆

Vgent (NeurIPS 2025, KAUST × Meta AI) & Graph-based Memory

Vgent 将检索增强生成（RAG）引入视频 Agent，构建基于图结构的视频知识库，通过多跳检索回答复杂跨段问题。图节点=场景/对象，边=时序/因果关系，突破了线性时序记忆的局限。

Vgent brought Retrieval-Augmented Generation (RAG) to video agents, constructing a graph-structured video knowledge base with multi-hop retrieval for complex cross-segment questions. Graph nodes = scenes/objects; edges = temporal/causal relations. Overcame limitations of linear temporal memory.

RAGGraph MemoryMulti-hop

2025-Q1 · 🆕 固定记忆窗口加速

Long-VMNet (2025) — Fixed Memory for Acceleration

提出固定记忆窗口（Fixed Memory Window）加速长视频理解，相比动态记忆方案显著降低延迟，在多个基准上以极小精度损失换取大幅推理加速。

Proposed Fixed Memory Window to accelerate long video understanding, significantly reducing latency compared to dynamic memory approaches while achieving large speedups with minimal accuracy loss on multiple benchmarks.

Fixed MemoryEfficiency

↗ 2503.13707

2025-Q2 · 🆕 事件中心情景记忆

Video-EM (2025) — Event-Centric Episodic Memory

以"事件"为基本单元构建情景记忆（Episodic Memory），克服传统基于帧/片段的碎片化检索问题，强化叙事主线的完整性，在长视频问答中实现更连贯的时序证据链。

Built episodic memory with "events" as the fundamental unit, overcoming fragmented retrieval in frame/clip-based approaches, reinforcing narrative completeness and achieving more coherent temporal evidence chains in long video QA.

Episodic MemoryEvent-Centric

↗ 2508.09486

2025-Q3 · 🆕 深度回溯记忆

VideoLucy (2025) — Deep Memory Backtracking

受人类回忆"由粗到精"过程启发，设计深度记忆回溯框架：先粗粒度定位大致时段，再细粒度回溯精确片段，避免全量扫描的计算浪费，同时保证关键细节不丢失。

Inspired by human coarse-to-fine recall, designed a deep memory backtracking framework: coarse-grained temporal localization followed by fine-grained segment backtracking, avoiding full-scan compute waste while retaining critical details.

Deep BacktrackingCoarse-to-Fine

↗ 2510.12422

2025-Q4 · 🆕 流式 Token 压缩

StreamingTOM (2025) — Streaming Token Compression

无需训练的两阶段插件式框架，同时解决 LLM 前（视觉 token 冗余）和 LLM 后（KV Cache 膨胀）两个瓶颈；因果时序缩减 + 语义 token 合并；预测延迟稳定、精度与 SOTA 持平。

Training-free, plug-and-play two-stage framework addressing both pre-LLM (visual token redundancy) and post-LLM (KV Cache inflation) bottlenecks. Causal temporal reduction + semantic token merging; predictable latency with state-of-the-art accuracy.

Token CompressionTraining-FreeStreaming

↗ 2510.18269

🔬

五大技术范式

Five Technical Paradigms

Long Video Understanding with Memory — Taxonomy

📸 稀疏采样范式 (Sparse Sampling)

📸 Sparse Sampling Paradigm

核心思路：从长视频中均匀或自适应地抽取 N 帧（通常 8–64），将所有帧 token 拼接送入模型。优点是实现简单；缺点是面对小时级视频时采样率极低，高密度事件段信息丢失严重。

Core idea: uniformly or adaptively sample N frames (typically 8–64) from the long video, concatenating all frame tokens as model input. Simple to implement; suffers from extreme undersampling on hour-long video, causing severe information loss in dense event segments.

2022

CLIP-ViP

视频-文本对比学习，均匀采样16帧，奠定基础采样基线。

Video-text contrastive learning with 16-frame uniform sampling; established the basic sampling baseline.

SamplingCLIP

2023

Video-LLaMA

将 BLIP-2 的图像理解扩展至视频，帧级 Q-Former 提取视觉特征后注入 LLaMA。

Extended BLIP-2 image understanding to video with a frame-level Q-Former injecting features into LLaMA.

Video-LLMQ-Former

2023

LLaMA-VID

极端压缩：每帧仅保留 2 token（上下文+内容），首次使百分钟级视频处理成为可能。

Extreme compression: 2 tokens per frame (context + content). First to make hour-length video processing feasible.

arxiv 2311.17043

2-token/frameCompression

2024

Video Token Merging (NeurIPS 2024)

探索多种视频 token 合并策略（均匀/区域集中/时序），在长视频分类上以低算力匹配高算力模型。

Explored video token merging strategies (uniform/region-concentrated/temporal) for long video classification, matching high-compute models at low cost.

NeurIPS 2024

Token Merging

2024

LLaVA-Video (ByteDance)

1.6M 合成视频指令数据 + 高质量采样策略，成为开源视频 LMM 的新 SOTA 基线。

1.6M synthetic video instruction data + high-quality sampling; new open-source Video LMM SOTA baseline.

arxiv 2411.10442

Instruction TuningSOTA

2025

DYTO (ICCV 2025)

动态 token 合并：层次帧选择 + 二部图 token 合并，零样本自适应压缩，平衡关键帧保留与冗余消除。

Dynamic token merging: hierarchical frame selection + bipartite token merging; zero-shot adaptive compression balancing key frame retention and redundancy removal.

ICCV 2025

Dynamic MergingZero-Shot

🧠 记忆增强范式 (Memory-Augmented)

🧠 Memory-Augmented Paradigm

核心思路：将"已处理"的视频帧信息压缩存入外部/内部记忆库，推理时按需检索或直接读取。分为短时缓冲（近 K 帧精细 token）+ 长期压缩存储（历史帧蒸馏后的稀疏表示）。受人类认知记忆模型启发。

Core idea: compress processed frame information into external/internal memory banks; retrieve or read on demand during inference. Consists of short-term buffer (recent K frames' fine-grained tokens) + long-term compressed storage (distilled sparse representations). Inspired by human cognitive memory models.

2023

MovieChat (CVPR 2024)

Atkinson-Shiffrin 模型：短时 token + 长期稀疏记忆，10K 帧突破。

Atkinson-Shiffrin model: short-term tokens + long-term sparse memory; 10K frame breakthrough.

CVPR 2024

Short+Long Memory

2024

MA-LMM (CVPR 2024)

即插即用记忆库，在线 token 压缩，支持多任务；标准基线。

Plug-in memory bank with online token compression supporting multi-task; standard baseline.

CVPR 2024

Memory BankMulti-task

2024

MovieChat+

问题感知稀疏记忆，检索相关性大幅提升。

Question-aware sparse memory with significantly improved retrieval relevance.

arxiv 2404.17176

Query-Aware

2024

HEM-LLM

层次化事件记忆，多粒度分层压缩，跨级检索。

Hierarchical event memory with multi-granularity layered compression and cross-level retrieval.

arxiv 2409.06299

HierarchicalEvent

2024

ReWind (CVPR 2025)

指令驱动可学习记忆，两阶段（建记忆→推理）框架，时序保真。

Instruction-driven learnable memory in a two-stage (build memory → reason) framework with temporal fidelity.

CVPR 2025

LearnableInstruction-Driven

2024

VideoLLaMB (ICCV 2025)

循环记忆桥接 + 时序记忆 token，流式高效，长程连贯。

Recurrent memory bridges + temporal memory tokens; streaming-efficient with long-range coherence.

ICCV 2025

RecurrentStreaming

2024

VideoMem

超长视频自适应记忆，语义重要性动态决策，小时级以上视频。

Adaptive memory for ultra-long video; semantic importance-based dynamic decision-making for hour+ video.

arxiv 2512.04540

AdaptiveUltra-Long

2025

Video-EM

事件中心情景记忆，叙事完整性强化，跨段推理连贯。

Event-centric episodic memory reinforcing narrative completeness and cross-segment reasoning coherence.

arxiv 2508.09486

EpisodicNarrative

2025

VideoLucy

深度记忆回溯，由粗到精，仿人类回忆过程。

Deep memory backtracking; coarse-to-fine, mimicking human recollection process.

arxiv 2510.12422

BacktrackingCoarse-to-Fine

📏 长上下文扩展范式 (Long Context Scaling)

📏 Long Context Scaling Paradigm

核心思路：通过扩展模型原生上下文窗口（RoPE 外推/YaRN/位置插值等），让模型直接接受百万级 token 的视频序列，无需外部记忆模块。代表：Gemini 1.5（1M）、GPT-4o（128K）。优点是架构简洁，端到端推理；缺点是推理计算量随上下文平方增长，成本极高。

Core idea: extend native context window (RoPE extrapolation/YaRN/position interpolation) to accept million-token video sequences directly, without external memory. Examples: Gemini 1.5 (1M), GPT-4o (128K). Advantages: simple architecture, end-to-end; drawbacks: inference cost scales quadratically with context.

2024

Gemini 1.5 Pro (Google)

1M token 上下文，原生处理整部电影；Video-MME 基准全面领先 GPT-4o。

1M token context, natively processes entire movies. Outperformed GPT-4o on Video-MME benchmark across all categories.

2024 Technical Report

1M ContextIndustry

2024

GPT-4o (OpenAI)

128K 上下文，多模态原生支持，视频理解能力强但上下文仍有限。

128K context, native multimodal support; strong video understanding but context still limited.

128K ContextIndustry

2024

LongVA (EECS, Berkeley)

长上下文 LLM（224K）用于高分辨率长视频，无需专项记忆模块。

Long-context LLM (224K) for high-resolution long video, requiring no dedicated memory module.

224K ContextHigh-Res

2025

Long-VMNet

固定记忆窗口加速方案，用小计算量近似长上下文效果。

Fixed memory window acceleration approximating long-context performance at reduced compute.

arxiv 2503.13707

Fixed WindowAcceleration

🔎 检索增强范式 (RAG for Video)

🔎 Retrieval-Augmented Generation for Video

核心思路：将长视频预处理为索引（文本描述/视觉向量），问答时动态检索相关片段再交给 LLM 推理，避免全量处理。优点是可扩展性强、支持超长视频；缺点是检索错误会级联影响推理质量。

Core idea: preprocess long video into an index (text descriptions/visual vectors), dynamically retrieve relevant clips during QA then hand to LLM for reasoning. Highly scalable, supports ultra-long video; retrieval errors can cascade and degrade reasoning quality.

2024

VideoAgent (ECCV 2024)

结构化事件记忆 + 对象状态追踪，迭代检索-推理循环。

Structured event memory + object tracking states; iterative retrieve-reason loop.

ECCV 2024

Structured MemoryIterative

2024

VideoAgent-Long

GPT-4V 辅助构建语义记忆库，专攻小时级长视频问答。

GPT-4V-assisted semantic memory store construction for hour-long video QA.

arxiv 2403.10517

Hour-LongGPT-4V

2024

EgoVideo-RAG

专为自我中心视频设计的 RAG，动作/对象检索融合，EgoSchema 基准提升。

RAG designed for egocentric video, fusing action/object retrieval; improved EgoSchema benchmark results.

EgocentricFusion Retrieval

2025

Vgent (NeurIPS 2025)

图结构知识库 + 多跳检索，突破线性时序记忆局限，复杂推理大幅提升。

Graph-structured knowledge base + multi-hop retrieval; breaks linear temporal memory limitations with major complex reasoning gains.

Graph RAGMulti-hop

🤖 Agent 式理解范式 (Agentic Video Understanding)

🤖 Agentic Video Understanding Paradigm

核心思路：将视频理解包装为多步 Agent 循环（计划→工具调用→记忆更新→回答），支持自适应采样、多工具协作（OCR/ASR/对象检测/字幕生成）。最适合需要多步推理的复杂问答。

Core idea: wrap video understanding as a multi-step agent loop (plan → tool call → memory update → answer), supporting adaptive sampling and multi-tool collaboration (OCR/ASR/object detection/captioning). Best for complex QA requiring multi-step reasoning.

2024

VideoAgent (ECCV 2024)

记忆增强 Agent，结构化记忆库，LLM 驱动推理循环。

Memory-augmented agent with structured memory store; LLM-driven reasoning loop.

ECCV 2024

Memory Agent

2025

LVAgent (ICCV 2025)

多轮动态协作 MLLM，多 Agent 互补理解，长视频复杂推理新 SOTA。

Multi-round dynamic collaboration of MLLMs; complementary multi-agent understanding; new SOTA for complex long video reasoning.

ICCV 2025

Multi-AgentDynamic

2025

MARC (RL Token Compression)

强化学习驱动 token 压缩 Agent，动态决策哪些 token 保留，性能与效率双优。

RL-driven token compression agent that dynamically decides which tokens to retain; achieves both performance and efficiency.

RL AgentToken Compression

2025

Vgent (NeurIPS 2025)

图增强视频 Agent，知识图谱 + RAG，复杂因果推理。

Graph-augmented video agent; knowledge graph + RAG for complex causal reasoning.

Graph AgentKnowledge Graph

📊

主流基准测试

Major Benchmarks

Benchmark	视频时长	Video Duration	规模	Scale	任务类型	Task Type
EgoSchema	3 min avg	5,000 QA	Video QA	2023	自我中心视频，多项选择，长时推理	Egocentric, multiple-choice, long-range reasoning
Video-MME	11s–1h	900 videos, 2700 QA	Comprehensive	2024	首个全面视频 MLLM 评估基准，短/中/长三档，Gemini 1.5 Pro 登顶	First comprehensive Video MLLM benchmark; short/medium/long tiers; Gemini 1.5 Pro topped leaderboard
LongVideoBench	15s–60min	6,678 QA (17 cats)	Interleaved	2024	视频-语言交织输入，17类问题，最全面长视频 QA 基准之一	Video-language interleaved input; 17 QA categories; one of the most comprehensive long video benchmarks
LVBench	1h+	超长视频集	Extreme Long	2024	专为超长视频设计，平均时长 1 小时以上，当前模型在此表现仍差	Designed for extreme long video (avg 1h+); current models still struggle significantly
MovieChat-1K	电影级 / Movie-length	1K videos, 2K grounding	Grounding	2024	配套 MovieChat 发布，包含时序定位标注，首个10K帧级评测集	Released with MovieChat; includes temporal grounding annotations; first 10K-frame scale evaluation set
ActivityNet-QA	3 min avg	58,000 QA	Open-Ended QA	2019	开放式问答，覆盖日常活动，长视频理解早期经典基准	Open-ended QA on daily activities; classic early long video benchmark
How2QA / How2R	90s avg	22K QA	Retrieval+QA	2020	多语言教育视频，检索+问答双任务，测试跨段推理	Multilingual educational videos; dual retrieval+QA tasks; tests cross-segment reasoning
Ego4D-NLQ	小时级 / Hour-long	15K 查询	Temporal NLQ	2022	自我中心视频自然语言查询时序定位，Meta+UC Berkeley 联合构建	Egocentric video natural language query temporal localization; jointly built by Meta+UC Berkeley
MLVUBench	3–120 min	2,593 QA	Multi-task	2024	多维度长视频评测（叙事/推理/识别），Needle-in-a-Haystack 专项	Multi-dimensional long video evaluation (narrative/reasoning/recognition); Needle-in-a-Haystack special test

⚖️

主要方法能力对比

Method Capability Comparison

方法 / Method	Method	记忆类型	Memory Type	小时级	Hour+	流式	Streaming	问题感知
MovieChat (CVPR 2024)	短+长期稀疏	Short+Long Sparse	✓	✗	✗	△	✗	✗
MA-LMM (CVPR 2024)	在线记忆库	Online Memory Bank	✓	△	✗	△	✗	✗
ReWind (CVPR 2025)	指令驱动可学习	Instruction-Learnable	✓	✗	✓	✓	✗	✗
VideoLLaMB (ICCV 2025)	循环记忆桥接	Recurrent Memory Bridge	✓	✓	✗	△	✗	✗
VideoAgent (ECCV 2024)	结构化检索记忆	Structured Retrieval	✓	✗	✓	✓	✓	✗
Gemini 1.5 Pro	1M 上下文窗口	1M Context Window	✓	✗	✓	✓	✓	✗
StreamingTOM (2025)	流式 Token 压缩	Streaming Token Compress	✓	✓	✗	△	✗	✓
VideoMem (2024)	自适应记忆	Adaptive Memory	✓	△	✓	✓	✗	✗
Vgent (NeurIPS 2025)	图结构记忆 + RAG	Graph Memory + RAG	✓	✗	✓	✓	✓	✗
LVAgent (ICCV 2025)	多 Agent 协作	Multi-Agent Collab.	✓	✗	✓	✓	✓	✗

✓ 支持 / Supported · △ 部分支持 / Partial · ✗ 不支持 / Not Supported

🚀

六大演进趋势

Six Key Trends

2023 → 2026 发展方向

🧠

记忆从静态→动态→可学习

Memory: Static → Dynamic → Learnable

早期方案（MovieChat/MA-LMM）采用固定压缩规则（FIFO、平均池化）；2024年后转向问题感知、指令驱动的动态记忆（ReWind、MovieChat+）；趋势终点是完全端到端可学习的记忆读写机制，记忆内容由模型自主决策。

Early approaches (MovieChat/MA-LMM) used fixed compression rules (FIFO, average pooling); post-2024 shifted to query-aware, instruction-driven dynamic memory (ReWind, MovieChat+); the end goal is fully end-to-end learnable memory read/write, where memory content is autonomously decided by the model.

代表 /Reps: MovieChat → MA-LMM → ReWind → VideoMem

⚡

Token 压缩效率化

Token Compression Efficiency

随着视频 LMM 进入工业落地，推理成本成为瓶颈。从简单均匀采样→基于注意力的 token 合并→训练无关的流式压缩（StreamingTOM）→RL 驱动的动态压缩（MARC）。目标：在固定 token 预算下最大化信息密度。

As Video LMMs enter production, inference cost is the bottleneck. Evolution: simple uniform sampling → attention-based token merging → training-free streaming compression (StreamingTOM) → RL-driven dynamic compression (MARC). Goal: maximize information density under fixed token budget.

代表 /Reps: LLaMA-VID → Video Token Merging → StreamingTOM → MARC → Long-VMNet

🌐

上下文窗口 vs. 结构化记忆

Context Window vs. Structured Memory

Gemini 1.5 Pro 的 1M 上下文引发"是否需要专用记忆模块"的争论。学界认为：长上下文适合短时高密度推理，结构化记忆适合超长视频（数小时）的高效检索与低成本部署。两者将长期共存，分别占据不同应用场景。

Gemini 1.5's 1M context sparked debate on "whether dedicated memory modules are needed." Academic consensus: long context suits short high-density reasoning; structured memory suits ultra-long video (hours) for efficient retrieval and low-cost deployment. Both will coexist for different use cases.

代表 /Reps: Gemini 1.5 vs. VideoMem vs. VideoAgent

🕸️

平坦记忆→层次/图结构记忆

Flat → Hierarchical / Graph Memory

线性时序记忆难以表达复杂的因果、空间和实体关系。HEM-LLM 引入层次化事件记忆；Vgent 引入图结构记忆，支持多跳推理。预计未来视频知识图谱（Video KG）将成为长视频理解的基础设施。

Linear temporal memory fails to capture complex causal, spatial, and entity relations. HEM-LLM introduced hierarchical event memory; Vgent introduced graph memory supporting multi-hop reasoning. Video Knowledge Graphs (KGs) are expected to become long video understanding infrastructure.

代表 /Reps: HEM-LLM → Video-EM → Vgent → VideoLucy

🤖

单模型→多 Agent 协作

Single Model → Multi-Agent Collaboration

单一 MLLM 的能力上限难以覆盖长视频所有任务（OCR/ASR/对象追踪/时序推理）。LVAgent、VideoAgent 等工作转向多 Agent 分工协作，每个 Agent 专精子任务，通过共享记忆库和协调机制融合输出，显著提升复杂长视频问答性能。

A single MLLM cannot cover all long video tasks (OCR/ASR/object tracking/temporal reasoning). LVAgent, VideoAgent et al. shifted to multi-agent division of labor, each specializing in subtasks, fusing outputs via shared memory and coordination, significantly improving complex long video QA.

代表 /Reps: VideoAgent → LVAgent → Vgent

📡

离线→实时流式处理

Offline → Real-Time Streaming

早期工作均为离线处理（等待全部帧）。直播/监控场景需要实时感知与记忆更新。VideoStreaming、StreamingTOM、VideoLLaMB 等引入因果约束下的流式记忆更新，在实时推理效率上取得突破。体现了长视频理解从"学术玩具"走向"工业落地"的核心挑战。

Early work was all offline (wait for all frames). Live streaming/surveillance requires real-time perception and memory updates. VideoStreaming, StreamingTOM, VideoLLaMB et al. introduced causal-constrained streaming memory updates, achieving breakthroughs in real-time inference. Reflects the core challenge of moving from "academic toy" to "industrial deployment."

代表 /Reps: VideoStreaming → VideoLLaMB → StreamingTOM

🔭

开放挑战与未来方向

Open Challenges & Future Directions

🎬 小时级视频推理

🎬 Hour-long Video Reasoning

LVBench 显示当前最优模型在 1 小时以上视频上准确率仍不足 50%，核心瓶颈是超长时序依赖建模与关键帧定位。

LVBench shows current best models achieve <50% accuracy on 1h+ videos. Core bottleneck: ultra-long temporal dependency modeling and key frame localization.

💡 记忆可解释性

💡 Memory Interpretability

"模型记住了什么"目前几乎无法解释。可解释性记忆（类似注意力可视化）是提升用户信任和模型 Debug 的关键研究方向。

"What did the model remember" is currently nearly uninterpretable. Interpretable memory (analogous to attention visualization) is key for user trust and model debugging.

🌐 多语言 × 长视频

🌐 Multilingual × Long Video

绝大多数基准与模型以英文为主。多语言（特别是低资源语言）长视频理解数据极度匮乏，是重要的未探索空白。

Most benchmarks and models are English-centric. Multilingual (especially low-resource) long video understanding data is severely lacking — an important unexplored gap.

🔗 与具身 AI 融合

🔗 Integration with Embodied AI

具身 Agent 需要长时序视频记忆来完成多步任务（"一小时前我把工具放在哪里"）。长视频记忆与 VLA/WM 的深度融合是下一个重大方向。

Embodied agents need long-temporal video memory for multi-step tasks ("where did I put the tool an hour ago"). Deep integration of long video memory with VLA/WM is the next major direction.

⚡ 实时低延迟部署

⚡ Real-time Low-latency Deployment

监控/直播场景需要端到端延迟 <200ms 的流式推理，而目前最好的模型仍需秒级甚至分钟级推理时间，工业落地差距巨大。

Surveillance/live-streaming requires <200ms end-to-end latency for streaming inference, but current best models still need seconds to minutes — a huge gap for industrial deployment.

🏆 统一评测体系

🏆 Unified Evaluation Framework

目前各基准测试场景割裂（EgoSchema/Video-MME/LVBench），缺少统一的跨任务、跨时长、跨场景的长视频理解综合评测体系。

Current benchmarks are fragmented (EgoSchema/Video-MME/LVBench). A unified cross-task, cross-duration, cross-domain long video understanding evaluation framework is urgently needed.

长视频理解与记忆机制综述2019–2026

Long Video Understandingwith Memory Survey 2019–2026

🎯 核心问题

🎯 Core Problem

🧠 记忆的意义

🧠 Role of Memory

🔗 与 LLM 的融合

🔗 LLM Integration

🚀 应用场景

🚀 Applications

计算与内存瓶颈

Compute & Memory Bottleneck

长时依赖推理

Long-Range Temporal Reasoning

时序定位精度

Temporal Grounding Accuracy

信息压缩 vs. 保真

Compression vs. Fidelity

流式 / 在线处理

Streaming / Online Processing

多模态对齐

Multimodal Alignment

VideoBERT (Sun et al., Google)

Longformer / BigBird for Video (多个工作)

Frozen / FrozenBiLM (Alayrac et al., DeepMind)

VideoChat (OpenGVLab) / Video-ChatGPT (MBZUAI)

MovieChat (CVPR 2024) ⭐ 长视频记忆命名起点

TimeChat (ICLR 2024) & LLaMA-VID

VideoAgent (ECCV 2024) · VideoAgent-Long (EMNLP 2024)

Gemini 1.5 Pro (Google) — 1M Token Context ⭐

MA-LMM (CVPR 2024) & MovieChat+

HEM-LLM: Hierarchical Event-Based Memory (2024)

LLaVA-Video (ByteDance / NeurIPS 2024) & VideoStreaming

ReWind (CVPR 2025) — Instructed Learnable Memory

VideoLLaMB (ICCV 2025) — Recurrent Memory Bridges

VideoMem (2024) — Adaptive Memory for Ultra-Long Video

Vgent (NeurIPS 2025, KAUST × Meta AI) & Graph-based Memory

Long-VMNet (2025) — Fixed Memory for Acceleration

Video-EM (2025) — Event-Centric Episodic Memory

VideoLucy (2025) — Deep Memory Backtracking

StreamingTOM (2025) — Streaming Token Compression

📸 稀疏采样范式 (Sparse Sampling)

📸 Sparse Sampling Paradigm

CLIP-ViP

Video-LLaMA

LLaMA-VID

Video Token Merging (NeurIPS 2024)

LLaVA-Video (ByteDance)

DYTO (ICCV 2025)

🧠 记忆增强范式 (Memory-Augmented)

🧠 Memory-Augmented Paradigm

MovieChat (CVPR 2024)

MA-LMM (CVPR 2024)

MovieChat+

HEM-LLM

ReWind (CVPR 2025)

VideoLLaMB (ICCV 2025)

VideoMem

Video-EM

VideoLucy

📏 长上下文扩展范式 (Long Context Scaling)

📏 Long Context Scaling Paradigm

Gemini 1.5 Pro (Google)

GPT-4o (OpenAI)

LongVA (EECS, Berkeley)

Long-VMNet

🔎 检索增强范式 (RAG for Video)

🔎 Retrieval-Augmented Generation for Video

VideoAgent (ECCV 2024)

VideoAgent-Long

EgoVideo-RAG

Vgent (NeurIPS 2025)

🤖 Agent 式理解范式 (Agentic Video Understanding)

🤖 Agentic Video Understanding Paradigm

VideoAgent (ECCV 2024)

LVAgent (ICCV 2025)

MARC (RL Token Compression)

Vgent (NeurIPS 2025)

记忆从静态→动态→可学习

Memory: Static → Dynamic → Learnable

长视频理解与记忆机制综述
2019–2026

Long Video Understanding
with Memory Survey 2019–2026