Semantic Generative Tuning
for Unified Multimodal Models

The first systematic investigation into generative post-training for UMMs — bridging visual understanding and generation through high-level semantic proxies.

Songsong Yu¹^,2, Yuxin Chen², Ying Shan², Yanwei Li^1,^✉️

¹Shanghai Jiao Tong University ²Tencent ARC Lab

Paper (PDF) Code Cite

CV-Bench gain
(BAGEL baseline)

190k

SGT training
samples (SAM)

2×

UMM architectures
validated

Abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.

This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.

Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity, achieving a 6.02% gain on CV-Bench over BAGEL and a 90.0% score on GenEval.

Motivation

Why Existing UMMs Fall Short

We compare three paradigms for training unified multimodal models and pinpoint the critical gap.

Figure 1: Comparison of alignment strategies for UMMs. (a) Traditional UMMs optimize understanding and generation separately, resulting in low synergy. (b) Recent pixel-level attempts over-focus on high-frequency details. (c) Our proposed SGT achieves semantic-level alignment, enabling true synergy between understanding and generation.

⚠️

Traditional UMMs

Understanding and generation are optimized independently with divergent supervisory signals, resulting in misaligned representations and no synergy between the two capabilities.

Misaligned

🔍

Pixel-Level Alignment

Recent methods use pixel-space reconstruction as a proxy. While yielding some gains, this over-emphasizes textures and distracts from semantic reasoning.

Suboptimal

✨

Semantic Generative Tuning

SGT leverages image segmentation as a generative proxy — high-level semantic structure that naturally bridges understanding and generation through shared semantic space.

Optimal ✓

Method

Semantic Generative Tuning (SGT)

SGT formulates image segmentation as a generative post-training objective within UMMs.

The framework is architecture-agnostic and validated across two fundamentally different UMM designs.

Figure 2: SGT Training Pipeline Overview

Figure 2: Overview of the SGT generative tuning paradigm. An RGB image and a textual instruction are processed by vision and text encoders. Because empirical evaluations demonstrate that high-level visual generation targets yield the most significant gains, SGT explicitly adopts image segmentation as its generative objective.

Step 01

Hierarchical Taxonomy

Systematically evaluate low/mid/high-level visual tasks (edge → depth → segmentation) as generative proxies.

Step 02

Empirical Discovery

Three key observations confirm segmentation as optimal proxy — outperforms pixel reconstruction on all understanding benchmarks.

Step 03

SGT Training

190k segmentation samples from SAM dataset. Optimal 2:1 (Seg:VQA) batch ratio with LLaVA-OneVision SFT data.

Step 04

Synergized UMM

Improved feature linear separability + optimized visual-textual attention → consistent gains on both understanding and generation.

🏗️

BAGEL (7B + 7B)

Highly native design. Mixture of Transformers with native interleaved training across understanding and generation.
Shared attention. Understanding and generation streams interact through shared attention, enabling deep cross-modal fusion.
Larger scale. Dual-7B parameter capacity provides strong representational power for SGT integration.

Large-scale UMM

⚡

OmniGen2 (3B + 4B)

Frozen VLM. The pretrained VLM is kept frozen; only the diffusion head is trained.
Feature sharing. Hidden states are shared as semantic guidance, bridging understanding and generation in series.
Lightweight parameters. Compact 3B + 4B configuration delivers efficient training and inference.

Efficient UMM

Key Findings

Three Empirical Observations

Our hierarchical task study across BAGEL and OmniGen2 reveals consistent patterns guiding SGT's design.

Figure 3a: Understanding capability gains

Fig. 3a: Understanding capability gains across proxy task levels on BAGEL and OmniGen2. Segmentation (high-level) consistently yields the largest gains.

Fig. 3b: Generation capability gains. All proxy tasks improve position-aware generation.

Observation 01

High-level semantic tasks dominate

Image segmentation consistently outperforms mid-level (depth estimation) and low-level (edge detection) tasks on all understanding benchmarks. High-level supervision aligns with perception demands, while texture-focused tasks cause overfitting to irrelevant details.

High-level wins

Observation 02

Visual supervision enhances perception, not reasoning

Generative tuning fortifies visual perception (vision-centric tasks, spatial reasoning, hallucination resistance) while chart/math reasoning remains static. Visual supervision enhances representation quality but does not impart logical priors.

Perception only

Observation 03

Spatial fidelity improves universally

Regardless of semantic granularity, all proxy tasks consistently improve generative spatial fidelity — especially for position-aware tasks. Reconstructing visual structure forces accurate spatial layouts, naturally boosting positional prompt adherence.

Universal gain

Experiments

Comparison with State-of-the-Art

SGT consistently outperform their baselines and surpass competitive UMMs across understanding and generation benchmarks.

Model	Params	Visual Understanding						Visual Generation
Model	Params	MMVP↑	VSR↑	Hallu.↑	MMStar↑	RWQA↑	MathV.↑	GenEval↑	GEdit↑
Small-scale Models (≤ 4B)
Show-o 512	1.3B	50.00	54.26	46.06	–	38.17	–	68.0	✗
Harmon	1.5B	60.00	60.88	46.69	38.00	48.00	33.70	73.0	✗
UniLIP	2B	73.00	65.55	60.57	–	64.18	–	90.0	–
UniMRG	3.6B	74.67	73.90	64.56	–	66.01	–	55.8	✗
OpenUni	2B	71.67	66.69	60.88	–	65.23	–	51.0	–
OmniGen2	3B+4B	65.00	77.52	62.35	55.07	64.41	63.50	76.0	6.63
✦ SGT-Gen2 (Ours)	3B+4B	68.33	78.85	64.25	57.07	65.10	64.00	79.1	6.83
Large-scale Models (≥ 7B)
Chameleon	7B	50.00	–	31.13	28.93	39.00	21.90	39.0	✗
Janus-Pro	7B	63.00	71.03	60.15	46.80	41.83	42.60	80.0	✗
UniWorld-v1	7B+12B	77.67	83.34	68.35	63.90	67.58	68.20	84.0	4.85
BAGEL	7B+7B	83.00	80.45	68.34	67.46	71.26	73.10	88.0	6.64
✦ SGT-BAGEL (Ours)	7B+7B	83.33	81.54	70.24	68.33	72.42	73.90	90.0	6.94

Ablation Study

Effect of Proxy Task Choice

SGT yields the largest gains across understanding benchmarks while achieving competitive generation performance.

Method	CV-Bench↑	MMVP↑	VSR↑	SIBench↑	POPE↑	Hallusion↑	GenEval↑	GEdit↑
Base: BAGEL
BAGEL (Base)	73.21	83.00	80.45	48.95	85.69	68.34	78.21	6.52
+ SFT only	74.61	82.67	80.69	49.34	86.77	67.92	77.18	6.49
+ SFT + Edge	74.56	83.67	80.83	49.51	86.48	68.66	79.96	6.72
+ SFT + Reconstruction	75.23	83.33	80.83	50.59	87.98	68.03	80.82	6.75
✦ + SFT + SGT (Ours)	79.23	83.33	81.54	50.18	88.32	70.24	80.95	6.94
Base: OmniGen2
OmniGen2 (Base)	65.94	65.00	77.52	43.29	85.97	62.35	76.58	6.63
+ SFT only	65.99	66.00	77.61	44.37	86.25	64.35	74.54	6.32
+ SFT + Edge	66.67	65.33	77.99	45.51	86.10	63.72	77.45	6.79
+ SFT + Reconstruction	66.71	66.33	78.18	45.41	85.92	65.19	77.53	6.81
✦ + SFT + SGT (Ours)	66.91	68.33	78.85	45.37	87.29	64.25	78.86	6.83

Mechanistic Analysis

Why Does SGT Work?

We probe feature distributions and attention dynamics to uncover the mechanisms behind SGT's improvements.

+6.1%

CV-Bench gain during training with 2:1 Seg:VQA ratio

Accelerated convergence vs. SFT-only baseline

2:1

Optimal Seg:VQA intra-batch ratio

Validated on both BAGEL and OmniGen2

+3.3%

BAGEL aggregate gain scaling from 2k → 100k seg samples

Monotonic improvement with data scale

L25–27

Deep layers where attention reallocation is most pronounced

Segmentation shifts attention most effectively to visual tokens

Figure 7: Visually confusable piano categories

Fig. 7 (left): Visually confusable categories — Grand Piano vs. Upright Piano — used to evaluate feature discriminability.

Figure 7: tSNE feature space visualization

Fig. 7 (right): t-SNE visualization. BAGEL baseline yields entangled features; SGT learns highly discriminative embeddings with clear class separation.

Figure 8a: Vision-language attention allocation across layers

Fig. 8a: Vision-language attention allocation change (%) across layers. Segmentation reallocates attention most strongly at deep layers (L25–L27), reflecting structural semantic influence.

Figure 8b: Understanding attention allocation

Fig. 8b: Attention allocation to object/color/position/other categories. SGT increases visual token attention while reducing distractions.

Training Data

Data Recipe & Scalability

SGT is scalable — performance improves with more segmentation data, and a 2:1 Seg:VQA batch ratio is optimal.

Fig. 5a: Ablation on segmentation-to-VQA ratio. Both BAGEL and OmniGen2 achieve optimal performance at a 2:1 ratio (segmentation:VQA).

Fig. 5b: Data scalability. Performance improves consistently as segmentation data scales from 2k to 100k samples (BAGEL: +3.3%, OmniGen2: +2.0%).

Training Data Breakdown (Total ≈ 691k)

SGT — Segmentation (SAM)190k

General VQA180k

Doc / Chart / Screen103k

Math / Reasoning101k

Language72k

General OCR45k

Citation

BibTeX

If you find SGT useful in your research, please consider citing our work.

@article{yu2026sgt,
  title     = {Semantic Generative Tuning for Unified Multimodal Models},
  author    = {Yu, Songsong and Chen, Yuxin and Shan, Ying and Li, Yanwei},
  journal   = {arXiv preprint arXiv:2605.18714},
  year      = {2026},
}