🖌️

Semantic Generative Tuning
for Unified Multimodal Models

The first systematic investigation into generative post-training for UMMs — bridging visual understanding and generation through high-level semantic proxies.

Songsong Yu1,2, Yuxin Chen2, Ying Shan2, Yanwei Li1,✉️

1Shanghai Jiao Tong University 2Tencent ARC Lab   

0%
CV-Bench gain
(BAGEL baseline)
190k
SGT training
samples (SAM)
UMM architectures
validated
Abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.

This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.

Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity, achieving a 6.02% gain on CV-Bench over BAGEL and a 90.0% score on GenEval.

Why Existing UMMs Fall Short

We compare three paradigms for training unified multimodal models and pinpoint the critical gap.

Figure 1: Comparison of alignment strategies for UMMs
Figure 1: Comparison of alignment strategies for UMMs. (a) Traditional UMMs optimize understanding and generation separately, resulting in low synergy. (b) Recent pixel-level attempts over-focus on high-frequency details. (c) Our proposed SGT achieves semantic-level alignment, enabling true synergy between understanding and generation.
⚠️

Traditional UMMs

Understanding and generation are optimized independently with divergent supervisory signals, resulting in misaligned representations and no synergy between the two capabilities.

Misaligned
🔍

Pixel-Level Alignment

Recent methods use pixel-space reconstruction as a proxy. While yielding some gains, this over-emphasizes textures and distracts from semantic reasoning.

Suboptimal

Semantic Generative Tuning

SGT leverages image segmentation as a generative proxy — high-level semantic structure that naturally bridges understanding and generation through shared semantic space.

Optimal ✓
Semantic Generative Tuning (SGT)

SGT formulates image segmentation as a generative post-training objective within UMMs.

The framework is architecture-agnostic and validated across two fundamentally different UMM designs.

Figure 2: SGT Training Pipeline Overview
Figure 2: Overview of the SGT generative tuning paradigm. An RGB image and a textual instruction are processed by vision and text encoders. Because empirical evaluations demonstrate that high-level visual generation targets yield the most significant gains, SGT explicitly adopts image segmentation as its generative objective.
Step 01
Hierarchical Taxonomy
Systematically evaluate low/mid/high-level visual tasks (edge → depth → segmentation) as generative proxies.
Step 02
Empirical Discovery
Three key observations confirm segmentation as optimal proxy — outperforms pixel reconstruction on all understanding benchmarks.
Step 03
SGT Training
190k segmentation samples from SAM dataset. Optimal 2:1 (Seg:VQA) batch ratio with LLaVA-OneVision SFT data.
Step 04
Synergized UMM
Improved feature linear separability + optimized visual-textual attention → consistent gains on both understanding and generation.
🏗️

BAGEL (7B + 7B)

  • Highly native design. Mixture of Transformers with native interleaved training across understanding and generation.
  • Shared attention. Understanding and generation streams interact through shared attention, enabling deep cross-modal fusion.
  • Larger scale. Dual-7B parameter capacity provides strong representational power for SGT integration.
Large-scale UMM

OmniGen2 (3B + 4B)

  • Frozen VLM. The pretrained VLM is kept frozen; only the diffusion head is trained.
  • Feature sharing. Hidden states are shared as semantic guidance, bridging understanding and generation in series.
  • Lightweight parameters. Compact 3B + 4B configuration delivers efficient training and inference.
Efficient UMM
Three Empirical Observations

Our hierarchical task study across BAGEL and OmniGen2 reveals consistent patterns guiding SGT's design.

Figure 3a: Understanding capability gains
Fig. 3a: Understanding capability gains across proxy task levels on BAGEL and OmniGen2. Segmentation (high-level) consistently yields the largest gains.
Figure 3b: Generation capability gains
Fig. 3b: Generation capability gains. All proxy tasks improve position-aware generation.
Observation 01

High-level semantic tasks dominate

Image segmentation consistently outperforms mid-level (depth estimation) and low-level (edge detection) tasks on all understanding benchmarks. High-level supervision aligns with perception demands, while texture-focused tasks cause overfitting to irrelevant details.

High-level wins
Observation 02

Visual supervision enhances perception, not reasoning

Generative tuning fortifies visual perception (vision-centric tasks, spatial reasoning, hallucination resistance) while chart/math reasoning remains static. Visual supervision enhances representation quality but does not impart logical priors.

Perception only
Observation 03

Spatial fidelity improves universally

Regardless of semantic granularity, all proxy tasks consistently improve generative spatial fidelity — especially for position-aware tasks. Reconstructing visual structure forces accurate spatial layouts, naturally boosting positional prompt adherence.

Universal gain
Comparison with State-of-the-Art

SGT consistently outperform their baselines and surpass competitive UMMs across understanding and generation benchmarks.

ModelParams Visual UnderstandingVisual Generation
MMVP↑VSR↑Hallu.↑ MMStar↑RWQA↑MathV.↑ GenEval↑GEdit↑
Small-scale Models (≤ 4B)
Show-o 5121.3B50.0054.2646.0638.1768.0
Harmon1.5B60.0060.8846.6938.0048.0033.7073.0
UniLIP2B73.0065.5560.5764.1890.0
UniMRG3.6B74.6773.9064.5666.0155.8
OpenUni2B71.6766.6960.8865.2351.0
OmniGen23B+4B65.0077.5262.3555.0764.4163.5076.06.63
✦ SGT-Gen2 (Ours)3B+4B68.3378.8564.2557.0765.1064.0079.16.83
Large-scale Models (≥ 7B)
Chameleon7B50.0031.1328.9339.0021.9039.0
Janus-Pro7B63.0071.0360.1546.8041.8342.6080.0
UniWorld-v17B+12B77.6783.3468.3563.9067.5868.2084.04.85
BAGEL7B+7B83.0080.4568.3467.4671.2673.1088.06.64
✦ SGT-BAGEL (Ours)7B+7B83.3381.5470.2468.3372.4273.9090.06.94
Effect of Proxy Task Choice

SGT yields the largest gains across understanding benchmarks while achieving competitive generation performance.

Method CV-Bench↑MMVP↑VSR↑SIBench↑ POPE↑Hallusion↑GenEval↑GEdit↑
Base: BAGEL
BAGEL (Base)73.2183.0080.4548.9585.6968.3478.216.52
+ SFT only74.6182.6780.6949.3486.7767.9277.186.49
+ SFT + Edge74.5683.6780.8349.5186.4868.6679.966.72
+ SFT + Reconstruction75.2383.3380.8350.5987.9868.0380.826.75
✦ + SFT + SGT (Ours)79.2383.3381.5450.1888.3270.2480.956.94
Base: OmniGen2
OmniGen2 (Base)65.9465.0077.5243.2985.9762.3576.586.63
+ SFT only65.9966.0077.6144.3786.2564.3574.546.32
+ SFT + Edge66.6765.3377.9945.5186.1063.7277.456.79
+ SFT + Reconstruction66.7166.3378.1845.4185.9265.1977.536.81
✦ + SFT + SGT (Ours)66.9168.3378.8545.3787.2964.2578.866.83
Why Does SGT Work?

We probe feature distributions and attention dynamics to uncover the mechanisms behind SGT's improvements.

+6.1%
CV-Bench gain during training with 2:1 Seg:VQA ratio
Accelerated convergence vs. SFT-only baseline
2:1
Optimal Seg:VQA intra-batch ratio
Validated on both BAGEL and OmniGen2
+3.3%
BAGEL aggregate gain scaling from 2k → 100k seg samples
Monotonic improvement with data scale
L25–27
Deep layers where attention reallocation is most pronounced
Segmentation shifts attention most effectively to visual tokens
Figure 7: Visually confusable piano categories
Fig. 7 (left): Visually confusable categories — Grand Piano vs. Upright Piano — used to evaluate feature discriminability.
Figure 7: tSNE feature space visualization
Fig. 7 (right): t-SNE visualization. BAGEL baseline yields entangled features; SGT learns highly discriminative embeddings with clear class separation.
Figure 8a: Vision-language attention allocation across layers
Fig. 8a: Vision-language attention allocation change (%) across layers. Segmentation reallocates attention most strongly at deep layers (L25–L27), reflecting structural semantic influence.
Figure 8b: Understanding attention allocation
Fig. 8b: Attention allocation to object/color/position/other categories. SGT increases visual token attention while reducing distractions.
Data Recipe & Scalability

SGT is scalable — performance improves with more segmentation data, and a 2:1 Seg:VQA batch ratio is optimal.

Figure 5a: Optimal Seg-to-VQA ratio
Fig. 5a: Ablation on segmentation-to-VQA ratio. Both BAGEL and OmniGen2 achieve optimal performance at a 2:1 ratio (segmentation:VQA).
Figure 5b: Scalability curve
Fig. 5b: Data scalability. Performance improves consistently as segmentation data scales from 2k to 100k samples (BAGEL: +3.3%, OmniGen2: +2.0%).
Training Data Breakdown (Total ≈ 691k)
SGT — Segmentation (SAM)190k
General VQA180k
Doc / Chart / Screen103k
Math / Reasoning101k
Language72k
General OCR45k
BibTeX

If you find SGT useful in your research, please consider citing our work.

@article{yu2026sgt,
  title     = {Semantic Generative Tuning for Unified Multimodal Models},
  author    = {Yu, Songsong and Chen, Yuxin and Shan, Ying and Li, Yanwei},
  journal   = {arXiv preprint arXiv:2605.18714},
  year      = {2026},
}