CVPR 2026

ConsistCompose

Unified Multimodal Layout Control for Image Composition

Xuanke Shi Boxuan Li Xiaoyang Han Zhongang Cai Lei Yang Dahua Lin Quan Wang

SenseTime Research

Demonstrations

Capabilities

Text Prompt Input

A cat <bbox>[0.109, 0.297, 0.607, 0.87]</bbox> has extended paw <bbox>[0.497, 0.566, 0.564, 0.642]</bbox> and balancing paw <bbox>[0.108, 0.786, 0.187, 0.862]</bbox>. Dark floor <bbox>[0.0, 0.593, 1.0, 0.999]</bbox> with pebbles, a large door <bbox>[0.137, 0.001, 0.869, 0.684]</bbox> forms the backdrop.

Layout-grounded generation

cat paw 1 paw 2 floor door

Generated Output

Image + Text Prompt Input

The mug <bbox>[0.382, 0.388, 0.651, 0.666]</bbox> from image1 sits on the right-middle of a rustic wooden table. A steamed dumpling <bbox>[0.128, 0.605, 0.356, 0.823]</bbox> from image1 rests on a plate. A red crab <bbox>[0.421, 0.611, 0.817, 0.802]</bbox> from image1 crawls on the table.

Spatial rearrangement

mug dumpling red crab

Generated Output

Image + Text Prompt Input

image 1

image 2

The piglet <bbox>[0.32, 0.269, 0.669, 0.82]</bbox> from image 1 nudges the brown briefcase <bbox>[0.353, 0.532, 0.714, 0.905]</bbox> from image 2 with its snout under the soft golden glow of a morning sun, standing on a cobblestone path.

Identity-consistent composition

piglet (image 1) briefcase (image 2)

Generated Output

Abstract

Overview

Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding — aligning language with image regions — while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored.

We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs).

Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy — achieving a 7.2% gain in layout IoU and a 13.7% AP improvement — while preserving identity fidelity and competitive general multimodal understanding.

🎯LELG Paradigm
Linguistic-embedded layout-grounded generation encodes spatial constraints as textual tokens, natively integrating layout control without task-specific branches.

🔗Unified Framework
A single model jointly supports layout-grounded T2I synthesis, multi-reference identity-consistent composition, and general multimodal understanding.

📦ConsistCompose3M
3.4M-sample multimodal dataset providing layout and identity supervision at scale for unified layout-aware multimodal training.

Methodology

Framework

Figure 1. Overview of the ConsistCompose framework. ConsistCompose adopts Bagel as its backbone with MoT architecture, featuring two Transformer experts for understanding and generation. The proposed LELG paradigm enables spatially controllable multi-instance generation by embedding explicit layout semantics into the linguistic stream.

Instance-Coordinate Binding

Layout constraints are encoded as succinct coordinate expressions in text, allowing the model to bind each subject identity and its designated spatial position through the shared token space governing both understanding and generation.

Coordinate-Aware CFG

A coordinate-aware classifier-free guidance mechanism further enhances spatial fidelity during sampling, without altering the backbone architecture or introducing task-specific layout-centric branches.

Experiments

Quantitative Results

We evaluate on COCO-Position for layout controllability and MS-Bench for identity-consistent multi-reference generation. ConsistCompose establishes state-of-the-art performance on both benchmarks.

Methods	Instance Success Ratio (%) ↑						Image Success Ratio (%) ↑						Position Accuracy (%) ↑
Methods	L₂	L₃	L₄	L₅	L₆	Avg	L₂	L₃	L₄	L₅	L₆	Avg	mIoU	AP	AP₅₀	AP₇₅
GLIGEN	89.1	86.3	82.0	79.6	81.6	82.6	78.8	63.8	48.1	35.0	35.0	52.1	69.0	40.5	75.9	39.1
InstanceDiffusion	94.1	94.4	89.5	84.6	83.8	87.8	89.4	84.4	67.5	46.9	39.4	65.5	78.1	57.2	83.6	65.5
MIGC++	94.1	92.1	87.3	84.1	83.4	86.8	89.4	78.1	62.5	48.1	38.8	63.4	74.9	48.3	79.2	52.6
CreatiLayout	81.9	76.3	73.4	73.5	71.2	74.0	69.4	48.1	36.9	31.9	26.3	42.5	64.9	32.4	61.1	31.6
PlanGen	85.3	84.2	83.8	80.9	81.2	82.5	72.5	63.1	51.3	33.1	31.3	50.3	66.2	31.9	74.0	21.5
Ours	95.6	94.2	92.7	90.6	92.4	92.6	91.9	83.1	73.1	63.7	68.8	76.1	85.3	70.9	89.1	76.9

Citation

BibTeX

If you find our work useful, please cite:

@article{shi2025consistcompose,
  title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
  author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and
          Yang, Lei and Lin, Dahua and Wang, Quan},
  journal={arXiv preprint arXiv:2511.18333},
  year={2025}
}

ConsistCompose

Capabilities

Overview

Framework

Quantitative Results

ConsistCompose3M Gallery

BibTeX