CVPR 2026

ConsistCompose

Unified Multimodal Layout Control for Image Composition

Xuanke Shi   Boxuan Li   Xiaoyang Han   Zhongang Cai   Lei Yang   Dahua Lin   Quan Wang 

SenseTime Research 

Demonstrations

Capabilities

Text Prompt Input
door cat floor paw 1 paw 2 layout coordinates embedded in text tokens
A cat <bbox>[0.109, 0.297, 0.607, 0.87]</bbox> has extended paw <bbox>[0.497, 0.566, 0.564, 0.642]</bbox> and balancing paw <bbox>[0.108, 0.786, 0.187, 0.862]</bbox>. Dark floor <bbox>[0.0, 0.593, 1.0, 0.999]</bbox> with pebbles, a large door <bbox>[0.137, 0.001, 0.869, 0.684]</bbox> forms the backdrop.
Layout-grounded generation
cat   paw 1   paw 2   floor   door
Generated Output
Layout Control
Image + Text Prompt Input
Input image
The mug <bbox>[0.382, 0.388, 0.651, 0.666]</bbox> from image1 sits on the right-middle of a rustic wooden table. A steamed dumpling <bbox>[0.128, 0.605, 0.356, 0.823]</bbox> from image1 rests on a plate. A red crab <bbox>[0.421, 0.611, 0.817, 0.802]</bbox> from image1 crawls on the table.
Spatial rearrangement
mug   dumpling   red crab
Generated Output
Rearrangement
Image + Text Prompt Input
Reference 1
image 1
Reference 2
image 2
The piglet <bbox>[0.32, 0.269, 0.669, 0.82]</bbox> from image 1 nudges the brown briefcase <bbox>[0.353, 0.532, 0.714, 0.905]</bbox> from image 2 with its snout under the soft golden glow of a morning sun, standing on a cobblestone path.
Identity-consistent composition
piglet (image 1)   briefcase (image 2)
Generated Output
Multi-instance
Abstract

Overview

Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding — aligning language with image regions — while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored.

We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs).

Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy — achieving a 7.2% gain in layout IoU and a 13.7% AP improvement — while preserving identity fidelity and competitive general multimodal understanding.
🎯
LELG Paradigm
Linguistic-embedded layout-grounded generation encodes spatial constraints as textual tokens, natively integrating layout control without task-specific branches.
🔗
Unified Framework
A single model jointly supports layout-grounded T2I synthesis, multi-reference identity-consistent composition, and general multimodal understanding.
📦
ConsistCompose3M
3.4M-sample multimodal dataset providing layout and identity supervision at scale for unified layout-aware multimodal training.
Methodology

Framework

ConsistCompose Framework Overview
Figure 1. Overview of the ConsistCompose framework. ConsistCompose adopts Bagel as its backbone with MoT architecture, featuring two Transformer experts for understanding and generation. The proposed LELG paradigm enables spatially controllable multi-instance generation by embedding explicit layout semantics into the linguistic stream.
Instance-Coordinate Binding

Layout constraints are encoded as succinct coordinate expressions in text, allowing the model to bind each subject identity and its designated spatial position through the shared token space governing both understanding and generation.

Coordinate-Aware CFG

A coordinate-aware classifier-free guidance mechanism further enhances spatial fidelity during sampling, without altering the backbone architecture or introducing task-specific layout-centric branches.

Experiments

Quantitative Results

We evaluate on COCO-Position for layout controllability and MS-Bench for identity-consistent multi-reference generation. ConsistCompose establishes state-of-the-art performance on both benchmarks.

Methods Instance Success Ratio (%) ↑ Image Success Ratio (%) ↑ Position Accuracy (%) ↑
L2L3L4L5L6Avg L2L3L4L5L6Avg mIoUAPAP50AP75
GLIGEN 89.186.382.079.681.682.6 78.863.848.135.035.052.1 69.040.575.939.1
InstanceDiffusion 94.194.489.584.683.887.8 89.484.467.546.939.465.5 78.157.283.665.5
MIGC++ 94.192.187.384.183.486.8 89.478.162.548.138.863.4 74.948.379.252.6
CreatiLayout 81.976.373.473.571.274.0 69.448.136.931.926.342.5 64.932.461.131.6
PlanGen 85.384.283.880.981.282.5 72.563.151.333.131.350.3 66.231.974.021.5
Ours 95.694.292.790.692.492.6 91.983.173.163.768.876.1 85.370.989.176.9
Citation

BibTeX

If you find our work useful, please cite:

@article{shi2025consistcompose,
  title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
  author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and
          Yang, Lei and Lin, Dahua and Wang, Quan},
  journal={arXiv preprint arXiv:2511.18333},
  year={2025}
}