Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

About VLV

We introduce the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). We establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder.

Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. By fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash.

Key advantages:

• Knowledge Distillation: Our method effectively distills knowledge from text-conditioned diffusion models using continuous embeddings without requiring paired image-text datasets.

• Scalable Framework: We introduce a novel framework that leverages a vision encoder, a frozen T2I diffusion decoder, and a Large Language Model to create an efficient information bottleneck.

• Cost Efficient: By primarily utilizing single-modal images and maximizing pretrained model utility, we keep total training expenditure under $1,000 USD—three orders of magnitude cheaper than comparable methods.

VLV Auto-Encoder Architecture

The overall structure is auto-encoder, with three key components:

• VLV Encoder: A visual backbone augmented with a lightweight multi-modal adapter maps an input image into continuous caption embedding with compact semantic information.

• Diffusion Decoder: A frozen text-to-image diffusion model reconstructs the image.

• Caption Decoder: A pretrained large language model translates the same embedding into comprehensive captions.

Dataset Construction

We construct our training dataset through a carefully designed two-stage data collection pipeline:

• Stage-1 Data (40M images): We curate a 40M image subset from LAION-2B-en-aesthetic, a subset of LAION-5B. For training stability, we apply strict filtering criteria: images must have shorter side greater than 512 pixels, aspect ratio between 0.5 to 2.0, and watermark probability less than 0.5.

• Stage-2 Data (6M image-text pairs): We query Gemini 2.0 Flash to generate high-quality captions for 6M images from our Stage-1 dataset, producing aligned image-text pairs for fine-tuning the language decoder. Most captions contain 170-280 tokens (mean: 226.82 tokens).

• Efficient Data Usage: Despite using only 0.4% (40M / 10B) of the WebLI dataset used by De-Diffusion, our method learns strong language-oriented semantics through the vision-language-vision auto-encoding pipeline.

• Quality Control: Our filtering process ensures high-quality images with appropriate aspect ratios and minimal watermarks, while the caption generation process creates comprehensive, descriptive captions that capture spatial layout and fine-grained details.

Text-Conditioned Reconstruction Results

We assess caption quality by feeding each decoded caption to Stable Diffusion 3.5 Medium and computing the Fréchet Inception Distance (FID) between the synthesized and original images on 30K samples from the MS-COCO 2014 validation split.

Our captions are compared against state-of-the-art vision-language models: Florence-2, Qwen2.5-VL, Gemini 2.0 Flash, and GPT-4o. Image synthesis employs rectified flow-matching sampler using 40 inference steps and classifier-free guidance scales from 1.0 to 4.0.

Key Results:

• Matches GPT-4o Performance: Our captions achieve an FID essentially indistinguishable from GPT-4o's (difference <0.5) at guidance scale 2.0: VLV (6.64) vs GPT-4o (6.20)

• Outperforms Open-Source Models: Markedly better than Florence-2 (7.51) and Qwen2.5-VL (6.98), demonstrating superior caption quality among open-source alternatives

• Competitive with Commercial Models: Only the closed-source Gemini 2.0 Flash (5.87) attains a marginally better score, while we remain highly competitive

Model	Guidance 1.0	Guidance 2.0	Guidance 3.0	Guidance 4.0
Original (MS-COCO)	16.62	9.90	12.69	14.49
Florence-2 Large	10.61	7.51	9.95	11.35
Qwen-2.5-VL-7B	12.61	6.98	9.19	10.59
VLV (Ours)	11.47	6.64	8.56	9.90
Gemini 2.0 Flash	12.82	5.87	7.57	8.77
GPT-4o	12.16	6.20	7.96	9.25

FID scores (↓ lower is better) on Stable Diffusion 3.5 Medium generations. Highlighted shows best open-source performance. Commercial models (Gemini 2.0 Flash, GPT-4o) shown in gray for reference.

Human Evaluation Results

We also benchmark caption fidelity using human raters and VLM judges under a three-criterion rubric: coverage, no hallucination, and spatial-layout consistency on a 0-6 scale.

VLV matches GPT-4o within <0.05 points on the 0–6 scale and surpasses Qwen-2.5-VL-7B by 0.15 on average, confirming that our caption embeddings yield human-level captions while remaining competitive with the strongest commercial VLMs.

Rater	Qwen-2.5 VL 7B	GPT-4o	VLV (Ours)
Human 1	4.88	5.32	5.25
Human 2	5.02	5.17	5.04
Human 3	5.10	5.12	5.22
Gemini 2.0 Flash	5.07	5.25	5.18
*Average*	5.02	5.22	5.17

Caption fidelity scores (0-6 scale, higher is better) from three human annotators and one automated system. Highlighted shows VLV performance. Bold indicates best score per rater.

Reconstruction with Language Semantics

For each original input image (top), we feed its caption embedding directly to the frozen diffusion decoder and obtain a reconstruction (middle) that preserves high-level semantics and fine-grained appearance cues.

The same embedding is then decoded by the LLM; prompting Midjourney with that caption yields an image of high fidelity, demonstrating that a single embedding suffices for both visual and textual regeneration.

Key Insights:

• Dual-Purpose Embeddings: Our caption embeddings serve as a universal representation that can be decoded into both high-quality image reconstructions and comprehensive textual descriptions

• Semantic Preservation: The reconstructions maintain crucial visual details including object positioning, spatial relationships, and fine-grained appearance characteristics

• Cross-Modal Consistency: The same compact embedding generates visually consistent results whether used for image reconstruction or text-to-image generation with state-of-the-art models like Midjourney

Emergent Compositionality

We investigate notable emergent properties manifested by the VLV auto-encoder. The learned embeddings encapsulate crucial semantic details, including object 3D pose and orientation, ensuring robust spatial consistency. Through concatenation of caption embeddings from disparate images, VLV demonstrates the capacity to seamlessly disentangle foreground objects from backgrounds and compose novel, coherent images.

Interactive Demo: Click on any combined image below to see the two source images that were used to create it.

Source Images

Click on a combined image above to see the source images

BibTeX


                @article{zhang2025vision,
                title     = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
                author    = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
                journal   = {arXiv preprint arXiv:2507.07104},
                year      = {2025}}