Recent studies have shown that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches typically either introduce an additional and complex representation training framework or rely on a large-scale pre-trained representation foundation model to provide representational guidance during the original generative training.
In this work, we argue that the unique discriminative process inherent to diffusion transformers makes it possible to offer such guidance without needing external components. We thus introduce SRA, a simple yet straightforward method that introducing representation guidance through a self-distillation manner.
Experiment relusts show that SRA accelerates training and improves generation performance for both DiTs and SiTs.
We find out that the diffusion transformer gets a roughly from croase-to-fine discriminative process when only generative training is performed.
In short, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative process.
SRA shows benifit on different baselines across different model size.
SRA shows comparable or superior performance against other methods that leverage either representation training paradigm or representation foundation model.
SRA genuinely enhances the representation capacity of the baseline model, and the generative capability is indeed strongly correlated with the representation guidance.
@article{jiang2025sra,
title={No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves},
author={Jiang, Dengyang and Wang, Mengmeng and Li, Liuzhuozheng and Zhang, Lei and Wang, Haoyu and Wei, Wei and Zhang, Yanning and Dai, Guang and Wang, Jingdong},
journal={arXiv preprint arXiv:2505.02831},
year={2025}
}