🎭SRA

No Other Representation Component Is Needed:
Diffusion Transformers Can Provide Representation Guidance by Themselves

1 Northwestern Polytechnical University 2 SGIT AI Lab, State Grid Corporation of China
3 Zhejiang University of Technology 4 Baidu Inc.
Project lead * Corresponding author
Preprint
algebraic reasoning

🚈Overview

algebraic reasoning

Recent studies have shown that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches typically either introduce an additional and complex representation training framework or rely on a large-scale pre-trained representation foundation model to provide representational guidance during the original generative training.

In this work, we argue that the unique discriminative process inherent to diffusion transformers makes it possible to offer such guidance without needing external components. We thus introduce SRA, a simple yet straightforward method that introducing representation guidance through a self-distillation manner.

Experiment relusts show that SRA accelerates training and improves generation performance for both DiTs and SiTs.

🔍Observations

algebraic reasoning

We find out that the diffusion transformer gets a roughly from croase-to-fine discriminative process when only generative training is performed.

🔑Approach

algebraic reasoning

In short, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative process.

🍀Results

algebraic reasoning

SRA shows benifit on different baselines across different model size.

algebraic reasoning

SRA shows comparable or superior performance against other methods that leverage either representation training paradigm or representation foundation model.

algebraic reasoning

SRA genuinely enhances the representation capacity of the baseline model, and the generative capability is indeed strongly correlated with the representation guidance.

🌺Citation

@article{jiang2025sra,
  title={No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves},
  author={Jiang, Dengyang and Wang, Mengmeng and Li, Liuzhuozheng and Zhang, Lei and Wang, Haoyu and Wei, Wei and Zhang, Yanning and Dai, Guang and Wang, Jingdong},
  journal={arXiv preprint arXiv:2505.02831},
  year={2025}
}