Masked Diffusion Models Are Fast and Privacy-Aware Learners

Published in arXiv preprint, 2023

Diffusion model has emerged as the de-facto model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all finegrained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise visible areas. In subsequent fine-tuning stage, we efficiently train diffusion model without masking. Utilizing the two-stage training framework, we achieves significant training acceleration and a new FID score record of 6.27 on CelebA-HQ 256 × 256 for ViT-based diffusion models. The generalizability of a pre-trained model further helps building models that perform better than ones trained from scratch on different downstream datasets. For instance, a diffusion model pre-trained on VGGFace2 attains a 46% quality improvement when fine-tuned on a different dataset that contains only 3000 images. Our code is available at https://github.com/jiachenlei/maskdm.

Recommended citation: Lei, J., Wang, Q., Cheng, P., Ba, Z., Qin, Z., Wang, Z., Liu, Z., Ren, K. (2023). "Masked Diffusion Models Are Fast and Privacy-Aware Learners." *arXiv preprint arXiv:2306.11363*.
Download Paper | Download Bibtex