[2108.01073] Image Synthesis and Editing with Stochastic Differential Equations

Significance

Hijacking the diffusion model for image generation and manipulation

Keypoints

  • Propose a diffusion based image synthesis and editing method
  • Demonstrate performance of the proposed method by experiments

Review

Background

Application of the diffusion models is one of the hottest topic in generative models (see previous posts for diffusion models on image synthesis 1, 2). Some of the key applications of the generative model include image synthesis, which is to synthesize a realistic image from a known random distribution (e.g. Gaussian noise), or semantic image manipulation, which is to edit an image in a semantically natural way. GANs have been studied widely as the state-of-the-art generative model for both image synthesis and image manipulation since the perceptual quality of GANs have been better than other methods. However, recent improvements of the diffusion models suggest that diffusion models are capable of generating better quality images when compared to GAN. This work further show that diffusion models are also easily applicable to image manipulation tasks, overcoming drawbacks of GAN based image manipulation methods including (i) need for latent space search optimization, (ii) difficulty of defining appropriate loss function, or (iii) need for ground-truth paired data.

Keypoints

Propose a diffusion based image synthesis and editing method

210803-1 Schematic illustration of the proposed method

The proposed method is surprisingly simple. Given a diffusion model for image synthesis, the authors try to hijack the reverse process based on the fact that the images to be denoised are still of very low quality (prone to noise) on early steps of the reverse process. The authors adopt Stochastic Differential Equation (SDE) as the diffusion model and demonstrate good results.

210803-2 Pseudocode for image synthesis with SDEdit

210803-3 Pseudocode for image manipulation with SDEdit

As mentioned earlier, the proposed method (SDEdit) (i) does not require optimization of the latent space since the reverse process is performed directly on the corrupted (i.e. user marked with stroke) sample, (ii) does not require defining the loss between the stroke mask and the real counterpart, and (iii) does not require paired data since it exploits a diffusion model trained for image synthesis.

Demonstrate performance of the proposed method by experiments

The performance of the proposed method is mostly compared to GAN-based methods qualitatively. For quantitative results, Mechanical Turk human evaluation was performed to evaluate preference over the proposed method versus other baseline methods.

Stroke-based image synthesis

Stroke based image synthesis is compared to other methods on LSUN and CelebA dataset. 210803-4 Comparative result of stroke-based image synthesis on LSUN bedroom 210803-5 Qualitative result of proposed SDEdit on other datasets

It can be seen that the proposed method generates images based on user strokes with better quality

210803-9 Faithfulness result of stroke-based image synthesis (human preference vs. SDEdit) In the human evaluation, SDEdit is more often preferred over GAN baselines in stroke-based image synthesis.

Stroke based image manipulation

For image manipulation task, SDEdit was compared on LSUN, CelebA, and FFHQ datasets.

210803-6 Comprative result of stroke-based image manipulation on LSUN, CelebA datasets

210803-7 Comparative result of stroke-based image manipulation on FFHQ dataset

It is also qualitatively suggested that the edited image is of better perceptual quality with the proposed SDEdit.

210803-8 Faithfulness result of stroke-based image editing (human preference vs. SDEdit) Human evaluation also suggests preference of SDEdit over GAN based image editing results.

Related

Share

#language-model #responsible-ai #privacy-preserving #scaling #mixture-of-experts #image-generation #diffusion-model #generative-adversarial-network #speech-model #multi-modal #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #notice #data-sampling #long-tail #graph-neural-network #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing