# [2111.05826] Palette: Image-to-Image Diffusion Models

### Significance

Diffusion models beat GANs on image-to-image translation

### Keypoints

• Propose a method for applying denoising diffusion models for image-to-image translation tasks
• Introduce an unified evaluation protocol for inpainting, uncropping, and JPEG decompression
• Demonstrate performance and general applicability of the proposed method by experiments

### Review

#### Background

Diffusion models are a class of generative models, gaining big interest in the field owing to its state-of-the-art generative performance. Although there have been attempts on applying the diffusion models to conditional generation problems, the results have not been sufficient to rival GANs on image-to-image translation tasks, such as colorization, inpainting, uncropping, or JPEG decompression. The authors study and confirm versatility and general applicability of the diffusion model to these tasks, and show that the diffusion models can outperform GAN-based methods in terms of image quality and diversity.

#### Keypoints

##### Propose a method for applying denoising diffusion models for image-to-image translation tasks

The authors propose Palette, which the model is basically a 256 $\times$ 256 class-conditional U-Net used for conditional diffusion models. The training objective of Palette is formulated as: $$\mathbb{E}_{(\mathbf{x},\mathbf{y})} \mathbb{E}_{\epsilon ~ \mathcal{N}(0,I)} \mathbb{E}_{\gamma} || f_{\theta} (\mathbf{x}, \sqrt{\gamma}\mathbf{y} + \sqrt{1-\gamma} \epsilon , \gamma) - \epsilon || ^{p} _{p},$$ where $\mathbf{x}$, $\mathbf{y}$ are input and target images, $f_{\theta}$ is a trainable neural network, and $\gamma \in [0,1]$ is the noise level indicator. The authors compare L1 and L2 norms for its effect on the final image characteristics, i.e. $p\in {1,2 }$, and find that L1 objective yields more conservative results with lower diversity, while L2 objective yields images with better diversity.

##### Introduce a unified evaluation protocol for inpainting, uncropping, and JPEG decompression

To correctly evaluate the model performance of image-to-image translation tasks, the authors propose a unified evaluation protocol for inpainting, uncropping, and JPEG decompression. The models are evaluated on ImageNet dataset with ctest10k split and proposed places10k subset images with Inception Score (IS), Fréchet Inception Distance (FID), Classification Accuracy (CA) of pretrained ResNet-50, Perceptual Distance (PD) from Inception-v1 feature space distance. Sample diversity is evaluated by visual inspection and histogram plots of pairwise SSIM scores between multiple model outputs. Human evaluation is further reported by fool rate from 2-alternative forced choice (2AFC) trials.

##### Demonstrate performance and general applicability of the proposed method by experiments

The authors experiment performance and general applicability of the proposed model in terms of colorization, inpainting, uncropping, and JPEG decompression. The authors further test the performance of Palette for JPEG decompression with the model trained from other three tasks (named Multi-Task Palette) to show general applicability in image-to-image translation tasks. The results suggest that Palette outperforms other previous methods in these four tasks, and achieve state-of-the-art results both qualitatively and quantitatively.

###### Colorization

Qualitative results of the proposed method in colorization

Quantitative results of the proposed method in colorization

###### Inpainting

Qualitative results of the proposed method in inpainting

Quantitative results of the proposed method in inpainting

###### Uncropping

Qualitative results of the proposed method in uncropping

Quantitative results of the proposed method in uncropping

###### JPEG decompression

Qualitative results of the proposed method in JPEG decompression

Quantitative results of the proposed method in JPEG decompression

Another important point of the Palette is the output sample diversity, which GAN based methods usually suffer from mode collapse leading to less diverse output images.

Exemplar image outputs from Palette, suggesting diversity

Further results on Multi-Task Palette and the role of self-attention in the model architecture are referred to the original paper.