# [2103.17249] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

### Significance

Make your cat cute by typing it into this StyleGAN-CLIP hybrid

### Keypoints

• Propose three methods that bridge the gap between the latent space of StyleGAN and CLIP to realize text-driven image manipulation
• The proposed methods are compared with each other and other baseline methods

### Review

#### Background

StyleGAN is a generative model that is capable of generating high quality (1024$\times$1024) images by AdaIN based style adaptation of a ProGAN. A github repository was one of the first to investigate the high-level disentanglement of the StyleGAN latent space, which is injected to each level of the generator via AdaIN. This interesting finding was elaborated by works such as Image2StyleGAN, InterFaceGAN, StyleFlow, confirming its potential to be applied to semantic image editing. Along with this trend, recent introduction of powerful multimodal (or more specificaly, text-image) learning methods like DALL-E, CLIP suggest a possible semantic image editing based on natural language guidance. The key point to achieve this is to find and map common image- and text- manipulation to a common direction between the StyleGAN latent space and the text-image model latent space. This work proposes three ways to bridge the gap between the StyleGAN latent space direction and the CLIP latent space direction. Your cat is now cute!

#### Keypoints

##### Propose three methods that bridge the gap between the latent space of StyleGAN and CLIP to realize text-driven image manipulation
###### Method 1: Latent Optimization

This approach follows primitive methods of StyleGAN encoding which finds the embedding of an image by optimization in the latent space. One difference is that the methods incorporates the cosine distance between the CLIP embeddings the text prompt $t$ and the generator $G$ output of the latent vector $w$ into its loss function for the optimization: $$\underset{w \in \mathcal{W}+}{\arg\min}D_{\text{CLIP}}(G(w),t) + \lambda_{L2} ||w-w_{s}||_{2} + \lambda_{\text{ID}}\mathcal{L}_{\text{ID}}(w).$$ The ID loss $\mathcal{L}_{\text{ID}}(w)$ is a cosine similarity between the source latent vector $w_{s}$ and the latent code being optimized $w$ which is input to the pre-trained ArcFace network $R$: $$\mathcal{L}_{\text{ID}}(w) = 1 - \langle R(G(w_{s})), R(G(w)) \rangle,$$ where $G$ is a pre-trained StyleGAN generator. The ArcFace network is trained for face recognition, which means that keeping the cosine similarity can preserve identity of the face being edited. (ArcFace is also well-known for its loss function, suited for metric learning to solve tasks like text-image retrieval.) The latent optimization method is easy to implement and the process is straightforward, but takes long optimization time (over a few minutes per image). Examples of Latent Optimization output