[2103.14021] Orthogonal Projection Loss

Significance

Simple method with simple theoretical background to improve model performance supported by extensive experiments

Keypoints

  • Propose a loss function that enforces inter-class orthogonality in the feature space
  • Confirm performance improvement by extensive experiments

Review

Background

Cross-entropy loss is a common choice of option when training a deep neural network for classification tasks. Variants of the cross-entropy loss, such as the angular margin based losses, have been studied to enhance its performance by improving discriminability of the feature space. This work also aims to improve the cross-entropy loss by enforcing inter-class separation and intra-class clustering without introducing any additional learnable parameters.

Keypoints

Propose a loss function that enforces inter-class orthogonality in the feature space

Training with cross-entropy loss implicitly achieves inter-class orthogonality of the features if the ground-truth labels are encoded as one-hot vectors (which are orthogonal). The key idea of the authors is to enforce this inter-class orthogonality more explicitly by adding a simple and straightforward loss function to the cross-entropy loss. The orthogonal projection loss $\mathcal{L}_{\texttt{OPL}}$ is defined as

\begin{align} s &= \sum_{i,j \in B , y_{i} = y_{j}} \langle \mathbf{f}_{i}, \mathbf{f} _{j} \rangle \\ d &= \sum_{i,k \in B , y_{i} \neq y_{k}} \langle \mathbf{f}_{i}, \mathbf{f} _{k} \rangle \\ \mathcal{L}_{\texttt{OPL}} &= (1-s) + |d|\label{opl} \end{align}

where $y$ is the class label and $\mathbf{f}$ is the intermediate feature of the neural network with subscript $i$, $j$, $k$ are indices within the minibatch $B$. It should be noted that the bracket notation $\langle \cdot , \cdot \rangle$ denotes cosine similarity in this paper, not the inner product between the two vectors. The orthogonal projection loss drags the cosine similarity of the same class samples $s$ towards 1 (i.e., linear dependence between the two vectors) and the cosine similarity of the different class samples $d$ towards 0 (i.e., orthogonality between the two vectors). 210326-1 Cross-entropy loss implicitly achieves orthogonality. Adding orthogonal projection loss can further achieve this explicitly. This inter-class orthogonality and intra-class clustering is expected to improve the performance of the trained neural network.

In practice, the importance of $d$ with respect to $s$ is controlled by introducing a hyperparameter $\gamma$, reformulating the equation \eqref{opl} to \begin{align} \mathcal{L} _ {\texttt{OPL}} = (1 - s) + \gamma * |d|. \end{align}

Confirm performance improvement by extensive experiments

Improvement in performance is confirmed by extensive experiments, including image classification task, robustness against label noise, robustness against adversarial attacks and generalization to domain shifts. 210326-2 210326-3 Image classification task using CIFAR-100 and ImageNet dataset 210326-4 Label noise robustness 210326-5 Adversarial robustness 210326-6 Generalization to domain shifts

The authors further observe that transferability is enhanced in the case of of orthogonal features. 210326-7 Transferability in few-shot learning

Related

Share

#language-model #responsible-ai #privacy-preserving #scaling #mixture-of-experts #image-generation #diffusion-model #generative-adversarial-network #speech-model #multi-modal #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #notice #data-sampling #long-tail #graph-neural-network #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing