Applying Optimal Transport Theory to Emotion Analysis in NLP

4 minute read

Published:

Optimal Transport (OT) theory, with roots in 18th-century mathematics, has found remarkable applications in modern machine learning, particularly in generative modeling and distribution alignment. However, its potential in Natural Language Processing, especially for structured classification tasks like emotion analysis, has been underexplored. In this post, I discuss how Optimal Transport theory can be leveraged to improve emotion classification by capturing semantic relationships between emotion labels.

What is Optimal Transport?

Optimal Transport, originally formulated by Monge and later refined by Kantorovich, addresses the problem of finding the most cost-effective way to move “mass” from one distribution to another. In modern ML, OT provides a principled way to measure distances between probability distributions that respect the underlying geometry of the space.

The Wasserstein distance (also known as Earth Mover’s Distance) is perhaps the most well-known OT metric. It measures the minimum cost of transforming one distribution into another, where the cost depends on the distance between points in the metric space.

Why Optimal Transport for Emotion Analysis?

Emotion labels are not independent—they exist in a structured space where some emotions are more similar than others. Traditional classification losses (like cross-entropy) treat all misclassifications equally, ignoring these semantic relationships. Optimal Transport offers a natural framework to incorporate this structure.

Consider the emotion “joy” and “happiness”. These are semantically very close, and confusing one for the other should be penalized less than confusing “joy” with “anger”. By defining a cost matrix based on the hierarchical or semantic structure of emotions, OT can capture these nuanced relationships.

Earth Mover’s Distance for Emotions

The Earth Mover’s Distance (EMD) variant of Optimal Transport is particularly suitable for emotion analysis. Here’s how it works:

  1. Cost Matrix Construction: Define a cost matrix where each entry represents the semantic distance between two emotion labels. This can be based on:
    • Hierarchical relationships (parent-child distance in an emotion taxonomy)
    • Semantic embeddings (cosine distance between emotion word embeddings)
    • Psychological models (distance in emotion circumplex or dimensional models)
  2. Distribution Representation: Represent predictions and ground truth as probability distributions over emotion labels.

  3. Distance Calculation: Compute the EMD between the predicted and true distributions using the cost matrix.

Integration into Training

To use EMD as a training objective, we need to make it differentiable. The Sinkhorn algorithm provides an efficient, differentiable approximation to the exact EMD calculation, making it suitable for gradient-based optimization.

The key insight is that by minimizing the EMD between predictions and ground truth during training, the model learns to make mistakes that are semantically less severe—predictions that are “close” to the correct emotion even if not exactly correct.

Evaluation Advantages

Beyond training, EMD serves as an excellent evaluation metric for emotion analysis. Unlike accuracy or F1-score, EMD:

  • Captures severity of errors: A prediction of “excitement” when the true label is “joy” incurs less EMD than predicting “sadness”
  • Provides interpretable scores: The EMD value directly reflects the average semantic distance of errors
  • Handles multi-label scenarios: Naturally extends to cases where multiple emotions can be present

Implementation Considerations

Implementing OT-based losses for emotion analysis requires careful consideration:

  1. Cost matrix design: The choice of cost matrix significantly affects performance. Hierarchical structures, learned embeddings, or domain knowledge can inform this design.

  2. Computational efficiency: Exact EMD computation can be expensive. Sinkhorn iterations provide a good balance between accuracy and efficiency.

  3. Hyperparameter tuning: The regularization parameter in Sinkhorn needs tuning, as it balances computational efficiency and accuracy of the OT approximation.

Practical Results

In our experiments with the GoEmotions dataset, incorporating EMD into both the loss function and evaluation metric led to:

  • Better semantic coherence in predictions
  • More interpretable error patterns
  • Improved fine-grained emotion distinction

Models trained with EMD-based losses showed a better understanding of emotion relationships, making more “reasonable” mistakes when errors occurred.

Conclusion

Optimal Transport theory provides a powerful framework for emotion analysis that naturally incorporates semantic and hierarchical relationships between emotions. By moving beyond independent label assumptions and leveraging OT’s ability to respect structured spaces, we can build more sophisticated and human-aligned emotion understanding systems.

The application of OT to NLP tasks is still emerging, and emotion analysis represents just one promising direction. As we continue to recognize that many NLP tasks involve structured, non-independent labels, OT’s role in the field is likely to grow significantly.


This work is part of my research on hierarchical emotion analysis. For implementation details and experimental results, see our published work on Emotional Earth Mover’s Distance.