We propose a framework for multimodal sentiment analysis and emotion recognition using convolutional neural network-based feature extraction from text and visual modalities. We obtain a performance improvement of 10% over the state of the art by combining visual, text and audio features. We also discuss some major issues frequently ignored in multimodal sentiment analysis research: the role of speaker-independent models, importance of the modalities and generalizability. The paper thus serve as a new benchmark for further research in mul- timodal sentiment analysis and also demonstrates the different facets of analysis to be considered while performing such tasks.
In this paper, we propose CNN-based framework for feature extraction from visual and text modality and a method for fusing them for multimodal sentiment analysis and emotion recognition.
- Textual Features
- Audio Features
- Visual Features
- Multimodal Sentiment Analysis Datasets
- MOUD dataset
- MOSI dataset
- Multimodal Emotion Recognition Dataset
- IEMOCAP dataset
The above examples demonstrates the effectiveness and robustness of our model to capture overall video semantics of the utterances for emotion and sen- timent detection. It also shows how bi and multimodal models, given the multiple media input, overcomes the limitations of unimodal networks.
We also explored the misclassified validation utterances and found some interesting trends. A video is constituent of a group of utterances which have contextual dependencies among them. Thus, our model failed to classify utterances whose emotional polarity was highly dependent on the context described in earlier or later part of the video. However, such interdependent modeling was out of the scope of this paper and we therefore enlist it as a future work.
We have presented a framework for multimodal sentiment analysis and multimodal emotion recognition, which outperforms the state of the art in both tasks by a significant margin. Apart from that, we also discuss some major aspects of multimodal sentiment analysis problem such as the performance of speaker-independent models and cross dataset performance of the models.
Our future work will focus on extracting semantics from the visual features, relatedness of the cross-modal features and their fusion. We will also include contextual dependency learning in our model to overcome limitations mentioned in Section 5.