It is hard to understand how well a certain model generalizes across different tasks and datasets.
In this paper, we contribute to this situation by comparing several models on six different benchmarks, which belong to different domains and additionally have different levels of granularity (binary, 3-class, 4-class and 5-class).
We show that Bi- LSTMs perform well across datasets and that both LSTMs and Bi-LSTMs are particularly good at fine-grained sentiment tasks (i. e., with more than two classes).
Incorporating sentiment information into word embeddings during training gives good results for datasets that are lexically similar to the training data.
With our experiments, we contribute to a better understanding of the performance of different model architectures on different data sets.
Consequently, we detect novel state-of-the-art results on the SenTube datasets.




  • The Stanford Sentiment Treebank (SST-fine) is a dataset of movie reviews.

In order to compare with [Faruqui et al. (2015)], we also adapt the dataset to the task of binary sentiment analysis, where strong negative and negative are mapped to one label, and strong positive and positive are mapped to another label, and the neutral examples are dropped. This leads to a slightly different split of 6920/872/1821 (we refer to this dataset as SST- binary).

  • The OpeNER dataset is a dataset of hotel reviews in which each review is annotated for opinions.
  • The SenTube datasets are texts that are taken from YouTube comments regarding automobiles (SenTube-A) and tablets (SenTube-T).
  • The SemEval 2013 Twitter dataset (SemEval) is a dataset that contains tweets collected for the 2013 SemEval shared task B.




First, we train an L2-regularized logistic regression on a bag-of-words representation (BOW) of the training examples, where each example is represented as a vector of size n, with n = |V | and V the vocabulary. This is a standard baseline for text classification.

Our second baseline is an L2-regularized logistic regression classifier trained on the average of the word vectors in the training example (AVE). We train word embeddings using the skip-gram with negative sampling algorithm [Mikolov et al., 2013] on a 2016 Wikipedia dump, using 50-, 100-, 200-, and 600-dimensional vectors, a window size of 10, 5 negative samples, and we set the subsampling parameter to 10−4. Additionally, we use the publicly available 300-dimensional GoogleNews vectors3 in order to compare to previous work.


We apply the approach by [Faruqui et al. (2015)] and make use of the code released in combination with the PPDB-XL lexicon, as this gave the best results for sentiment analysis in their experiments. We train for 10 iterations. Following the authors’ setup, for testing we train an L2-regularized logistic re- gression classifier on the average word embeddings for a phrase (RETROFIT).

Joint Training

For the joint method, we use the 50-dimensional sentiment embeddings provided by [Tang et al. (2014)]. Additionally, we create 100-, 200-, and 300-dimensional embeddings using the code that is publicly available. We use the same hyperpa- rameters as [Tang et al. (2014)]: five million positive and negative tweets crawled using hashtags as prox- ies for sentiment, a 20-dimensional hidden layer, and a window size of three. Following the authors’ setup, we concatenate the maximum, minimum and average vectors of the word embeddings for each phrase. We then train a linear SVM on these representations (JOINT).

Supervised Training