How to Solve Sentiment Analysis Challenges with Deep Learning Models?

baoshi.rao

This article focuses on sentiment analysis, discussing its challenges and how deep learning models can address them.

Meltwater has been providing sentiment analysis through machine learning for over 10 years. The first models were deployed in English and German in 2009. Today, Meltwater's in-house models support 16 languages. This blog post explores how deep learning and feedback loops are used to deliver sentiment analysis at scale to over 30,000 global clients.

Sentiment analysis is a field within natural language processing (NLP) that involves identifying and categorizing subjective opinions from text [1]. Its scope ranges from detecting emotions (e.g., anger, happiness, fear) to identifying sarcasm and intent (e.g., complaints, feedback, opinions). In its simplest form, sentiment analysis assigns attributes (e.g., positive, negative, neutral) to a piece of text.

Let’s look at a few examples:

"Acme is by far the worst company I’ve ever encountered."

This sentence clearly expresses a negative opinion. The sentiment is carried by the phrase "the worst company" (the sentiment phrase) and directed at "Acme" (the sentiment target).

"Tomorrow, Acme and NewCo will release their latest revenue data."

Here, we have a factual statement about "Acme" and "NewCo." The statement is neutral.

"Supported by record sales figures and a soaring stock market over the past year, NewCo became the first pension plan to accumulate $1 trillion in assets on its platform."

This time, phrases like "supported" and "record sales" are used in a positive semantic context, referring to "NewCo."

Meltwater has been providing sentiment analysis through machine learning for over 10 years. The first models were deployed in English and German in 2009. Today, Meltwater's in-house models support 16 languages: Arabic, Chinese, Danish, Dutch, Finnish, French, Hindi, Italian, Japanese, Korean, Norwegian, Portuguese, Spanish, and Swedish.

Most of our clients analyze sentiment trends through media monitoring dashboards (Figure 1) or reports. Larger clients can access our data in enriched document form via the Fairhair.ai data platform.

Figure 1: Meltwater Media Intelligence Dashboard.

A key feature of the product is the ability for users to override the sentiment values assigned by the algorithm. Overridden sentiment attributes are indexed as different "versions" of the same document in Meltwater’s Elasticsearch cluster, providing clients with a personalized view of sentiment when building dashboards and reports (Figure 2).

Figure 2: "Sentiment Attribute" override dropdown in Meltwater’s Media Intelligence content stream.

Each month, our clients override sentiment values in approximately 200,000 documents—6,500 documents per day! So, why is sentiment so difficult to get right?

Some nuances of human language are a source of challenges. Here are a few examples:

Handling Negation:

"How is your company doing? Not bad! I’m not very satisfied with the latest financial situation…"

Here, we have three sentences: the first is neutral, the second is positive but contains "bad," which is typically used in a negative context, and the third is negative but includes "very satisfied."

Sarcasm:

"It’s raining again today… fun times!"

Despite the phrase "fun times," the text may be sarcastic and express negative sentiment.

Comparative Sentiment:

"I love the new Acme phones; they’re much better than NewCo’s."

Here, expressions like "love" and "much better" carry positive sentiment, but the evaluation is negative for "NewCo."

Context-Dependent Perspective:

"The Acme Police Department arrested eight individuals today for alleged assault and robbery. The gang has been terrorizing the community for months."

Beyond word meanings, all of the above require an understanding of context.

A practical issue to address is the trade-off between accuracy and speed. Meltwater performs sentiment analysis on approximately 450 million documents daily, ranging from tweets (averaging around 30 characters) to news and blog posts (up to 600,000–700,000 characters). Each document must be processed within 20 milliseconds. Speed is critical!

Traditional machine learning methods like naïve Bayes, logistic regression, and support vector machines (SVM) are widely used for large-scale sentiment analysis due to their scalability. Deep learning (DL) methods have been shown to achieve higher accuracy in various NLP tasks, including sentiment analysis, but they are typically slower and more expensive to train and operate [2].

To date, Meltwater has been using a multinomial naïve Bayes sentiment classifier. The classifier takes a piece of text and converts it into a vector of feature values (f1, f2,…, fn).

The classifier then calculates the most likely sentiment polarity Sj—positive, negative, or neutral—given the observed feature values in the text. This is typically expressed as a conditional probability statement:

p(Sj | f1, f2,…, fn)

The sentiment polarity with the highest probability is obtained by finding Sj that maximizes the following formula:

log(p(Sj)) + log(p(fi | Sj))

Let’s apply this theory to our sentiment problem. The value p(Sj) is the probability of finding a document with a specific polarity "by default." These probabilities can be estimated by labeling a large corpus of documents as positive, negative, or neutral and then calculating the probability of finding a document with a given sentiment polarity. Ideally, we should use all documents ever created, but this is impractical.

For example, if the corpus consists of the following labeled documents:

Then the values of p(Sj) are:

We use a simple bag-of-words model to derive our features, including unigrams, bigrams, and trigrams. For example, D1 is converted to:

(My, phone, is, not, bad, My phone, phone is, is not, not bad, My phone is, phone is not, is not bad)

The value p(fi | Sj) is the probability of seeing a feature in documents labeled as Sj in the corpus.

We can calculate this using Kolmogorov’s definition of conditional probability: p(fi | Sj) = p(fi ∩ Sj) / p(Sj). For example, for the feature "bad":

Given a document (e.g., "My tablet is good"), the classifier calculates a "score" for each sentiment polarity based on the text’s features. For "positive," we get:

log(p(POS | my, tablet, is, good, my tablet, tablet is, is good, my tablet is, tablet is good))

Which is:

log(p(POS)) + log(p(my | POS)) + … + log(p(tablet is good | POS)) = −13.6949

The same applies to "negative" and "neutral," yielding the following ordered scores:

The classifier concludes that "positive" is the most likely sentiment polarity.

The naïve Bayes classifier is fast because the required computations are simple. However, its accuracy is limited. For example:

Meltwater’s NLP team was tasked with improving sentiment analysis for all supported languages. Since training new models is complex and expensive, the team first explored quick ways to enhance sentiment analysis using our existing tech stack.

The first change we made was to how we train the Bayes model. Instead of training and classifying at the document level, we now do so at the sentence level. This has several advantages:

We then decided to aggregate sentence-level sentiment into document-level sentiment by stacking classifiers to "pick" meaningful sentences, producing an overall sentiment for the entire document.

Figure 3: Recorded sentiment attribute overrides in Q2 2018 (left) and Q2 2019 (right)—all languages.

These simple changes have had a significant impact on reducing the number of emotional attribute overrides made by customers each month. Specifically, in the 16 supported languages, the average reduction in sentiment attribute overrides for news documents was 58%.

The analysis involved approximately 450 million news documents and 4.2 million overrides generated by 7,193 customers. Figure 3 shows a comparison between the number of overrides performed in Q2 2018 (document-level prediction) and Q2 2019 (sentence-level prediction + aggregation).

Meanwhile, Meltwater's NLP team has been working to improve our technology stack to analyze sentiment in two major languages (English and Chinese), covering about 40% of the daily content processed by Meltwater.

We experimented with various techniques, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory networks (LSTM), aiming to find a good compromise between accuracy, speed, and cost.

Due to a favorable trade-off between accuracy, speed, and operational costs, we decided to adopt a CNN-based solution. CNNs are primarily used in computer vision but have proven to perform exceptionally well in NLP. Our solution is implemented in Python using TensorFlow, NumPy (with MKL optimization), GenSim, and EKPhrasis to support hashtags, emojis, and emoticons.

A simplified architecture is shown in Figure 4. It includes an embedding (input) layer, followed by a single convolutional layer, then a max-pooling layer and a softmax layer [5].

Figure 4: Simplified Model Architecture (Source: Zhang, Y. and Wallace, B. (2015). A Sensitivity Analysis of Convolutional Neural Networks for Sentence Classification (and a Practitioner's Guide))

Our input is the text to be classified. As with the Bayesian approach, we need to represent the text based on its features. We embed the text as a matrix.

For example, the text "I like this movie very much!" is represented as a matrix with 7 rows, one for each word. The number of columns depends on the features we want to represent. Unlike the Bayesian case, we no longer design features ourselves. Instead, we now use pre-trained third-party word embeddings.

Word embeddings capture semantic similarity on a large scale. These embeddings are publicly available and generated by neural networks trained by third-party machine learning experts. For English, we use Stanford's GloVe embeddings trained on 840 billion words from Common Crawl, using vectors with 300 features. We also experimented with BERT and ElMo, but the accuracy/cost trade-off still favored GloVe.

For Chinese, we use embeddings from Tencent AI, trained on 8 million phrases with 2 million feature vectors. The vectors are fine-tuned using transfer learning on our own training dataset. The goal is to ensure the embeddings account for Meltwater's PR/marketing requirements.

The core of the CNN is the convolutional layer, where artificial neurons are trained to extract salient features from the embeddings. In our case, the convolutional layer consists of 100 neurons for English and 50 for Chinese.

The advantage is that we no longer need to design features; the network learns the features we need. The drawback is that we may no longer know what these features are. Click here for more details on convolutional layers, and here for an explanation of the black-box issue.

The idea of pooling is to capture the most important local features in the feature map to reduce dimensionality and speed up the network.

The pooled vectors are combined into a single vector and passed to a fully connected softmax layer, which performs the actual polarity classification.

For English, in addition to GloVe embeddings, we have 23,000 internally labeled news sentences and 60,000 social sentences, including the Twitter dataset provided by SemEval-2017 Task 4. For Chinese, in addition to Tencent AI embeddings, the dataset contains about 38,000 sentences from news, social media, and comments.

The dataset was annotated via crowdsourcing using Amazon's SageMaker Ground Truth. Before training, the dataset was stratified and randomized using the 80/20 rule—80% for training and 20% for validation.

Compared to the Bayesian approach, this simple architecture has already enabled the model to achieve significantly better performance at the sentence level (Table 1). Gains were 7% for English social text, 18% for Chinese (social and news combined), and 26% for English news.

After aggregation at the document level, compared to the Bayesian approach, we found a further reduction of 48.06% in document-level sentiment attribute overrides for English and 29.24% for Chinese.

Table 1: Sentiment Accuracy of CNN vs. Naive Bayes (English and Chinese).

How accurate is sentiment analysis? The F1 score essentially measures the accuracy of the model's results compared to human annotations. Research shows that human annotators agree on results only 80% of the time.

In other words, even assuming a 100% accurate model, humans would still disagree with the model 20% of the time [6]. In practice, this means our CNN model performs almost as well as humans when classifying individual sentences.

Until now, sentiment override results were never fed back into the sentiment model. The NLP team has now designed a feedback loop to collect cases where customers disagree with the CNN classifier, allowing us to improve the model over time.

The overridden documents are then sent to Fairhair.ai Studio (Figure 5), where annotators relabel them at each level (entity, sentence, section (e.g., headline, lead, body), and document).

Figure 5: Fairhair.ai Studio: Meltwater's Annotation Tool

Each document is annotated multiple times by different annotators and reviewed by a senior annotator. End customers sometimes participate in this process. When our annotators are not proficient in a specific language, the labeling is outsourced to a third-party crowdsourcing tool.

Meltwater is a heavy user of Amazon SageMaker Ground Truth (Figure 6). When using crowdsourcing, we increase the number of annotators required, as they may not be as accurate as our internally trained annotators.

Figure 6: AWS SageMaker GT Helps Meltwater Label 2,690 Chinese Documents 5 Times

After annotation, new data points are reviewed by our research scientists. The review process ensures these overrides do not intentionally bias our own model or follow the biases of specific clients requiring particular models.

If the data is correct, it is added to the test set—we do not want to overfit the model by adding it to the training set. The model needs to generalize the correct answer from other data points.

We collect data of similar nature, carrying the necessary knowledge to correctly classify overridden documents. For example, if misclassifications occur in documents discussing financial products, we collect financial documents from the Elasticsearch cluster.

We changed how the Bayesian sentiment model is trained and applied across all languages, reducing document-level sentiment attribute overrides for news documents by an average of 58%.

We now support sentence-level and entity-level sentiment for all 16 languages. For us, an entity can be a named entity like "Ford" or a key phrase like "customer service."

We deployed deep learning sentiment models for English and Chinese. Their sentence-level accuracy is 83% for English and 76% for Chinese. They further reduced document-level override rates for news documents by 48.06% (English) and 29% (Chinese).

The new models account for hashtags (e.g., #love), emojis, and emoticons.

We have a feedback loop to continuously improve our sentiment models.

[1] Bing Liu. Sentiment Analysis: mining sentiments, opinions, and emotions. Cambridge University Press, 2015.

[2] Daniel Justus, John Brennan, Stephen Bonner, Andrew Stephen McGough. Predicting the Computational Cost of Deep Learning Models. IEEE Intl Conf. on Big Data. 2018.

[3] Irina Rish. An empirical study of the naive Bayes classifier. IJCAI Work. on Empirical Methods in AI. 2001.

[4] Alexandru Niculescu-Mizil, Rich Caruana. Predicting good probabilities with supervised learning. International Conference on Machine Learning. 2005.

[5] Ye Zhang, Byron Wallace. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. International Joint Conference on Natural Language Processing. 2015.

[6] Kevin Roebuck. Sentiment Analysis: High-impact Strategies – What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. 2012.