How Sprout Built a Hybrid Model

As anyone who’s ever been in a relationship will tell you, human emotions are a complicated concept. This is especially true for marketers who are trying to understand the qualitative benefits—the value that goes beyond basic functionality—of their product or service. It’s not difficult to understand what your product does, but do you know how it makes your consumers feel?

You would if you used social listening sentiment analysis to distill your target audience’s unfiltered social media musings into actionable strategic insights. Taking all of the social data available across Twitter and categorizing it for positive, negative or neutral sentiment is a major undertaking, and no two methods are created equal. That’s why Sprout Social built a hybrid sentiment analysis system that combines the two primary approaches, Rule Lists and Machine Learning.

Rule Lists

One of the simplest ways to tackle sentiment analysis is by using human-created rules or dictionaries. With this approach, the system relies on a list of words or phrases that directly map to a specific sentiment. For example, any Tweet that contains the word “high five” might be labeled as positive, while a Tweet containing “horrible” would be negative. Systems like this are highly customizable, and can be extended to include thousands of word and phrase rules.

On the downside, rule systems struggle with Tweets that match conflicting rules, such as “The movie wasn’t as horrible as I anticipated.” Here, “horrible” might be labeled negative, while “anticipated” would be positive. The conflicting rules label the Tweet as neutral, while some human readers would interpret it as slightly positive and others, slightly negative.

An additional limitation of rule-based systems is the reliance on human effort and understanding. Language evolves rapidly (especially on Twitter), and a rule-based system requires someone to provide a steady stream of new terms and phrases. Updating a sentiment system isn’t always a top priority and a system can quickly grow outdated. Even with vigilant monitoring, it can be difficult to identify changing language trends, and determine when new rules need to be added.

Machine Learning

More advanced sentiment analysis systems use Machine Learning (ML) techniques (sometimes also called Artificial Intelligence or Natural Language Processing). Machine Learning is a family of techniques that use statistics and probability to identify complex patterns that can be used to label items.

Unlike rule-based systems, ML systems are flexible enough to detect similarities that aren’t immediately apparent to a human. By looking at many, many examples, the system learns patterns that are typically associated with positive, negative, or neutral sentiments.

For example, an ML sentiment analysis system might find that Tweets that contain the word “rain” and end with one exclamation point are negative, while Tweets with “rain” and two exclamation points are positive. A human might not notice this pattern or understand why it occurs, but an ML system can use it to make very accurate predictions.

While Machine Learning systems can produce great results, they do have a few shortcomings. When there’s a lot of variety in the language, it can be tough for an ML system to sift through the noise to pick out patterns. When strong patterns do exist, they can overshadow less common patterns, and cause the ML system to ignore subtle cues.

Sprout’s Approach

To build our sentiment analysis system, we designed a hybrid system that combines the best of both rule-based and machine learning approaches. We analyzed tens of thousands of Tweets to identify places where ML models struggle, and introduced rule-based strategies to help overcome those shortcomings.

By supplementing statistical models with human understanding, we’ve built a robust system that performs well in a wide variety of settings.

All About Accuracy

On the surface, sentiment analysis seems pretty straightforward—just decide whether a Tweet is positive, negative or neutral. Human language and emotions are complicated, though, and detecting sentiment within a Tweet reflects this complexity.

Consider these Tweets. Are they positive, negative, or neutral?

even bad espresso is good
— alex (@alex) October 9, 2017

Dude just asked for 6 shots of espresso at Starbucks … SIX. Freaking SIX!! 😐
— Simone Eli (@SimoneEli_TV) October 31, 2017

You might feel confident in your answers, but chances are good that not everyone would agree with you. Research has shown that people only agree on the sentiment of Tweets 60-80% of the time.

You might be skeptical. We were, too.

To test it out, two members of our Data Science team labeled the exact same set of 1,000 Tweets as positive, negative or neutral. We figured “we work with Tweets every day; we’ll probably have near-perfect agreement between the two of us.”

We calculated the results and then double and triple-checked them. The research was spot-on—we only agreed on 73% of the Tweets.

Challenges in Sentiment Analysis

Research (along with our little experiment) shows that sentiment analysis isn’t straightforward. Why is it so tricky? Let’s walk through a few of the biggest challenges.

Context

Tweets are a tiny snapshot in time. While some stand alone, Tweets are often part of an ongoing conversation or reference information that only makes sense if you know the author. Without those clues, it can be hard to interpret an author’s feelings.

😂 I do this with spoons for coffee too.
— Renée Barrow (@RmBarrow) October 14, 2017

Sarcasm

Sarcasm detection is another flavor of the context challenge. Without additional information, sentiment analysis systems often confuse the literal meaning of words with how they’re intended. Sarcasm is an active area of academic research, so we may see systems in the near future that understand snark.

Comparisons

Sentiment also gets tricky when Tweets make comparisons. If I’m conducting market research on vegetables and someone Tweets, “Carrots are better than squash,” is this Tweet positive or negative? It depends on your perspective. Similarly, someone might tweet, “Company A is better than Company B.” If I work for Company A, this Tweet is positive, but if I’m with Company B, it’s negative.

Emojis

Emojis are a language all their own. While emojis like 😡 express a pretty obvious sentiment, others are less universal. While building our sentiment analysis system, we looked closely at how people use emojis, finding that even common emojis can cause confusion. 😭 is almost equally used to mean “so happy I’m crying” or “so sad I’m crying.” If humans can’t agree on the meaning of an emoji, neither can a sentiment analysis system.

Defining Neutral

Even “neutral” sentiment isn’t always straightforward. Consider a news headline about a tragic event. While we’d all agree that the event is terrible, most news headlines are meant to be factual, informative statements. Sentiment analysis systems are designed to identify the emotion of the content’s author, not the reader’s response. While it may seem strange to see terrible news labeled “neutral,” it reflects the author’s intent of communicating factual information.

Sentiment analysis systems also vary in how neutral is defined. Some consider neutral to be a catch-all category for any Tweet where the system can’t decide between positive or negative. In those systems, “neutral” is synonymous with, “I’m not sure.” In reality, though, there are many Tweets that don’t express emotion, such as the example below.

A ‘Venti’ typically has two shots of espresso, but this customer asked for 14 https://t.co/jzOi93RRd9
— TAXI (@designtaxi) October 30, 2017

Our system explicitly classifies non-emotional Tweets as neutral, rather than using neutral as a default label for ambiguous Tweets.

Evaluating Sentiment Analysis

With so many challenges in sentiment analysis, it pays to do your homework before investing in a new tool. Vendors try to help cut through the complexities by focusing on statistics about the accuracy of their product. Accuracy isn’t always an apples-to-apples comparison, though. If you plan to use accuracy as a measuring stick, here are a few things you should ask.

Is the reported accuracy greater than 80%?
Since humans only agree with each other 60-80% of the time, there’s no way to create a test dataset that everyone will agree contains the “correct” sentiment labels. When it comes to sentiment, “correct” is subjective. In other words, there isn’t a gold standard to use in testing accuracy.

The upper limit of a sentiment analysis system’s accuracy will always be human-level agreement: about 80%. If a vendor claims more than 80% accuracy, it’s a good idea to be skeptical. Current research suggest that even 80% accuracy is unlikely; top experts in the field typically achieve accuracies in the mid to upper 60s.

How many sentiment categories are being predicted?
Some vendors evaluate accuracy only on Tweets that have been identified by human evaluators as definitively positive or negative, excluding all neutral Tweets. It’s much easier for a system’s accuracy to appear very high when working with strongly emotional Tweets and only two possible outcomes (positive or negative).

In the wild, however, most Tweets are neutral or ambiguous. When a system is evaluated against only positive and negative, it’s impossible to know how well the system copes with neutral Tweets—the majority of what you’ll actually see.

What types of Tweets are included in their test set?
A sentiment analysis system should be built and tested on Tweets that are representative of real-world conditions. Some sentiment analysis systems are created using domain-specific Tweets that have been filtered and cleaned to make it as easy as possible for a system to understand.

For example, a vendor may have found a pre-existing dataset that only includes strongly emotional Tweets about the airline industry, with any spam or off-topic Tweets excluded. This would cause accuracy to be high, but only when used on very similar Tweets. If you’re working in a different domain, or receive any off-topic or spam Tweets, you’ll see much lower accuracy.

How large was the test dataset?
Sentiment analysis systems should be evaluated on several thousand Tweets to measure the system’s performance in many different scenarios. You won’t get a true measure of a system’s accuracy when a system is only tested on a few hundred Tweets.

Here at Sprout, we built our model on a collection of 50,000 Tweets drawn from a random sample from Twitter. Because our Tweets are not domain-specific, our sentiment analysis system performs well on a wide range of domains.

Additionally, we make separate predictions for positive, negative and neutral categories; we don’t just apply neutral when other predictions fail. Our accuracy was tested on 10,000 Tweets, none of which were used to build the system.

See Sprout’s Sentiment Analysis Live With Listeners

All the research in the world is no substitute for evaluating a system first-hand. Give our new sentiment analysis system a test drive within our newest social listening toolset, Listeners, and see how it works for you. Ultimately, the best social listening tool is the one that meets your needs and helps you get greater value from social. Let us help you get started today.

Source link