Automation Key to High-Volume Analysis

This is a guest post contributed by Lexalytics.

Machine-based sentiment analysis is key to processing large amounts of data
by Mike Marshall

There has been ongoing debate in the text analytics world regarding accuracy of sentiment as it pertains to automated analysis vs. human analysis. At the very least, customers and vendors can agree on the value of sentiment analysis when applied to unstructured text, but deciding how and when to use automated vs. human methods continues to be up for discussion.

Language is a complex thing. We would never argue that. In fact, sarcasm, irony and sometimes even just simple misspellings can confuse even the most complex automated sentiment systems, leading it to apply the incorrect sentiment to a piece of text. This is the meat and drink of the proponents of the theory that all sentiment analysis should be done by humans, who will gleefully point to the fact that the machine got the sentiment of a document wrong. However, what they are less likely to point out is the fact that humans get it wrong as well and that unlike an automated system, are affected by external factors, such as going out for a few drinks the night before, or having a disagreement with a co-worker. In fact, anecdotal evidence suggests that two human analysts will only agree with each other approximately 80 percent of the time, so one of them must be wrong (if not both), and this is very close to the accuracy rates that the best machine systems will achieve when applied to a large corpus of documents.

The fact that a machine-based sentiment system can process a large document corpus relatively quickly (three or more documents a second per processor) is a big factor in its appeal. It's also important to realize that for many applications—including online search-based web sites—it's the overall accuracy of a sentiment engine across hundreds or thousands of documents that really matters. Another example of this is can found in financial services where sentiment for an individual stock can be measured to help predict the trading range of that stock; the sentiment of an individual story is unimportant, it's how that stock is trending across all the news that actually matters. In these cases, a machine-based system is better than humans because it can scale up to very large volumes of information.

So, what's the moral of the story? As in most things it's somewhere in the middle. The future is probably a combination of human- and machine-based sentiment. Humans bring something very important to the table that machines are never going to have and that is domain-specific knowledge and experience. For instance, in the world of pharmaceuticals, the sentence "the drug killed the tumor dead" is actually a pretty positive thing. A system that enables the human analysts to import that knowledge into the machine and then lets the machine take over the drudgery of actually processing the documents is surely the best of both worlds.

Mike Marshall is CTO of Lexalytics.

Discover the new project

How does third-party data get from the source to the end user? Follow the exploration at the Data Market Study, also sharing discoveries and insights in a newsletter.