how-sentiment-analysis-is-used-to-analyze-movie-reviews

Sentiment analysis is one of the most widely used applications of natural language processing and text analytics, with a variety of sites, publications, and courses dedicated to the subject. Sentiment analysis appears to be the best to operate on subjective material, in which people express their thoughts, feelings, and mood. Sentiment analysis is commonly utilized in the real world to assess corporate investigations, feedback survey data, social media posts, and reviews for movies, places, etc.

Terminologies

1. Text Corpus

A Text Corpus consists of several text documents that is as simple as a single sentence, or complicated as a multi-paragraph document.

2. Sentiment Analysis

Sentiment analysis is also known as opinion analysis or opinion harvesting. The basic idea is of using various tricks from text analytics, NLP (Natural Language Processing), Machine Learning, and languages to scrape necessary data from unformatted text.

3. Sentiment Polarity

Sentiment Polarity is the numeric result that is allocated to both the positive and negative factors of the text document that is dependent on subjective factors like particular words and phrases that will express feelings and emotions. Natural sentiment generally consists of 0 polarities as it does not show the particular sentiment. The positive sentiment that will consist >0, and negative <0.

Here, we will show a step-by-step process that will make a model performing Sentiment Analysis on a huge Movie database. The information was gathered from the Internet Movie Database (IMDB).

We concentrate on analyzing a large corpus of movie reviews and gather opinions. We are attempting to extract sentiment from a big corpus of movie reviews. We’ll use both an unsupervised lexicon-based model and a typical supervised machine learning model to conduct our research.

The main goal is to estimate how people will react to various movie reviews from the Internet Movie Database (IMDb).

Dataset
dataset

Based on the content of the reviews, this data consists of 50,000 movie reviews that were pre-labeled with “good” and “negative” sentiment different classifiers. Hence, our task will be to expect the opinion of 15,000 labeled movie reviews and the use of other 35,000 reviews to training our supervised models.

Setting up Dependencies

We will use various Python tools and frameworks dedicated to text analytics, Natural Language Processing, and machine learning. We must first install Pandas, NumPy, SciPy, scikit-learn, seaborn, matplotlib, and NLP libraries before initiating the project.

After installing nltk with pip or conda, we need to type the following code from a Python or ipython terminal.


import nltk
nltk.download('all', halt_on_error=False)

To install the library and receive the English model dependencies for spacy, write the following code as administrator in a Unix shell/windows command line.


$ conda config --add channels conda-forge
$ conda install spacy
$ python -m spacy download en
Text Pre-Processing and Normalization

The important step before going into the phase of feature engineering and modeling involves cleaning, pre-processing, and normalizing information to get textual elements such as phrases and words to a customized standard.

This allows document corpus standardization, which aids in the development of significant features and the reduction of noise supplied by a variety of elements such as irrelevant symbols, special characters, XML and HTML tags, and so on. Use the text normalizer.py utility to perform the following tasks.

Important Libraries
important-libraries

import spacy
	import pandas as pd
	import numpy as np
	import nltk
	from nltk.tokenize.toktok import ToktokTokenizer
	import re
	from bs4 import BeautifulSoup
	from contractions import CONTRACTION_MAP
	import unicodedata
	nlp = spacy.load('en_core_web_sm', disable=['ner'])
	doc = nlp(u"I don't want parsed", disable=['parser','tag','entity'])
	tokenizer = ToktokTokenizer()
	stopword_list = nltk.corpus.stopwords.words('english')
	stopword_list.remove('no')
	stopword_list.remove('not')
Cleaning Text:

Unnecessary material such as HTML elements, frequently appears in our text and adds little value when analyzing sentiment. As a result, we must ensure that they are removed before extracting features. This is done using the strip html tags(…) method.


def strip_html_tags(text):
	    soup = BeautifulSoup(text, "html.parser")
	    stripped_text = soup.get_text()
	    return stripped_text
Deleting Accented Characters.

In our database, we deal with reviews of the English language hence we assure that characters of any format, especially accented characters are transformed and standardized into ASCII characters.


def remove_accented_chars(text):
	    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
	    return text
Expanding Contractions

Contractions are abbreviated versions of words or phrases in the English language. By eliminating key letters and sounds, these abbreviated versions of preexisting words and phrases are generated. For instance, expand don’t to do not and I’d to I Would.


def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
	    
	    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
	    def expand_match(contraction):
	        match = contraction.group(0)
	        first_char = match[0]
	        expanded_contraction = contraction_mapping.get(match)\
	                                if contraction_mapping.get(match)\
	                                else contraction_mapping.get(match.lower())                       
	        expanded_contraction = first_char+expanded_contraction[1:]
	        return expanded_contraction
	        
	    expanded_text = contractions_pattern.sub(expand_match, text)
	    expanded_text = re.sub("'", "", expanded_text)
	    return expanded_text
Removing Special Characters

This can be accomplished with the help of simple regexes. If you don’t want numbers in your normalized corpus, you can keep them or remove them. Remove special characters(…) is a function that assists us in removing special characters.


def remove_special_characters(text):
	    text = re.sub('[^a-zA-z0-9\s]', '', text)
	    return text
Removing Stop Words

Words with little or no significance especially while developing meaningful features from the text are known as stop words. Words, such as a, an, the, and others are considered to be phrases. The function remove_stopwords(…) will help remove stopwards and retain words with the most significance and context in a corpus.


def remove_stopwords(text, is_lower_case=False):
	    tokens = tokenizer.tokenize(text)
	    tokens = [token.strip() for token in tokens]
	    if is_lower_case:
	        filtered_tokens = [token for token in tokens if token not in stopword_list]
	    else:
	        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
	    filtered_text = ' '.join(filtered_tokens)    
	    return filtered_text
Lemmatization

Lemmatization is similar to stemming in which it removes language affixes to get at a word’s base form. To keep lexicographically correct words, we’ll exclusively use Lemmatization in our normalization pipeline. This is where the function lemmatize text(…) comes in handy.


def lemmatize_text(text):
	    text = nlp(text)
	    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
	    return text
Building Text-Normalizer

We will utilize all the components and hold them all together in the following function known as normalize_corpus(…), which is used to take a document corpus as input and return the similar corpus with cleaned and normal text documents.


def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
	                     accented_char_removal=True, text_lower_case=True, 
	                     text_lemmatization=True, special_char_removal=True, 
	                     stopword_removal=True):
	    
	    normalized_corpus = []
	    # normalize each document in the corpus
	    for doc in corpus:
	        # strip HTML
	        if html_stripping:
	            doc = strip_html_tags(doc)
	        # remove accented characters
	        if accented_char_removal:
	            doc = remove_accented_chars(doc)
	        # expand contractions    
	        if contraction_expansion:
	            doc = expand_contractions(doc)
	        # lowercase the text    
	        if text_lower_case:
	            doc = doc.lower()
	        # remove extra newlines
	        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
	        # insert spaces between special characters to isolate them    
	        special_char_pattern = re.compile(r'([{.(-)!}])')
	        doc = special_char_pattern.sub(" \\1 ", doc)
	        # lemmatize text
	        if text_lemmatization:
	            doc = lemmatize_text(doc)
	        # remove special characters    
	        if special_char_removal:
	            doc = remove_special_characters(doc)  
	        # remove extra whitespace
	        doc = re.sub(' +', ' ', doc)
	        # remove stopwords
	        if stopword_removal:
	            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
	            
	        normalized_corpus.append(doc)
	        
	    return normalized_corpus
Load Dataset and Normalize the Data

Load the IMDb dataset after importing the appropriate libraries. The dataset will then be divided into training and testing datasets. The normalized corpus(…) method will be used to standardize the information.

1st Approach – Undergoing Sentiment Analysis through Unsupervised Lexicon-Based Models

Unsupervised sentiment opinion models depend on good knowledge bases, taxonomies, lexicons, and datasets that includes precise information on words or phrases such as emotion, mood, partiality, and impartiality. A lexicon model that uses lexicon and also known as dictionary or vocabulary of words that specifically match sentiment classification.

These lexicons often include a set of words related to positive and negative emotion, polarity (the magnitude of a negative or positive score), mood, parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), and modality, among other things. There are various other lexicon models, used for opinion analysis which include AFINN Lexicon, SentiWordNet Lexicon, and VADER Lexicon.

Model Training, Forecasting, and Progress Analysis

1. Sentiment Analysis

Sentiment Analysis

One of the most basic and often used lexicons for sentiment analysis is the AFINN lexicon. There are over 3300+ words in total, each with a polarity score. A built-in function for this lexicon exists in Python

2. Sentiment Analysis with SentiWordNet

sentiment-analysis-with-sentiwordnet

SentiWordNet is a sentiment lexicon derived from the WordNet database where every term is related with numeric scores which indicates positive and negative sentiment data.

3. Sentiment Analysis with VADER

/sentiment-analysis-with-vader

VADER is Valence Aware Dictionary for Sentiment Reasoning model is used for the text opinion analysis which is sensitive for both polarity (positive/negative) and intensity (strength) of emotions.

Which Unsupervised Model Works Best?
which-unsupervised-model-works-best.

It is found that AFINN lexical model has better accuracy of 71.18% and works best among all three lexical models. It is also observed that the performance of the other models is near to that of the AFINN model on the given data.

2nd Approach: Sentiment Classification with Supervised Learning

Using supervised Machine Learning to develop a model to interpret the textual data and forecast the sentiment of text-based evaluations is another option. To be more explicit, classification models will be used to solve this challenge.

Feature Engineering

Word embedding is a method that uses vectors to represent the text. The following are the most popular word embedding:

1. BoW: Bag of words

The Bag of Words (BoW) model is the most basic type of numerical text representation. A phrase can be represented as a bag of words vector, just like the term itself (a string of numbers).

2. TF-IDF Term Frequency-Inverse Document frequency

The term frequency-inverse document frequency statistic is a quantitative expression of how essential a phrase is to a page in a catalog or corpus.

Model Training, Prediction, and Performance Assessment

1. To determine the likelihood of a binary event occurring and to deal with classification concerns, logistic regression is utilized.

Logistic Regression with BoW features

logistic-regression-with-bow-features

Logistic Regression using TF-IDF features

logistic-regression-using-tf-idf-features

2. Stochastic Gradient Descent (SGD)

SGD (Stochastic Gradient Descent) is a straightforward but effective optimization approach for determining the values of the parameters of functions that reduce an objective function. In other words, it’s utilized to learn discriminative binding strengths like SVM and Logistic regression.

SVM using BoW features
svm-using-bow-features
SVM using TF-IDF features
svm-using-tf-idf-features
Which Model Performs Best?
which-supervised-model-performs-best

We have discovered that the Logistic Regression model on Bag of Word characteristics performs the best, with an accuracy of 90.65%. Other models’ performance is likewise extremely similar to this one, as can be shown.

Conclusion

By utilizing Unsupervised Lexicon-based models, it is observed that effective Lexicon model as AFFIN that has an accuracy of 71%. Where on using traditional supervised ML models there is a precision of 89-90%. With an accuracy of 90.65%, a logistic regression model trained with bag-of-words features outperformed all other models and an F1 score of about 0.9. It can be concluded that typical supervised models outperform lexical models by comparing the top models from both supervised and unsupervised learning.

For any queries, contact X-Byte Enterprise Crawling today!