Classifying Amazon Reviews Depending on Customer Reviews and Using NLP

The main aim of the article is to analyze and classify Amazon reviews depending on user ratings.

Consumers benefit from reviews because they provide objective feedback on a product. A numerical score, or the number of ratings, is frequently used to describe these ratings. Of course, the text itself is more valuable than the measured stars. And, in other cases, the offered rating does not reflect the product’s experience – the core of the review is actually in the text.

The goal is to create a classifier that can recognize the basis of a review and award its most acceptable score based on the text’s meaning.

Background

Though Amazon’s product ratings are compiled from all of a customer’s reviews, each evaluation is just a number ranging from one to five stars. As a result, our predictions are reduced to five discrete classifications. As a result, we’ll have a multi-class supervised classifier with the actual review text as the key predictor.

The goal of predicting a star rating based on a line of writing will involve a variety of NLP techniques, such as word embedding, subject modeling, and dimension reduction. After this, we’ll create a finalized data frame and use various machine learning approaches to choose the optimal strategy (i.e., the most accurate estimator) for our classifier.

Dataset

Customer reviews for all listed electronic products from May 1996 to July 2014 are included in the Amazon dataset. On 63,001 unique goods, there are a total of 1,689,188 reviews from 192,403 customers. The following is the data dictionary:

  • ASIN: Unique id of the product to be reviewed
  • Helpful: The number of people who voted beneficially, as well as the total number of customers who voted on the article, are both included in this list.
  • Overall: The reviewer’s rating of the product.
  • reviewText: the review text itself, String.
  • reviewerID: unique ID of the reviewer, String.
  • reviewerName: Particular name of the reviewer, String
  • summary: Headline summary of the review, string.
  • unixReviewTime: Unix Time of when the review was posted, string.
dataset

Pipeline

Data preprocessing – tokenization – phrase modeling – generating vocabulary – count feature extraction engineering – word anchoring for feature engineering – PCA – interactive data analysis – machine learning are all steps in the NLP analysis process.

NLP Processing

The review Text column will be used to retrieve the model’s ultimate data frame, with the overall serving as the ground truth tag.

nlp-processing

HTML Entities:

Data that precedes the global UTF-8 standard can be found in some datasets. HTML processing converts several special characters, such as the apostrophe, to integers between &# and ;. Tokens that match the &#[0–9]+; the pattern is dropped using RegEx. Code example:

import html
                            
decoded_review = html.unescape(sample_review)
print(decoded_review)
pattern = r"\&\#[0-9]+\;"
df["preprocessed"] = df["reviewText"].str.replace(pat=pattern, repl="", regex=True)

print(df["preprocessed"].iloc[1689185])
                        

Lemmatization

To preserve consistency in word usage, terms are reduced to their basic words. It considers context similarity in terms of part-of-speech anatomy. The NLTK library’s WordNetLemmatizer is employed.

Accents

Each review is converted to ASCII encoding from long-form UTF-8. Because accents are removed from characters, words like naive become naive.

Punctuations

The reviews are tidied up even further by removing punctuation. All RegEx pattern finds are replaced with whitespaces, leaving only spaces and alphanumeric characters.

Lowercasing

Each letter is converted to lowercase.

Stop Words

Stop words include pronouns, articles, and prepositions, which are the most regularly used terms. These words are no longer used since they are ineffective in identifying one text from another.

Single Whitespaces

We employ pattern matching once more to guarantee that we would never have so much more whitespace character between words in our phrases.

Tokenization

Our collection, which is basically a collection of all our papers, is made up of the items for the preprocessed section. After then, each review is turned into a sorted list of words. Tokenization is the process of breaking down a document into individual words or tokens. The following is a tokenized sample review:

tokenization

Phrase Modeling

Because word order is important in most NLP models, it’s often useful to combine various words to communicate the same meaning as if they were a single word, such as smart TV.

The number of times two words must occur next to each other to be considered a phrase is fixed to at least 300. The threshold then compares the total amount of token occurrences in the corpora to that minimum. The higher the threshold, the more frequently two words must appear next to each other for a phrase to be formed.

from gensim.models import Phrases
from gensim.models.phrases import Phraser

bi_gram = Phrases(tokenized, min_count=300, threshold=50)

tri_gram = Phrases(bi_gram[tokenized], min_count=300, threshold=5

Forming the Vocabulary

The vocabulary consists of all of the unique tokens from each product review’s key-value pairs. A lookup ID is assigned to every token.

from gensim.corpora.dictionary import Dictionary
vocabulary = Dictionary(tokenized)
vocabulary_keys = list(vocabulary.token2id)[0:10]
for key in vocabulary_keys:
    print(f”ID: {vocabulary.token2id[key]}, Token: {key}”)

Count-Based Feature Engineering

The document must then be mapped before a machine learning model can operate with it. This simply means that the input must be transformed into numerical value containers.

Model for a Bag of Words — Getting the token frequency is the traditional method of describing text as a set of features. Each row of the dataframe corresponds to a unique token in the corpora, whereas each column corresponds to a document. The number of times a term appears in the manuscript will be displayed in the row. The following is the bow model for the sample review:

bow = [vocabulary.doc2bow(doc) for doc in tokenized]for idx, freq in bow[0]:
 print(f”Word: {vocabulary.get(idx)}, Frequency: {freq}”)

TF-IDF Model – The Term Frequency-Inverse Document Frequency (TF-IDF) method assigns continuous values to the token frequency instead of simple numbers. Words that frequently appear in a document do not generate saliency and are thus given a lower weighting. Words that are distinctive to a text are weighted more since they help identify it from the others. Our bow variable is used to calculate the weighting.

from gensim.models.tfidfmodel import TfidfModeltfidf = TfidfModel(bow)for idx, weight in tfidf[bow[0]]:
 print(f”Word: {vocabulary.get(idx)}, Weight: {weight:.3f}”)

Word Embedding for Feature Engineering

The disadvantage of total number approaches is that the semantics are lost when the word sequence and sentence structure are ignored. The Word2Vec approach, quantifies how often a word appears in the vicinity of a group of other words, thereby embedding meaning in vectors.

A context window with a span of context size glides one token at a time over each document. The chance that the token appears with the others is represented in feature size dimensions, and the center word is described by its nearby words in every step. Every token in the dataset is integrated in the Word2Vec model because the minimum word requirement is fixed to 1.

np.set_printoptions(suppress=True) feature_size = 100
context_size = 20
min_word = 1word_vec= word2vec.Word2Vec(tokenized, size=feature_size, \
 window=context_size, min_count=min_word, \
 iter=50, seed=42)

Final Data Frame

The purpose is to create a data frame that contains observations related to product reviews. The word_vec model is used to collect all of the corpora’s original tokens. This allows us to create the word_vec_df, which uses the dimensions as features for each word.

final data frame

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction approach that we may use to reduce our model_df 100 dimensions to only two. This will let you see if the five overall rating classifications have a confident decision boundary. The more datapoints from the same class are grouped together, the more likely our machine learning model will be simpler and more successful.

Exploratory Data Analysis – Word Algebra

We can add or subtract word vectors using Word2Vec since it converts words into quantified tokens. To mix the meanings of the elements is to add, and to subtract is to remove the meaning of one token from the context of another. The following are some examples of vector algebra and their ratings of similarity:

from sklearn.model_selection import train_test_splitX = model_df.iloc[:, :-1]
y = model_df.iloc[:, -1]X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.5, random_state=42)

On the training data, our tweaked Random Forest model received a very high score. The prediction model depicted below shows how well the model classified each Amazon review almost accurately.

These results, however, may be misleading since they are dependent on the data used to train the model. This is almost certainly due to overfitting. Then, without entering into our reserved testing set, we must rate our model more effectively.

y_pred = forest.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
f1_score = metrics.f1_score(y_train, y_pred, average=”micro”)
print(f”Training Set Accuracy: {accuracy*100:.3f}%”)
print(f”Training Set F1 Score: {f1_score:.3f}”)

Long Short-Term Memory

The LSTM (Long Short-Term Memory) architecture is a Deep Convolution Network (RNN)-based architecture used in natural language processing and time series prediction. In sequence prediction challenges, it is capable of learning order dependence. This is a requirement in a variety of complicated issue domains, including machine translation, speech recognition, and others. The notion that LSTMs were one of the first methods to overcome technical hurdles and fulfill the promise of recurrent neural networks is the key to their success.

Gradients become larger or smaller over time, and each change makes it easier for the network’s gradients to compound in either way.

test_reviewText = review_data.reviewText
test_Ratings = review_data.overall
text_vectorizer = TfidfVectorizer(max_df=.8)
text_vectorizer.fit(test_reviewText)def rate(r):
ary2 = []
for rating in r:
    tv = [0,0,0,0,0]
    tv[rating-1] = 1
    ary2.append(tv)
    return np.array(ary2)X = text_vectorizer.transform(test_reviewText).toarray()
    y = rate(test_Ratings.values)X_train, X_test, y_train, y_test =train_test_split(X,y,test_size=.2)
    model = Sequential()
    model.add(Dense(128,input_dim=X_train.shape[1]))
    model.add(Dense(5,activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['accuracy'])model.fit(X_train,y_train,validation_data=(X_test, y_test),epochs=20,batch_size=32,verbose=1)
    model.evaluate(X_test,y_test)[1]

Word Cloud

word cloud

We may create a word cloud using the real labels of the reviews by selecting the fifty most important words in each evaluation. The same stop_words that we found in the NLTK library aren’t allowed.

Some of the words are quite descriptive to the ranking, such as “trouble” and “issue” in one-star reviews, and “quality” and “highly recommend” in five-star reviews.

Conclusion

The study explored a wide range of Natural Language Processing techniques. Subject modelling — where comparable texts were grouped together because of topic — and interdependence trees — whereby parts-of-speech tags and sentence structure were identified — are just two of the topics studied.

The pre-processing procedures were arguably just as important as the Word2Vec phase in our final model. Every document has to be decoded from UTF, encoded to ASCII, and transformed to lowercase before being tokenized. Accents, stop words, and punctuation were removed from the texts, as well as many whitespaces. To reduce the language as much as feasible, words were simplified to their root words. Phrase modelling was also utilized to singularize tokens that were frequently used together.

Our model extracts and measures context in addition to word usage and frequency. Every token in every review is interpreted by the words around it and is imbedded in a certain number of dimensions. Vectors represent all of a word’s interactions with all of the other words with which it has been related.

We get a multi-class model, for each of the 5 categories corresponding to the star rating of a review. This is a distinct approach, in which each class is distinct from the others. When the model misinterprets a 5-star rating as a 1-star review, the model has simply misclassified – it is unconcerned about how far apart 1 and 5 are. This differs from a continuous method, in which misclassifying a 5-star rating as a 1-star review would be more punishing. The distinction between each type of review is then crucial to our model. It is more concerned with the question of “What distinguishes a 5-star review from a 4-star review?” than with the question of “Is this review more approving than critical?”

Contact X-Byte Enterprise Crawling today!!

Request for a quote!!