Nltk-sentiment-analysis

Sentiment Analysis with Scikit-Learn

We will use Python's Scikit-Learn library for machine learning to train a text classification model.

Following are the steps required to create a text classification model in Python:

  1. Importing Libraries
  2. Importing The dataset
  3. Text Preprocessing
  4. Converting Text to Numbers
  5. Training and Test Sets
  6. Training Text Classification Model and Predicting Sentiment
  7. Evaluating The Model
  8. Saving and Loading the Model

Importing Libraries

Execute the following script to import the required libraries:

import numpy as np  
import re  
import nltk  
from sklearn.datasets import load_files  
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle  
from nltk.corpus import stopwords  

Importing the Dataset

Execute the following script to see load_files function in action:

movie_data = load_files(r"D:\txt_sentoken")

X, y = movie_data.data, movie_data.target

In the script above, the load_files function loads the data from both "neg" and "pos" folders into the X variable, while the target categories are stored in y. Here X is a list of 2000 string type elements where each element corresponds to single user review. Similarly, y is a numpy array of size 2000. If you print y on the screen, you will see an array of 1s and 0s. This is because, for each category, the load_files function adds a number to the target numpy array. We have two categories: "neg" and "pos", therefore 1s and 0s have been added to the target array.

 

Text Preprocessing

Execute the following script to preprocess the data:

documents = []
nltk.download('stopwords')

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):  
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)

    documents.append(document)

In the script above we use Regex Expressions from Python re library to perform different preprocessing tasks. We start by removing all non-word characters such as special characters, numbers, etc.

Next, we remove all the single characters. For instance, when we remove the punctuation mark from "David's" and replace it with a space, we get "David" and a single character "s", which has no meaning. To remove such single characters we use \s+[a-zA-Z]\s+ regular expression which substitutes all the single characters having spaces on either side, with a single space.

Next, we use the \^[a-zA-Z]\s+ regular expression to replace a single character from the beginning of the document, with a single space. Replacing single characters with a single space may result in multiple spaces, which is not ideal.

We again use the regular expression \s+ to replace one or more spaces with a single space. When you have a dataset in bytes format, the alphabet letter "b" is appended before every string. The regex ^b\s+ removes "b" from the start of a string. The next step is to convert the data to lower case so that the words that are actually the same but have different cases can be treated equally.

The final preprocessing step is the lemmatization. In lemmatization, we reduce the word into dictionary root form. For instance "cats" is converted into "cat". Lemmatization is done in order to avoid creating features that are semantically similar but syntactically different. For instance, we don't want two different features named "cats" and "cat", which are semantically similar, therefore we perform lemmatization.

 

Converting Text to Numbers

Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, we need to convert our text into numbers.

You can directly convert text documents into TFIDF feature values (without first converting documents to bag of words features) using the following script:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=2000,stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(documents).toarray()  

 

Training and Testing Sets

Like any other supervised machine learning problem, we need to divide our data into training and testing sets. To do so, we will use the train_test_split utility from the sklearn.model_selectionlibrary. Execute the following script:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The above script divides data into 20% test set and 80% training set.

Training Text Classification Model and Predicting Sentiment

We have divided our data into training and testing set. Now is the time to see the real action. We will use the Naive Bayes to train our model.To train our machine learning model using the Naive Bayes algorithm we will use GaussianNB class from the sklearn.naive_bayes library. The fit method of this class is used to train the algorithm. We need to pass the training data and training target sets to this method. Take a look at the following script:

gnb = GaussianNB()
gnb.fit(X_train, y_train)
Finally, to predict the sentiment for the documents in our test set we can use the predict method of the GaussianNB class as shown below:

To load the model, we can use the following code:

with open('text_classifier', 'rb') as training_model:  
    model = pickle.load(training_model)

We loaded our trained model and stored it in the model variable. Let's predict the sentiment for the test set using our loaded model and see if we can get the same results. Execute the following script:

with open('text_classifier', 'rb') as training_model:  
    model = pickle.load(training_model)
y_pred2 = model.predict(X_test)
print("the accuracy level after load ",accuracy_score(y_test, y_pred2))

y_pred = gnb.predict(X_test)

Evaluating the Model:

To evaluate the performance of a classification model such as the one that we just trained we can use accuracy score.for calculating accuracy score run the following script: 

print(accuracy_score(y_test, y_pred))  

Saving and Loading the Model

We can save our model as a pickle object in Python. To do so, execute the following script:

with open('text_classifier', 'wb') as picklefile:  
    pickle.dump(gnb,picklefile)

Once you execute the above script, you can see the text_classifier file in your working directory

complere source code you can get in this link RandomForestClassifier.py

Sentiment Analysis with Nltk nativebayes classification by using Bigrams

We will use Python's Nltk library for machine learning to train a text classification model.

Following are the steps required to create a text classification model in Python:

  1. Import the library
  2. Importing The movie_reviews dataset 
  3. Training and Test Sets
  4. Training Text Classification Model and Evaluating The Model

Importing Libraries

Execute the following script to import the required libraries:

import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util as util
from nltk.collocations import BigramCollocationFinder as BCF
from nltk.metrics import BigramAssocMeasures
import itertools

Importing The movie_reviews dataset 

Execute the following script for importing movie_reviews dataset.

    pid = movie_reviews.fileids('neg')
    nid = movie_reviews.fileids('pos')

 next code segment return bigram feature 

    prev = [(features(movie_reviews.words(fileids = id)), 'positive') for id in pid]
    nrev = [(features(movie_reviews.words(fileids = id)), 'negative') for id in nid]

Training and Testing Sets

 following script return the training and testing set

     train_set = nrev[:ncutoff] + prev[:pcutoff]
     test_set = nrev[ncutoff:] + prev[pcutoff:]

Training Text Classification Model and Evaluating The Model

 following script train the Text Classification Model and Evaluating The Model

   classifier = NaiveBayesClassifier.train(train_set)

    # Accuracy
    print ("Accuracy is : ", util.accuracy(classifier, test_set) * 100)

 

complere source code you can get in this link movie_review_using_bigram

Nltk nativebayes classification without Bigrams

Following are the steps required to create a text classification model without Bigrams in Python:

  1. Import the library
  2. load the positive and negative review and create the feature
  3. Prepare train and test dataset
  4. Training Text Classification Model and Evaluating The Model

Nltk nativebayes classification without Bigrams

Following are the steps required to create a text classification model without Bigrams in Python:

  1. Import the library
  2. load the positive and negative review and create the feature
  3. Prepare train and test dataset
  4. Training Text Classification Model and Evaluating The Model

Import the library

Execute the following script to import the required libraries:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

load the positive and negative review and create the feature dictionary

Execute the following script load the positive and negative review and create the feature then add to the feature dictionary

      for fileid in movie_reviews.fileids('pos'):
        words = movie_reviews.words(fileid)
        pos_reviews.append((create_word_features(words), "positive"))
 

      for fileid in movie_reviews.fileids('neg'):
        words1 = movie_reviews.words(fileid)
        neg_reviews.append((create_word_features(words1), "negative"))

Prepare train and test dataset

Execute the following script for preparing train and test dataset

    train_set = neg_reviews[:750] + pos_reviews[:750]
    test_set = neg_reviews[750:] + pos_reviews[750:]

Training Text Classification Model and Evaluating The Model

Execute the following script for Training Text Classification Model and Evaluating The Model

    classifier = NaiveBayesClassifier.train(train_set)
    accuracy = nltk.classify.util.accuracy(classifier, test_set)
    print(accuracy * 100) 

complere source code you can get in this link Senti_model