Naïve Bayes

is a simple but powerful classifier based on a probabilistic model derived from the Bayes’ theorem. Basically it determines the probability that an instance belongs to a class based on each of the feature value probabilities.

One of the most successful applications of Naïve Bayes has been within the field of Natural Language Processing (NLP). NLP is a field that has been much related to machine learning, since many of its problems can be formulated as a classification task.

In this section, we will use Naïve Bayes for text classification; we will have a set of text documents with their corresponding categories, and we will train a Naïve Bayes algorithm to learn to predict the categories of new unseen instances.

Importing our pylab environment

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Import the newsgroup Dataset, explore its structure and data

In [3]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print (news.keys())
print (type(news.data), type(news.target), type(news.target_names))
print (news.target_names)
print (len(news.data))
print (len(news.target))
dict_keys(['target_names', 'data', 'description', 'target', 'DESCR', 'filenames'])
<class 'list'> <class 'numpy.ndarray'> <class 'list'>
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
18846
18846
In [4]:
print (news.data[0])
print (news.target[0], news.target_names[news.target[0]])
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10 rec.sport.hockey

Preprocessing the data

We have to partition our data into training and testing set. The loaded data is already in a random order, so we only have to split the data into, for example, 75 percent for training and the rest 25 percent for testing

In [6]:
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]
X_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

This function will serve to perform and evaluate a cross validation:

In [8]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores)))

Our machine learning algorithms can work only on numeric data.Currently we only have one feature, the text content of the message; we need some function that transforms a text into a meaningful set of numeric features.

The sklearn. feature_extraction.text module has some useful utilities to build numeric feature vectors from text documents.

You will find three different classes that can transform text into numeric features: CountVectorizer, HashingVectorizer, and TfidfVectorizer. The difference between them resides in the calculations they perform to obtain the numeric features. CountVectorizer basically creates a dictionary of words from the text corpus. Then, each instance is converted to a vector of numeric features where each element will be the count of the number of times a particular word appears in the document.

HashingVectorizer, instead of constricting and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer.

TfidfVectorizer works like the CountVectorizer, but with a more advanced calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency in the whole corpus of documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterize the instances.

Training a Naïve Bayes classifier

We will create a Naïve Bayes classifier that is composed of a feature vectorizer and the actual Bayes classifier. We will use the MultinomialNB class from the sklearn.naive_bayes module.

In order to compose the classifier with the vectorizer, scikitlearn has a very useful class called Pipeline (available in the sklearn.pipeline module) that eases the construction of a compound classifier, which consists of several vectorizers and classifiers.

Evaluate three models with the same Naive Bayes classifier, but with different vectorizers
In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])
clf_2 = Pipeline([
    ('vect', HashingVectorizer(non_negative=True)),
    ('clf', MultinomialNB()),
])
clf_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])
Perform a five-fold cross-validation by using each one of the classifiers
In [10]:
clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
    evaluate_cross_validation(clf, news.data, news.target, 5)
[ 0.85782493  0.85725657  0.84664367  0.85911382  0.8458477 ]
Mean score: 0.853 (+/-0.003)
[ 0.75543767  0.77659857  0.77049615  0.78508888  0.76200584]
Mean score: 0.770 (+/-0.005)
[ 0.84482759  0.85990979  0.84558238  0.85990979  0.84213319]
Mean score: 0.850 (+/-0.004)

CountVectorizer and TfidfVectorizer had similar performances, and much better than HashingVectorizer

Let’s continue with TfidfVectorizer; we could try to improve the results by trying to parse the text documents into tokens with a different regular expression

The default regular expression: ur”\b\w\w+\b” considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur”\b[a-z0-9-.]+[a-z][a-z0-9-.]+\b”.

In [27]:
clf_4 = Pipeline([
    ('vect', TfidfVectorizer(
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB()),
])
In [26]:
evaluate_cross_validation(clf_4, news.data, news.target, 5)
[ 0.86100796  0.8718493   0.86203237  0.87291059  0.8588485 ]
Mean score: 0.865 (+/-0.003)

We have a slight improvement from 0.86 to 0.87. Another parameter that we can use is stop_words: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic.

We will define a function to load the stop words from a text file

In [28]:
def get_stop_words():
    result = set()
    for line in open('stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result
In [30]:
stop_words = get_stop_words()

Create a new classifier with this new parameter

In [31]:
clf_5 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB()),
])
In [32]:
evaluate_cross_validation(clf_5, news.data, news.target, 5)
[ 0.88116711  0.89519767  0.88325816  0.89227912  0.88113558]
Mean score: 0.887 (+/-0.003)

Let us look at MultinomialNB parameters.

Try to improve by adjusting the alpha parameter on the MultinomialNB classifier

In [33]:
clf_7 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])
In [34]:
evaluate_cross_validation(clf_7, news.data, news.target, 5)
[ 0.9204244   0.91960732  0.91828071  0.92677103  0.91854603]
Mean score: 0.921 (+/-0.002)

Evaluating the performance

If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.

In [35]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):

    clf.fit(X_train, y_train)

    print ("Accuracy on training set:")
    print (clf.score(X_train, y_train))
    print ("Accuracy on testing set:")
    print (clf.score(X_test, y_test))

    y_pred = clf.predict(X_test)

    print ("Classification Report:")
    print (metrics.classification_report(y_test, y_pred))
    print ("Confusion Matrix:")
    print (metrics.confusion_matrix(y_test, y_pred))
In [36]:
train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)
Accuracy on training set:
0.996957690675
Accuracy on testing set:
0.917869269949
Classification Report:
             precision    recall  f1-score   support

          0       0.95      0.88      0.91       216
          1       0.85      0.85      0.85       246
          2       0.91      0.84      0.87       274
          3       0.81      0.86      0.83       235
          4       0.88      0.90      0.89       231
          5       0.89      0.91      0.90       225
          6       0.88      0.80      0.84       248
          7       0.92      0.93      0.93       275
          8       0.96      0.98      0.97       226
          9       0.97      0.94      0.96       250
         10       0.97      1.00      0.98       257
         11       0.97      0.97      0.97       261
         12       0.90      0.91      0.91       216
         13       0.94      0.95      0.95       257
         14       0.94      0.97      0.95       246
         15       0.90      0.96      0.93       234
         16       0.91      0.97      0.94       218
         17       0.97      0.99      0.98       236
         18       0.95      0.91      0.93       213
         19       0.86      0.78      0.82       148

avg / total       0.92      0.92      0.92      4712

Confusion Matrix:
[[190   0   0   0   1   0   0   0   0   1   0   0   0   1   0   9   2   0
    0  12]
 [  0 208   5   3   3  13   4   0   0   0   0   1   3   2   3   0   0   1
    0   0]
 [  0  11 230  22   1   5   1   0   1   0   0   0   0   0   1   0   1   0
    1   0]
 [  0   6   6 202   9   3   4   0   0   0   0   0   4   0   1   0   0   0
    0   0]
 [  0   2   3   4 208   1   5   0   0   0   2   0   5   0   1   0   0   0
    0   0]
 [  0   9   2   2   1 205   0   1   1   0   0   0   0   2   1   0   0   1
    0   0]
 [  0   2   3  10   6   0 199  14   1   2   0   1   5   2   2   0   0   1
    0   0]
 [  0   1   1   1   1   0   6 257   4   1   0   0   0   1   0   0   2   0
    0   0]
 [  0   0   0   0   0   1   1   2 221   0   0   0   0   1   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   1   0   2 236   5   0   1   3   0   1   1   0
    0   0]
 [  0   0   0   1   0   0   0   0   0   0 256   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   1   0   1   0   0   0 254   0   1   0   0   3   0
    1   0]
 [  0   1   0   1   5   1   3   1   0   2   1   1 197   1   2   0   0   0
    0   0]
 [  0   1   0   1   1   0   0   0   0   0   0   2   2 245   3   0   1   0
    0   1]
 [  0   2   0   0   1   0   0   1   0   0   0   0   0   1 238   0   1   0
    1   1]
 [  1   0   1   2   0   0   0   1   0   0   0   1   1   0   1 225   0   1
    0   0]
 [  0   0   1   0   0   0   1   0   1   0   0   1   0   0   0   0 212   0
    2   0]
 [  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 234
    1   0]
 [  0   0   0   0   0   0   1   0   0   0   0   2   1   1   0   1   7   3
  193   4]
 [  9   0   0   0   0   1   0   0   0   1   0   0   0   0   0  13   4   1
    4 115]]

An accuracy of around 0.91.

If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:

In [37]:
print (len(clf_7.named_steps['vect'].get_feature_names()))
145767

This shows that the dictionary is composed of 145767 tokens.

excerpt from

One Response