Naïve Bayes
is a simple but powerful classifier based on a probabilistic model derived from the Bayes’ theorem. Basically it determines the probability that an instance belongs to a class based on each of the feature value probabilities.
One of the most successful applications of Naïve Bayes has been within the field of Natural Language Processing (NLP). NLP is a field that has been much related to machine learning, since many of its problems can be formulated as a classification task.
In this section, we will use Naïve Bayes for text classification; we will have a set of text documents with their corresponding categories, and we will train a Naïve Bayes algorithm to learn to predict the categories of new unseen instances.
Importing our pylab environment
%pylab inline
Import the newsgroup Dataset, explore its structure and data
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print (news.keys())
print (type(news.data), type(news.target), type(news.target_names))
print (news.target_names)
print (len(news.data))
print (len(news.target))
print (news.data[0])
print (news.target[0], news.target_names[news.target[0]])
Preprocessing the data
We have to partition our data into training and testing set. The loaded data is already in a random order, so we only have to split the data into, for example, 75 percent for training and the rest 25 percent for testing
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]
X_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]
This function will serve to perform and evaluate a cross validation:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold croos validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print (scores)
print (("Mean score: {0:.3f} (+/-{1:.3f})").format(
np.mean(scores), sem(scores)))
Our machine learning algorithms can work only on numeric data.Currently we only have one feature, the text content of the message; we need some function that transforms a text into a meaningful set of numeric features.
The sklearn. feature_extraction.text module has some useful utilities to build numeric feature vectors from text documents.
You will find three different classes that can transform text into numeric features: CountVectorizer, HashingVectorizer, and TfidfVectorizer. The difference between them resides in the calculations they perform to obtain the numeric features. CountVectorizer basically creates a dictionary of words from the text corpus. Then, each instance is converted to a vector of numeric features where each element will be the count of the number of times a particular word appears in the document.
HashingVectorizer, instead of constricting and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer.
TfidfVectorizer works like the CountVectorizer, but with a more advanced calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency in the whole corpus of documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterize the instances.
Training a Naïve Bayes classifier
We will create a Naïve Bayes classifier that is composed of a feature vectorizer and the actual Bayes classifier. We will use the MultinomialNB class from the sklearn.naive_bayes module.
In order to compose the classifier with the vectorizer, scikitlearn has a very useful class called Pipeline (available in the sklearn.pipeline module) that eases the construction of a compound classifier, which consists of several vectorizers and classifiers.
Evaluate three models with the same Naive Bayes classifier, but with different vectorizers
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
clf_1 = Pipeline([
('vect', CountVectorizer()),
('clf', MultinomialNB()),
])
clf_2 = Pipeline([
('vect', HashingVectorizer(non_negative=True)),
('clf', MultinomialNB()),
])
clf_3 = Pipeline([
('vect', TfidfVectorizer()),
('clf', MultinomialNB()),
])
Perform a five-fold cross-validation by using each one of the classifiers
clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
evaluate_cross_validation(clf, news.data, news.target, 5)
CountVectorizer and TfidfVectorizer had similar performances, and much better than HashingVectorizer
Let’s continue with TfidfVectorizer; we could try to improve the results by trying to parse the text documents into tokens with a different regular expression
The default regular expression: ur”\b\w\w+\b” considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur”\b[a-z0-9-.]+[a-z][a-z0-9-.]+\b”.
clf_4 = Pipeline([
('vect', TfidfVectorizer(
token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
)),
('clf', MultinomialNB()),
])
evaluate_cross_validation(clf_4, news.data, news.target, 5)
We have a slight improvement from 0.86 to 0.87. Another parameter that we can use is stop_words: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic.
We will define a function to load the stop words from a text file
def get_stop_words():
result = set()
for line in open('stopwords_en.txt', 'r').readlines():
result.add(line.strip())
return result
stop_words = get_stop_words()
Create a new classifier with this new parameter
clf_5 = Pipeline([
('vect', TfidfVectorizer(
stop_words=stop_words,
token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
)),
('clf', MultinomialNB()),
])
evaluate_cross_validation(clf_5, news.data, news.target, 5)
Let us look at MultinomialNB parameters.
Try to improve by adjusting the alpha parameter on the MultinomialNB classifier
clf_7 = Pipeline([
('vect', TfidfVectorizer(
stop_words=stop_words,
token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
)),
('clf', MultinomialNB(alpha=0.01)),
])
evaluate_cross_validation(clf_7, news.data, news.target, 5)
Evaluating the performance
If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.
from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print ("Accuracy on training set:")
print (clf.score(X_train, y_train))
print ("Accuracy on testing set:")
print (clf.score(X_test, y_test))
y_pred = clf.predict(X_test)
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred))
print ("Confusion Matrix:")
print (metrics.confusion_matrix(y_test, y_pred))
train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)
An accuracy of around 0.91.
If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:
print (len(clf_7.named_steps['vect'].get_feature_names()))
This shows that the dictionary is composed of 145767 tokens.
Very complex and informative
Thank you