This countvectorizer sklearn example is from Pycon Dublin 2016. For further information please visit this link. The dataset is from UCI.

In [2]:

messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]

In [3]:

print (len(messages))

In [5]:

for num,message in enumerate(messages[:10]):
    print(num,message)
    print ('n')

0 ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


1 ham	Ok lar... Joking wif u oni...


2 spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


3 ham	U dun say so early hor... U c already then say...


4 ham	Nah I don't think he goes to usf, he lives around here though


5 spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv


6 ham	Even my brother is not like to speak with me. They treat me like aids patent.


7 ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune


8 spam	WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


9 spam	Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

In [6]:

import pandas

In [7]:

messages = pandas.read_csv('smsspamcollection/SMSSpamCollection',
                           sep='t',names=['labels','message'])

In [9]:

messages.head()

Out[9]:

	labels	message
0	ham	Go until jurong point, crazy.. Available only …
1	ham	Ok lar… Joking wif u oni…
2	spam	Free entry in 2 a wkly comp to win FA Cup fina…
3	ham	U dun say so early hor… U c already then say…
4	ham	Nah I don’t think he goes to usf, he lives aro…

In [10]:

messages.describe()

Out[10]:

	labels	message
count	5572	5572
unique	2	5169
top	ham	Sorry, I’ll call later
freq	4825	30

In [11]:

messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
labels     5572 non-null object
message    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB

In [12]:

messages.groupby('labels').describe()

Out[12]:

		message
labels
ham	count	4825
	unique	4516
	top	Sorry, I’ll call later
	freq	30
spam	count	747
	unique	653
	top	Please call our customer service representativ…
	freq	4

In [13]:

messages['length'] = messages['message'].apply(len)
messages.head()

Out[13]:

	labels	message	length
0	ham	Go until jurong point, crazy.. Available only …	111
1	ham	Ok lar… Joking wif u oni…	29
2	spam	Free entry in 2 a wkly comp to win FA Cup fina…	155
3	ham	U dun say so early hor… U c already then say…	49
4	ham	Nah I don’t think he goes to usf, he lives aro…	61

In [14]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [16]:

messages['length'].plot(bins=50,kind = 'hist')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x22eca4dd240>

In [17]:

messages['length'].describe()

Out[17]:

count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64

In [20]:

messages[messages['length'] == 910]['message'].iloc[0]

Out[20]:

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

In [22]:

messages.hist(column='length',by ='labels',bins=50,figsize = (10,4))

Out[22]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000022ECA458320>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x0000022ECAFEB978>], dtype=object)

In [23]:

import string

In [24]:

mess = 'Sample message ! Notice: it has punctuation'

In [25]:

string.punctuation

Out[25]:

'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

In [26]:

nopunc = [char for char in mess if char not in string.punctuation]

In [27]:

nopunc = ''.join(nopunc)

In [28]:

nopunc

Out[28]:

'Sample message  Notice it has punctuation'

In [30]:

from nltk.corpus import stopwords

In [31]:

stopwords.words('english')[0:10]

Out[31]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [32]:

nopunc.split()

Out[32]:

['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']

In [33]:

clean_mess = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [34]:

clean_mess

Out[34]:

['Sample', 'message', 'Notice', 'punctuation']

In [35]:

def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)

    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [36]:

messages.head()

Out[36]:

	labels	message	length
0	ham	Go until jurong point, crazy.. Available only …	111
1	ham	Ok lar… Joking wif u oni…	29
2	spam	Free entry in 2 a wkly comp to win FA Cup fina…	155
3	ham	U dun say so early hor… U c already then say…	49
4	ham	Nah I don’t think he goes to usf, he lives aro…	61

In [37]:

messages['message'].head(5).apply(text_process)

Out[37]:

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: message, dtype: object

In [40]:

from sklearn.feature_extraction.text import CountVectorizer

In [44]:

bow_transformer = CountVectorizer(analyzer=text_process)

In [45]:

bow_transformer.fit(messages['message'])

Out[45]:

CountVectorizer(analyzer=<function text_process at 0x0000022ECBC7FE18>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\b\w\w+\b', tokenizer=None, vocabulary=None)

In [46]:

message4 = messages['message'][3]

In [47]:

print (message4)

U dun say so early hor... U c already then say...

In [48]:

bow4 = bow_transformer.transform([message4])

In [49]:

print (bow4)

  (0, 4068)	2
  (0, 4629)	1
  (0, 5261)	1
  (0, 6204)	1
  (0, 6222)	1
  (0, 7186)	1
  (0, 9554)	2

In [50]:

print (bow_transformer.get_feature_names()[4073])

UIN

In [51]:

print (bow_transformer.get_feature_names()[4068])

In [52]:

print (bow_transformer.get_feature_names()[9554])

say

In [53]:

messages_bow = bow_transformer.transform(messages['message'])

In [54]:

print ('Shape of Sparse Matrix: ', messages_bow.shape)
print ('Amount of Non-Zero occurences: ', messages_bow.nnz)
print ('sparsity: %.2f%%' % (100.0 * messages_bow.nnz /
                             (messages_bow.shape[0] * messages_bow.shape[1])))

Shape of Sparse Matrix:  (5572, 11425)
Amount of Non-Zero occurences:  50548
sparsity: 0.08%

In [55]:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(messages_bow)

In [56]:

tfidf4 = tfidf_transformer.transform(bow4)

In [57]:

print (tfidf4)

  (0, 9554)	0.538562626293
  (0, 7186)	0.438936565338
  (0, 6222)	0.318721689295
  (0, 6204)	0.299537997237
  (0, 5261)	0.297299574059
  (0, 4629)	0.266198019061
  (0, 4068)	0.408325899334

In [58]:

print (tfidf_transformer.idf_[bow_transformer.vocabulary_['u']])
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['university']])

3.28005242674
8.5270764989

In [59]:

messages_tfidf = tfidf_transformer.transform(messages_bow)

In [60]:

print (messages_tfidf.shape)

(5572, 11425)

In [61]:

from sklearn.naive_bayes import MultinomialNB

In [62]:

spam_detect_model = MultinomialNB().fit(messages_tfidf,messages['labels'])

In [64]:

print ('Predicted: ',spam_detect_model.predict(tfidf4)[0] )
print ('Expected: ',messages['labels'][3])

Predicted:  ham
Expected:  ham

In [65]:

all_predictions = spam_detect_model.predict(messages_tfidf)
print (all_predictions)

['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']

In [67]:

from sklearn.metrics import classification_report
print (classification_report(messages['labels'], all_predictions))

             precision    recall  f1-score   support

        ham       0.98      1.00      0.99      4825
       spam       1.00      0.85      0.92       747

avg / total       0.98      0.98      0.98      5572

In [69]:

from sklearn.cross_validation import train_test_split

msg_train, msg_test, label_train, label_test =
train_test_split(messages['message'], messages['labels'], test_size=0.2)

print (len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

4457 1115 5572

In [70]:

from sklearn.pipeline import Pipeline

In [71]:

pipeline = Pipeline([('bow',CountVectorizer(analyzer =text_process)),
                    ('tfidf',TfidfTransformer()),
                    ('classifier',MultinomialNB())])

In [72]:

pipeline.fit(msg_train,label_train)

Out[72]:

Pipeline(steps=[('bow', CountVectorizer(analyzer=<function text_process at 0x0000022ECBC7FE18>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [73]:

predictions = pipeline.predict(msg_test)

In [74]:

print (classification_report(predictions,label_test))

             precision    recall  f1-score   support

        ham       1.00      0.96      0.98      1010
       spam       0.71      1.00      0.83       105

avg / total       0.97      0.96      0.97      1115

Tagged Data Science, Machine Learning, Scikit-learn

5 Responses

Ashwin Perti says:

September 8, 2017 at 12:55 pm

Error

NotFittedError: CountVectorizer – Vocabulary wasn’t fitted.
Richard says:

October 24, 2018 at 3:26 am

you passed wrong order in the last step “classification report”.
It should be: classification_report(y_true=label_test,y_pred=predictions)
so the recall rate for spam is only 0.71
Rohit Jagannath says:

November 3, 2018 at 9:21 pm

Error

‘utf-8’ codec can’t decode byte 0xe5 in position 135: invalid continuation byte

after running the line –
messages = pd.read_csv(‘spam.csv’, sep=’\t’,names=[‘labels’,’message’])

This has to be corrected by giving proper encoding as below –
messages = pd.read_csv(‘spam.csv’, names=[‘labels’,’message’], encoding=’latin1′)
Ali Farghaly says:

December 16, 2018 at 9:02 pm

bow_transformer.fit(messages[‘message’]) produces the following errot in Python3.7 , expecting index to be integer not a string
Traceback (most recent call last):
File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\NLP_Programs\clean.py”, line 39, in
bow_transformer.fit(posts[‘post’])
File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py”, line 767, in __getitem__
result = self.index.get_value(self, key)
File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\indexes\base.py”, line 3118, in get_value
tz=getattr(series.dtype, ‘tz’, None))
File “pandas\_libs\index.pyx”, line 106, in pandas._libs.index.IndexEngine.get_value
File “pandas\_libs\index.pyx”, line 114, in pandas._libs.index.IndexEngine.get_value
File “pandas\_libs\index.pyx”, line 164, in pandas._libs.index.IndexEngine.get_loc
KeyError: ‘post’
Ram says:

December 3, 2019 at 9:14 am

It was running perfectly when i used it and the results were upto the marks..!!

Countvectorizer sklearn example

5 Responses