This countvectorizer sklearn example is from Pycon Dublin 2016. For further information please visit this link. The dataset is from UCI.

In [2]:

messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]
In [3]:
print (len(messages))
5574
In [5]:
for num,message in enumerate(messages[:10]):
    print(num,message)
    print ('n')
0 ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


1 ham	Ok lar... Joking wif u oni...


2 spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


3 ham	U dun say so early hor... U c already then say...


4 ham	Nah I don't think he goes to usf, he lives around here though


5 spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv


6 ham	Even my brother is not like to speak with me. They treat me like aids patent.


7 ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune


8 spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


9 spam	Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030


In [6]:
import pandas
In [7]:
messages = pandas.read_csv('smsspamcollection/SMSSpamCollection',
                           sep='t',names=['labels','message'])
In [9]:
messages.head()
Out[9]:
labels message
0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don’t think he goes to usf, he lives aro…
In [10]:
messages.describe()
Out[10]:
labels message
count 5572 5572
unique 2 5169
top ham Sorry, I’ll call later
freq 4825 30
In [11]:
messages.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
labels     5572 non-null object
message    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB
In [12]:
messages.groupby('labels').describe()
Out[12]:
message
labels
ham count 4825
unique 4516
top Sorry, I’ll call later
freq 30
spam count 747
unique 653
top Please call our customer service representativ…
freq 4
In [13]:
messages['length'] = messages['message'].apply(len)
messages.head()
Out[13]:
labels message length
0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don’t think he goes to usf, he lives aro… 61
In [14]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [16]:
messages['length'].plot(bins=50,kind = 'hist')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x22eca4dd240>
In [17]:
messages['length'].describe()
Out[17]:
count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64
In [20]:
messages[messages['length'] == 910]['message'].iloc[0]
Out[20]:
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."
In [22]:
messages.hist(column='length',by ='labels',bins=50,figsize = (10,4))
Out[22]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000022ECA458320>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x0000022ECAFEB978>], dtype=object)
In [23]:
import string
In [24]:
mess = 'Sample message ! Notice: it has punctuation'
In [25]:
string.punctuation
Out[25]:
'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'
In [26]:
nopunc = [char for char in mess if char not in string.punctuation]
In [27]:
nopunc = ''.join(nopunc)
In [28]:
nopunc
Out[28]:
'Sample message  Notice it has punctuation'
In [30]:
from nltk.corpus import stopwords
In [31]:
stopwords.words('english')[0:10]
Out[31]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']
In [32]:
nopunc.split()
Out[32]:
['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']
In [33]:
clean_mess = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
In [34]:
clean_mess
Out[34]:
['Sample', 'message', 'Notice', 'punctuation']
In [35]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)

    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
In [36]:
messages.head()
Out[36]:
labels message length
0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don’t think he goes to usf, he lives aro… 61
In [37]:
messages['message'].head(5).apply(text_process)
Out[37]:
0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: message, dtype: object
In [40]:
from sklearn.feature_extraction.text import CountVectorizer
In [44]:
bow_transformer = CountVectorizer(analyzer=text_process)
In [45]:
bow_transformer.fit(messages['message'])
Out[45]:
CountVectorizer(analyzer=<function text_process at 0x0000022ECBC7FE18>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\b\w\w+\b', tokenizer=None, vocabulary=None)
In [46]:
message4 = messages['message'][3]
In [47]:
print (message4)
U dun say so early hor... U c already then say...
In [48]:
bow4 = bow_transformer.transform([message4])
In [49]:
print (bow4)
  (0, 4068)	2
  (0, 4629)	1
  (0, 5261)	1
  (0, 6204)	1
  (0, 6222)	1
  (0, 7186)	1
  (0, 9554)	2
In [50]:
print (bow_transformer.get_feature_names()[4073])
UIN
In [51]:
print (bow_transformer.get_feature_names()[4068])
U
In [52]:
print (bow_transformer.get_feature_names()[9554])
say
In [53]:
messages_bow = bow_transformer.transform(messages['message'])
In [54]:
print ('Shape of Sparse Matrix: ', messages_bow.shape)
print ('Amount of Non-Zero occurences: ', messages_bow.nnz)
print ('sparsity: %.2f%%' % (100.0 * messages_bow.nnz /
                             (messages_bow.shape[0] * messages_bow.shape[1])))
Shape of Sparse Matrix:  (5572, 11425)
Amount of Non-Zero occurences:  50548
sparsity: 0.08%
In [55]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(messages_bow)
In [56]:
tfidf4 = tfidf_transformer.transform(bow4)
In [57]:
print (tfidf4)
  (0, 9554)	0.538562626293
  (0, 7186)	0.438936565338
  (0, 6222)	0.318721689295
  (0, 6204)	0.299537997237
  (0, 5261)	0.297299574059
  (0, 4629)	0.266198019061
  (0, 4068)	0.408325899334
In [58]:
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['u']])
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['university']])
3.28005242674
8.5270764989
In [59]:
messages_tfidf = tfidf_transformer.transform(messages_bow)
In [60]:
print (messages_tfidf.shape)
(5572, 11425)
In [61]:
from sklearn.naive_bayes import MultinomialNB
In [62]:
spam_detect_model = MultinomialNB().fit(messages_tfidf,messages['labels'])
In [64]:
print ('Predicted: ',spam_detect_model.predict(tfidf4)[0] )
print ('Expected: ',messages['labels'][3])
Predicted:  ham
Expected:  ham
In [65]:
all_predictions = spam_detect_model.predict(messages_tfidf)
print (all_predictions)
['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']
In [67]:
from sklearn.metrics import classification_report
print (classification_report(messages['labels'], all_predictions))
             precision    recall  f1-score   support

        ham       0.98      1.00      0.99      4825
       spam       1.00      0.85      0.92       747

avg / total       0.98      0.98      0.98      5572

In [69]:
from sklearn.cross_validation import train_test_split

msg_train, msg_test, label_train, label_test =
train_test_split(messages['message'], messages['labels'], test_size=0.2)

print (len(msg_train), len(msg_test), len(msg_train) + len(msg_test))
4457 1115 5572
In [70]:
from sklearn.pipeline import Pipeline
In [71]:
pipeline = Pipeline([('bow',CountVectorizer(analyzer =text_process)),
                    ('tfidf',TfidfTransformer()),
                    ('classifier',MultinomialNB())])
In [72]:
pipeline.fit(msg_train,label_train)
Out[72]:
Pipeline(steps=[('bow', CountVectorizer(analyzer=<function text_process at 0x0000022ECBC7FE18>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
In [73]:
predictions = pipeline.predict(msg_test)
In [74]:
print (classification_report(predictions,label_test))
             precision    recall  f1-score   support

        ham       1.00      0.96      0.98      1010
       spam       0.71      1.00      0.83       105

avg / total       0.97      0.96      0.97      1115

5 Responses

  1. you passed wrong order in the last step “classification report”.
    It should be: classification_report(y_true=label_test,y_pred=predictions)
    so the recall rate for spam is only 0.71

  2. Error

    ‘utf-8’ codec can’t decode byte 0xe5 in position 135: invalid continuation byte

    after running the line –
    messages = pd.read_csv(‘spam.csv’, sep=’\t’,names=[‘labels’,’message’])

    This has to be corrected by giving proper encoding as below –
    messages = pd.read_csv(‘spam.csv’, names=[‘labels’,’message’], encoding=’latin1′)

  3. bow_transformer.fit(messages[‘message’]) produces the following errot in Python3.7 , expecting index to be integer not a string
    Traceback (most recent call last):
    File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\NLP_Programs\clean.py”, line 39, in
    bow_transformer.fit(posts[‘post’])
    File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py”, line 767, in __getitem__
    result = self.index.get_value(self, key)
    File “C:\Users\NLP\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\indexes\base.py”, line 3118, in get_value
    tz=getattr(series.dtype, ‘tz’, None))
    File “pandas\_libs\index.pyx”, line 106, in pandas._libs.index.IndexEngine.get_value
    File “pandas\_libs\index.pyx”, line 114, in pandas._libs.index.IndexEngine.get_value
    File “pandas\_libs\index.pyx”, line 164, in pandas._libs.index.IndexEngine.get_loc
    KeyError: ‘post’