Data Science – A Data Analyst

The AWS Machine Learning Framework: A comprehensive guide to bringing models to production

Data scientists develop a functioning model using a machine learning framework and some data. The developed model has a low error rate and a set of hyperparameters optimized. But when companies decide to apply this model to a production environment, they realize model building as the first step. There is a greater challenge. Converting functioning […]

Countvectorizer sklearn example

This countvectorizer sklearn example is from Pycon Dublin 2016. For further information please visit this link. The dataset is from UCI. In [2]: messages = [line.rstrip() for line in open(‘smsspamcollection/SMSSpamCollection’)] In [3]: print (len(messages)) 5574 In [5]: for num,message in enumerate(messages[:10]): print(num,message) print (‘n’) 0 ham Go until jurong point, crazy.. Available only in bugis n great world la e […]

What Make A Really Good Diamond?

The aim of this blog is to assess the quality and characteristics of the diamonds and gain insights about what makes a really good diamond. The data set is from ggplot2. The explanatory data analysis is done in Python and the notebooks are available on my Github. This blog address few important questions such as: […]

Coding FP-growth algorithm in Python 3

FP-growth algorithm Have you ever gone to a search engine, typed in a word or part of a word, and the search engine automatically completed the search term for you? Perhaps it recommended something you didn’t even know existed, and you searched for that instead. This requires a way to find frequent itemsets efficiently. FP-growth […]

AdaBoost (Python 3)

AdaBoost The AdaBoost (adaptive boosting) algorithm was proposed in 1995 by Yoav Freund and Robert Shapire as a general method for generating a strong classifier out of a set of weak classifiers . AdaBoost works even when the classifiers come from a continuum of potential classifiers (such as neural networks, linear discriminants, etc.) AdaBoost Pros: […]

Apriori Algorithm (Python 3.0)

Apriori Algorithm The Apriori algorithm principle says that if an itemset is frequent, then all of its subsets are frequent.this means that if {0,1} is frequent, then {0} and {1} have to be frequent. The rule turned around says that if an itemset is infrequent, then its supersets are also infrequent. We first need to […]

Principal Component Analysis in scikit-learn

Principal Component Analysis (PCA) is an orthogonal linear transformation that turns a set of possibly correlated variables into a new set of variables that are as uncorrelated as possible. The new variables lie in a new coordinate system such that the greatest variance is obtained by projecting the data in the first coordinate, the second […]

Naiive Bayes in scikit-learn

Naïve Bayes is a simple but powerful classifier based on a probabilistic model derived from the Bayes’ theorem. Basically it determines the probability that an instance belongs to a class based on each of the feature value probabilities. One of the most successful applications of Naïve Bayes has been within the field of Natural Language […]

Decision Trees in scikit-learn

Decision trees are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions. The main advantage of this model is that a human being can easily understand and reproduce the sequence of decisions (especially if the number of attributes is small) taken to predict the […]

Regression in scikit-learn

We will compare several regression methods by using the same dataset. We will try to predict the price of a house as a function of its attributes. In [6]: import numpy as np import matplotlib.pyplot as plt %pylab inline Populating the interactive namespace from numpy and matplotlib Import the Boston House Pricing Dataset In [9]: from sklearn.datasets […]

Tag: Data Science