A Pythonic data science project: Part I
A complete predictive modeling project in Python
Part I: Preprocessing and exploratory analysis
One of the amazing things with data science is the ability to tackle complex problems involving hidden parallel phenomena interacting with each other, just from the data they produce.
As an example, we will use data extracted from images of forged and genuine banknotes. The distinction between the two categories would be thought to require a deep domain expertise, which limits the ability to check more than a few banknotes at a time. An automated and trustable test would be of interest for many businesses, governments and organizations.
Starting from the data provided by H. Dörsken and Volker Lohweg, from the University of Applied Science of OstwestfalenLippe, Germany on the UCI Machine Learning Repository, we will follow key steps of a data science project to build a performant, yet scalable classifier.
The dataset was built by applying a wavelet transform on images of banknotes to extract 4 features:

Variance, skewness, kurtosis of the wavelet transform (respectively second, third and fourth moment of the distribution).

Entropy of the image, which can be interpreted as the amount of information or randomness (which is represented by how different adjacent pixels are).
You can find further information on Wavelet on Wikipedia or ask Quora. An explanation of entropy as meant in the image processing context can be found here.
To get a better understanding of the way the algorithms works, the full model will be built from scratch or almost (not using a machine learning library like scikitlearn on Python or caret on R).
Basic statistic notions (variance, linear regression) and some basic python knowledge is recommended to follow through the three articles.
Programming choices and libraries
Language and environment
Python, which is a great compromise between practicality (with handy data format and manipulation) and scalability (much easier to implement for large scale, automated computation than R, Octave or Matlab). More precisely, Python 3.5.1 with the Anaconda distribution 2.4.0, I personally use the Spyder environment but feel free to keep your favorite tools.
Libraries
 Collections (builtin) for occurrence counting
 numpy 1.10.1, providing key data format, mathematical manipulation techniques.
 scipy 0.16.0, imported here for the distance matrix computation and the stat submodule for QuantileQuantile plots.
 pandas 0.17.1 for advanced data format, highlevel manipulation and visualization
 pyplot from matplotlib 1.5.0 for basic visualization
 ggplot 0.6.8, which I think is a much improved way to visualize data
 urllib3 to parse the data directly from the repository (no manual download)
So our first lines of code (once you placed your data in the proper repository) should look like this:
import numpy as np
import pandas as pd
import ggplot
from matplotlib import pyplot as plt
import scipy.stats as stats
import scipy.spatial.distance
from collections import Counter
import urllib3
Source files
The source files will be available on the corresponding Github repository. These include:
 preprocess.py to load the data and libraries
 exploratory.py for preliminary visualization
 feature_eng.py where the data will be transformed to boost the model performance
 model_GLM.py where we define key functions and build our model
 model.py where we will visualize characteristics of the model
Dataset overview and exploratory analysis
Understanding intuitive phenomena in the data and test its underlying structure are the objectives for this first (usually long) phase of a data science project, especially if you were not involved in the data collection process.
Data parsing
Instead of manually downloading the data and placing it in our project repository, we will download using the urllib3 library.
url = "https://archive.ics.uci.edu/ml/machinelearningdatabases/00267/data_banknote_authentication.txt"
http = urllib3.PoolManager()
r = http.request('GET',url)
with open("data_banknote_authentication.txt",'wb') as f:
f.write(r.data)
r.release_conn()
data0 = pd.read_csv("data_banknote_authentication.txt",
names=["vari","skew","kurtosis","entropy","class"])
Key statistics and overview
Since the data were loaded using pandas, key methods of the DataFrame object can be used to find some key information in the data.
data0.describe()
vari  skew  kurtosis  entropy  class  

count  1372.000000  1372.000000  1372.000000  1372.000000  1372.000000 
mean  0.433735  1.922353  1.397627  1.191657  0.444606 
std  2.842763  5.869047  4.310030  2.101013  0.497103 
min  7.042100  13.773100  5.286100  8.548200  0.000000 
25%  1.773000  1.708200  1.574975  2.413450  0.000000 
50%  0.496180  2.319650  0.616630  0.586650  0.000000 
75%  2.821475  6.814625  3.179250  0.394810  1.000000 
max  6.824800  12.951600  17.927400  2.449500  1.000000 
Negative values can be noticed in the variance and entropy, whereas it is theoretically impossible, so it can be deduced that some preprocessing operations were already performed.
We are trying to detect forged banknotes thanks to the extracted features. The dataset contains 1372 observations, including 610 forged banknotes, so roughly 45%. The two classes are balanced in the data, which might be relevant for some algorithms. Indeed, a higher proportion of a category in the characteristic of interest (here whether the banknote is genuine or not) yields a higher prior probability for that outcome in Bayesian reasoning.
Kernel Density Estimation for each variable by class
KDE are powerful tools to understand how 1dimensional data are distributed.
The estimate can also be split by class to find differences in the
distributions. Using ggplot and the pandas groupby
method, the
plots can be generated and saved as such:
for v in data0.columns[:4]:
ggplot.ggsave(
ggplot.ggplot(ggplot.aes(x=v, color='class'),data=data0)+
ggplot.geom_density()+
ggplot.geom_point(ggplot.aes(y=0),alpha=0.2)+
ggplot.labs(title='KDE '+v,x=v,y="KDE"),
'KDE_'+v+'.png',width=18,height=12)
Using this first simple visualization technique, we can deduce that the variance may be much more efficient to separate the two banknotes categories than the Kurtosis.
Visualizing variable combinations with scatter plots
We generate a color list using forcomprehension:
col = list('r' if i==1 else 'b' for i in data0["class"])
pd.tools.plotting.scatter_matrix(data0.ix[:,:4],figsize=(6,3),
color=col,diagonal='kde')
A scatter plot is the most straightforward way to understand intuitive and obvious patterns in the data. It is especially efficient when the number of variables and classes is limited, such as our data set. It allows us to understand classdependent, nonlinear relationships between variables.
This is much more efficient than a simple statistic, such as the correlation coefficient which would not have found the skewness and entropy to be related. From these rather strong relationships between variables, we now know that some techniques based on independent features might not be efficient here.
Testing a distribution with QuantileQuantile plots
# Subsetting the data by class
d0 = data0[data0["class"]==0]
d1 = data0[data0["class"]==1]
# For each variable
for v in data0.columns[:4]:
#set the figure size
plt.figure(figsize=(9,4))
# define two subplots
ax1 = plt.subplot(121)
# compute the quantilequantile plot with normal distribution
stats.probplot(d0[v],dist='norm',plot=plt)
# add title
plt.title("Normal QQplot "+v + "  Class 0")
ax2 = plt.subplot(122)
stats.probplot(d1[v],dist='norm',plot=plt)
plt.title("Normal QQplot "+v + "  Class 1")
plt.savefig("qqplot_"+v+".png",width=700,height=250)
plt.show()
Even though some variables are quite far from normally distributed, the hypothesis would be acceptable for some modelbased learning algorithms using properties of Gaussian variables.
Nonparametric distribution with boxplots
Boxplots represent the data using 25th, 50th and 75th percentiles which can be more robust than mean and variance. The pandas library offers a quick method and plotting tool to represent boxplots for each class and variable. It highlights the differences in the spread of the data.
data0.groupby("class").boxplot(figsize=(9,5))
This will be useful in the next part, when the data will be transformed to enhance the performance and robustness of predictive models.
So see you in the next part for feature engineering!