A Tutorial on Python

Features

  • Author: Ajay Ohri
  • Date: 28 Jan 2016

Python is a very widely used programming language. Written by Guido Von Russum in 1989, it is now one of the most widely used programming languages. In Data Science, Python has increasingly made strides thanks to the Pandas package as well as the efforts of PyData community. Companies like Continuum Analytics, Enthought and Civis Analytics are creating both tools as well as actually utilizing Python for data science. Companies like Datakind, CodeAcademy and Dataquest offer online education in Python for free. Unlike R language, Python has two major versions Python 2 and Python 3, but just like R it is free and open source.

Core design parameters for Python remain crisp lines of code, using whitespace as an input, emphasis for indentation and sparse grammar. People interested in knowing more on Python can go to the home page at https://www.python.org/.

thumbnail image: A Tutorial on Python

Data science lies at the intersection of programming, statistics and business analysis. It is the use of programming tools with statistical techniques to analyze data in a systematic and scientific way. Accordingly this tutorial will try to focus at least on the statistical and programming parts of data science. Data Scientists would also be interested in the PyData community at http://pydata.org/. Why use Python for Data Science? Python has surprising capabilities in data analysis and data visualization thanks to the new generation of packages being created.

Here is a brief tutorial in Pythonic Data Science. Some prerequistes are given here -

Installations:

1) Download and Install Anaconda from https://www.continuum.io/downloads (Alternatives could be Canopy Express from https://store.enthought.com/ or just the core implementation from https://www.python.org/downloads/).
2) Download and Install the Jupyter Notebook Interface http://jupyter.readthedocs.org/en/latest/install.html  
3) You can use pip or easy install to install packages. There are more than 72000 Python packages available at https://pypi.python.org/pypi. You can browse Python packages by topic at https://pypi.python.org/pypi?%3Aaction=browse

Packages for Data Science - Some important packages for Data Scientists to use in Python are:

1) Pandas - http://pandas.pydata.org/. Pandas allows users the familiar dataframe format in which rows are observations and columns are variables and a wide variety of useful data analysis features.
2) Scikit-learn - http://scikit-learn.org/. Scikit-learn allows you a widely used machine learning package for data mining and modeling.
3) Statsmodels - http://statsmodels.sourceforge.net/. Statsmodels brings statistical tests and models available in Python.
4) Seaborn - http://stanford.edu/~mwaskom/software/seaborn/,. Seaborn brings statistical data visualization to Python.
5) Pandasql - https://pypi.python.org/pypi/pandasql.  This package allows SQL syntax and is thus similar to sqldf package in R.
6) ggplot - http://ggplot.yhathq.com/. This is the implementation of Grammar of Graphics in Python. You can practically reuse same ggplot2 code from R to this package in Python.
7) SQLAlchemy - http://www.sqlalchemy.org/. This tool allows you to connect and query with databases

Tutorial Overview

1) You can write markdown within Jupyter notebook by changing the code cell type from code to Markdown. You can also install and work with R using the IR Kernel. This makes the code more readable as well as very easy to switch between kernels.
2) Install packages from within the Jupyter notebook using a ! sign in the beginning
3) Import (or Load) packages using the following syntax Import Package, or Import Package as Pkg or Import Function from Package. This is similar to library function in R.
4) Read in data using the read_csv or similar Input functions from Pandas (http://pandas.pydata.org/pandas-docs/stable/io.html)
5) Inspect data using the info and head methods
6) Slice data using the query function or index or the column name
7) Summarize data using the describe, group_by and the value_counts functions
8) Use dir on the object to find out what all can be done on it
9) Visualize using various plots from seaborn and ggplot package
10) Build a regression model using statsmodel using the familiar formula method ( dependent_var~ independent_var 1 + independent_var2 + ..)
11) Learn about additional tools useful for data scientists

Detailed Tutorial

1) Install Packages from within Jupyter Notebook.Use the --upgrade flag to upgrade existing packages.

In [1]: ! sudo pip install pandas --upgrade

2) Load the Package. You can load a Python Package using the following ways- import PACKAGE or import PACKAGE as PK or from PACKAGE import FUN. You can then invoke the function using PACKAGE.FUN , PK.FUN and FUN respectively.

In [2]:import pandas as pd

3) Import Data. We use read_csv from pandas to import a csv file. Note that Jupyter automatically applies color to the code to ensure code, functions, comments are easily readable. In case the file is stored locally we can use the os python library.

In [3]: import os as os
os.getcwd() #current working directory Out[3]:'/home/ajay/Dropbox/PYTHON BOOK WILEY/FINAL'
In [4]:os.chdir('/home/ajay/Desktop/test') #change current working directory
In [5]:os.listdir(os.getcwd()) #list files in directory Out[5]:['adult.data.txt']
In [6]:adult=pd.read_csv("adult.data.txt",header=None) #read data

''Lets get some information on the object.This was a multiple line comment using three single quote marks'''

4) Lets use a dataset from within R’s dataset for familiarity. We will use diamond Dataset bundled with R language from https://vincentarelbundock.github.io/Rdatasets/datasets.html.

In [12]: diamonds =pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv ")

5) We can use len to find out number of observations or length, and type to find out class of object type. Using info we can combine all these to get the information on object.

In [7]:diamonds.info()

Int64Index: 53940 entries, 0 to 53939

Data columns (total 11 columns):

Unnamed: 0 53940 non-null int64
carat 53940 non-null float64
cut 53940 non-null object
color 53940 non-null object
clarity 53940 non-null object
depth 53940 non-null float64
table 53940 non-null float64
price 53940 non-null int64
x 53940 non-null float64
y 53940 non-null float64
z 53940 non-null float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.3+ MB

6) To find out what all functions we can do we can just use the dir command on the object i.e dir(diamonds) . We can use head to inspect first few rows, .ix to select rows by index number and double square brackets with column names in quotes to select by column name. Note we can chain multiple commands in Python very easily.

In [8]:diamonds2=diamonds.drop('Unnamed: 0', 1) #Dropping a particular variable
diamonds2.head()

Out[8]:

carat cut colour clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.04 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

In [9]:diamonds.ix[20:28] #refers to the 21st to 29th row since index starts from 0.

Out[9]:

carat cut colour clarity depth table price x y z
20 0.30 Good I SI2 63.3 56 351 4.26 4.30 2.71
21 0.23 Very Good E VS2 63.8 55 352 3.85 3.92 2.48
22 0.23 Very Good H VS1 61.0 57 353 3.94 3.96 2.41
23 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62
24 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59
25
0.23
Very Good
G
VVS2 60.4 58 354 3.97 4.01 2.41
26 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
27 0.30 Very Good J VS2 62.2 57 357 4.28 4.30 2.67
28 0.23 Very Good D VS2 60.5 61 357 3.96 3.97 2.40

  

In [10]:diamonds.ix[20:25].cut
Out[10]:
20 Good
21 Very Good
22 Very Good
23 Very Good
24 Very Good
25 Very Good
Name: cut, dtype: object
In [11]:diamonds[['color','cut','price']].head() #Note the double square brackets [[]]
Out[8]:

colour cut price
0
E Ideal 326
1 E Premium 326
2 E Good 327
3 I Premium 334
4 J Good 335

7) Conditional Selection- We can use the query command for conditional selection of data
In [12]:diamonds.query('carat >3 and color =="J"')
Out[12]:

carat cut colour
clarity depth table price
x y z
21758 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
25999 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24
26467 3.01 Ideal J SI2 61.7 58 16037 9.25 9.20 5.69
26744 3.01 Ideal J I1 65.4 60 16538 8.99 8.93 5.86
27415 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
27630 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72
27679 3.51 Premium J VS2 62.5 59 18701 9.66 9.63 6.03
27684 3.01 Premium J SI2 60.7 59 18710 9.35 9.22 5.64
27685 3.01 Premium J SI2 59.7 58 18710 9.41 9.32 5.59


8) Data Summary is done in Pandas by describe for numerical variables and by value_counts for categorical variables. Numerical correlation can be done by corr command. Unique values are given by unique command.

In [13]:diamonds.price.describe()
Out[13]:

count 53940.000000
mean 3932.799722
std 3989.439738
min 326.000000
25% 950.000000
50% 2401.000000
75% 5324.250000
max 18823.000000
Name: price, dtype: float64

In [14]:diamonds.corr() #Numerical Correlations
Out[14]:

carat depth table price x y z
carat 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722 0.953387
depth 0.028224 1.000000 -0.295779 -0.010647 -0.025289 -0.029341 0.094924
table 0.18618 -0.295779 1.000000 0.127134 0.195344 0.183760 0.150929
price 0.921591 -0.010647 0.127134 1.000000 0.884435 0.865421 0.861249
x 0.975094 -0.025289 0.195344 0.884435 1.000000 0.974701 0.970772
y 0.951722 -0.029341 0.183760 0.865421 0.974701 1.000000 0.952006
z 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006 1.000000

In [15]:diamonds['cut'].unique()
Out[15]:array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

In [16]:pd.value_counts(diamonds.cut)
Out[16]:

Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: cut, dtype: int64

Note - To run a command on a particular column instead of entire data frame I can just use the dot notation and its name (i.e diamonds.price instead of diamonds. This is analogous to R’s $ notation)

9) Group by summary is done by group_by command and cross tabulation can be done by crosstab.

In [17]:cutgroup=pd.groupby(diamonds,diamonds.cut)
In [18]:type(cutgroup)
Out[18]:
pandas.core.groupby.DataFrameGroupBy
In [19]:cutgroup.price.median()
Out[19]:
cut
Fair 3282.0
Good 3050.5
Ideal 1810.0
Premium 3185.0
Very Good 2648.0
Name: price, dtype: float64

In [20]:pd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')
Out[20]:

colour D E F G H I J All
cut
Fair 163 224 312 314 303 175 119 1610
Good
662 933 909 871 702 522 307 4906
Ideal
2834 3903 3826 4884 3115 2093 896 21551
Premium
1603 2337 2331 2924 2360 1428 808 13791
Very Good
1513 2400 2164 2299 1824 1204 678 12082
All
6775 9797 9542 11292 8304 5422 2808 53940



Note - we can use dropna to remove missing values in Python i.e. diamonds= diamonds.dropna(how='any')


10) We can also pivot data like a pivot table using pivot command.
In [21]:e=diamonds.groupby(['cut', "color"]).price.median().reset_index()
e.pivot(index='cut', columns='color', values='price')
Out[21]:

colour D E F G H I J
cut
Fair 3730.0 2956.0 3035 3057.0 3816.0 3246.0 3302
Good 2728.5 2420.0 2647 3340.0 3468.5 3639.0 3733
Ideal 1576.0 1437.0 1775 1857.5 2659.0 2659.0 4096
Premium 2009.0 1928.0 2841 2745.0 4640.0 4640.0 5063
Very Good 2310.0 1989.5 2471 2437.0 3888.0 3888.0 4113

11) Using SQL- Python does have the pandasql package thanks to the team at YHat ( who also made the Rodeo IDE) . It is similar to the sqldf package in R in that is allows the user to write sql queries to the data frame object. Note you need to ensure table names are consistent to SQLite tablename conventions (thus it makes sense to drop or rename any column name with any special characters)
In [22]:from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
In [23]:pysqldf("SELECT * FROM diamonds2 LIMIT 5 ; ")
Out[23]:

carat cut colour clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2
0.23
Good
E
VS1
56.9
65 327 4.05 4.07 2.31
3
0.29
Premium
I
VS2
62.4
58 334 4.20 4.23 2.63
4
0.31
Good
J
SI2
63.3
58 335 4.34 4.35 2.75
5
0.24
Very Good
J
VVS2
62.8
57 336 3.94 3.96 2.48


In [24]:pysqldf("SELECT * FROM diamonds2 WHERE carat >4 ;")
Out[24]:

carat cut colour clarity depth table price x y z
0 4.01 Premium I I1 61.0 61 15223 10.14 10.10 6.17
1 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24
2 4.13 Fair H I1 64.8 61 17329 10.00 9.85 6.43
3 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
4 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72

12) For Data Visualization I am going to first use the excellent seaborn package from http://stanford.edu/~mwaskom/software/seaborn/index.html. Histograms , Boxplots ScatterPlots and Jointplots are very easily plotted using seaborn

In [25]:sns.distplot(diamonds.price, bins=20, kde=True, rug=False);

In [25]:ax = sns.boxplot(x="color", y="price", data=diamonds)

In [26]:sns.jointplot('price','carat',data=diamonds2)
Out[26]:

In [27]:sns.factorplot(x="color", y="price",
col="cut", data=diamonds, kind="box", size=4, aspect=.5);

13) For Data Visualization, I can also use the ggplot package created by Yhat ( who also created pandasql and rodeo - a RStudio style editor for Python). It uses the grammar of graphics as created by Wilkinson and popularized by Hadley Wickham.
In [28]:p = ggplot(aes(x='price', y='carat',color="clarity"), data=diamonds)
p + geom_point()


Out[28]:


14) For Regression Models- A widely used data science technique for business, I can also use the statsmodel package.
In [80]:import statsmodels.formula.api as sm
In [81]:boston=pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Boston.csv")
In [82]:boston =boston.drop('Unnamed: 0', 1)
In [83]:boston.head()
Out[83]:


crim
zn

indu

s

cha

s

nox
rm
age
dis
rad
tax
ptratio black
lstat
medv
0
0.00632
0
2.31
0
0.538
6.575
65.2
4.0900
1
296
15.3
396.90
4.98
24.0
1
0.02731
0
7.07
0
0.469
06.421
78.9
4.9671
2
242
17.8
396.90
9.14
21.6
2
0.02729
0
7.07
0
0.469
7.185
61.1
4.9671
2
242
17.8
392.83
4.03
34.7
3
0.03237
0
2.18
0
0.458
6.998
45.8
6.0622
2
222
18.7
394.63
2.94
33.4
4
0.06905
0
2.18
0
0.458
7.147
54.2
6.0622
2
222
18.7
396.90
5.33
36.2

In [87]:import statsmodels.formula.api as sm
result = sm.ols(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston).fit()
result.summary()
Out[87]:

Dep. Variable:
medv R-squared: 0.631
Model:
OLS
Adj. R-squared:
0.626
Method:
Least Squares
F-statistics:
142.0
Date:
Fri 22 Jan 2016
Prob (F-statistic)@
1.49e-104
Time:
13:22:42
Log-Likelihood
-1588.2
No. Observations
506
AIC:
3190.
Df Residuals:
499
BIC:
3220.
Df Model:
6


Covariance Type:
nonrobust



coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.3594 4.863 -0.074 0.941 -9.915 9.196
crim -0.0991 0.034 -2.890 0.004 0.167 -0.032
zn -0.0064 0.014 -0.470 0.638 -0.033 0.020
nox -10.8653 2.865 -3.793 0.000 -16.494 -5.237
ptratio -1.0519 0.135 -7.796 0.000 -1.317 -0.787
black 0.0137 0.003 4.453 0.000 0.008 0.020
rm 6.9796 0.396 17.612 0.000 6.201 7.45


Omnibus: 298.859 Durbin-Watson: 0.808
Prob (Omnibus): 0.000 Jarque-Bera (JB): 3305.426
Skew: 2.385 Prob (JB): 0.00
Kurtosis: 14.577 Cond. No. 7.66e+03

In [88]:result.params
Out[88]:
Intercept -0.359432
crim -0.099122
zn -0.006364
nox -10.865295
ptratio -1.051937
black 0.013737
rm 6.979587
dtype: float64

15) One more thing. For data mining we have the wonderful scikit-learn package. For example see Decision Trees from http://scikit-learn.org/stable/modules/tree.html.  

16) For using both R and Python together you can use http://beakernotebook.com/ as it allows you to select kernel specific to each code block not just the whole notebook like Jupyter does and makes passing of objects very easy between languages.
Of course I could not do justice to all the wonderful things that Python and Pydata have to offer to data scientists. A 50 page elaborate version of this tutorial is available at http://www.slideshare.net/ajayohri/a-data-science-tutorial-in-python. For data scientists working with huge amounts of data, Python is an increasingly credible option to try out in production systems. I hope you find this tutorial useful for your usage.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.