Pandas refreshers

Pandas is mostly used just to read data, fix missing values, mask and change it to numpy array.

It is also used to quickly visualize data from files

In [10]:
import pandas as pd
import numpy as np
pd.__version__
Out[10]:
'0.23.4'

Read CSV

Returns a padas.DataFrame object

In [3]:
pd.read_csv('files/data_csv.csv')
Out[3]:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

Read JSON

In [36]:
#if the json is in format
#{column1 : [values_list], column2 : [values_list]}
pd.read_json('files/data_json.json')
Out[36]:
color fruit size
0 Red Apple Large
1 yellow orange medium
In [37]:
#Or in format [{col1:val,col2:val}, {col1:val,col2:val}]
pd.read_json('files/data_json_records.json',orient='records')
Out[37]:
color fruit size
0 Red Apple Large
1 Yellow Orange Medium

Slice

In [38]:
df = pd.read_csv('files/data_csv.csv')
type(df)
Out[38]:
pandas.core.frame.DataFrame
In [40]:
df["Salary"]
Out[40]:
0    72000.0
1    48000.0
2    54000.0
3    61000.0
4        NaN
5    58000.0
6    52000.0
7    79000.0
8    83000.0
9    67000.0
Name: Salary, dtype: float64
In [42]:
df[["Country","Salary"]]
Out[42]:
Country Salary
0 France 72000.0
1 Spain 48000.0
2 Germany 54000.0
3 Spain 61000.0
4 Germany NaN
5 France 58000.0
6 Spain 52000.0
7 France 79000.0
8 Germany 83000.0
9 France 67000.0

View statistic

In [43]:
df.describe()
Out[43]:
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000

Mask

Works just like numpy

In [44]:
#Masking just like numpy
df[df['Salary']>60000]
Out[44]:
Country Age Salary Purchased
0 France 44.0 72000.0 No
3 Spain 38.0 61000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

Get numpy array from DataFrame

In [47]:
#Get the numpy array
df.values
Out[47]:
array([['France', 44.0, 72000.0, 'No'],
       ['Spain', 27.0, 48000.0, 'Yes'],
       ['Germany', 30.0, 54000.0, 'No'],
       ['Spain', 38.0, 61000.0, 'No'],
       ['Germany', 40.0, nan, 'Yes'],
       ['France', 35.0, 58000.0, 'Yes'],
       ['Spain', nan, 52000.0, 'No'],
       ['France', 48.0, 79000.0, 'Yes'],
       ['Germany', 50.0, 83000.0, 'No'],
       ['France', 37.0, 67000.0, 'Yes']], dtype=object)

Sample data

In [3]:
import numpy as np
In [11]:
x = np.linspace(0,10,100000).reshape(-1,1) #otherwise we cant concat along axis 1 (use vstack instead?)
y = (np.linspace(0,10,100000) + np.random.normal(0,0.01,100000)).reshape(-1,1)
In [13]:
df = pd.DataFrame(np.concatenate((x,y),axis=1),columns=["X","Y"])
In [14]:
df.head() #shows the first five elements
Out[14]:
X Y
0 0.0000 0.015176
1 0.0001 0.006938
2 0.0002 0.008640
3 0.0003 0.001290
4 0.0004 0.016184
In [15]:
df.sample(n=10) #samples 10 random elements
Out[15]:
X Y
2029 0.202902 0.190949
73792 7.379274 7.377391
69128 6.912869 6.927882
50606 5.060651 5.055054
47980 4.798048 4.802660
79788 7.978880 7.993227
43684 4.368444 4.379953
73852 7.385274 7.383774
73325 7.332573 7.336620
2365 0.236502 0.242109
In [18]:
#More on data visualization refresher. It will be difficult to plot million points. so we plot a sample
%matplotlib inline
df.sample(n=25).plot(x='X',y='Y',kind='scatter')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f01ac17da20>

Apply function along column or rows

In [3]:
diabetes = pd.read_csv("files/pima-indians-diabetes.csv")
diabetes.head()
Out[3]:
Number_pregnant Glucose_concentration Blood_pressure Triceps Insulin BMI Pedigree Age Class Group
0 6 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 50 1 B
1 1 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 31 0 C
2 8 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 32 1 B
3 1 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 21 0 B
4 0 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 33 1 C

Lets just normalize 'Number_pregnant' and 'Age' columns

In [3]:
diabetes.columns
Out[3]:
Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')
In [4]:
cols_to_normalize = ['Number_pregnant','Age']
In [6]:
diabetes[cols_to_normalize].head()
Out[6]:
Number_pregnant Age
0 6 50
1 1 31
2 8 32
3 1 21
4 0 33
In [7]:
#When axis=0(default), applies function to each column
#When axis=1, applies function to each row
fn = lambda x: (x-x.min())/(x.max()-x.min())
diabetes[cols_to_normalize] = diabetes[cols_to_normalize].apply(fn,axis=0)
In [9]:
diabetes.head()
Out[9]:
Number_pregnant Glucose_concentration Blood_pressure Triceps Insulin BMI Pedigree Age Class Group
0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 0.483333 1 B
1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 0.166667 0 C
2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 0.183333 1 B
3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 0.000000 0 B
4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 0.200000 1 C