Pandas refreshers¶

Pandas is mostly used just to read data, fix missing values, mask and change it to numpy array.

It is also used to quickly visualize data from files

import pandas as pd
import numpy as np
pd.__version__

'0.23.4'

Read CSV¶

Returns a padas.DataFrame object

pd.read_csv('files/data_csv.csv')

Read JSON¶

#if the json is in format
#{column1 : [values_list], column2 : [values_list]}
pd.read_json('files/data_json.json')

#Or in format [{col1:val,col2:val}, {col1:val,col2:val}]
pd.read_json('files/data_json_records.json',orient='records')

Slice¶

df = pd.read_csv('files/data_csv.csv')
type(df)

pandas.core.frame.DataFrame

df["Salary"]

0    72000.0
1    48000.0
2    54000.0
3    61000.0
4        NaN
5    58000.0
6    52000.0
7    79000.0
8    83000.0
9    67000.0
Name: Salary, dtype: float64

df[["Country","Salary"]]

View statistic¶

df.describe()

Mask¶

Works just like numpy

#Masking just like numpy
df[df['Salary']>60000]

Get numpy array from DataFrame¶

#Get the numpy array
df.values

array([['France', 44.0, 72000.0, 'No'],
       ['Spain', 27.0, 48000.0, 'Yes'],
       ['Germany', 30.0, 54000.0, 'No'],
       ['Spain', 38.0, 61000.0, 'No'],
       ['Germany', 40.0, nan, 'Yes'],
       ['France', 35.0, 58000.0, 'Yes'],
       ['Spain', nan, 52000.0, 'No'],
       ['France', 48.0, 79000.0, 'Yes'],
       ['Germany', 50.0, 83000.0, 'No'],
       ['France', 37.0, 67000.0, 'Yes']], dtype=object)

Sample data¶

import numpy as np

x = np.linspace(0,10,100000).reshape(-1,1) #otherwise we cant concat along axis 1 (use vstack instead?)
y = (np.linspace(0,10,100000) + np.random.normal(0,0.01,100000)).reshape(-1,1)

df = pd.DataFrame(np.concatenate((x,y),axis=1),columns=["X","Y"])

df.head() #shows the first five elements

df.sample(n=10) #samples 10 random elements

#More on data visualization refresher. It will be difficult to plot million points. so we plot a sample
%matplotlib inline
df.sample(n=25).plot(x='X',y='Y',kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x7f01ac17da20>

Apply function along column or rows¶

diabetes = pd.read_csv("files/pima-indians-diabetes.csv")
diabetes.head()

Lets just normalize 'Number_pregnant' and 'Age' columns

diabetes.columns

Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')

cols_to_normalize = ['Number_pregnant','Age']

diabetes[cols_to_normalize].head()

#When axis=0(default), applies function to each column
#When axis=1, applies function to each row
fn = lambda x: (x-x.min())/(x.max()-x.min())
diabetes[cols_to_normalize] = diabetes[cols_to_normalize].apply(fn,axis=0)

diabetes.head()

	color	fruit	size
0	Red	Apple	Large
1	yellow	orange	medium

	color	fruit	size
0	Red	Apple	Large
1	Yellow	Orange	Medium

	Age	Salary
count	9.000000	9.000000
mean	38.777778	63777.777778
std	7.693793	12265.579662
min	27.000000	48000.000000
25%	35.000000	54000.000000
50%	38.000000	61000.000000
75%	44.000000	72000.000000
max	50.000000	83000.000000

	X	Y
0	0.0000	0.015176
1	0.0001	0.006938
2	0.0002	0.008640
3	0.0003	0.001290
4	0.0004	0.016184

	X	Y
2029	0.202902	0.190949
73792	7.379274	7.377391
69128	6.912869	6.927882
50606	5.060651	5.055054
47980	4.798048	4.802660
79788	7.978880	7.993227
43684	4.368444	4.379953
73852	7.385274	7.383774
73325	7.332573	7.336620
2365	0.236502	0.242109

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

	Number_pregnant	Glucose_concentration	Blood_pressure	Triceps	Insulin	BMI	Pedigree	Age	Class	Group
0	6	0.743719	0.590164	0.353535	0.000000	0.500745	0.234415	50	1	B
1	1	0.427136	0.540984	0.292929	0.000000	0.396423	0.116567	31	0	C
2	8	0.919598	0.524590	0.000000	0.000000	0.347243	0.253629	32	1	B
3	1	0.447236	0.540984	0.232323	0.111111	0.418778	0.038002	21	0	B
4	0	0.688442	0.327869	0.353535	0.198582	0.642325	0.943638	33	1	C

	Number_pregnant	Glucose_concentration	Blood_pressure	Triceps	Insulin	BMI	Pedigree	Age	Class	Group
0	0.352941	0.743719	0.590164	0.353535	0.000000	0.500745	0.234415	0.483333	1	B
1	0.058824	0.427136	0.540984	0.292929	0.000000	0.396423	0.116567	0.166667	0	C
2	0.470588	0.919598	0.524590	0.000000	0.000000	0.347243	0.253629	0.183333	1	B
3	0.058824	0.447236	0.540984	0.232323	0.111111	0.418778	0.038002	0.000000	0	B
4	0.000000	0.688442	0.327869	0.353535	0.198582	0.642325	0.943638	0.200000	1	C