Imports, and Data Fetch

imports

#imports
import pandas as pd
import numpy as np

read configuration file

#configuration
from read_config import Config
config = Config ()

data

NOTE: We make use of the following dataframes

df_titanic read from the titanic dataset
df_raw - an in-code sample dataframe to demonstrate csv file save, and some dataframe statistics
df_books - an in-code sample dataframe - starting from sub-section 5. of section ‘DataFrame Playground’. It is used to show most of the features of the dataframe.

titanic

config.set_dataset_id ("titanic")
df_titanic = config.get_train_df ()
df_titanic.head (2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

DataFrame Playground

1. Read Data from CSV file into a DataFrame

import os
import pandas as pd

config.set_dataset_id ("titanic") #read titanic data
df_titanic = config.get_train_df ()

2. Create DataFrame from Dictionary

pd.DataFrame (dict, columns = , index = )

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}

df_raw = pd.DataFrame (raw_data, columns = ['first_name', 'last_name',\
                                           'age', 'preTestScore',\
                                            'postTestScore'])

3. Save DataFrame to CSV file

df.to_csv (csv_filename)

df_raw.to_csv ("raw.csv")

4. DataFrame Statistics

Information

df.info ()

df_raw.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   first_name     5 non-null      object
 1   last_name      5 non-null      object
 2   age            5 non-null      int64 
 3   preTestScore   5 non-null      object
 4   postTestScore  5 non-null      object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes

Information that indicates how many nulls in each column

E.g. ‘Age’ value is present only for 714 out of 891 rows. 177 rows do not have value. This count can be achieved by using the function df.isnull().sum() as shown further down.

df_titanic.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Column Names

NOTE: columns is an attribute that returns an index containing the column names. This attribute can also have a name.

df_titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Shape

df.shape

df_raw.shape

(5, 5)

Display first few rows

df.head (n)

df_raw.head (2)

	first_name	last_name	age	preTestScore	postTestScore
0	Jason	Miller	42	4	25,000
1	Molly	Jacobson	52	24	94,000

Find ‘unique’ values of a column

df[col].unique ()

df_titanic ['Embarked'].unique ()

array(['S', 'C', 'Q', nan], dtype=object)

Subtotals, grouped by values of a column

value_counts ()

returns a Series with group by values as index values.

df_titanic ['Survived'].value_counts ()

0    549
1    342
Name: Survived, dtype: int64

NOTE: ‘Survived’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Survived’.

print(df_titanic ['Survived'].value_counts ().index)

Int64Index([0, 1], dtype='int64')

Extract rows based on a filter

filter = df [col_filter] == value
df [filter]

df_titanic [df_titanic ['Survived'] == 1].head (2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S

Subtotals on extracted rows, grouped by values of a column

filter = df [col_filter] == value
df [filter][col_groupby].value_counts ()

print ('Survived:')
print (df_titanic [df_titanic \
                  ['Survived'] == 1]['Sex'].value_counts ())
print ('\nDidn''t Survive:')
print (df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ())

Survived:
female    233
male      109
Name: Sex, dtype: int64

Didnt Survive:
male      468
female     81
Name: Sex, dtype: int64

value_counts () returns a Series with group by values as index values.

NOTE: ‘Survived’ is used as a filter, and ‘Sex’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Sex’.

type(df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ())

pandas.core.series.Series

Series has an index

df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ().index

Index(['male', 'female'], dtype='object')

Null count subtotals

visualizing null values inside the dataframe.
df.isnull ()

df_titanic.isnull ()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False
3	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...
886	False	False	False	False	False	False	False	False	False	False	True	False
887	False	False	False	False	False	False	False	False	False	False	False	False
888	False	False	False	False	False	True	False	False	False	False	True	False
889	False	False	False	False	False	False	False	False	False	False	False	False
890	False	False	False	False	False	False	False	False	False	False	True	False

891 rows × 12 columns

summary of null values df.isnull ().sum ()

df_titanic.isnull ().sum ()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

5. Create DataFrame (revisited)

Data creation

numpy for NaN

np.NaN

import numpy as np

Create some lists

author = ['Strang', 'Blitzstein', 'Witten', 'Bishop', 'Bengio', 'Sutton']
title = ['Introduction to Linear Algebra', 'Introduction to Probability', 'ML Beginner', 'ML Advanced', \
         'Deep Learning', 'Reinforcement Learning - An Introduction']
edition = [1, np.NaN, 2, np.NaN, 1, 2]
cost = [10, 20, 15, 40, 30, 25]
topic = ['Maths', 'Maths', 'Machine Learning', 'Machine Learning', \
         'Machine Learning', 'Machine Learning']
sub_topic = ['LA', 'Prob', 'ML', 'ML', 'DL', 'RL']

Create some Series

pd.Series (list)

s_author = pd.Series (author)
s_title = pd.Series (title)
s_edition = pd.Series (edition)
s_cost = pd.Series (cost)

Series has an index. This series has a range index

Series.index

s_author.index

RangeIndex(start=0, stop=6, step=1)

DataFrame from multiple lists

Incorrect way

pd.DataFrame ([author, title])

	0	1	2	3	4	5
0	Strang	Blitzstein	Witten	Bishop	Bengio	Sutton
1	Introduction to Linear Algebra	Introduction to Probability	ML Beginner	ML Advanced	Deep Learning	Reinforcement Learning - An Introduction

The reason authors and titles appear on different rows is that a list is treated as a column, and we provided a list of lists. So, the outer list of 2 elements (author_list and title_list) created the 2 rows, and, the inner list of authors and titles expanded column-wise.

list is treated as a column

pd.DataFrame (['a', 'b'])

	0
0	a
1	b

Correct way

Zip and create a list of tuples. The elements of the list form the rows, and the elements of the enclosing tuple form the columns.
pd.DataFrame (list (zip (lst_1, lst_2)) , columns = , index = )

pd.DataFrame (list (zip (author, title)), columns = ['Author', 'Title'])

	Author	Title
0	Strang	Introduction to Linear Algebra
1	Blitzstein	Introduction to Probability
2	Witten	ML Beginner
3	Bishop	ML Advanced
4	Bengio	Deep Learning
5	Sutton	Reinforcement Learning - An Introduction

Zip and create a tuple of tuples. The elements of the outer tuple form the rows, and the elements of the enclosing tuple form the columns.

pd.DataFrame (tuple (zip (author, title)), columns = ['Author', 'Title'])

	Author	Title
0	Strang	Introduction to Linear Algebra
1	Blitzstein	Introduction to Probability
2	Witten	ML Beginner
3	Bishop	ML Advanced
4	Bengio	Deep Learning
5	Sutton	Reinforcement Learning - An Introduction

DataFrame from a dictionary, using ‘lists’ as values of keys

frame = {'Author': author, 'Title': title}
df_temp = pd.DataFrame (frame)
df_temp.head ()

	Author	Title
0	Strang	Introduction to Linear Algebra
1	Blitzstein	Introduction to Probability
2	Witten	ML Beginner
3	Bishop	ML Advanced
4	Bengio	Deep Learning

DataFrame from a dictionary, using ‘Series’ as values of keys

frame = {'Author': s_author, 'Title': s_title}
df_books = pd.DataFrame (frame)
df_books.head ()

	Author	Title
0	Strang	Introduction to Linear Algebra
1	Blitzstein	Introduction to Probability
2	Witten	ML Beginner
3	Bishop	ML Advanced
4	Bengio	Deep Learning

Add a ‘Series’ externally

NOTE:

A series can be added externally only if the index of the series matches that of the data frame

df_books ['Edition'] = s_edition
df_books.head ()

	Author	Title	Edition
0	Strang	Introduction to Linear Algebra	1.0
1	Blitzstein	Introduction to Probability	NaN
2	Witten	ML Beginner	2.0
3	Bishop	ML Advanced	NaN
4	Bengio	Deep Learning	1.0

6. Index

Provide ‘index’

Create a custom index

idx = ['a', 'b', 'c', 'd', 'e', 'f']

Pass nameless indices as a parameter

NOTE:

‘index’ is the row index
‘columns’ is the column index

df = pd.DataFrame (tuple (zip (author, title)), \
              columns = ['Author', 'Title'], \
             index = idx)
df

	Author	Title
a	Strang	Introduction to Linear Algebra
b	Blitzstein	Introduction to Probability
c	Witten	ML Beginner
d	Bishop	ML Advanced
e	Bengio	Deep Learning
f	Sutton	Reinforcement Learning - An Introduction

Provide names for indices

use rename_axis () to rename index names

A single parameter implies row index.
rename_axis (row_index_name)

df2 = df.rename_axis ('Sr. No.')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2

Row Index Name:  Sr. No.
Column Index Name:  None

	Author	Title
Sr. No.
a	Strang	Introduction to Linear Algebra
b	Blitzstein	Introduction to Probability
c	Witten	ML Beginner
d	Bishop	ML Advanced
e	Bengio	Deep Learning
f	Sutton	Reinforcement Learning - An Introduction

Use axis to indicate row (= 0) or column (= 1) axis
rename_axis (col_axis_name, axis = 1)

df2 = df.rename_axis ('Attributes->', axis = 1)
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2

Row Index Name:  None
Column Index Name:  Attributes->

Attributes->	Author	Title
a	Strang	Introduction to Linear Algebra
b	Blitzstein	Introduction to Probability
c	Witten	ML Beginner
d	Bishop	ML Advanced
e	Bengio	Deep Learning
f	Sutton	Reinforcement Learning - An Introduction

‘axis’ values 0 and 1 also carry names “rows” and “columns” respectively.

df2 = df.rename_axis ('Attributes->', axis = "columns")
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2

Row Index Name:  None
Column Index Name:  Attributes->

Attributes->	Author	Title
a	Strang	Introduction to Linear Algebra
b	Blitzstein	Introduction to Probability
c	Witten	ML Beginner
d	Bishop	ML Advanced
e	Bengio	Deep Learning
f	Sutton	Reinforcement Learning - An Introduction

Rename both row and column axes denoted by ‘index’ for row index, and ‘columns’ for column attribute that returns an index.
rename_axis (index = row_axis_name, columns = col_axis_name)

df2 = df.rename_axis (index = 'Sr. No.', columns = 'Attributes->')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2

Row Index Name:  Sr. No.
Column Index Name:  Attributes->

Attributes->	Author	Title
Sr. No.
a	Strang	Introduction to Linear Algebra
b	Blitzstein	Introduction to Probability
c	Witten	ML Beginner
d	Bishop	ML Advanced
e	Bengio	Deep Learning
f	Sutton	Reinforcement Learning - An Introduction

Hierarchical Indices

Create the multi-index

pd.MultiIndex

.from_product ()

list of list of indices
list of names of indices

NOTE: The list of indices are of different size: There are 3 topics, and 2 sub-topics. So, we have to use from_product () to perform the cross product of the two indices. This creates a hierarchy. If we just wanted two row indices without them being hierarchical, the indices list should be of the same size, and we need to use from_arrays () instead.

i_topic = ['Mathematics', 'Machine Learning', 'DL/RL']
i_subtopic = [1, 2]
i_names = ['Topic', 'Sub-Topic']

mux = pd.MultiIndex.from_product ([i_topic, i_subtopic], names = i_names)
mux

MultiIndex([(     'Mathematics', 1),
            (     'Mathematics', 2),
            ('Machine Learning', 1),
            ('Machine Learning', 2),
            (           'DL/RL', 1),
            (           'DL/RL', 2)],
           names=['Topic', 'Sub-Topic'])

Create the dataframe with multi-index

pd.DataFrame (list, columns = , index = pd.MultiIndex ().from_#)

pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)

		Author	Title
Topic	Sub-Topic
Mathematics	1	Strang	Introduction to Linear Algebra
Mathematics	2	Blitzstein	Introduction to Probability
Machine Learning	1	Witten	ML Beginner
Machine Learning	2	Bishop	ML Advanced
DL/RL	1	Bengio	Deep Learning
DL/RL	2	Sutton	Reinforcement Learning - An Introduction

Create a multi-index using from_arrays ()

mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
mux

MultiIndex([(           'Maths',   'LA'),
            (           'Maths', 'Prob'),
            ('Machine Learning',   'ML'),
            ('Machine Learning',   'ML'),
            ('Machine Learning',   'DL'),
            ('Machine Learning',   'RL')],
           names=['Topic', 'Sub-Topic'])

Create a data frame with multi-index (created using from_arrays ())

pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)

		Author	Title
Topic	Sub-Topic
Maths	LA	Strang	Introduction to Linear Algebra
Maths	Prob	Blitzstein	Introduction to Probability
Machine Learning	ML	Witten	ML Beginner
	ML	Bishop	ML Advanced
	DL	Bengio	Deep Learning
	RL	Sutton	Reinforcement Learning - An Introduction

Navigating using indices

Create a dataframe with multi-index

mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)
df

		Author	Title
Topic	Sub-Topic
Maths	LA	Strang	Introduction to Linear Algebra
Maths	Prob	Blitzstein	Introduction to Probability
Machine Learning	ML	Witten	ML Beginner
	ML	Bishop	ML Advanced
	DL	Bengio	Deep Learning
	RL	Sutton	Reinforcement Learning - An Introduction

Filter based on index values

df.loc [index_val]
df.loc [(index1_val, index2_val)]

NOTE: A data frame is returned.

df.sort_index ()

to prevent “PerformanceWarning: indexing past lexsort depth may impact performance.”

#To prevent 
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df.sort_index (inplace = True)
print ('Books in Mathematics:')
print (df.loc ['Maths'], '\n\n')
print ('Books in Machine Learning:')
print (df.loc ['Machine Learning'], '\n\n')
print ('Books in Deep Learning:')
print (df.loc [('Machine Learning', 'DL')])

Books in Mathematics:
               Author                           Title
Sub-Topic                                            
LA             Strang  Introduction to Linear Algebra
Prob       Blitzstein     Introduction to Probability 


Books in Machine Learning:
           Author                                     Title
Sub-Topic                                                  
DL         Bengio                             Deep Learning
ML         Witten                               ML Beginner
ML         Bishop                               ML Advanced
RL         Sutton  Reinforcement Learning - An Introduction 


Books in Deep Learning:
                            Author          Title
Topic            Sub-Topic                       
Machine Learning DL         Bengio  Deep Learning

Set existing columns as indexes

Create a dataframe WITHOUT passing any index

Instead, pass index as regular columns - topic, and sub_topic

#mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title, topic, sub_topic)), \
             columns = ['Author', 'Title', 'Topic', 'Sub-Topic'], \
             #index = mux
                  )
df

	Author	Title	Topic	Sub-Topic
0	Strang	Introduction to Linear Algebra	Maths	LA
1	Blitzstein	Introduction to Probability	Maths	Prob
2	Witten	ML Beginner	Machine Learning	ML
3	Bishop	ML Advanced	Machine Learning	ML
4	Bengio	Deep Learning	Machine Learning	DL
5	Sutton	Reinforcement Learning - An Introduction	Machine Learning	RL

Create index from columns

df.set_index ([cols], inplace = True)

df.set_index (['Topic', 'Sub-Topic'], inplace = True)
df

		Author	Title
Topic	Sub-Topic
Maths	LA	Strang	Introduction to Linear Algebra
Maths	Prob	Blitzstein	Introduction to Probability
Machine Learning	ML	Witten	ML Beginner
	ML	Bishop	ML Advanced
	DL	Bengio	Deep Learning
	RL	Sutton	Reinforcement Learning - An Introduction

Instantiate an index and set it

df.set_index (pd.RangeIndex (start, stop, step), inplace = True

df.set_index (\
              pd.RangeIndex (start = 5, stop = 11, step = 1),\
              inplace = True)
df

	Author	Title
5	Strang	Introduction to Linear Algebra
6	Blitzstein	Introduction to Probability
7	Witten	ML Beginner
8	Bishop	ML Advanced
9	Bengio	Deep Learning
10	Sutton	Reinforcement Learning - An Introduction

df.groupby (col)

1. DataFrameGroupBy

df.groupby (col)
df.groupby ([cols])

returns an instance of DataFrameGroupBy

df = df_titanic
df.groupby (['Survived', 'Sex'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001CAEA053A30>

2. Summary Statistics

Perform a summary statistic on DataFrameGroupBy

mean () - for numerical columns
count () - for all columns
size ()
- for numerical columns

NOTE:

stats are obviously not performed on groupby columns, since those column values are used to form groups
count returns individual column count of rows having some value

count of columns like ‘Age’ and ‘Cabin’ indicate missing values
sum returns count of rows. some column having missing value does not impact, unless all columns have missing values.

print ('Information of the data frame:')
print (df.info (), '\n\n')
print ('Mean Statistics:')
print (df.groupby (['Survived', 'Sex']).mean (), '\n\n')
print ('Count Statistics:')
print (df.groupby (['Survived', 'Sex']).count (), '\n\n')
print ('Size Statistics:')
print (df.groupby (['Survived', 'Sex']).size ())

Information of the data frame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None 


Mean Statistics:
                 PassengerId    Pclass        Age     SibSp     Parch  \
Survived Sex                                                            
0        female   434.851852  2.851852  25.046875  1.209877  1.037037   
         male     449.121795  2.476496  31.618056  0.440171  0.207265   
1        female   429.699571  1.918455  28.847716  0.515021  0.515021   
         male     475.724771  2.018349  27.276022  0.385321  0.357798   

                      Fare  
Survived Sex                
0        female  23.024385  
         male    21.960993  
1        female  51.938573  
         male    40.821484   


Count Statistics:
                 PassengerId  Pclass  Name  Age  SibSp  Parch  Ticket  Fare  \
Survived Sex                                                                  
0        female           81      81    81   64     81     81      81    81   
         male            468     468   468  360    468    468     468   468   
1        female          233     233   233  197    233    233     233   233   
         male            109     109   109   93    109    109     109   109   

                 Cabin  Embarked  
Survived Sex                      
0        female      6        81  
         male       62       468  
1        female     91       231  
         male       45       109   


Size Statistics:
Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
dtype: int64

> Series (with indexes) returned by groupby.summary_stat

NOTE:

df.groupby (col).size () returns a series, with an index
df.groupby ([cols]).size () returns a series with a multi-index

» groupby (col)

series

df.groupby ('Survived').size ()

Survived
0    549
1    342
dtype: int64

type = Series

type(df.groupby ('Survived').size ())

pandas.core.series.Series

index

of the series

df.groupby ('Survived').size ().index

Int64Index([0, 1], dtype='int64', name='Survived')

» groupby ([cols])

NOTE:

didn’t survive 549 = 81 + 468
survived 342 = 233 + 109

series

df.groupby (['Survived', 'Sex']).size ()

Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
dtype: int64

type = series

type (df.groupby (['Survived', 'Sex']).size ())

pandas.core.series.Series

multi-index

of the series

df.groupby (['Survived', 'Sex']).size ().index

MultiIndex([(0, 'female'),
            (0,   'male'),
            (1, 'female'),
            (1,   'male')],
           names=['Survived', 'Sex'])

3. Unstack the multi-index series into a data frame

You may wish to remove one of the features from the hierarchical index and form different columns with respect to that feature. You can do so using the unstack method.

NOTE: unstack () requires the feature to have unique values. This is because the values of the feature form column names. An index created using groupby () ensures that the index has unique values.

df.groupby (['Survived', 'Sex']).size ().unstack ()

Sex	female	male
Survived
0	81	468
1	233	109

NOTE:

didn’t survive 549 = 81 + 468
survived 342 = 233 + 109

The index column to unstack, is passed as parameter, can be given by name, or by position, 0 is the default.

df.groupby (['Survived', 'Sex']).size ().unstack ('Sex')

Sex	female	male
Survived
0	81	468
1	233	109

by default, the leaf of the hierarchical (multi) index is unstacked, leaving the rest of the index tree as the index of the returned dataframe

df.groupby (['Survived', 'Sex']).size ().unstack ().index

Int64Index([0, 1], dtype='int64', name='Survived')

the unstacked leaf of the index forms the columns of the returned dataframe

df.groupby (['Survived', 'Sex']).size ().unstack ().columns

Index(['female', 'male'], dtype='object', name='Sex')

mention the index column to unstack

NOTE

this time we have unstacked 'Survived' instead of 'Sex'

df.groupby (['Survived', 'Sex']).size ().unstack ('Survived')

Survived	0	1
Sex
female	81	233
male	468	109

7. Filter

Data creation

frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
df_books.head ()

	Author	Title	Edition
0	Strang	Introduction to Linear Algebra	1.0
1	Blitzstein	Introduction to Probability	NaN
2	Witten	ML Beginner	2.0
3	Bishop	ML Advanced	NaN
4	Bengio	Deep Learning	1.0

An example ‘filter’

df_books ['Edition'] == 1

   True
  False
  False
  False
   True
  False
Name: Edition, dtype: bool

Chaining format

df_books.Edition.eq (2)

  False
  False
   True
  False
  False
   True
Name: Edition, dtype: bool

Apply filter to a dataframe

Filter rows whose column value equals a specific value

filter = df_books ['Edition'] == 2
df_books [filter]

	Author	Title	Edition
2	Witten	ML Beginner	2.0
5	Sutton	Reinforcement Learning - An Introduction	2.0

Filter rows whose column value does NOT equal a specific value - !=

filter = df_books.Edition != 2
df_books [filter]

	Author	Title	Edition
0	Strang	Introduction to Linear Algebra	1.0
1	Blitzstein	Introduction to Probability	NaN
3	Bishop	ML Advanced	NaN
4	Bengio	Deep Learning	1.0

Filter rows whose column value is not NaN - df.col.notnull ()

filter = df_books.Edition.notnull ()
df_books [filter]

	Author	Title	Edition
0	Strang	Introduction to Linear Algebra	1.0
2	Witten	ML Beginner	2.0
4	Bengio	Deep Learning	1.0
5	Sutton	Reinforcement Learning - An Introduction	2.0

Filter rows with column values among a set of values

df.colname.isin

authors_maths = ['Apostol', 'Strang', 'Blitzstein']
filter = df_books.Author.isin (authors_maths)
df_books [filter]

	Author	Title	Edition
0	Strang	Introduction to Linear Algebra	1.0
1	Blitzstein	Introduction to Probability	NaN

Filter rows with column values NOT among a set of values

~df

.colname.isin

filter = ~df_books.Author.isin (authors_maths)
df_books [filter]

	Author	Title	Edition
2	Witten	ML Beginner	2.0
3	Bishop	ML Advanced	NaN
4	Bengio	Deep Learning	1.0
5	Sutton	Reinforcement Learning - An Introduction	2.0

Filter rows based on multiple conditions

condition1 & condition2

filter = ~df_books.Author.isin (authors_maths)\
        & df_books.Edition.notnull ()
df_books [filter]

	Author	Title	Edition
2	Witten	ML Beginner	2.0
4	Bengio	Deep Learning	1.0
5	Sutton	Reinforcement Learning - An Introduction	2.0

Filter rows based on a series

s_title_learning = df_books ['Title'\
                            ].apply (lambda x: x.endswith ('Learning'))
df_books [s_title_learning]

	Author	Title	Edition
4	Bengio	Deep Learning	1.0

8. Selection

using .loc

selection using label-based location
selection using filter

df.loc [(index1_label, ...), [selected_cols]]

df.loc [[index_label1, index_label2, ...], [selected_cols]]

df.loc [filter, [selected_cols]]

NOTE: For the first parameter of .loc (), we pass index values. These values are the labels of the rows. Hence, .loc is label-based.

data creation - add index

frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
                                 names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0
	ML	Witten	ML Beginner	2.0
	ML	Bishop	ML Advanced	NaN
	RL	Sutton	Reinforcement Learning - An Introduction	2.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0
Maths	Prob	Blitzstein	Introduction to Probability	NaN

df.loc [index] - returns rows belonging to the index label

df_books.loc ['Machine Learning']

	Author	Title	Edition
Sub-Topic
DL	Bengio	Deep Learning	1.0
ML	Witten	ML Beginner	2.0
ML	Bishop	ML Advanced	NaN
RL	Sutton	Reinforcement Learning - An Introduction	2.0

df.loc [(index1, index2)] - returns rows belonging to the hierarchical index labels

df_books.loc [('Machine Learning', 'ML')]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	ML	Witten	ML Beginner	2.0
Machine Learning	ML	Bishop	ML Advanced	NaN

df.loc [[idx1_labl1, idx1_lab2]] - returns rows belonging to the multiple labels of the index root

df_books.loc [['Machine Learning', 'Maths']]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0
	ML	Witten	ML Beginner	2.0
	ML	Bishop	ML Advanced	NaN
	RL	Sutton	Reinforcement Learning - An Introduction	2.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0
Maths	Prob	Blitzstein	Introduction to Probability	NaN

df.loc [([idx1_lab1, idx1_lab2], [idx2_lab1]), [selected_cols]] - returns rows belonging to the cross-product labels of the nodes of the hierarchical index

df_books.loc [(['Machine Learning', 'Maths'],\
               ['LA', 'DL', 'RL']), ['Title']]

		Title
Topic	Sub-Topic
Machine Learning	DL	Deep Learning
Machine Learning	RL	Reinforcement Learning - An Introduction
Maths	LA	Introduction to Linear Algebra

df.loc [(index1, index2), col_name] - returns rows belonging to the label, and their single column

returns a series

df_books.loc [('Machine Learning', 'ML'), 'Title']

Topic             Sub-Topic
Machine Learning  ML           ML Beginner
                  ML           ML Advanced
Name: Title, dtype: object

df.loc [(index1, index2), [col_name] - returns rows belonging to the label, and their one or more columns

returns a data frame

df_books.loc [('Machine Learning', 'ML'), ['Title']]

		Title
Topic	Sub-Topic
Machine Learning	ML	ML Beginner
Machine Learning	ML	ML Advanced

df.loc [(index1, index2), [cols]] - returns rows belonging to the label, and their one or more columns

df_books.loc [('Machine Learning', 'ML'), ['Title', 'Edition']]

		Title	Edition
Topic	Sub-Topic
Machine Learning	ML	ML Beginner	2.0
Machine Learning	ML	ML Advanced	NaN

df.loc [filter] - returns rows satisfying the filter

filter = df_books ['Author'] == 'Sutton'
df_books.loc [filter]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	RL	Sutton	Reinforcement Learning - An Introduction	2.0

NOTE:

df [filter]
and df.loc [filter] both return the same result.
But, df.loc provides selection of columns

df_books [filter]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	RL	Sutton	Reinforcement Learning - An Introduction	2.0

df_books.loc [filter, ['Edition', 'Author']]

		Edition	Author
Topic	Sub-Topic
Machine Learning	RL	2.0	Sutton

df.loc [series] - return rows based on a series

Create a series by applying a lambda on a column

s_title_learning = df_books ['Title'\
                            ].apply (lambda x: x.endswith ('Learning'))
df_books.loc [s_title_learning, ['Edition', 'Title']]

		Edition	Title
Topic	Sub-Topic
Machine Learning	DL	1.0	Deep Learning

using .iloc - integer-location-based location

df.iloc [row_range, col_range]

NOTE: For both the parameters of .iloc (), we pass location indices. Hence, .iloc is integer-location-based.

NOTE:

a single row or column is returned as a series

have returned a dataframe by enclosing the index inside a []
multiple rows or columns are returned as a dataframe

df.iloc [int_location] returns a row as series

df_books.iloc [0]

Author            Bengio
Title      Deep Learning
Edition                1
Name: (Machine Learning, DL), dtype: object

df.iloc [[int_location]] returns a row as data frame

df_books.iloc [[0]]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0

df.iloc [row_range] - returns multiple rows from the range

df_books.iloc [2:5]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	ML	Bishop	ML Advanced	NaN
Machine Learning	RL	Sutton	Reinforcement Learning - An Introduction	2.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0

df.iloc [row_range, col_range] - returns multiple rows and columns from the range

df_books.iloc [2:5, 1:3]

		Title	Edition
Topic	Sub-Topic
Machine Learning	ML	ML Advanced	NaN
Machine Learning	RL	Reinforcement Learning - An Introduction	2.0
Maths	LA	Introduction to Linear Algebra	1.0

df.iloc [row_range, [-1]] - returns multiple rows from the range, and the last column

df_books.iloc [2:5, [-1]]

		Edition
Topic	Sub-Topic
Machine Learning	ML	NaN
Machine Learning	RL	2.0
Maths	LA	1.0

get the first row

df_books.iloc [[0]]

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0

get the last row

df_books.iloc [[-1]]

		Author	Title	Edition
Topic	Sub-Topic
Maths	Prob	Blitzstein	Introduction to Probability	NaN

9. Update

Data Creation

frame = {'Author': s_author, 'Title': s_title,\
         'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
                                 names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books

		Author	Title	Edition
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0
	ML	Witten	ML Beginner	2.0
	ML	Bishop	ML Advanced	NaN
	RL	Sutton	Reinforcement Learning - An Introduction	2.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0
Maths	Prob	Blitzstein	Introduction to Probability	NaN

Add a ‘cost column

NOTE:

A series can be added externally only if the index of the series matches that of the data frame
This can be achieved by pd.Series (lst, index = )

cost_indexed = [30, np.NaN, 40, 25, np.NaN, 20]
df_books ['Cost'] = pd.Series (cost_indexed, index = df_books.index)
df_books

		Author	Title	Edition	Cost
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Learning	1.0	30.0
	ML	Witten	ML Beginner	2.0	NaN
	ML	Bishop	ML Advanced	NaN	40.0
	RL	Sutton	Reinforcement Learning - An Introduction	2.0	25.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0	NaN
Maths	Prob	Blitzstein	Introduction to Probability	NaN	20.0

Using .loc

df_books.loc [('Machine Learning', ['DL']), ['Title']]\
= 'Deep Reinforcement Learning'
df_books

		Author	Title	Edition	Cost
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Reinforcement Learning	1.0	30.0
	ML	Witten	ML Beginner	2.0	NaN
	ML	Bishop	ML Advanced	NaN	40.0
	RL	Sutton	Reinforcement Learning - An Introduction	2.0	25.0
Maths	LA	Strang	Introduction to Linear Algebra	1.0	NaN
Maths	Prob	Blitzstein	Introduction to Probability	NaN	20.0

fill missing values

fill with a default value

Note:

df[col].fillna () updates a column (not inplace) and returns it as a series. Pandas Philosophy: single row/column returned as series.
The returned series is assigned to a dataframe column in the LHS

df_books ['Edition'] = df_books ['Edition'].fillna ('0.0')
df_books

		Author	Title	Edition	Cost
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Reinforcement Learning	1	30.0
	ML	Witten	ML Beginner	2	NaN
	ML	Bishop	ML Advanced	0.0	40.0
	RL	Sutton	Reinforcement Learning - An Introduction	2	25.0
Maths	LA	Strang	Introduction to Linear Algebra	1	NaN
Maths	Prob	Blitzstein	Introduction to Probability	0.0	20.0

fill with a stats value

groupby based on index

level =
[l1, l2, ..]

df_books.groupby (level = [0])['Cost'].sum ()

Topic
Machine Learning    95.0
Maths               20.0
Name: Cost, dtype: float64

median value
of the group

df_books.groupby (level = [0])['Cost'].transform ('median')

Topic             Sub-Topic
Machine Learning  DL           30.0
                  ML           30.0
                  ML           30.0
                  RL           30.0
Maths             LA           20.0
                  Prob         20.0
Name: Cost, dtype: float64

missing values filled with median value of the group NOTE:

values updated for ‘ML Beginner’ and ‘Introduction to Linear Algebra’

#series of median cost grouped by index level 0 ('Topic')
s_cost_median_by_group = df_books.groupby (level = [0])\
['Cost'].transform ('median')
#update the empty cells of 'Cost'
df_books ['Cost'].fillna (s_cost_median_by_group, inplace = True)
df_books

		Author	Title	Edition	Cost
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Reinforcement Learning	1	30.0
	ML	Witten	ML Beginner	2	30.0
	ML	Bishop	ML Advanced	0.0	40.0
	RL	Sutton	Reinforcement Learning - An Introduction	2	25.0
Maths	LA	Strang	Introduction to Linear Algebra	1	20.0
Maths	Prob	Blitzstein	Introduction to Probability	0.0	20.0

Using columns

df_books ['Description'] = df_books ['Title'] + ' - by - '\
+ df_books ['Author']
df_books

		Author	Title	Edition	Cost	Description
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Reinforcement Learning	1	30.0	Deep Reinforcement Learning - by - Bengio
	ML	Witten	ML Beginner	2	30.0	ML Beginner - by - Witten
	ML	Bishop	ML Advanced	0.0	40.0	ML Advanced - by - Bishop
	RL	Sutton	Reinforcement Learning - An Introduction	2	25.0	Reinforcement Learning - An Introduction - by ...
Maths	LA	Strang	Introduction to Linear Algebra	1	20.0	Introduction to Linear Algebra - by - Strang
Maths	Prob	Blitzstein	Introduction to Probability	0.0	20.0	Introduction to Probability - by - Blitzstein

binning using pd.cut ()

binning - numeric to categorical

cost_boundaries = [0, 11, 21, 31, 41, np.inf]
cost_range = ['Free', 'Cheap', 'Affordable', \
              'Costly', 'Exorbitant']
df_books ['Cost Range'] = pd.cut \
(df_books ['Cost'], cost_boundaries, labels = cost_range)
df_books

		Author	Title	Edition	Cost	Description	Cost Range
Topic	Sub-Topic
Machine Learning	DL	Bengio	Deep Reinforcement Learning	1	30.0	Deep Reinforcement Learning - by - Bengio	Affordable
	ML	Witten	ML Beginner	2	30.0	ML Beginner - by - Witten	Affordable
	ML	Bishop	ML Advanced	0.0	40.0	ML Advanced - by - Bishop	Costly
	RL	Sutton	Reinforcement Learning - An Introduction	2	25.0	Reinforcement Learning - An Introduction - by ...	Affordable
Maths	LA	Strang	Introduction to Linear Algebra	1	20.0	Introduction to Linear Algebra - by - Strang	Cheap
Maths	Prob	Blitzstein	Introduction to Probability	0.0	20.0	Introduction to Probability - by - Blitzstein	Cheap