Imports, and Data Fetch

imports

#imports
import pandas as pd
import numpy as np

read configuration file

#configuration
from read_config import Config
config = Config ()

data

NOTE: We make use of the following dataframes

  • df_titanic read from the titanic dataset
  • df_raw - an in-code sample dataframe to demonstrate csv file save, and some dataframe statistics
  • df_books - an in-code sample dataframe - starting from sub-section 5. of section ‘DataFrame Playground’. It is used to show most of the features of the dataframe.

titanic

config.set_dataset_id ("titanic")
df_titanic = config.get_train_df ()
df_titanic.head (2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

DataFrame Playground

1. Read Data from CSV file into a DataFrame

import os
import pandas as pd

config.set_dataset_id ("titanic") #read titanic data
df_titanic = config.get_train_df ()

2. Create DataFrame from Dictionary

pd.DataFrame (dict, columns = , index = )
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}

df_raw = pd.DataFrame (raw_data, columns = ['first_name', 'last_name',\
                                           'age', 'preTestScore',\
                                            'postTestScore'])

3. Save DataFrame to CSV file

df.to_csv (csv_filename)
df_raw.to_csv ("raw.csv")

4. DataFrame Statistics

Information

df.info ()
df_raw.info ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   first_name     5 non-null      object
 1   last_name      5 non-null      object
 2   age            5 non-null      int64 
 3   preTestScore   5 non-null      object
 4   postTestScore  5 non-null      object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes

Information that indicates how many nulls in each column

E.g. ‘Age’ value is present only for 714 out of 891 rows. 177 rows do not have value. This count can be achieved by using the function df.isnull().sum() as shown further down.

df_titanic.info ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Column Names

NOTE: columns is an attribute that returns an index containing the column names. This attribute can also have a name.

df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Shape

df.shape
df_raw.shape
(5, 5)

Display first few rows

df.head (n)
df_raw.head (2)
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25,000
1 Molly Jacobson 52 24 94,000

Find ‘unique’ values of a column

df[col].unique ()
df_titanic ['Embarked'].unique ()
array(['S', 'C', 'Q', nan], dtype=object)

Subtotals, grouped by values of a column

value_counts ()

returns a Series with group by values as index values.

df_titanic ['Survived'].value_counts ()
0    549
1    342
Name: Survived, dtype: int64

NOTE: ‘Survived’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Survived’.

print(df_titanic ['Survived'].value_counts ().index)
Int64Index([0, 1], dtype='int64')

Extract rows based on a filter

filter = df [col_filter] == value
df [filter]

df_titanic [df_titanic ['Survived'] == 1].head (2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

Subtotals on extracted rows, grouped by values of a column

filter = df [col_filter] == value
df [filter][col_groupby].value_counts ()

print ('Survived:')
print (df_titanic [df_titanic \
                  ['Survived'] == 1]['Sex'].value_counts ())
print ('\nDidn''t Survive:')
print (df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ())
Survived:
female    233
male      109
Name: Sex, dtype: int64

Didnt Survive:
male      468
female     81
Name: Sex, dtype: int64

value_counts () returns a Series with group by values as index values.

NOTE: ‘Survived’ is used as a filter, and ‘Sex’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Sex’.

type(df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ())
pandas.core.series.Series

Series has an index

df_titanic [df_titanic \
                  ['Survived'] == 0]['Sex'].value_counts ().index
Index(['male', 'female'], dtype='object')

Null count subtotals

visualizing null values inside the dataframe.
df.isnull ()

df_titanic.isnull ()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
3 False False False False False False False False False False False False
4 False False False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False False True False
887 False False False False False False False False False False False False
888 False False False False False True False False False False True False
889 False False False False False False False False False False False False
890 False False False False False False False False False False True False

891 rows × 12 columns

summary of null values df.isnull ().sum ()

df_titanic.isnull ().sum ()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

5. Create DataFrame (revisited)

Data creation

numpy for NaN

np.NaN
import numpy as np

Create some lists

author = ['Strang', 'Blitzstein', 'Witten', 'Bishop', 'Bengio', 'Sutton']
title = ['Introduction to Linear Algebra', 'Introduction to Probability', 'ML Beginner', 'ML Advanced', \
         'Deep Learning', 'Reinforcement Learning - An Introduction']
edition = [1, np.NaN, 2, np.NaN, 1, 2]
cost = [10, 20, 15, 40, 30, 25]
topic = ['Maths', 'Maths', 'Machine Learning', 'Machine Learning', \
         'Machine Learning', 'Machine Learning']
sub_topic = ['LA', 'Prob', 'ML', 'ML', 'DL', 'RL']

Create some Series

pd.Series (list)
s_author = pd.Series (author)
s_title = pd.Series (title)
s_edition = pd.Series (edition)
s_cost = pd.Series (cost)

Series has an index. This series has a range index

Series.index
s_author.index
RangeIndex(start=0, stop=6, step=1)

DataFrame from multiple lists

Incorrect way

pd.DataFrame ([author, title])
0 1 2 3 4 5
0 Strang Blitzstein Witten Bishop Bengio Sutton
1 Introduction to Linear Algebra Introduction to Probability ML Beginner ML Advanced Deep Learning Reinforcement Learning - An Introduction

The reason authors and titles appear on different rows is that a list is treated as a column, and we provided a list of lists. So, the outer list of 2 elements (author_list and title_list) created the 2 rows, and, the inner list of authors and titles expanded column-wise.

list is treated as a column

pd.DataFrame (['a', 'b'])
0
0 a
1 b

Correct way

Zip and create a list of tuples. The elements of the list form the rows, and the elements of the enclosing tuple form the columns.
pd.DataFrame (list (zip (lst_1, lst_2)) , columns = , index = )

pd.DataFrame (list (zip (author, title)), columns = ['Author', 'Title'])
Author Title
0 Strang Introduction to Linear Algebra
1 Blitzstein Introduction to Probability
2 Witten ML Beginner
3 Bishop ML Advanced
4 Bengio Deep Learning
5 Sutton Reinforcement Learning - An Introduction

Zip and create a tuple of tuples. The elements of the outer tuple form the rows, and the elements of the enclosing tuple form the columns.

pd.DataFrame (tuple (zip (author, title)), columns = ['Author', 'Title'])
Author Title
0 Strang Introduction to Linear Algebra
1 Blitzstein Introduction to Probability
2 Witten ML Beginner
3 Bishop ML Advanced
4 Bengio Deep Learning
5 Sutton Reinforcement Learning - An Introduction

DataFrame from a dictionary, using ‘lists’ as values of keys

frame = {'Author': author, 'Title': title}
df_temp = pd.DataFrame (frame)
df_temp.head ()
Author Title
0 Strang Introduction to Linear Algebra
1 Blitzstein Introduction to Probability
2 Witten ML Beginner
3 Bishop ML Advanced
4 Bengio Deep Learning

DataFrame from a dictionary, using ‘Series’ as values of keys

frame = {'Author': s_author, 'Title': s_title}
df_books = pd.DataFrame (frame)
df_books.head ()
Author Title
0 Strang Introduction to Linear Algebra
1 Blitzstein Introduction to Probability
2 Witten ML Beginner
3 Bishop ML Advanced
4 Bengio Deep Learning

Add a ‘Series’ externally

NOTE:

  • A series can be added externally only if the index of the series matches that of the data frame
df_books ['Edition'] = s_edition
df_books.head ()
Author Title Edition
0 Strang Introduction to Linear Algebra 1.0
1 Blitzstein Introduction to Probability NaN
2 Witten ML Beginner 2.0
3 Bishop ML Advanced NaN
4 Bengio Deep Learning 1.0

6. Index

Provide ‘index’

Create a custom index

idx = ['a', 'b', 'c', 'd', 'e', 'f']

Pass nameless indices as a parameter

NOTE:

  • index’ is the row index
  • columns’ is the column index
df = pd.DataFrame (tuple (zip (author, title)), \
              columns = ['Author', 'Title'], \
             index = idx)
df
Author Title
a Strang Introduction to Linear Algebra
b Blitzstein Introduction to Probability
c Witten ML Beginner
d Bishop ML Advanced
e Bengio Deep Learning
f Sutton Reinforcement Learning - An Introduction

Provide names for indices

use rename_axis () to rename index names

A single parameter implies row index.
rename_axis (row_index_name)

df2 = df.rename_axis ('Sr. No.')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name:  Sr. No.
Column Index Name:  None
Author Title
Sr. No.
a Strang Introduction to Linear Algebra
b Blitzstein Introduction to Probability
c Witten ML Beginner
d Bishop ML Advanced
e Bengio Deep Learning
f Sutton Reinforcement Learning - An Introduction

Use axis to indicate row (= 0) or column (= 1) axis
rename_axis (col_axis_name, axis = 1)

df2 = df.rename_axis ('Attributes->', axis = 1)
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name:  None
Column Index Name:  Attributes->
Attributes-> Author Title
a Strang Introduction to Linear Algebra
b Blitzstein Introduction to Probability
c Witten ML Beginner
d Bishop ML Advanced
e Bengio Deep Learning
f Sutton Reinforcement Learning - An Introduction

‘axis’ values 0 and 1 also carry names “rows” and “columns” respectively.

df2 = df.rename_axis ('Attributes->', axis = "columns")
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name:  None
Column Index Name:  Attributes->
Attributes-> Author Title
a Strang Introduction to Linear Algebra
b Blitzstein Introduction to Probability
c Witten ML Beginner
d Bishop ML Advanced
e Bengio Deep Learning
f Sutton Reinforcement Learning - An Introduction

Rename both row and column axes denoted by ‘index’ for row index, and ‘columns’ for column attribute that returns an index.
rename_axis (index = row_axis_name, columns = col_axis_name)

df2 = df.rename_axis (index = 'Sr. No.', columns = 'Attributes->')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name:  Sr. No.
Column Index Name:  Attributes->
Attributes-> Author Title
Sr. No.
a Strang Introduction to Linear Algebra
b Blitzstein Introduction to Probability
c Witten ML Beginner
d Bishop ML Advanced
e Bengio Deep Learning
f Sutton Reinforcement Learning - An Introduction

Hierarchical Indices

Create the multi-index

pd.MultiIndex

.from_product ()

  • list of list of indices
  • list of names of indices

NOTE: The list of indices are of different size: There are 3 topics, and 2 sub-topics. So, we have to use from_product () to perform the cross product of the two indices. This creates a hierarchy. If we just wanted two row indices without them being hierarchical, the indices list should be of the same size, and we need to use from_arrays () instead.

i_topic = ['Mathematics', 'Machine Learning', 'DL/RL']
i_subtopic = [1, 2]
i_names = ['Topic', 'Sub-Topic']

mux = pd.MultiIndex.from_product ([i_topic, i_subtopic], names = i_names)
mux
MultiIndex([(     'Mathematics', 1),
            (     'Mathematics', 2),
            ('Machine Learning', 1),
            ('Machine Learning', 2),
            (           'DL/RL', 1),
            (           'DL/RL', 2)],
           names=['Topic', 'Sub-Topic'])

Create the dataframe with multi-index

pd.DataFrame (list, columns = , index = pd.MultiIndex ().from_#)

pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)
Author Title
Topic Sub-Topic
Mathematics 1 Strang Introduction to Linear Algebra
2 Blitzstein Introduction to Probability
Machine Learning 1 Witten ML Beginner
2 Bishop ML Advanced
DL/RL 1 Bengio Deep Learning
2 Sutton Reinforcement Learning - An Introduction

Create a multi-index using from_arrays ()

mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
mux
MultiIndex([(           'Maths',   'LA'),
            (           'Maths', 'Prob'),
            ('Machine Learning',   'ML'),
            ('Machine Learning',   'ML'),
            ('Machine Learning',   'DL'),
            ('Machine Learning',   'RL')],
           names=['Topic', 'Sub-Topic'])

Create a data frame with multi-index (created using from_arrays ())

pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)
Author Title
Topic Sub-Topic
Maths LA Strang Introduction to Linear Algebra
Prob Blitzstein Introduction to Probability
Machine Learning ML Witten ML Beginner
ML Bishop ML Advanced
DL Bengio Deep Learning
RL Sutton Reinforcement Learning - An Introduction

Create a dataframe with multi-index

mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title)), \
             columns = ['Author', 'Title'], \
             index = mux)
df
Author Title
Topic Sub-Topic
Maths LA Strang Introduction to Linear Algebra
Prob Blitzstein Introduction to Probability
Machine Learning ML Witten ML Beginner
ML Bishop ML Advanced
DL Bengio Deep Learning
RL Sutton Reinforcement Learning - An Introduction

Filter based on index values

df.loc [index_val]
df.loc [(index1_val, index2_val)]

NOTE: A data frame is returned.

df.sort_index ()

  • to prevent “PerformanceWarning: indexing past lexsort depth may impact performance.”
#To prevent 
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df.sort_index (inplace = True)
print ('Books in Mathematics:')
print (df.loc ['Maths'], '\n\n')
print ('Books in Machine Learning:')
print (df.loc ['Machine Learning'], '\n\n')
print ('Books in Deep Learning:')
print (df.loc [('Machine Learning', 'DL')])
Books in Mathematics:
               Author                           Title
Sub-Topic                                            
LA             Strang  Introduction to Linear Algebra
Prob       Blitzstein     Introduction to Probability 


Books in Machine Learning:
           Author                                     Title
Sub-Topic                                                  
DL         Bengio                             Deep Learning
ML         Witten                               ML Beginner
ML         Bishop                               ML Advanced
RL         Sutton  Reinforcement Learning - An Introduction 


Books in Deep Learning:
                            Author          Title
Topic            Sub-Topic                       
Machine Learning DL         Bengio  Deep Learning

Set existing columns as indexes

Create a dataframe WITHOUT passing any index

Instead, pass index as regular columns - topic, and sub_topic

#mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title, topic, sub_topic)), \
             columns = ['Author', 'Title', 'Topic', 'Sub-Topic'], \
             #index = mux
                  )
df
Author Title Topic Sub-Topic
0 Strang Introduction to Linear Algebra Maths LA
1 Blitzstein Introduction to Probability Maths Prob
2 Witten ML Beginner Machine Learning ML
3 Bishop ML Advanced Machine Learning ML
4 Bengio Deep Learning Machine Learning DL
5 Sutton Reinforcement Learning - An Introduction Machine Learning RL

Create index from columns

df.set_index ([cols], inplace = True)

df.set_index (['Topic', 'Sub-Topic'], inplace = True)
df
Author Title
Topic Sub-Topic
Maths LA Strang Introduction to Linear Algebra
Prob Blitzstein Introduction to Probability
Machine Learning ML Witten ML Beginner
ML Bishop ML Advanced
DL Bengio Deep Learning
RL Sutton Reinforcement Learning - An Introduction

Instantiate an index and set it

df.set_index (pd.RangeIndex (start, stop, step), inplace = True

df.set_index (\
              pd.RangeIndex (start = 5, stop = 11, step = 1),\
              inplace = True)
df
Author Title
5 Strang Introduction to Linear Algebra
6 Blitzstein Introduction to Probability
7 Witten ML Beginner
8 Bishop ML Advanced
9 Bengio Deep Learning
10 Sutton Reinforcement Learning - An Introduction

df.groupby (col)

1. DataFrameGroupBy

df.groupby (col)
df.groupby ([cols])

  • returns an instance of DataFrameGroupBy
df = df_titanic
df.groupby (['Survived', 'Sex'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001CAEA053A30>

2. Summary Statistics

Perform a summary statistic on DataFrameGroupBy

  • mean () - for numerical columns
  • count () - for all columns
  • size ()
    • for numerical columns

NOTE:

  • stats are obviously not performed on groupby columns, since those column values are used to form groups
  • count returns individual column count of rows having some value

    count of columns like ‘Age’ and ‘Cabin’ indicate missing values

  • sum returns count of rows. some column having missing value does not impact, unless all columns have missing values.
print ('Information of the data frame:')
print (df.info (), '\n\n')
print ('Mean Statistics:')
print (df.groupby (['Survived', 'Sex']).mean (), '\n\n')
print ('Count Statistics:')
print (df.groupby (['Survived', 'Sex']).count (), '\n\n')
print ('Size Statistics:')
print (df.groupby (['Survived', 'Sex']).size ())
Information of the data frame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None 


Mean Statistics:
                 PassengerId    Pclass        Age     SibSp     Parch  \
Survived Sex                                                            
0        female   434.851852  2.851852  25.046875  1.209877  1.037037   
         male     449.121795  2.476496  31.618056  0.440171  0.207265   
1        female   429.699571  1.918455  28.847716  0.515021  0.515021   
         male     475.724771  2.018349  27.276022  0.385321  0.357798   

                      Fare  
Survived Sex                
0        female  23.024385  
         male    21.960993  
1        female  51.938573  
         male    40.821484   


Count Statistics:
                 PassengerId  Pclass  Name  Age  SibSp  Parch  Ticket  Fare  \
Survived Sex                                                                  
0        female           81      81    81   64     81     81      81    81   
         male            468     468   468  360    468    468     468   468   
1        female          233     233   233  197    233    233     233   233   
         male            109     109   109   93    109    109     109   109   

                 Cabin  Embarked  
Survived Sex                      
0        female      6        81  
         male       62       468  
1        female     91       231  
         male       45       109   


Size Statistics:
Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
dtype: int64

> Series (with indexes) returned by groupby.summary_stat

NOTE:

  • df.groupby (col).size () returns a series, with an index
  • df.groupby ([cols]).size () returns a series with a multi-index

» groupby (col)

series

df.groupby ('Survived').size ()
Survived
0    549
1    342
dtype: int64

type = Series

type(df.groupby ('Survived').size ())
pandas.core.series.Series
index

of the series

df.groupby ('Survived').size ().index
Int64Index([0, 1], dtype='int64', name='Survived')

» groupby ([cols])

NOTE:

  • didn’t survive 549 = 81 + 468
  • survived 342 = 233 + 109

series

df.groupby (['Survived', 'Sex']).size ()
Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
dtype: int64

type = series

type (df.groupby (['Survived', 'Sex']).size ())
pandas.core.series.Series
multi-index

of the series

df.groupby (['Survived', 'Sex']).size ().index
MultiIndex([(0, 'female'),
            (0,   'male'),
            (1, 'female'),
            (1,   'male')],
           names=['Survived', 'Sex'])

3. Unstack the multi-index series into a data frame

You may wish to remove one of the features from the hierarchical index and form different columns with respect to that feature. You can do so using the unstack method.

NOTE: unstack () requires the feature to have unique values. This is because the values of the feature form column names. An index created using groupby () ensures that the index has unique values.

df.groupby (['Survived', 'Sex']).size ().unstack ()
Sex female male
Survived
0 81 468
1 233 109

NOTE:

  • didn’t survive 549 = 81 + 468
  • survived 342 = 233 + 109

The index column to unstack, is passed as parameter, can be given by name, or by position, 0 is the default.

df.groupby (['Survived', 'Sex']).size ().unstack ('Sex')
Sex female male
Survived
0 81 468
1 233 109

by default, the leaf of the hierarchical (multi) index is unstacked, leaving the rest of the index tree as the index of the returned dataframe

df.groupby (['Survived', 'Sex']).size ().unstack ().index
Int64Index([0, 1], dtype='int64', name='Survived')

the unstacked leaf of the index forms the columns of the returned dataframe

df.groupby (['Survived', 'Sex']).size ().unstack ().columns
Index(['female', 'male'], dtype='object', name='Sex')

mention the index column to unstack

NOTE

  • this time we have unstacked 'Survived' instead of 'Sex'
df.groupby (['Survived', 'Sex']).size ().unstack ('Survived')
Survived 0 1
Sex
female 81 233
male 468 109

7. Filter

Data creation

frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
df_books.head ()
Author Title Edition
0 Strang Introduction to Linear Algebra 1.0
1 Blitzstein Introduction to Probability NaN
2 Witten ML Beginner 2.0
3 Bishop ML Advanced NaN
4 Bengio Deep Learning 1.0

An example ‘filter’

df_books ['Edition'] == 1
0     True
1    False
2    False
3    False
4     True
5    False
Name: Edition, dtype: bool

Chaining format

df_books.Edition.eq (2)
0    False
1    False
2     True
3    False
4    False
5     True
Name: Edition, dtype: bool

Apply filter to a dataframe

Filter rows whose column value equals a specific value

filter = df_books ['Edition'] == 2
df_books [filter]
Author Title Edition
2 Witten ML Beginner 2.0
5 Sutton Reinforcement Learning - An Introduction 2.0
Filter rows whose column value does NOT equal a specific value - !=
filter = df_books.Edition != 2
df_books [filter]
Author Title Edition
0 Strang Introduction to Linear Algebra 1.0
1 Blitzstein Introduction to Probability NaN
3 Bishop ML Advanced NaN
4 Bengio Deep Learning 1.0

Filter rows whose column value is not NaN - df.col.notnull ()

filter = df_books.Edition.notnull ()
df_books [filter]
Author Title Edition
0 Strang Introduction to Linear Algebra 1.0
2 Witten ML Beginner 2.0
4 Bengio Deep Learning 1.0
5 Sutton Reinforcement Learning - An Introduction 2.0

Filter rows with column values among a set of values

df.colname.isin

authors_maths = ['Apostol', 'Strang', 'Blitzstein']
filter = df_books.Author.isin (authors_maths)
df_books [filter]
Author Title Edition
0 Strang Introduction to Linear Algebra 1.0
1 Blitzstein Introduction to Probability NaN
Filter rows with column values NOT among a set of values
~df

.colname.isin

filter = ~df_books.Author.isin (authors_maths)
df_books [filter]
Author Title Edition
2 Witten ML Beginner 2.0
3 Bishop ML Advanced NaN
4 Bengio Deep Learning 1.0
5 Sutton Reinforcement Learning - An Introduction 2.0

Filter rows based on multiple conditions

condition1 & condition2

filter = ~df_books.Author.isin (authors_maths)\
        & df_books.Edition.notnull ()
df_books [filter]
Author Title Edition
2 Witten ML Beginner 2.0
4 Bengio Deep Learning 1.0
5 Sutton Reinforcement Learning - An Introduction 2.0

Filter rows based on a series

s_title_learning = df_books ['Title'\
                            ].apply (lambda x: x.endswith ('Learning'))
df_books [s_title_learning]
Author Title Edition
4 Bengio Deep Learning 1.0

8. Selection

using .loc

  • selection using label-based location
  • selection using filter

df.loc [(index1_label, ...), [selected_cols]]

df.loc [[index_label1, index_label2, ...], [selected_cols]]

df.loc [filter, [selected_cols]]

NOTE: For the first parameter of .loc (), we pass index values. These values are the labels of the rows. Hence, .loc is label-based.

data creation - add index

frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
                                 names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books
Author Title Edition
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0
ML Witten ML Beginner 2.0
ML Bishop ML Advanced NaN
RL Sutton Reinforcement Learning - An Introduction 2.0
Maths LA Strang Introduction to Linear Algebra 1.0
Prob Blitzstein Introduction to Probability NaN

df.loc [index] - returns rows belonging to the index label

df_books.loc ['Machine Learning']
Author Title Edition
Sub-Topic
DL Bengio Deep Learning 1.0
ML Witten ML Beginner 2.0
ML Bishop ML Advanced NaN
RL Sutton Reinforcement Learning - An Introduction 2.0

df.loc [(index1, index2)] - returns rows belonging to the hierarchical index labels

df_books.loc [('Machine Learning', 'ML')]
Author Title Edition
Topic Sub-Topic
Machine Learning ML Witten ML Beginner 2.0
ML Bishop ML Advanced NaN

df.loc [[idx1_labl1, idx1_lab2]] - returns rows belonging to the multiple labels of the index root

df_books.loc [['Machine Learning', 'Maths']]
Author Title Edition
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0
ML Witten ML Beginner 2.0
ML Bishop ML Advanced NaN
RL Sutton Reinforcement Learning - An Introduction 2.0
Maths LA Strang Introduction to Linear Algebra 1.0
Prob Blitzstein Introduction to Probability NaN

df.loc [([idx1_lab1, idx1_lab2], [idx2_lab1]), [selected_cols]] - returns rows belonging to the cross-product labels of the nodes of the hierarchical index

df_books.loc [(['Machine Learning', 'Maths'],\
               ['LA', 'DL', 'RL']), ['Title']]
Title
Topic Sub-Topic
Machine Learning DL Deep Learning
RL Reinforcement Learning - An Introduction
Maths LA Introduction to Linear Algebra

df.loc [(index1, index2), col_name] - returns rows belonging to the label, and their single column

returns a series

df_books.loc [('Machine Learning', 'ML'), 'Title']
Topic             Sub-Topic
Machine Learning  ML           ML Beginner
                  ML           ML Advanced
Name: Title, dtype: object

df.loc [(index1, index2), [col_name] - returns rows belonging to the label, and their one or more columns

returns a data frame

df_books.loc [('Machine Learning', 'ML'), ['Title']]
Title
Topic Sub-Topic
Machine Learning ML ML Beginner
ML ML Advanced

df.loc [(index1, index2), [cols]] - returns rows belonging to the label, and their one or more columns

df_books.loc [('Machine Learning', 'ML'), ['Title', 'Edition']]
Title Edition
Topic Sub-Topic
Machine Learning ML ML Beginner 2.0
ML ML Advanced NaN

df.loc [filter] - returns rows satisfying the filter

filter = df_books ['Author'] == 'Sutton'
df_books.loc [filter]
Author Title Edition
Topic Sub-Topic
Machine Learning RL Sutton Reinforcement Learning - An Introduction 2.0

NOTE:

  • df [filter]

    and df.loc [filter] both return the same result.

  • But, df.loc provides selection of columns
df_books [filter]
Author Title Edition
Topic Sub-Topic
Machine Learning RL Sutton Reinforcement Learning - An Introduction 2.0
df_books.loc [filter, ['Edition', 'Author']]
Edition Author
Topic Sub-Topic
Machine Learning RL 2.0 Sutton

df.loc [series] - return rows based on a series

Create a series by applying a lambda on a column

s_title_learning = df_books ['Title'\
                            ].apply (lambda x: x.endswith ('Learning'))
df_books.loc [s_title_learning, ['Edition', 'Title']]
Edition Title
Topic Sub-Topic
Machine Learning DL 1.0 Deep Learning

using .iloc - integer-location-based location

df.iloc [row_range, col_range]

NOTE: For both the parameters of .iloc (), we pass location indices. Hence, .iloc is integer-location-based.

NOTE:

  • a single row or column is returned as a series

    have returned a dataframe by enclosing the index inside a []

  • multiple rows or columns are returned as a dataframe

df.iloc [int_location] returns a row as series

df_books.iloc [0]
Author            Bengio
Title      Deep Learning
Edition                1
Name: (Machine Learning, DL), dtype: object

df.iloc [[int_location]] returns a row as data frame

df_books.iloc [[0]]
Author Title Edition
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0

df.iloc [row_range] - returns multiple rows from the range

df_books.iloc [2:5]
Author Title Edition
Topic Sub-Topic
Machine Learning ML Bishop ML Advanced NaN
RL Sutton Reinforcement Learning - An Introduction 2.0
Maths LA Strang Introduction to Linear Algebra 1.0

df.iloc [row_range, col_range] - returns multiple rows and columns from the range

df_books.iloc [2:5, 1:3]
Title Edition
Topic Sub-Topic
Machine Learning ML ML Advanced NaN
RL Reinforcement Learning - An Introduction 2.0
Maths LA Introduction to Linear Algebra 1.0

df.iloc [row_range, [-1]] - returns multiple rows from the range, and the last column

df_books.iloc [2:5, [-1]]
Edition
Topic Sub-Topic
Machine Learning ML NaN
RL 2.0
Maths LA 1.0

get the first row

df_books.iloc [[0]]
Author Title Edition
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0

get the last row

df_books.iloc [[-1]]
Author Title Edition
Topic Sub-Topic
Maths Prob Blitzstein Introduction to Probability NaN

9. Update

Data Creation

frame = {'Author': s_author, 'Title': s_title,\
         'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
                                 names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books
Author Title Edition
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0
ML Witten ML Beginner 2.0
ML Bishop ML Advanced NaN
RL Sutton Reinforcement Learning - An Introduction 2.0
Maths LA Strang Introduction to Linear Algebra 1.0
Prob Blitzstein Introduction to Probability NaN

Add a ‘cost column

NOTE:

  • A series can be added externally only if the index of the series matches that of the data frame
  • This can be achieved by pd.Series (lst, index = )
cost_indexed = [30, np.NaN, 40, 25, np.NaN, 20]
df_books ['Cost'] = pd.Series (cost_indexed, index = df_books.index)
df_books
Author Title Edition Cost
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning 1.0 30.0
ML Witten ML Beginner 2.0 NaN
ML Bishop ML Advanced NaN 40.0
RL Sutton Reinforcement Learning - An Introduction 2.0 25.0
Maths LA Strang Introduction to Linear Algebra 1.0 NaN
Prob Blitzstein Introduction to Probability NaN 20.0

Using .loc

df_books.loc [('Machine Learning', ['DL']), ['Title']]\
= 'Deep Reinforcement Learning'
df_books
Author Title Edition Cost
Topic Sub-Topic
Machine Learning DL Bengio Deep Reinforcement Learning 1.0 30.0
ML Witten ML Beginner 2.0 NaN
ML Bishop ML Advanced NaN 40.0
RL Sutton Reinforcement Learning - An Introduction 2.0 25.0
Maths LA Strang Introduction to Linear Algebra 1.0 NaN
Prob Blitzstein Introduction to Probability NaN 20.0

fill missing values

fill with a default value

Note:

  • df[col].fillna () updates a column (not inplace) and returns it as a series. Pandas Philosophy: single row/column returned as series.
  • The returned series is assigned to a dataframe column in the LHS
df_books ['Edition'] = df_books ['Edition'].fillna ('0.0')
df_books
Author Title Edition Cost
Topic Sub-Topic
Machine Learning DL Bengio Deep Reinforcement Learning 1 30.0
ML Witten ML Beginner 2 NaN
ML Bishop ML Advanced 0.0 40.0
RL Sutton Reinforcement Learning - An Introduction 2 25.0
Maths LA Strang Introduction to Linear Algebra 1 NaN
Prob Blitzstein Introduction to Probability 0.0 20.0

fill with a stats value

groupby based on index
  • level =

    [l1, l2, ..]

df_books.groupby (level = [0])['Cost'].sum ()
Topic
Machine Learning    95.0
Maths               20.0
Name: Cost, dtype: float64
median value

of the group

df_books.groupby (level = [0])['Cost'].transform ('median')
Topic             Sub-Topic
Machine Learning  DL           30.0
                  ML           30.0
                  ML           30.0
                  RL           30.0
Maths             LA           20.0
                  Prob         20.0
Name: Cost, dtype: float64

missing values filled with median value of the group NOTE:

  • values updated for ‘ML Beginner’ and ‘Introduction to Linear Algebra
#series of median cost grouped by index level 0 ('Topic')
s_cost_median_by_group = df_books.groupby (level = [0])\
['Cost'].transform ('median')
#update the empty cells of 'Cost'
df_books ['Cost'].fillna (s_cost_median_by_group, inplace = True)
df_books
Author Title Edition Cost
Topic Sub-Topic
Machine Learning DL Bengio Deep Reinforcement Learning 1 30.0
ML Witten ML Beginner 2 30.0
ML Bishop ML Advanced 0.0 40.0
RL Sutton Reinforcement Learning - An Introduction 2 25.0
Maths LA Strang Introduction to Linear Algebra 1 20.0
Prob Blitzstein Introduction to Probability 0.0 20.0

Using columns

df_books ['Description'] = df_books ['Title'] + ' - by - '\
+ df_books ['Author']
df_books
Author Title Edition Cost Description
Topic Sub-Topic
Machine Learning DL Bengio Deep Reinforcement Learning 1 30.0 Deep Reinforcement Learning - by - Bengio
ML Witten ML Beginner 2 30.0 ML Beginner - by - Witten
ML Bishop ML Advanced 0.0 40.0 ML Advanced - by - Bishop
RL Sutton Reinforcement Learning - An Introduction 2 25.0 Reinforcement Learning - An Introduction - by ...
Maths LA Strang Introduction to Linear Algebra 1 20.0 Introduction to Linear Algebra - by - Strang
Prob Blitzstein Introduction to Probability 0.0 20.0 Introduction to Probability - by - Blitzstein

binning using pd.cut ()

binning - numeric to categorical

cost_boundaries = [0, 11, 21, 31, 41, np.inf]
cost_range = ['Free', 'Cheap', 'Affordable', \
              'Costly', 'Exorbitant']
df_books ['Cost Range'] = pd.cut \
(df_books ['Cost'], cost_boundaries, labels = cost_range)
df_books
Author Title Edition Cost Description Cost Range
Topic Sub-Topic
Machine Learning DL Bengio Deep Reinforcement Learning 1 30.0 Deep Reinforcement Learning - by - Bengio Affordable
ML Witten ML Beginner 2 30.0 ML Beginner - by - Witten Affordable
ML Bishop ML Advanced 0.0 40.0 ML Advanced - by - Bishop Costly
RL Sutton Reinforcement Learning - An Introduction 2 25.0 Reinforcement Learning - An Introduction - by ... Affordable
Maths LA Strang Introduction to Linear Algebra 1 20.0 Introduction to Linear Algebra - by - Strang Cheap
Prob Blitzstein Introduction to Probability 0.0 20.0 Introduction to Probability - by - Blitzstein Cheap