pandas Playground
Imports, and Data Fetch
imports
#imports
import pandas as pd
import numpy as np
read configuration file
#configuration
from read_config import Config
config = Config ()
data
NOTE: We make use of the following dataframes
- df_titanic read from the titanic dataset
- df_raw - an in-code sample dataframe to demonstrate csv file save, and some dataframe statistics
- df_books - an in-code sample dataframe - starting from sub-section 5. of section ‘DataFrame Playground’. It is used to show most of the features of the dataframe.
titanic
config.set_dataset_id ("titanic")
df_titanic = config.get_train_df ()
df_titanic.head (2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
DataFrame Playground
1. Read Data from CSV file into a DataFrame
import os
import pandas as pd
config.set_dataset_id ("titanic") #read titanic data
df_titanic = config.get_train_df ()
2. Create DataFrame from Dictionary
pd.DataFrame (dict, columns = , index = )raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, ".", "."],
'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df_raw = pd.DataFrame (raw_data, columns = ['first_name', 'last_name',\
'age', 'preTestScore',\
'postTestScore'])
3. Save DataFrame to CSV file
df.to_csv (csv_filename)df_raw.to_csv ("raw.csv")
4. DataFrame Statistics
Information
df.info ()df_raw.info ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 first_name 5 non-null object
1 last_name 5 non-null object
2 age 5 non-null int64
3 preTestScore 5 non-null object
4 postTestScore 5 non-null object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes
Information that indicates how many nulls in each column
E.g. ‘Age’ value is present only for 714 out of 891 rows. 177 rows do not have value. This count can be achieved by using the function df.isnull().sum() as shown further down.
df_titanic.info ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Column Names
NOTE: columns is an attribute that returns an index containing the column names. This attribute can also have a name.
df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Shape
df.shapedf_raw.shape
(5, 5)
Display first few rows
df.head (n)df_raw.head (2)
first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|
0 | Jason | Miller | 42 | 4 | 25,000 |
1 | Molly | Jacobson | 52 | 24 | 94,000 |
Find ‘unique’ values of a column
df[col].unique ()df_titanic ['Embarked'].unique ()
array(['S', 'C', 'Q', nan], dtype=object)
Subtotals, grouped by values of a column
value_counts ()returns a Series with group by values as index values.
df_titanic ['Survived'].value_counts ()
0 549
1 342
Name: Survived, dtype: int64
NOTE: ‘Survived’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Survived’.
print(df_titanic ['Survived'].value_counts ().index)
Int64Index([0, 1], dtype='int64')
Extract rows based on a filter
filter = df [col_filter] == value
df [filter]
df_titanic [df_titanic ['Survived'] == 1].head (2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Subtotals on extracted rows, grouped by values of a column
filter = df [col_filter] == value
df [filter][col_groupby].value_counts ()
print ('Survived:')
print (df_titanic [df_titanic \
['Survived'] == 1]['Sex'].value_counts ())
print ('\nDidn''t Survive:')
print (df_titanic [df_titanic \
['Survived'] == 0]['Sex'].value_counts ())
Survived:
female 233
male 109
Name: Sex, dtype: int64
Didnt Survive:
male 468
female 81
Name: Sex, dtype: int64
value_counts () returns a Series with group by values as index values.
NOTE: ‘Survived’ is used as a filter, and ‘Sex’ is used as the field, whose values will be grouped. Hence, the index is based on the unique values in ‘Sex’.
type(df_titanic [df_titanic \
['Survived'] == 0]['Sex'].value_counts ())
pandas.core.series.Series
Series has an index
df_titanic [df_titanic \
['Survived'] == 0]['Sex'].value_counts ().index
Index(['male', 'female'], dtype='object')
Null count subtotals
visualizing null values inside the dataframe.
df.isnull ()
df_titanic.isnull ()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | True | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | False | False | False | False | False | False | False | False | False | False | True | False |
887 | False | False | False | False | False | False | False | False | False | False | False | False |
888 | False | False | False | False | False | True | False | False | False | False | True | False |
889 | False | False | False | False | False | False | False | False | False | False | False | False |
890 | False | False | False | False | False | False | False | False | False | False | True | False |
891 rows × 12 columns
summary of null values df.isnull ().sum ()
df_titanic.isnull ().sum ()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
5. Create DataFrame (revisited)
Data creation
numpy for NaN
np.NaNimport numpy as np
Create some lists
author = ['Strang', 'Blitzstein', 'Witten', 'Bishop', 'Bengio', 'Sutton']
title = ['Introduction to Linear Algebra', 'Introduction to Probability', 'ML Beginner', 'ML Advanced', \
'Deep Learning', 'Reinforcement Learning - An Introduction']
edition = [1, np.NaN, 2, np.NaN, 1, 2]
cost = [10, 20, 15, 40, 30, 25]
topic = ['Maths', 'Maths', 'Machine Learning', 'Machine Learning', \
'Machine Learning', 'Machine Learning']
sub_topic = ['LA', 'Prob', 'ML', 'ML', 'DL', 'RL']
Create some Series
pd.Series (list)s_author = pd.Series (author)
s_title = pd.Series (title)
s_edition = pd.Series (edition)
s_cost = pd.Series (cost)
Series has an index. This series has a range index
s_author.index
RangeIndex(start=0, stop=6, step=1)
DataFrame from multiple lists
Incorrect way
pd.DataFrame ([author, title])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | Strang | Blitzstein | Witten | Bishop | Bengio | Sutton |
1 | Introduction to Linear Algebra | Introduction to Probability | ML Beginner | ML Advanced | Deep Learning | Reinforcement Learning - An Introduction |
The reason authors and titles appear on different rows is that a list is treated as a column, and we provided a list of lists. So, the outer list of 2 elements (author_list and title_list) created the 2 rows, and, the inner list of authors and titles expanded column-wise.
list is treated as a column
pd.DataFrame (['a', 'b'])
0 | |
---|---|
0 | a |
1 | b |
Correct way
Zip and create a list of tuples. The elements of the list form the rows, and the elements of the enclosing tuple form the columns.
pd.DataFrame (list (zip (lst_1, lst_2)) , columns = , index = )
pd.DataFrame (list (zip (author, title)), columns = ['Author', 'Title'])
Author | Title | |
---|---|---|
0 | Strang | Introduction to Linear Algebra |
1 | Blitzstein | Introduction to Probability |
2 | Witten | ML Beginner |
3 | Bishop | ML Advanced |
4 | Bengio | Deep Learning |
5 | Sutton | Reinforcement Learning - An Introduction |
Zip and create a tuple of tuples. The elements of the outer tuple form the rows, and the elements of the enclosing tuple form the columns.
pd.DataFrame (tuple (zip (author, title)), columns = ['Author', 'Title'])
Author | Title | |
---|---|---|
0 | Strang | Introduction to Linear Algebra |
1 | Blitzstein | Introduction to Probability |
2 | Witten | ML Beginner |
3 | Bishop | ML Advanced |
4 | Bengio | Deep Learning |
5 | Sutton | Reinforcement Learning - An Introduction |
DataFrame from a dictionary, using ‘lists’ as values of keys
frame = {'Author': author, 'Title': title}
df_temp = pd.DataFrame (frame)
df_temp.head ()
Author | Title | |
---|---|---|
0 | Strang | Introduction to Linear Algebra |
1 | Blitzstein | Introduction to Probability |
2 | Witten | ML Beginner |
3 | Bishop | ML Advanced |
4 | Bengio | Deep Learning |
DataFrame from a dictionary, using ‘Series’ as values of keys
frame = {'Author': s_author, 'Title': s_title}
df_books = pd.DataFrame (frame)
df_books.head ()
Author | Title | |
---|---|---|
0 | Strang | Introduction to Linear Algebra |
1 | Blitzstein | Introduction to Probability |
2 | Witten | ML Beginner |
3 | Bishop | ML Advanced |
4 | Bengio | Deep Learning |
Add a ‘Series’ externally
NOTE:
- A series can be added externally only if the index of the series matches that of the data frame
df_books ['Edition'] = s_edition
df_books.head ()
Author | Title | Edition | |
---|---|---|---|
0 | Strang | Introduction to Linear Algebra | 1.0 |
1 | Blitzstein | Introduction to Probability | NaN |
2 | Witten | ML Beginner | 2.0 |
3 | Bishop | ML Advanced | NaN |
4 | Bengio | Deep Learning | 1.0 |
6. Index
Provide ‘index’
Create a custom index
idx = ['a', 'b', 'c', 'd', 'e', 'f']
Pass nameless indices as a parameter
NOTE:
- ‘index’ is the row index
- ‘columns’ is the column index
df = pd.DataFrame (tuple (zip (author, title)), \
columns = ['Author', 'Title'], \
index = idx)
df
Author | Title | |
---|---|---|
a | Strang | Introduction to Linear Algebra |
b | Blitzstein | Introduction to Probability |
c | Witten | ML Beginner |
d | Bishop | ML Advanced |
e | Bengio | Deep Learning |
f | Sutton | Reinforcement Learning - An Introduction |
Provide names for indices
use rename_axis () to rename index names
A single parameter implies row index.
rename_axis (row_index_name)
df2 = df.rename_axis ('Sr. No.')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name: Sr. No.
Column Index Name: None
Author | Title | |
---|---|---|
Sr. No. | ||
a | Strang | Introduction to Linear Algebra |
b | Blitzstein | Introduction to Probability |
c | Witten | ML Beginner |
d | Bishop | ML Advanced |
e | Bengio | Deep Learning |
f | Sutton | Reinforcement Learning - An Introduction |
Use axis to indicate row (= 0) or column (= 1) axis
rename_axis (col_axis_name, axis = 1)
df2 = df.rename_axis ('Attributes->', axis = 1)
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name: None
Column Index Name: Attributes->
Attributes-> | Author | Title |
---|---|---|
a | Strang | Introduction to Linear Algebra |
b | Blitzstein | Introduction to Probability |
c | Witten | ML Beginner |
d | Bishop | ML Advanced |
e | Bengio | Deep Learning |
f | Sutton | Reinforcement Learning - An Introduction |
‘axis’ values 0 and 1 also carry names “rows” and “columns” respectively.
df2 = df.rename_axis ('Attributes->', axis = "columns")
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name: None
Column Index Name: Attributes->
Attributes-> | Author | Title |
---|---|---|
a | Strang | Introduction to Linear Algebra |
b | Blitzstein | Introduction to Probability |
c | Witten | ML Beginner |
d | Bishop | ML Advanced |
e | Bengio | Deep Learning |
f | Sutton | Reinforcement Learning - An Introduction |
Rename both row and column axes denoted by ‘index’ for row index, and ‘columns’ for column attribute that returns an index.
rename_axis (index = row_axis_name, columns = col_axis_name)
df2 = df.rename_axis (index = 'Sr. No.', columns = 'Attributes->')
print ('Row Index Name: ', df2.index.name)
print ('Column Index Name: ', df2.columns.name)
df2
Row Index Name: Sr. No.
Column Index Name: Attributes->
Attributes-> | Author | Title |
---|---|---|
Sr. No. | ||
a | Strang | Introduction to Linear Algebra |
b | Blitzstein | Introduction to Probability |
c | Witten | ML Beginner |
d | Bishop | ML Advanced |
e | Bengio | Deep Learning |
f | Sutton | Reinforcement Learning - An Introduction |
Hierarchical Indices
Create the multi-index
pd.MultiIndex.from_product ()
- list of list of indices
- list of names of indices
NOTE: The list of indices are of different size: There are 3 topics, and 2 sub-topics. So, we have to use from_product () to perform the cross product of the two indices. This creates a hierarchy. If we just wanted two row indices without them being hierarchical, the indices list should be of the same size, and we need to use from_arrays () instead.
i_topic = ['Mathematics', 'Machine Learning', 'DL/RL']
i_subtopic = [1, 2]
i_names = ['Topic', 'Sub-Topic']
mux = pd.MultiIndex.from_product ([i_topic, i_subtopic], names = i_names)
mux
MultiIndex([( 'Mathematics', 1),
( 'Mathematics', 2),
('Machine Learning', 1),
('Machine Learning', 2),
( 'DL/RL', 1),
( 'DL/RL', 2)],
names=['Topic', 'Sub-Topic'])
Create the dataframe with multi-index
pd.DataFrame (list, columns = , index = pd.MultiIndex ().from_#)
pd.DataFrame (list (zip (author, title)), \
columns = ['Author', 'Title'], \
index = mux)
Author | Title | ||
---|---|---|---|
Topic | Sub-Topic | ||
Mathematics | 1 | Strang | Introduction to Linear Algebra |
2 | Blitzstein | Introduction to Probability | |
Machine Learning | 1 | Witten | ML Beginner |
2 | Bishop | ML Advanced | |
DL/RL | 1 | Bengio | Deep Learning |
2 | Sutton | Reinforcement Learning - An Introduction |
Create a multi-index using from_arrays ()
mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
mux
MultiIndex([( 'Maths', 'LA'),
( 'Maths', 'Prob'),
('Machine Learning', 'ML'),
('Machine Learning', 'ML'),
('Machine Learning', 'DL'),
('Machine Learning', 'RL')],
names=['Topic', 'Sub-Topic'])
Create a data frame with multi-index (created using from_arrays ())
pd.DataFrame (list (zip (author, title)), \
columns = ['Author', 'Title'], \
index = mux)
Author | Title | ||
---|---|---|---|
Topic | Sub-Topic | ||
Maths | LA | Strang | Introduction to Linear Algebra |
Prob | Blitzstein | Introduction to Probability | |
Machine Learning | ML | Witten | ML Beginner |
ML | Bishop | ML Advanced | |
DL | Bengio | Deep Learning | |
RL | Sutton | Reinforcement Learning - An Introduction |
Navigating using indices
Create a dataframe with multi-index
mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title)), \
columns = ['Author', 'Title'], \
index = mux)
df
Author | Title | ||
---|---|---|---|
Topic | Sub-Topic | ||
Maths | LA | Strang | Introduction to Linear Algebra |
Prob | Blitzstein | Introduction to Probability | |
Machine Learning | ML | Witten | ML Beginner |
ML | Bishop | ML Advanced | |
DL | Bengio | Deep Learning | |
RL | Sutton | Reinforcement Learning - An Introduction |
Filter based on index values
df.loc [index_val]
df.loc [(index1_val, index2_val)]
NOTE: A data frame is returned.
df.sort_index ()
- to prevent “PerformanceWarning: indexing past lexsort depth may impact performance.”
#To prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df.sort_index (inplace = True)
print ('Books in Mathematics:')
print (df.loc ['Maths'], '\n\n')
print ('Books in Machine Learning:')
print (df.loc ['Machine Learning'], '\n\n')
print ('Books in Deep Learning:')
print (df.loc [('Machine Learning', 'DL')])
Books in Mathematics:
Author Title
Sub-Topic
LA Strang Introduction to Linear Algebra
Prob Blitzstein Introduction to Probability
Books in Machine Learning:
Author Title
Sub-Topic
DL Bengio Deep Learning
ML Witten ML Beginner
ML Bishop ML Advanced
RL Sutton Reinforcement Learning - An Introduction
Books in Deep Learning:
Author Title
Topic Sub-Topic
Machine Learning DL Bengio Deep Learning
Set existing columns as indexes
Create a dataframe WITHOUT passing any index
Instead, pass index as regular columns - topic, and sub_topic
#mux = pd.MultiIndex.from_arrays ([topic, sub_topic], names = i_names)
df = pd.DataFrame (list (zip (author, title, topic, sub_topic)), \
columns = ['Author', 'Title', 'Topic', 'Sub-Topic'], \
#index = mux
)
df
Author | Title | Topic | Sub-Topic | |
---|---|---|---|---|
0 | Strang | Introduction to Linear Algebra | Maths | LA |
1 | Blitzstein | Introduction to Probability | Maths | Prob |
2 | Witten | ML Beginner | Machine Learning | ML |
3 | Bishop | ML Advanced | Machine Learning | ML |
4 | Bengio | Deep Learning | Machine Learning | DL |
5 | Sutton | Reinforcement Learning - An Introduction | Machine Learning | RL |
Create index from columns
df.set_index ([cols], inplace = True)
df.set_index (['Topic', 'Sub-Topic'], inplace = True)
df
Author | Title | ||
---|---|---|---|
Topic | Sub-Topic | ||
Maths | LA | Strang | Introduction to Linear Algebra |
Prob | Blitzstein | Introduction to Probability | |
Machine Learning | ML | Witten | ML Beginner |
ML | Bishop | ML Advanced | |
DL | Bengio | Deep Learning | |
RL | Sutton | Reinforcement Learning - An Introduction |
Instantiate an index and set it
df.set_index (pd.RangeIndex (start, stop, step), inplace = True
df.set_index (\
pd.RangeIndex (start = 5, stop = 11, step = 1),\
inplace = True)
df
Author | Title | |
---|---|---|
5 | Strang | Introduction to Linear Algebra |
6 | Blitzstein | Introduction to Probability |
7 | Witten | ML Beginner |
8 | Bishop | ML Advanced |
9 | Bengio | Deep Learning |
10 | Sutton | Reinforcement Learning - An Introduction |
df.groupby (col)
1. DataFrameGroupBy
df.groupby (col)
df.groupby ([cols])
- returns an instance of DataFrameGroupBy
df = df_titanic
df.groupby (['Survived', 'Sex'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001CAEA053A30>
2. Summary Statistics
Perform a summary statistic on DataFrameGroupBy
- mean () - for numerical columns
- count () - for all columns
-
size ()
- for numerical columns
NOTE:
- stats are obviously not performed on groupby columns, since those column values are used to form groups
- count returns individual column count of rows having some value
count of columns like ‘Age’ and ‘Cabin’ indicate missing values
- sum returns count of rows. some column having missing value does not impact, unless all columns have missing values.
print ('Information of the data frame:')
print (df.info (), '\n\n')
print ('Mean Statistics:')
print (df.groupby (['Survived', 'Sex']).mean (), '\n\n')
print ('Count Statistics:')
print (df.groupby (['Survived', 'Sex']).count (), '\n\n')
print ('Size Statistics:')
print (df.groupby (['Survived', 'Sex']).size ())
Information of the data frame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
Mean Statistics:
PassengerId Pclass Age SibSp Parch \
Survived Sex
0 female 434.851852 2.851852 25.046875 1.209877 1.037037
male 449.121795 2.476496 31.618056 0.440171 0.207265
1 female 429.699571 1.918455 28.847716 0.515021 0.515021
male 475.724771 2.018349 27.276022 0.385321 0.357798
Fare
Survived Sex
0 female 23.024385
male 21.960993
1 female 51.938573
male 40.821484
Count Statistics:
PassengerId Pclass Name Age SibSp Parch Ticket Fare \
Survived Sex
0 female 81 81 81 64 81 81 81 81
male 468 468 468 360 468 468 468 468
1 female 233 233 233 197 233 233 233 233
male 109 109 109 93 109 109 109 109
Cabin Embarked
Survived Sex
0 female 6 81
male 62 468
1 female 91 231
male 45 109
Size Statistics:
Survived Sex
0 female 81
male 468
1 female 233
male 109
dtype: int64
> Series (with indexes) returned by groupby.summary_stat
NOTE:
- df.groupby (col).size () returns a series, with an index
- df.groupby ([cols]).size () returns a series with a multi-index
» groupby (col)
series
df.groupby ('Survived').size ()
Survived
0 549
1 342
dtype: int64
type = Series
type(df.groupby ('Survived').size ())
pandas.core.series.Series
of the series
df.groupby ('Survived').size ().index
Int64Index([0, 1], dtype='int64', name='Survived')
» groupby ([cols])
NOTE:
- didn’t survive 549 = 81 + 468
- survived 342 = 233 + 109
series
df.groupby (['Survived', 'Sex']).size ()
Survived Sex
0 female 81
male 468
1 female 233
male 109
dtype: int64
type = series
type (df.groupby (['Survived', 'Sex']).size ())
pandas.core.series.Series
of the series
df.groupby (['Survived', 'Sex']).size ().index
MultiIndex([(0, 'female'),
(0, 'male'),
(1, 'female'),
(1, 'male')],
names=['Survived', 'Sex'])
3. Unstack the multi-index series into a data frame
You may wish to remove one of the features from the hierarchical index and form different columns with respect to that feature. You can do so using the unstack method.
NOTE: unstack () requires the feature to have unique values. This is because the values of the feature form column names. An index created using groupby () ensures that the index has unique values.
df.groupby (['Survived', 'Sex']).size ().unstack ()
Sex | female | male |
---|---|---|
Survived | ||
0 | 81 | 468 |
1 | 233 | 109 |
NOTE:
- didn’t survive 549 = 81 + 468
- survived 342 = 233 + 109
The index column to unstack, is passed as parameter, can be given by name, or by position, 0 is the default.
df.groupby (['Survived', 'Sex']).size ().unstack ('Sex')
Sex | female | male |
---|---|---|
Survived | ||
0 | 81 | 468 |
1 | 233 | 109 |
by default, the leaf of the hierarchical (multi) index is unstacked, leaving the rest of the index tree as the index of the returned dataframe
df.groupby (['Survived', 'Sex']).size ().unstack ().index
Int64Index([0, 1], dtype='int64', name='Survived')
the unstacked leaf of the index forms the columns of the returned dataframe
df.groupby (['Survived', 'Sex']).size ().unstack ().columns
Index(['female', 'male'], dtype='object', name='Sex')
mention the index column to unstack
NOTE
- this time we have unstacked 'Survived' instead of 'Sex'
df.groupby (['Survived', 'Sex']).size ().unstack ('Survived')
Survived | 0 | 1 |
---|---|---|
Sex | ||
female | 81 | 233 |
male | 468 | 109 |
7. Filter
Data creation
frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
df_books.head ()
Author | Title | Edition | |
---|---|---|---|
0 | Strang | Introduction to Linear Algebra | 1.0 |
1 | Blitzstein | Introduction to Probability | NaN |
2 | Witten | ML Beginner | 2.0 |
3 | Bishop | ML Advanced | NaN |
4 | Bengio | Deep Learning | 1.0 |
An example ‘filter’
df_books ['Edition'] == 1
0 True
1 False
2 False
3 False
4 True
5 False
Name: Edition, dtype: bool
Chaining format
df_books.Edition.eq (2)
0 False
1 False
2 True
3 False
4 False
5 True
Name: Edition, dtype: bool
Apply filter to a dataframe
Filter rows whose column value equals a specific value
filter = df_books ['Edition'] == 2
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
2 | Witten | ML Beginner | 2.0 |
5 | Sutton | Reinforcement Learning - An Introduction | 2.0 |
Filter rows whose column value does NOT equal a specific value - !=
filter = df_books.Edition != 2
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
0 | Strang | Introduction to Linear Algebra | 1.0 |
1 | Blitzstein | Introduction to Probability | NaN |
3 | Bishop | ML Advanced | NaN |
4 | Bengio | Deep Learning | 1.0 |
Filter rows whose column value is not NaN - df.col.notnull ()
filter = df_books.Edition.notnull ()
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
0 | Strang | Introduction to Linear Algebra | 1.0 |
2 | Witten | ML Beginner | 2.0 |
4 | Bengio | Deep Learning | 1.0 |
5 | Sutton | Reinforcement Learning - An Introduction | 2.0 |
Filter rows with column values among a set of values
df.colname.isin
authors_maths = ['Apostol', 'Strang', 'Blitzstein']
filter = df_books.Author.isin (authors_maths)
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
0 | Strang | Introduction to Linear Algebra | 1.0 |
1 | Blitzstein | Introduction to Probability | NaN |
Filter rows with column values NOT among a set of values
~df.colname.isin
filter = ~df_books.Author.isin (authors_maths)
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
2 | Witten | ML Beginner | 2.0 |
3 | Bishop | ML Advanced | NaN |
4 | Bengio | Deep Learning | 1.0 |
5 | Sutton | Reinforcement Learning - An Introduction | 2.0 |
Filter rows based on multiple conditions
condition1 & condition2
filter = ~df_books.Author.isin (authors_maths)\
& df_books.Edition.notnull ()
df_books [filter]
Author | Title | Edition | |
---|---|---|---|
2 | Witten | ML Beginner | 2.0 |
4 | Bengio | Deep Learning | 1.0 |
5 | Sutton | Reinforcement Learning - An Introduction | 2.0 |
Filter rows based on a series
s_title_learning = df_books ['Title'\
].apply (lambda x: x.endswith ('Learning'))
df_books [s_title_learning]
Author | Title | Edition | |
---|---|---|---|
4 | Bengio | Deep Learning | 1.0 |
8. Selection
using .loc
- selection using label-based location
- selection using filter
df.loc [(index1_label, ...), [selected_cols]]
df.loc [[index_label1, index_label2, ...], [selected_cols]]
df.loc [filter, [selected_cols]]
NOTE: For the first parameter of .loc (), we pass index values. These values are the labels of the rows. Hence, .loc is label-based.
data creation - add index
frame = {'Author': s_author, 'Title': s_title, 'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | DL | Bengio | Deep Learning | 1.0 |
ML | Witten | ML Beginner | 2.0 | |
ML | Bishop | ML Advanced | NaN | |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 |
Prob | Blitzstein | Introduction to Probability | NaN |
df.loc [index] - returns rows belonging to the index label
df_books.loc ['Machine Learning']
Author | Title | Edition | |
---|---|---|---|
Sub-Topic | |||
DL | Bengio | Deep Learning | 1.0 |
ML | Witten | ML Beginner | 2.0 |
ML | Bishop | ML Advanced | NaN |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 |
df.loc [(index1, index2)] - returns rows belonging to the hierarchical index labels
df_books.loc [('Machine Learning', 'ML')]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | ML | Witten | ML Beginner | 2.0 |
ML | Bishop | ML Advanced | NaN |
df.loc [[idx1_labl1, idx1_lab2]] - returns rows belonging to the multiple labels of the index root
df_books.loc [['Machine Learning', 'Maths']]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | DL | Bengio | Deep Learning | 1.0 |
ML | Witten | ML Beginner | 2.0 | |
ML | Bishop | ML Advanced | NaN | |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 |
Prob | Blitzstein | Introduction to Probability | NaN |
df.loc [([idx1_lab1, idx1_lab2], [idx2_lab1]), [selected_cols]] - returns rows belonging to the cross-product labels of the nodes of the hierarchical index
df_books.loc [(['Machine Learning', 'Maths'],\
['LA', 'DL', 'RL']), ['Title']]
Title | ||
---|---|---|
Topic | Sub-Topic | |
Machine Learning | DL | Deep Learning |
RL | Reinforcement Learning - An Introduction | |
Maths | LA | Introduction to Linear Algebra |
df.loc [(index1, index2), col_name] - returns rows belonging to the label, and their single column
returns a series
df_books.loc [('Machine Learning', 'ML'), 'Title']
Topic Sub-Topic
Machine Learning ML ML Beginner
ML ML Advanced
Name: Title, dtype: object
df.loc [(index1, index2), [col_name] - returns rows belonging to the label, and their one or more columns
returns a data frame
df_books.loc [('Machine Learning', 'ML'), ['Title']]
Title | ||
---|---|---|
Topic | Sub-Topic | |
Machine Learning | ML | ML Beginner |
ML | ML Advanced |
df.loc [(index1, index2), [cols]] - returns rows belonging to the label, and their one or more columns
df_books.loc [('Machine Learning', 'ML'), ['Title', 'Edition']]
Title | Edition | ||
---|---|---|---|
Topic | Sub-Topic | ||
Machine Learning | ML | ML Beginner | 2.0 |
ML | ML Advanced | NaN |
df.loc [filter] - returns rows satisfying the filter
filter = df_books ['Author'] == 'Sutton'
df_books.loc [filter]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | RL | Sutton | Reinforcement Learning - An Introduction | 2.0 |
NOTE:
-
df [filter]
and df.loc [filter] both return the same result.
- But, df.loc provides selection of columns
df_books [filter]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | RL | Sutton | Reinforcement Learning - An Introduction | 2.0 |
df_books.loc [filter, ['Edition', 'Author']]
Edition | Author | ||
---|---|---|---|
Topic | Sub-Topic | ||
Machine Learning | RL | 2.0 | Sutton |
df.loc [series] - return rows based on a series
Create a series by applying a lambda on a column
s_title_learning = df_books ['Title'\
].apply (lambda x: x.endswith ('Learning'))
df_books.loc [s_title_learning, ['Edition', 'Title']]
Edition | Title | ||
---|---|---|---|
Topic | Sub-Topic | ||
Machine Learning | DL | 1.0 | Deep Learning |
using .iloc - integer-location-based location
df.iloc [row_range, col_range]
NOTE: For both the parameters of .iloc (), we pass location indices. Hence, .iloc is integer-location-based.
NOTE:
- a single row or column is returned as a series
have returned a dataframe by enclosing the index inside a []
- multiple rows or columns are returned as a dataframe
df.iloc [int_location] returns a row as series
df_books.iloc [0]
Author Bengio
Title Deep Learning
Edition 1
Name: (Machine Learning, DL), dtype: object
df.iloc [[int_location]] returns a row as data frame
df_books.iloc [[0]]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | DL | Bengio | Deep Learning | 1.0 |
df.iloc [row_range] - returns multiple rows from the range
df_books.iloc [2:5]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | ML | Bishop | ML Advanced | NaN |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 |
df.iloc [row_range, col_range] - returns multiple rows and columns from the range
df_books.iloc [2:5, 1:3]
Title | Edition | ||
---|---|---|---|
Topic | Sub-Topic | ||
Machine Learning | ML | ML Advanced | NaN |
RL | Reinforcement Learning - An Introduction | 2.0 | |
Maths | LA | Introduction to Linear Algebra | 1.0 |
df.iloc [row_range, [-1]] - returns multiple rows from the range, and the last column
df_books.iloc [2:5, [-1]]
Edition | ||
---|---|---|
Topic | Sub-Topic | |
Machine Learning | ML | NaN |
RL | 2.0 | |
Maths | LA | 1.0 |
get the first row
df_books.iloc [[0]]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | DL | Bengio | Deep Learning | 1.0 |
get the last row
df_books.iloc [[-1]]
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Maths | Prob | Blitzstein | Introduction to Probability | NaN |
9. Update
Data Creation
frame = {'Author': s_author, 'Title': s_title,\
'Edition': s_edition}
df_books = pd.DataFrame (frame)
mux = pd.MultiIndex.from_arrays ([topic, sub_topic],\
names = ['Topic', 'Sub-Topic'])
df_books = df_books.set_index (mux)
#to prevent
#"PerformanceWarning: indexing past lexsort depth may impact performance."
df_books.sort_index (inplace = True)
df_books
Author | Title | Edition | ||
---|---|---|---|---|
Topic | Sub-Topic | |||
Machine Learning | DL | Bengio | Deep Learning | 1.0 |
ML | Witten | ML Beginner | 2.0 | |
ML | Bishop | ML Advanced | NaN | |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 |
Prob | Blitzstein | Introduction to Probability | NaN |
Add a ‘cost column
NOTE:
- A series can be added externally only if the index of the series matches that of the data frame
- This can be achieved by pd.Series (lst, index = )
cost_indexed = [30, np.NaN, 40, 25, np.NaN, 20]
df_books ['Cost'] = pd.Series (cost_indexed, index = df_books.index)
df_books
Author | Title | Edition | Cost | ||
---|---|---|---|---|---|
Topic | Sub-Topic | ||||
Machine Learning | DL | Bengio | Deep Learning | 1.0 | 30.0 |
ML | Witten | ML Beginner | 2.0 | NaN | |
ML | Bishop | ML Advanced | NaN | 40.0 | |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | 25.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 | NaN |
Prob | Blitzstein | Introduction to Probability | NaN | 20.0 |
Using .loc
df_books.loc [('Machine Learning', ['DL']), ['Title']]\
= 'Deep Reinforcement Learning'
df_books
Author | Title | Edition | Cost | ||
---|---|---|---|---|---|
Topic | Sub-Topic | ||||
Machine Learning | DL | Bengio | Deep Reinforcement Learning | 1.0 | 30.0 |
ML | Witten | ML Beginner | 2.0 | NaN | |
ML | Bishop | ML Advanced | NaN | 40.0 | |
RL | Sutton | Reinforcement Learning - An Introduction | 2.0 | 25.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1.0 | NaN |
Prob | Blitzstein | Introduction to Probability | NaN | 20.0 |
fill missing values
fill with a default value
Note:
- df[col].fillna () updates a column (not inplace) and returns it as a series. Pandas Philosophy: single row/column returned as series.
- The returned series is assigned to a dataframe column in the LHS
df_books ['Edition'] = df_books ['Edition'].fillna ('0.0')
df_books
Author | Title | Edition | Cost | ||
---|---|---|---|---|---|
Topic | Sub-Topic | ||||
Machine Learning | DL | Bengio | Deep Reinforcement Learning | 1 | 30.0 |
ML | Witten | ML Beginner | 2 | NaN | |
ML | Bishop | ML Advanced | 0.0 | 40.0 | |
RL | Sutton | Reinforcement Learning - An Introduction | 2 | 25.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1 | NaN |
Prob | Blitzstein | Introduction to Probability | 0.0 | 20.0 |
fill with a stats value
groupby based on index
- level =
[l1, l2, ..]
df_books.groupby (level = [0])['Cost'].sum ()
Topic
Machine Learning 95.0
Maths 20.0
Name: Cost, dtype: float64
median valueof the group
df_books.groupby (level = [0])['Cost'].transform ('median')
Topic Sub-Topic
Machine Learning DL 30.0
ML 30.0
ML 30.0
RL 30.0
Maths LA 20.0
Prob 20.0
Name: Cost, dtype: float64
missing values filled with median value of the group NOTE:
- values updated for ‘ML Beginner’ and ‘Introduction to Linear Algebra’
#series of median cost grouped by index level 0 ('Topic')
s_cost_median_by_group = df_books.groupby (level = [0])\
['Cost'].transform ('median')
#update the empty cells of 'Cost'
df_books ['Cost'].fillna (s_cost_median_by_group, inplace = True)
df_books
Author | Title | Edition | Cost | ||
---|---|---|---|---|---|
Topic | Sub-Topic | ||||
Machine Learning | DL | Bengio | Deep Reinforcement Learning | 1 | 30.0 |
ML | Witten | ML Beginner | 2 | 30.0 | |
ML | Bishop | ML Advanced | 0.0 | 40.0 | |
RL | Sutton | Reinforcement Learning - An Introduction | 2 | 25.0 | |
Maths | LA | Strang | Introduction to Linear Algebra | 1 | 20.0 |
Prob | Blitzstein | Introduction to Probability | 0.0 | 20.0 |
Using columns
df_books ['Description'] = df_books ['Title'] + ' - by - '\
+ df_books ['Author']
df_books
Author | Title | Edition | Cost | Description | ||
---|---|---|---|---|---|---|
Topic | Sub-Topic | |||||
Machine Learning | DL | Bengio | Deep Reinforcement Learning | 1 | 30.0 | Deep Reinforcement Learning - by - Bengio |
ML | Witten | ML Beginner | 2 | 30.0 | ML Beginner - by - Witten | |
ML | Bishop | ML Advanced | 0.0 | 40.0 | ML Advanced - by - Bishop | |
RL | Sutton | Reinforcement Learning - An Introduction | 2 | 25.0 | Reinforcement Learning - An Introduction - by ... | |
Maths | LA | Strang | Introduction to Linear Algebra | 1 | 20.0 | Introduction to Linear Algebra - by - Strang |
Prob | Blitzstein | Introduction to Probability | 0.0 | 20.0 | Introduction to Probability - by - Blitzstein |
binning using pd.cut ()
binning - numeric to categorical
cost_boundaries = [0, 11, 21, 31, 41, np.inf]
cost_range = ['Free', 'Cheap', 'Affordable', \
'Costly', 'Exorbitant']
df_books ['Cost Range'] = pd.cut \
(df_books ['Cost'], cost_boundaries, labels = cost_range)
df_books
Author | Title | Edition | Cost | Description | Cost Range | ||
---|---|---|---|---|---|---|---|
Topic | Sub-Topic | ||||||
Machine Learning | DL | Bengio | Deep Reinforcement Learning | 1 | 30.0 | Deep Reinforcement Learning - by - Bengio | Affordable |
ML | Witten | ML Beginner | 2 | 30.0 | ML Beginner - by - Witten | Affordable | |
ML | Bishop | ML Advanced | 0.0 | 40.0 | ML Advanced - by - Bishop | Costly | |
RL | Sutton | Reinforcement Learning - An Introduction | 2 | 25.0 | Reinforcement Learning - An Introduction - by ... | Affordable | |
Maths | LA | Strang | Introduction to Linear Algebra | 1 | 20.0 | Introduction to Linear Algebra - by - Strang | Cheap |
Prob | Blitzstein | Introduction to Probability | 0.0 | 20.0 | Introduction to Probability - by - Blitzstein | Cheap |