My Data Scientist Journey — Pandas
After learning the basics of NumPy last week, I decided to start learning Pandas!
The chapter covering pandas was very in-depth so I won’t include everything here, just the topics I felt that were important.
Pandas Objects
Pandas objects are just enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than integer indices. The three fundamental Pandas data structures are the Series, DataFrame, and Index.
Pandas Series Object
A Pandas Series object is a one-dimensional array of indexed data. It can be created from a list or arrays as follows:
data = pd.Series([0.25,0.5,0.75,1.0])
The Pandas DataFrame Object
These can be thought of as a specialization of a Python dictionary. The dataframe object can be created from Python dictionaries:
data = pd.DataFrame({dictionary})data.index #returns index labels
data.columns #returns an Index object holding column labels
DataFrame as a specialized dictionary
A DataFrame maps a column name to a series of column data.
data['column'] = column data
The Pandas Index Object
The Index object can be thought of as an immutable array. You can construct an Index object using :
ind = pd.Index([list values])
Index as immutable array
You can use standard Python indexing notation to retrieve values or slices:
ind[1]
ing[::2]
These objects are immutable meaning they cannot be modified through normal means.
Data Indexing and Selection
The indexing methods are very similar to NumPy arrays.
Data Selection in Series
A series object acts in many ways like a one-dimensional NumPy array and is very similar to a Python dictionary.
Indexers: loc, iloc and ix
The loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
The iloc attribute allows indexing and slicing that always references the implicit Python-style index:
data.iloc[1]
data.iloc[1:3]
Operating on Data in Pandas
Mapping between Python operators and Pandas methods:
Python operator Pandas method(s)+ .add()
- .sub(), .subtract()
* .mul(), .multiply()
/ .truediv(), .div(), .divide()
// .floordiv()
% .mod()
** .pow()
Handling Missing Data
Lot’s of data in the real world isn’t as neat as presented in tutorials or books, they often include either mislabeled data or missing data.
Trade-Offs in Missing Data Conventions
Different languages use different labels and ways to handle missing data.
Missing Data in Pandas
Pandas handles missing data using sentinels and two-existing Python null values: NaN and the Python None object.
NaN: Missing numerical data
It is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation. NaN is supported by NumPy arrays to perform fast operations compared to None type objects which operate at the python level. Any operation performed on NaN will result in another NaN.
NumPy has some special aggregations that ignore NaN:
np.nansum()
np.nanmin()
np.nanmax()
Operating on Null values
There are several useful methods for detecting, removing, and replacing null values in Pandas data structures:
isnull() #generates a boolean mask indicating missing values
notnull() #opposite of isnull()
dropna() #return a filtered version of the data
fillna() # return a copy of the day with missing values filled or
imputed
Dropping null values
The methods do not allow single NA values to be dropped from the DataFrame; only full rows or columns.
Filling null values
Rather than dropping NA values sometimes you’d rather replace them. This can be a valid value, some sort of imputation or interpolation from the good values.
data.fillna(0) # fills NaN with 0's
Hierarchical Indexing
Often it is useful to go beyond one-dimensional and two-dimensional data. A common pattern to address higher dimensional data is hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index.
Pandas Multidex
You can create multiple levels of index with:
index = pd.MultiIndex.from_tuples(index) # index is predefined tuples
You can create these multiple levels from ordinary data frames using:
pd.stack()
You can turn multi index level data into data frames using :
pd.unstack()
Data Aggregations of Multi-Indices
Pandas has built-in aggregation methods like mean(), sum(), and max(). With hierarchical indexed data you can pass a level parameter that controls which subset of the data the aggregate is computed on.
data_mean = data.mean(level = index)
Combining Datasets: Concat and Append
Concatenation of NumPy Arrays
With np.concatenate() you combine the contents of two or more arrays into a single array. It can take an axis keyword that allows you to specify the axis along which the result will be concatenated
Simple concatenation with pd.concat
Pandas has a similar function , pd.concat, that contains more options than np.concatenate().
pd.concat(objs, axis=0, join='outer', join_axis=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
By default the concatenation takes placed row-wise with the DataFrame (axis=0). Like np.concatenate, you can specify the axis along which the concatenation will take place:
pd.concat(objs,axis='col')
Catching undesired, repeating indices
To verify that the indices in the result of pd.concat() do not overlap you can use the verify_integrity flag. When this is set to True, the concatenation will raise an exception if there are duplicate indices.
Ignoring the Index
If the index does not matter you can use the option ignore_index = True. This will create a new integer index for the resulting concatenation.
Adding MultiIndex Keys
Another alternative is to use the keys options to specify a label for the data sources. This will result in a hierarchical indexed DataFrame containing the data.
Concatenation with joins
You can concatenate data even though they do not have shared columns names. If there is any missing data, Pandas will fill them with NA values. You can change this by specifying one of several options for the join and join_axes parameters of the concatenate function. By default, the join is a union of the input columns (join = ‘outer’), but we can change this to only allow shared columns names to be concatenated with join = ‘inner’.
pd.concat(objs, join='inner')
You can directly specify the index of the remaining columns using the join_axes argument which takes a list of index objects.
pd.concat(objs, join_axes=[objs.columns])
The append() method
Series and DataFrame objects have an append method that can concatenate multiple objects with fewer lines of code.
df1.append(df2)
The append method does not modify the original object — instead it creates a new object with the combined data. This is not very efficient because if involves creating of a new index and data buffer. The better way to do multiple append operations is to build a list of DataFrames and pass them all at once to the concat() function.
Combining Datasets: Merge and Join
Pandas has a method, pd.merge(), that can join datasets in different ways.
One-to-one joins
A one-to-one join is very similar to the column-wise concatenation.
df1 = pd.DataFrame({dictionary})
df2 = pd.DataFrame({dictionary})
df3 = pd.merge(df1,df2)
The result is a new DataFrame that combines the information from the two inputs. The order of entries in each column is not necessarily maintained.
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. The resulting DataFrame will preserve those duplicate entries as appropriate.
Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
Specification of the Merge Key
The on keyword
You can explicitly specify the name of the key column using the ‘on’ keyword, which takes a column name or a list of column names:
pd.merge(df1,df2, on='column')
This only works if both the left and right DataFrames have the specified column name.
The left_on and right_on keywords
Often you might want to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. For this we can use the left_on and right_on keywords to specify the two merging columns:
pd.merge(df1,df2,left_on='employee',right_on='name')
The left_index and right_index keywords
Instead of merging on columns we can also merge on an index. This can be done using the left_index and right_index flags.
pd.merge(df1,df2,left_index=True,right_index=True)
You can also implement the join() method, which performs a merge that defaults to joining on indices:
df1.join(df2)
You can mix merging indices and columns by combining left_index with right_on or vice versa:
pd.merge(df1,df2,lefT_index=True,right_on='name')
Working with time series
Data and time data come in a few flvaors with pandas:
- Time stamps reference particular moments in time
- Time intervals and period reference a length of time between a particular beginning and end point
- Time deltas or duration reference an exact length of time
Date and times in python
Typed arrays of time: Numpy’s datetime64
The datetime64 dtype encodes dates as 64 bit integeters and allows array of dates to be represented very compactly.
date = np.array('2015-07-04',dtype=np.datetime64)
You can perform vectorized string operations on this array after it is formatted.
Dates and times in Pandas:Best of both worlds
Pandas builds upon these tools to create a timestamp object. From a group of these Timestamp objects, Pandas can construct a DatetimeIndex object that can be used to index data in a series or DataFrame.
pd.to_datetime(args(integer,float,string,datetime,list,tuple),
dayfirst(default false),
yearfirst(default False),
utc,
box,
format(strftime to parse time),
exact,
unit)
Pandas time series: Indexing by time
You can index data by Timestamps:
index = pd.Datetimeindex(args)
Pandas time series data structures
- For timestamps, Pandas provides the Timestamp type.
- For time periods, Pandas provides the period type. This encodes a fixed-frequency interval based on numpy.datetime64.
- For time deltas or durations, Pandas provides the Timedelta type.
The most fundamental of these are the Timestamp and DatetimeIndex objects. It’s more common to use the pd.to_datetime() function.
Regular sequences: pd.date_range()
You can create ranges of dates after they have been formatted with the following functions:
pd.date_range() for time stamps
pd.timedelta_range() for durations
pd.period_range()
End notes
The pandas chapter was huge and I didn’t want to fit everything here. The best way for me to retain and learn everything properly is to build some project using Pandas.