My Data Scientist Journey — Pandas

7 min readAug 11, 2019

After learning the basics of NumPy last week, I decided to start learning Pandas!

The chapter covering pandas was very in-depth so I won’t include everything here, just the topics I felt that were important.

Pandas Objects

Pandas objects are just enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than integer indices. The three fundamental Pandas data structures are the Series, DataFrame, and Index.

Pandas Series Object

A Pandas Series object is a one-dimensional array of indexed data. It can be created from a list or arrays as follows:

data = pd.Series([0.25,0.5,0.75,1.0])

The Pandas DataFrame Object

These can be thought of as a specialization of a Python dictionary. The dataframe object can be created from Python dictionaries:

data = pd.DataFrame({dictionary})data.index #returns index labels
data.columns #returns an Index object holding column labels

DataFrame as a specialized dictionary

A DataFrame maps a column name to a series of column data.

data['column'] = column data

The Pandas Index Object

The Index object can be thought of as an immutable array. You can construct an Index object using :

ind = pd.Index([list values])

Index as immutable array

You can use standard Python indexing notation to retrieve values or slices:

ind[1]
ing[::2]

These objects are immutable meaning they cannot be modified through normal means.

Data Indexing and Selection

The indexing methods are very similar to NumPy arrays.

Data Selection in Series

A series object acts in many ways like a one-dimensional NumPy array and is very similar to a Python dictionary.

Indexers: loc, iloc and ix

The loc attribute allows indexing and slicing that always references the explicit index:

data.loc[1]

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

data.iloc[1]
data.iloc[1:3]

Operating on Data in Pandas

Mapping between Python operators and Pandas methods:

Python operator                      Pandas method(s)+                                    .add()
-                                    .sub(), .subtract()
*                                    .mul(), .multiply()
/                                    .truediv(), .div(), .divide()
//                                   .floordiv()
%                                    .mod()
**                                   .pow()

Handling Missing Data

Lot’s of data in the real world isn’t as neat as presented in tutorials or books, they often include either mislabeled data or missing data.

Trade-Offs in Missing Data Conventions

Different languages use different labels and ways to handle missing data.

Missing Data in Pandas

Pandas handles missing data using sentinels and two-existing Python null values: NaN and the Python None object.

NaN: Missing numerical data

It is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation. NaN is supported by NumPy arrays to perform fast operations compared to None type objects which operate at the python level. Any operation performed on NaN will result in another NaN.

NumPy has some special aggregations that ignore NaN:

np.nansum()
np.nanmin()
np.nanmax()

Operating on Null values

There are several useful methods for detecting, removing, and replacing null values in Pandas data structures:

isnull() #generates a boolean mask indicating missing values
notnull() #opposite of isnull()
dropna() #return a filtered version of the data
fillna() # return a copy of the day with missing values filled or
          imputed

Dropping null values

The methods do not allow single NA values to be dropped from the DataFrame; only full rows or columns.

Filling null values

Rather than dropping NA values sometimes you’d rather replace them. This can be a valid value, some sort of imputation or interpolation from the good values.

data.fillna(0) # fills NaN with 0's

Hierarchical Indexing

Often it is useful to go beyond one-dimensional and two-dimensional data. A common pattern to address higher dimensional data is hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index.

Pandas Multidex

You can create multiple levels of index with:

index = pd.MultiIndex.from_tuples(index) # index is predefined          tuples

You can create these multiple levels from ordinary data frames using:

pd.stack()

You can turn multi index level data into data frames using :

pd.unstack()

Data Aggregations of Multi-Indices

Pandas has built-in aggregation methods like mean(), sum(), and max(). With hierarchical indexed data you can pass a level parameter that controls which subset of the data the aggregate is computed on.

data_mean = data.mean(level = index)

Combining Datasets: Concat and Append

Concatenation of NumPy Arrays

With np.concatenate() you combine the contents of two or more arrays into a single array. It can take an axis keyword that allows you to specify the axis along which the result will be concatenated

Simple concatenation with pd.concat

Pandas has a similar function , pd.concat, that contains more options than np.concatenate().

pd.concat(objs, axis=0, join='outer', join_axis=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

By default the concatenation takes placed row-wise with the DataFrame (axis=0). Like np.concatenate, you can specify the axis along which the concatenation will take place:

pd.concat(objs,axis='col')

Catching undesired, repeating indices

To verify that the indices in the result of pd.concat() do not overlap you can use the verify_integrity flag. When this is set to True, the concatenation will raise an exception if there are duplicate indices.

Ignoring the Index

If the index does not matter you can use the option ignore_index = True. This will create a new integer index for the resulting concatenation.

Adding MultiIndex Keys

Another alternative is to use the keys options to specify a label for the data sources. This will result in a hierarchical indexed DataFrame containing the data.

Concatenation with joins

You can concatenate data even though they do not have shared columns names. If there is any missing data, Pandas will fill them with NA values. You can change this by specifying one of several options for the join and join_axes parameters of the concatenate function. By default, the join is a union of the input columns (join = ‘outer’), but we can change this to only allow shared columns names to be concatenated with join = ‘inner’.

pd.concat(objs, join='inner')

You can directly specify the index of the remaining columns using the join_axes argument which takes a list of index objects.

pd.concat(objs, join_axes=[objs.columns])

The append() method

Series and DataFrame objects have an append method that can concatenate multiple objects with fewer lines of code.

df1.append(df2)

The append method does not modify the original object — instead it creates a new object with the combined data. This is not very efficient because if involves creating of a new index and data buffer. The better way to do multiple append operations is to build a list of DataFrames and pass them all at once to the concat() function.

Combining Datasets: Merge and Join

Pandas has a method, pd.merge(), that can join datasets in different ways.

One-to-one joins

A one-to-one join is very similar to the column-wise concatenation.

df1 = pd.DataFrame({dictionary})
df2 = pd.DataFrame({dictionary})
df3 = pd.merge(df1,df2)

The result is a new DataFrame that combines the information from the two inputs. The order of entries in each column is not necessarily maintained.

Many-to-one joins

Many-to-one joins are joins in which one of the two key columns contains duplicate entries. The resulting DataFrame will preserve those duplicate entries as appropriate.

Many-to-many joins

If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

Specification of the Merge Key

The on keyword

You can explicitly specify the name of the key column using the ‘on’ keyword, which takes a column name or a list of column names:

pd.merge(df1,df2, on='column')

This only works if both the left and right DataFrames have the specified column name.

The left_on and right_on keywords

Often you might want to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. For this we can use the left_on and right_on keywords to specify the two merging columns:

pd.merge(df1,df2,left_on='employee',right_on='name')

The left_index and right_index keywords

Instead of merging on columns we can also merge on an index. This can be done using the left_index and right_index flags.

pd.merge(df1,df2,left_index=True,right_index=True)

You can also implement the join() method, which performs a merge that defaults to joining on indices:

df1.join(df2)

You can mix merging indices and columns by combining left_index with right_on or vice versa:

pd.merge(df1,df2,lefT_index=True,right_on='name')

Working with time series

Data and time data come in a few flvaors with pandas:

Time stamps reference particular moments in time
Time intervals and period reference a length of time between a particular beginning and end point
Time deltas or duration reference an exact length of time

Date and times in python

Typed arrays of time: Numpy’s datetime64

The datetime64 dtype encodes dates as 64 bit integeters and allows array of dates to be represented very compactly.

    date = np.array('2015-07-04',dtype=np.datetime64)

You can perform vectorized string operations on this array after it is formatted.

Dates and times in Pandas:Best of both worlds

Pandas builds upon these tools to create a timestamp object. From a group of these Timestamp objects, Pandas can construct a DatetimeIndex object that can be used to index data in a series or DataFrame.

pd.to_datetime(args(integer,float,string,datetime,list,tuple),
               dayfirst(default false),
               yearfirst(default False),
               utc,
               box,
               format(strftime to parse time),
               exact,
               unit)

Pandas time series: Indexing by time

You can index data by Timestamps:

index = pd.Datetimeindex(args)

Pandas time series data structures

For timestamps, Pandas provides the Timestamp type.
For time periods, Pandas provides the period type. This encodes a fixed-frequency interval based on numpy.datetime64.
For time deltas or durations, Pandas provides the Timedelta type.

The most fundamental of these are the Timestamp and DatetimeIndex objects. It’s more common to use the pd.to_datetime() function.

Regular sequences: pd.date_range()

You can create ranges of dates after they have been formatted with the following functions:

pd.date_range() for time stamps
pd.timedelta_range() for durations
pd.period_range()

End notes

The pandas chapter was huge and I didn’t want to fit everything here. The best way for me to retain and learn everything properly is to build some project using Pandas.