DataFrame Insight Cheatsheet

My DataFrame Cheatsheet

I love working with DataFrames -very readable data tables that we are blessed with the python pandas library to help us break down complex and annoying data from a variety of formats. That said, working with them properly is a vital step for any data analysis (and so you will actually have time to watch Netflix) so here are some quick operations and recipes to consider:

Method Intent
df.info() Get summary of DataFrame
df.dtypes Examine data types
df.describe() List stat properties for each column: count, mean, std, min, percentiles, max
df.query('col1 > 0 & col1 < 5 | col2 > 0') Convenient alternative to slicing connectors: df[df['col1'] > 0 ...rest] : use numexpr syntax instead of python syntax
df.assign(new_col = lambda x : func(col1)) Creates a new column by evaluating function (using lambda here) on every row of df
df[['attr1', 'attr1']] Take subset of DataFrame
df.rename(<dict of cols to replace>) Rename column names of DataFrame
df.values.tolist() DataFrame to list of lists
df1.join(df2, how=<'inner' if intersection>) Join two DataFrames
df['col'].diff() Computer percent changes of time series col
df.corr(method='pearson') Review all pairwise correlations of df cols
df[col1].corr(df[col2]) Pandas correlation method of Series
df.skew() Review Gaussian distribution
df.hist() View histograms for faster observations
df[<col name>].value_counts() Get unique value counts for column
df.drop(columns=<list of cols>, inplace=True/False) Delete unnecessary columns
df[<col name>].value_counts(normalize=True) Express value counts as proportions of the total
df[df['<col name>'].str.contains('<substring>')] Get all rows that contain specific word
df.groupby(['<col1>', '<col2>']).mean() Get means by column
sub_df = df[df['col_name'] == '<some value>'] Create subset DataFrame based on some value
df.sort_values(by='<col_name>', ascending=False) Sort by descending
df['change_to_bool'] = df.change_to_bool.astype(bool) Change data type to bool
df.isnull().sum() Count number of missing values in each column
df.dropna(subset=['specific_col'], inplace=True) Drop all rows that are missing specific attribute
df.set_index('col_name', inplace=True) Set column as index
df.sample(frac=1.0, replace=True, random_state=42) Sample with replacement from a DataFrame