Pandas in action “how to automate your data analysis process”

What is all this Pandas about?

Our first steps with Pandas

As I mentioned previously, the intention of this post is to give you the tools to manage data like a data scientist without being a hardcore programmer. If you are looking for in-depth content on Pandas, I recommend Pandas in Action.

import pandas as pd
df = pd.read_csv('./sample_data/california_housing_train.csv')
a = 5
my_name = 'Pikachu'
print(my_name)
import pandas as pd
df = pd.read_csv('./sample_data/california_housing_train.csv')
df 

Exploring a dataset

With Pandas and our mighty friend Python, you can run some easy instructions to explore and transform data as you wish.

df.head(10)
df.tail(10)
df.sort_values(by='median_house_value')
df.sort_values(by='median_house_value').head(10)

Index

The index is the range of ascending numbers that you can see on the left side of the DataFrame. You can think of them as the row number that you have in Microsoft Excel. You can alter those indexes and add labels as identifiers. You can use columns as the index of a DataFrame, but if not defined Pandas generates a numeric index starting from 0 by default as shown in the examples above. For this particular dataset we don’t have a good candidate for a single row label, but let’s imagine that we want to use longitude and latitude attributes as identifiers. You can accomplish this task by using the set_index method and storing the result into a new variable that we’ll call df_with_index. Why we are doing this? because you need to keep in mind that the original Dataframe won’t change unless you override its value, and it’s a good practice to create “new instances” of your Dataframes as you move forward with your data analysis.

df_with_index = df.set_index(['longitude','latitude'])
df_with_index.loc[-114.57]
Printing data by its row label
df.iloc[0]
Finding data by its row number

Pivoting with data

A pivot table is a data summary tool often found in spreadsheet programs. Aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along with the columns.

df_pivot_table = df.pivot_table(['median_house_value','median_income'], 
index=['longitude','latitude'])
Printing Pivot Table

Lets the magic begins

Pandas is definitely the best tool for data analysis in Python, but, let’s be honest, the transition between Microsoft Excel to Python is challenging for those with little coding experience.

Mito Sheet installation

Regardless of the IDE of your choice, you must install Mito Sheet and we will use Jupyter notebook rather than Colab. Don’t worry, Jupyter is as user-friendly as Colab and offers many benefits compared to other applications out there.

python -m pip install mitoinstaller
python -m mitoinstaller install
Running Mitosheet
Jupyter Notebook first Run
Running MitoSheet for the first time
Loading MitoSheet
Data Loaded
Importing Data with MitoSheet

Data Transformation

As part of any data analysis, you might want to filter and sort data. With Mitosheet, you can get this done with a few clicks. Let’s filter the dataset by the ReportingCountry column. I will choose ES (Spain) for the purpose of this example.

Filtering Data
Pivot Tables

Testing Mitosheet Code in Colab

Picture this. Your manager is asking you to share the analysis with your colleagues, but, they use another IDE for data analysis. Let’s test the results in Colab.

Uploading a file into Colab
# Imported data.csvimport pandas as pd
data = pd.read_csv(r'./sample_data/data.csv')

# Filtered ReportingCountry in data
data = data[data['ReportingCountry'].str.contains('ES', na=False)]

# Pivoted data into df2
unused_columns = data.columns.difference(set(['ReportingCountry']).union(set(['TargetGroup'])).union(set({'SecondDose'})))
tmp_df = data.drop(unused_columns, axis=1)pivot_table = tmp_df.pivot_table( index=['ReportingCountry'], columns=['TargetGroup'], values=['SecondDose'], aggfunc={'SecondDose': ['sum']})
# Printing table
pivot_table

Summary

MitoSheet allows you to easily explore and transform data without coding by providing a user-friendly interface while reducing the time required to program and debug Python code and focusing on the data analysis itself.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alejandro Casas

Alejandro Casas

69 Followers

Sr. Manager at Oracle | Data Science and Cybersecurity | Technology & Startups | CISSP | CISM | CRISC | CDPSE