My Beginner Pandas Documentation
Focusing in data cleaning, series, labels, indexing and DataFrames using lists and dictionaries
Embarking on my Pandas learning journey, I'm here to document and practice my newfound skills.
About Pandas
Source for Pandas located in public github repository
https://github.com/pandas-dev/pandas
Guided Project with Pandas in Python
My first guided project in Cleaning Dataset with Pandas in Python. Link available in my github.
This project covers: Importing Data, Handling Missing Values, Cleaning Data and Manipulating Data.
Data-Science-Files/Data Cleaning in Pandas Project 1.ipynb at main · Akina-Aoki/Data-Science-Files (github.com)
Pandas Installation
In my VC Studio, I installed pandas in my CMD.
I also ran the installer in my Anaconda OS. Created a new Python 3 notebook in Jupyter Lab and executed this code to import Pandas.
Pandas is usually imported under the "pd" alias. The Pandas package can be referred to as pd instead of pandas.
Loading Data with Pandas
CSV Files
Let's say I have a CSV file I want to load using the Pandas built-in function, read_csv.
This statement is better since it can shorten by using the standard abbreviation, pd.
The process for loading an Excel file is similar. I use the path of the Excel file. The function reads Excel.
Data Frames
pd.DataFrame
Function
A DataFrame has rows and columns, often made from a dictionary where keys become column labels and values turn into rows. It can be converted from dictionary to a DataFrame using the data frame
function.
import pandas as pd
grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
"items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
"store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
}
grocery_frame = pd.DataFrame(grocery_table)
print(grocery_frame)
Adding new columns in a Data Frame
Then, to get a new DataFrame with just one column, put the DataFrame name and the column header in double brackets. variable = dataframe[["column name"]]
Same goes for getting multiple columns - just enclose the DataFrame name and column headers in double brackets to make a new DataFrame with those columns.
import pandas as pd
grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
"items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
"store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
"location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"]
}
grocery_frame = pd.DataFrame(grocery_table)
x = grocery_frame[["location"]] # add a new column in the data frame
print(grocery_frame)
Working with and saving data from a Data Frame
unique()
function
The unique()
function in pandas is used to extract unique elements from a pandas Series or a DataFrame column.
import pandas as pd
grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
"items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
"store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
"location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
"price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table) # name of the data frame
x = grocery_frame[["location"]]
print(grocery_frame["location"].unique()) # takes all unique elememts
to_csv()
method
I created a DataFrame called grocery_frame
from the dictionary grocery_table
. Then, I printed a boolean Series indicating whether the price of each item in the "price" column is greater than or equal to 1000. I filtered grocery_frame
based on items with prices greater than or equal to 1000 and assigning the result to a new DataFrame called df1
. Finally, I used the to_csv()
method to save the contents of df1
to a CSV file named "items_over_1000.csv".
import pandas as pd
grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
"items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
"store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
"location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
"price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table) # name of the data frame
print(grocery_frame["price"]>= 1000) # Boolean to check which items cost more than 1000
df1 = grocery_frame[grocery_frame["price"]>=1000] # create df from grocery_frame
df1.to_csv("items_over_1000")
Selecting data in a Data Frame
loc()
and iloc()
functions
loc()
is a label-based data selecting method which means that we have to pass the name of the row or column that we want to select. This method includes the last element of the range passed in it.
Simple syntax:
loc[row_label, column_label]
iloc()
is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it.
Simple syntax:
iloc[row_index, column_index]
Slicing
Slicing uses the []
operator to select a set of rows and/or columns from a DataFrame.
To slice out a set of rows, you use this syntax: data[start:stop]
NOTE: Remember how to count indexing. Labels must be found in the DataFrame or you will get a KeyError.
Indexing by labels(i.e. using
loc()
) differs from indexing by integers (i.e. usingiloc()
). Withloc()
, both the start bound and the stop bound are inclusive. When usingloc()
, integers can be used, but the integers refer to the index label and not the position.For example, using
loc()
and select 1:4 will get a different result than usingiloc()
to select rows 1:4.
Pandas Series, Labels and Indexes
Series is a column in a table, It is one-dimensional array that holds any data type.
Labels can be made with the index argument
Lists
import pandas as pd
budget_list = [33, 45, 25, 10, 90, 23, 4]
sample_a = pd.Series(budget_list, index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
# Using string formatting to add the dollar sign
sample_a_with_dollar = sample_a.map("${:.2f}".format)
print(sample_a_with_dollar)
Dictionaries
The error encountered occurs because dictionaries in Python do not have a .map()
method. The .map()
method is specific to Pandas Series and DataFrames.
If I have a dictionary and I want to apply formatting to its values, I can achieve this by iterating over the dictionary and formatting each value individually.
import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}
# Using string formatting to add the dollar sign
budget_with_dollar = budget_dict.map("${:.2f}".format)
print(budget_with_dollar)
import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}
# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}
print(budget_with_dollar)
Indexing
Using the index argument in dictionaries to print only specific items.
import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}
# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}
budget_specific_days = pd.Series(budget_with_dollar, index = ["Saturday", "Sunday"])
print(budget_specific_days)
Data Frames using Lists
Data Frames are multi-dimensional tables. Data Frames represents a whole table while Series are like columns.
Syntax:variable = pd.DataFrame(data)
Example: Upgrade my budget table by adding another series for the food bought using lists.
# Data Frame using lists
import pandas as pd
# create lists for each category
expenses = [33, 45, 25, 10, 90, 23, 4]
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
items = ["pizza kit", "tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"]
# Create pandas Series (column) for items
items_series = pd.Series(items, index = days)
# Create pandas Series(column) for expenses
expenses_series = pd.Series(expenses, index = days)
# Concatenate the two Series to create a DataFrame
budget_df = pd.DataFrame({"Budget" : expenses_series, "Items" : items_series})
print(budget_df)
This code creates a DataFrame named budget_df
with two columns: Budget
and Items
. The Budget
column contains the budget data and the Items
column contains the corresponding food items for each day of the week.
Data Frames using Dictionaries
Here, I am creating a base guide on how I understand DataFrames using dictionaries.
Data dictionary: Create a dictionary where the keys represent the column names, and the values are lists or arrays containing the data for each column. Each list should have the same length.
Creating a DataFrame: Use the pd.DataFrame()
function, passing your dictionary as an argument. This function converts the dictionary into a DataFrame object.
# DataFrames using Dictionaries
# start with this simple code as a guide
import pandas as pd
data = {
"numbers" : [1, 2, 3], #int
"letters" : ["a", "b" ,"c"] # str
}
# load data into a DataFrame object
sample = pd.DataFrame(data)
print(data)
Example: Upgrade my budget table by creating a DataFrame using dictionaries. I just need to provide a dictionary where the keys are the columns and the values are corresponding to the data in each column.
import pandas as pd
# create dictionary day:expenses pair
expenses_dict = {
"Monday":33,
"Tuesday":45,
"Wednesday":25,
"Thursday":10,
"Friday":90,
"Saturday":23,
"Sunday":4
}
# create dictionary day:items pair
items_dict = {
"Monday":"pizza kit",
"Tuesday":"tuna and salmon",
"Wednesday":"vegetables",
"Thursday":"pre-made dinner",
"Friday":"wagyu",
"Saturday":"greek salad",
"Sunday":"oatmeal"
}
# create pandas series (columns) from dictionaries
expenses_series = pd.Series(expenses_dict)
items_series = pd.Series(items_dict)
# create DataFrame by concatenation of 2 series
budget_df = pd.DataFrame({"Budget": expenses_series, "Items":items_series})
print(budget_df)