My Beginner Pandas Documentation

Photo by Lukas W. on Unsplash

My Beginner Pandas Documentation

Focusing in data cleaning, series, labels, indexing and DataFrames using lists and dictionaries

Embarking on my Pandas learning journey, I'm here to document and practice my newfound skills.

About Pandas
Pandas is a Python library used for working with data sets. It allows to analyze big data and makae conclusions based on statistical theories. Pandas can clean messy data sets and make them readable and relevant. Some of it's functions are analyzing, cleaning, exploring and manipulating data. Pandas is very important to relevant data in data science.

Source for Pandas located in public github repository

https://github.com/pandas-dev/pandas

Guided Project with Pandas in Python

My first guided project in Cleaning Dataset with Pandas in Python. Link available in my github.

This project covers: Importing Data, Handling Missing Values, Cleaning Data and Manipulating Data.

Data-Science-Files/Data Cleaning in Pandas Project 1.ipynb at main · Akina-Aoki/Data-Science-Files (github.com)


Pandas Installation

In my VC Studio, I installed pandas in my CMD.

I also ran the installer in my Anaconda OS. Created a new Python 3 notebook in Jupyter Lab and executed this code to import Pandas.

Pandas is usually imported under the "pd" alias. The Pandas package can be referred to as pd instead of pandas.

Loading Data with Pandas

CSV Files

Let's say I have a CSV file I want to load using the Pandas built-in function, read_csv.

This statement is better since it can shorten by using the standard abbreviation, pd.

The process for loading an Excel file is similar. I use the path of the Excel file. The function reads Excel.


Data Frames

pd.DataFrame Function

A DataFrame has rows and columns, often made from a dictionary where keys become column labels and values turn into rows. It can be converted from dictionary to a DataFrame using the data frame function.


import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                }
grocery_frame = pd.DataFrame(grocery_table)
print(grocery_frame)

Adding new columns in a Data Frame

Then, to get a new DataFrame with just one column, put the DataFrame name and the column header in double brackets. variable = dataframe[["column name"]]

Same goes for getting multiple columns - just enclose the DataFrame name and column headers in double brackets to make a new DataFrame with those columns.


import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"]
}
grocery_frame = pd.DataFrame(grocery_table)
x = grocery_frame[["location"]]    # add a new column in the data frame
print(grocery_frame)


Working with and saving data from a Data Frame

unique() function

The unique() function in pandas is used to extract unique elements from a pandas Series or a DataFrame column.

import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
                "price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table)    # name of the data frame
x = grocery_frame[["location"]]
print(grocery_frame["location"].unique())      # takes all unique elememts

to_csv() method

I created a DataFrame called grocery_frame from the dictionary grocery_table. Then, I printed a boolean Series indicating whether the price of each item in the "price" column is greater than or equal to 1000. I filtered grocery_frame based on items with prices greater than or equal to 1000 and assigning the result to a new DataFrame called df1. Finally, I used the to_csv() method to save the contents of df1 to a CSV file named "items_over_1000.csv".

import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
                "price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table)    # name of the data frame
print(grocery_frame["price"]>= 1000)             # Boolean to check which items cost more than 1000

df1 = grocery_frame[grocery_frame["price"]>=1000]   # create df from grocery_frame
df1.to_csv("items_over_1000")

Selecting data in a Data Frame

loc() and iloc() functions

loc() is a label-based data selecting method which means that we have to pass the name of the row or column that we want to select. This method includes the last element of the range passed in it.

Simple syntax:

loc[row_label, column_label]

iloc() is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it.

Simple syntax:

iloc[row_index, column_index]

Slicing

Slicing uses the [] operator to select a set of rows and/or columns from a DataFrame.

To slice out a set of rows, you use this syntax: data[start:stop]

NOTE: Remember how to count indexing. Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels(i.e. using loc()) differs from indexing by integers (i.e. using iloc()). With loc(), both the start bound and the stop bound are inclusive. When using loc(), integers can be used, but the integers refer to the index label and not the position.

For example, using loc() and select 1:4 will get a different result than using iloc() to select rows 1:4.


Pandas Series, Labels and Indexes

Series is a column in a table, It is one-dimensional array that holds any data type.

Labels can be made with the index argument

Lists

import pandas as pd

budget_list = [33, 45, 25, 10, 90, 23, 4]

sample_a = pd.Series(budget_list, index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

# Using string formatting to add the dollar sign
sample_a_with_dollar = sample_a.map("${:.2f}".format)

print(sample_a_with_dollar)

Dictionaries

The error encountered occurs because dictionaries in Python do not have a .map() method. The .map() method is specific to Pandas Series and DataFrames.

If I have a dictionary and I want to apply formatting to its values, I can achieve this by iterating over the dictionary and formatting each value individually.

import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}

# Using string formatting to add the dollar sign
budget_with_dollar = budget_dict.map("${:.2f}".format)

print(budget_with_dollar)

import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}

# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}

print(budget_with_dollar)

Indexing

Using the index argument in dictionaries to print only specific items.

import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}

# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}

budget_specific_days = pd.Series(budget_with_dollar, index = ["Saturday", "Sunday"])

print(budget_specific_days)

Data Frames using Lists

Data Frames are multi-dimensional tables. Data Frames represents a whole table while Series are like columns.

Syntax:variable = pd.DataFrame(data)

Example: Upgrade my budget table by adding another series for the food bought using lists.

# Data Frame using lists

import pandas as pd

# create lists for each category 
expenses = [33, 45, 25, 10, 90, 23, 4]
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
items = ["pizza kit", "tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"]

# Create pandas Series (column) for items
items_series = pd.Series(items, index = days)

# Create pandas Series(column) for expenses
expenses_series = pd.Series(expenses, index = days)

# Concatenate the two Series to create a DataFrame
budget_df = pd.DataFrame({"Budget" : expenses_series, "Items" : items_series})

print(budget_df)

This code creates a DataFrame named budget_df with two columns: Budget and Items. The Budget column contains the budget data and the Items column contains the corresponding food items for each day of the week.

Data Frames using Dictionaries

Here, I am creating a base guide on how I understand DataFrames using dictionaries.

Data dictionary: Create a dictionary where the keys represent the column names, and the values are lists or arrays containing the data for each column. Each list should have the same length.

Creating a DataFrame: Use the pd.DataFrame() function, passing your dictionary as an argument. This function converts the dictionary into a DataFrame object.

# DataFrames using Dictionaries
# start with this simple code as a guide

import pandas as pd

data = {
    "numbers" : [1, 2, 3],         #int
    "letters" : ["a", "b" ,"c"]    # str
}

# load data into a DataFrame object
sample = pd.DataFrame(data)
print(data)

Example: Upgrade my budget table by creating a DataFrame using dictionaries. I just need to provide a dictionary where the keys are the columns and the values are corresponding to the data in each column.

import pandas as pd

# create dictionary day:expenses pair
expenses_dict = {
    "Monday":33,
    "Tuesday":45,
    "Wednesday":25,
    "Thursday":10,
    "Friday":90,
    "Saturday":23,
    "Sunday":4
}


# create dictionary day:items pair
items_dict = {
    "Monday":"pizza kit",
    "Tuesday":"tuna and salmon",
    "Wednesday":"vegetables",
    "Thursday":"pre-made dinner",
    "Friday":"wagyu",
    "Saturday":"greek salad",
    "Sunday":"oatmeal"
}

# create pandas series (columns) from dictionaries
expenses_series = pd.Series(expenses_dict)
items_series = pd.Series(items_dict)

# create DataFrame by concatenation of 2 series
budget_df = pd.DataFrame({"Budget": expenses_series, "Items":items_series})

print(budget_df)