Big Mac: Data Acquisition
Part of my Python Data Analysis Learning Log
Welcome to my beginner-friendly guide on exploring the BigMac dataset using the powerful Pandas library in Python. Here's a simple breakdown of what you'll discover:
Contents
Data Acquisition: acquire data sets, covering mainly CSV format through Kaggle.
Syntax Error Handling: address common issues like Unicode errors.
Features of the DataFrame: features and structure of the Pandas DataFrame, showcasing its versatility in handling tabular data.
Adding Headers: manipulate headers within the DataFrame.
Identifying NaN Values: outline methods to identify missing values within the dataset.
Saving and Reading Data: save and read dataset
Exploring the Dataset: exploring the dataset, including checking data types, obtaining summary statistics and using the
info()
anddescribe()
method to gain insights into the dataset's structure and content.
Data Acquisition
The data set used in this project is csv format and stored the cleaned DaFrame in the computer. The Pandas Library is a very popular tool that enables readability in various datasets into a data frame. Jupyter notebook platforms have a built-in Pandas Library so that all needs to be done is import Pandas without installing.
Kaggle's BigMac Dataset
I used Kaggle's Bigmac Dataset showing Mcdonalds' Bigmac price for every country in the world from 2000 to 2022.
https://www.kaggle.com/datasets/vittoriogiatti/bigmacprice/data
pip install requests
If you're running the code within a Jupyter Notebook and if you haven't already installed the requests library, you will indeed need to install it using pip install requests
in your terminal or command prompt.
After executing this command, you should be able to import and use the requests library in your notebook. Then, you can proceed with the rest of the code to download the dataset.
This code will download the dataset from the provided URL and save it with the specified filename in the current directory. You can then use pandas.read
_csv()
to read the CSV file into a DataFrame as usual.
Extracting the Dataset from Kaggle
Extracting data online by downloading a dataset from kaggle, and saving into a csv file into my computer. Then, copy the file path of the downloaded dataset in csv form and convert in the following:
What to do when ecountering SyntaxError: (unicode error)
copied path from excel file
file path = "C:\Users-----\Documents\---\All Python Files\Kaggle Data Sets\BigmacPrice.csv"
Convert this into either:
raw string literal
file_path = r'C:\Users-----\Documents\---\All Python Files\Kaggle Data Sets\BigmacPrice.csv'
or
escaping backslashes
file_path = 'C:\Users\-----\Documents\---\All Python Files\Kaggle Data Sets\BigmacPrice.csv'
# Import necessary libraries
import pandas as pd
# Define the file path to your CSV file
# Replace 'path_to_your_csv_file.csv' with the actual file path
file_path = r'C:\Users\---\Documents\---\All Python Files\Kaggle Data Sets\BigmacPrice.csv'
# Read the CSV file into a DataFrame called "bigmac_df"
bigmac_df = pd.read_csv(file_path)
# Display the first few rows of the DataFrame to check if the data is imported correctly
bigmac_df.head()
# display the whole dataframe
bigmac_df
Here, you can see that the dataframe has 1946 rows and 6 columns.
Features of the data frame
Add Headers
Take a look at the data set. Let's say you want to rename the "Name" column to "Country" column. Also, you want to reset the headers. Pandas automatically set the header with an integer starting from 0.
First, create a list "headers" that include all column names in order. Then, use df.columns = headers
to replace the headers with the list you created.
# Create headers list
headers = ["Date", "Currency_Code", "Country", "Local_Price", "Dollar_Ex", "Dollar_Price"]
# Replace headers and recheck data frame
bigmac_df.columns = headers
bigmac_df.columns
x = df.isnull().sum().sum()
This will give the total count of NaN values in the entire DataFrame.
Checking for NaN Values in the DataFrame
There could be NaN(Not a Number) values in the DataFrame, which could be objects in Pandas. However, since it is a huge dataset with 1946 rows, it is time consuming to check one by one.
You can check for NaN values per column or for the entire DataFrame at once. Here are some methods:
x = df.isnull().sum()
This will give the count of NaN values in each column of the DataFrame.
# Check for NaN values per column
nan_per_column = bigmac_df.isnull().sum()
nan_per_column
x = df.isnull().sum().sum()
This will give the total count of NaN values in the entire DataFrame.
# Check for NaN values in the entire DataFrame
nan_total = bigmac_df.isnull().sum().sum()
print("Total NaN values in DataFrame:", nan_total)
This implies that the dataset is complete, and there are no missing values in any of the columns or the dataset.
dropna() method
However, in a case wherein there are no missing values found, it is important to drop NaN (Not a Number) objects in Pandas because they represent missing or undefined values in the data. NaN values can affect the accuracy and reliability of your data analysis.
The dropna()
method in pandas is used to remove missing values (NaN, null values) from a DataFrame or Series object. It provides flexibility in terms of which axis (rows or columns) to consider for dropping, as well as the threshold for the number of missing values required to trigger dropping.
Axis specifies whether to drop rows (axis=0)
or columns (axis=1)
that contain missing values. By default, it's set to 0, meaning it drops rows.
Setting the argument inplace
to True
allows the modification to be done on the data set directly, inplace = True
just writes the result back into the data frame.
"""sample code if NaN Values are found in dataset
# drop missing values along the column "x" as follows
df = df1.dropna(subset = ["x"], axis = 0)
df = df1.dropna(subset = ["x"], axis = 0, inplace = True)
df.head()
"""
Save Dataset
Correspondingly, Pandas enables you to save the data set to CSV. By using the dataframe.to
_csv()
method, you can add the file path and name along with quotation marks in the brackets.
If you want to save the DataFrame named df as a file named bigmac.csv on your computer, you can use the following code.
The index=False
part means that the row names or indices will not be saved along with the data.
bigmac_df.to_csv("bigmac.csv", index=False)
Read/Save Other Data Formats
You can also read and save other file formats. You can use similar functions like pd.read
_csv()
and df.to
_csv()
for other data formats. The functions are listed in the following table.
Exploring the Dataset
After Successfully extracting and reading data into Pandas DataFrame, it is important to explore the date set. There are several ways to obtain essential insights of the data to help you better understand it. These can be done with the following:
Check the Data Types
Data has a variety of types. The main types stored in Pandas data frames are object, float, int, bool and datetime64.
In order to better learn about each attribute, you should always know the data type of each column.
# Returns a series with the data type of each column.
bigmac_df.dtypes
Get Summary Statistics
bigmac_df.describe(include='all')
This code will display summary statistics for all columns in the DataFrame, including count, unique, top, and frequency for categorical columns, and mean, standard deviation, minimum, maximum, and quartiles for numerical columns.
# Display the summary of the DataFrame to check if the data is imported correctly
bigmac_df.describe(include='all')
describe() method:
Select particular columns/series in the data frame
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
dataframe[[' column 1 ',column 2', 'column 3']]
Where colum is the name of the column, you can apply the method .describe()
to get the statistics of those columns as follows:
dataframe[[' column 1 ',column 2', 'column 3'] ].describe()
bigmac_df[['Local_Price', 'Dollar_Ex', 'Dollar_Price']].describe()
Info
It provides a concise summary of the dataframe.
This method prints information about a data frame including the index dtype and columns, non-null values and memory usage.
bigmac_df.info()
GitHub Documentation
Disclosure
The content of this learning log is based on my personal reflections and insights gained from completing the IBM Data Analysis in Python course on Coursera. While I have engaged with the course materials and exercises, the views and interpretations presented here are entirely my own and may not necessarily reflect those of the course instructors or Coursera.