Laptop Pricing Dataset: Data Wrangling

Part of my Python Data Analysis Learning Log

Objectives

  • Handle Missing data

  • Data Type Correction

  • Standardize and Normalize Data Attributes

  • Visualize Data using Data Binning

  • Convert Categorical Data to Numerical Variables

Import Libraries

About %matplotlib inline

A command used in Jupyter Notebook to tell it to display plots directly within the notebook itself right below the code cell created rather than in a separate window or file.

Use import requests

In able for the program to fetch the dataset and resolve the NameError.

pip install requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests

Read CSV file into Pandas DataFrame

Source: File Path provided by IBM Data Analysis Course in Coursera.

# Define the URL from which to download the data
#file_url = 'https://coursera.....csv'
response = requests.get(file_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content to a file named 'auto.csv'
    with open('laptops.csv', 'wb') as f:
        f.write(response.content)
    print("Download successful. File saved as 'auto.csv'")
else:
    print("Failed to download the file")

# Use pandas to read the CSV file into a DataFrame
file_path = 'laptops.csv'
df = pd.read_csv(file_path, header = 0)

# Display the first few rows of the DataFrame to check if the data is imported correctly
df.head()

header = 0 explanation

file_path = 'laptops.csv' df = pd.read_csv(file_path, header = 0)

When you read a CSV file into a DataFrame using pd.read_csv() in pandas, the header parameter determines which row in the CSV file is considered as the header, i.e., the row that contains the column names.

Use header=0 when you are sure that the first row of the CSV file contains the column names. It is a good practice for clarity and explicitness in your code.

Verify DataFrame Loading

# Verify loading by displaying the dataframe summary using df.info()
display(df.info())

Verify DataFrame Columns

# View the first 5 values of the updated dataframe using df.head()
df.head()

np.round()

Update and round to nearest 2 decimal places the Screen_Size-cm column. Keep an eye of how Screen_Size-cm column changes.

df[['Screen_Size_cm']] = np.round(df[['Screen_Size_cm']], 2)
df.head()

# Checking how many rows there are in the dataset
df.tail()

Identifying and Handling Missing Values

Check the dataset for missing values (NaN)

There could be NaN(Not a Number) values in the DataFrame, which could be objects in Pandas. However, since it is a huge dataset with 237 rows, it is time consuming to check one by one.

Check for NaN values per column or for the entire DataFrame at once. Here are some methods:

Per Column

x = df.isnull().sum() This will give the count of NaN values in each column of the DataFrame.

nan_per_column = df.isnull().sum()
nan_per_column

Result shows that Screen_Size_cm has 4 NaN values and Weight_kg has 5 NaN values.

Entire DataFrame

x = df.isnull().sum().sum() This will give the total count of NaN values in the entire DataFrame.

# Check for NaN values in the entire DataFrame
nan_total = df.isnull().sum().sum()
print("Total NaN values in DataFrame:", nan_total)

Use for loop

Identify the entries having Null values in all columns. In the missing_data DataFrame, if missing = True, if not missing = False.

# Create new dataframe
missing_data = df.isnull()
# optional
print(missing_data.head())

# Initiate a for loop in the new DataFrame
for column in missing_data.columns.values.tolist():
    print(column)           #prints the name of the current column being processed
    print(missing_data[column].value_counts())    # calculates the count of True and False values
    print("")  # easier readability

Replace missing values

Remember that there are 2 ways to replace a missing value, by frequency or by mean.

Replace with mean

How to replace the 5 missing values in Weight_kg? We can replace missing values by the average value of the entire variable.

# replace missing values with mean
# astype() function converts the values to the desired data type
# axis=0 indicates that the mean value is to calculated across all column elements in a row
avg_weight = df['Weight_kg'].astype('float').mean(axis = 0)

# replace(old_value, new_value)
# inplace=True, changes are made directly to the DataFrame df 
df['Weight_kg'].replace(np.nan, avg_weight, inplace = True)

print(df["Weight_kg"])

# Check again if replacing all NaN values are successful
nan_weight_kg = df['Weight_kg'].isnull().sum()
nan_weight_kg
# Output: 0

Replace with the most frequent value

The Screen_Size_cm, which also has 4 missing values, is categorical attribute in nature. We can fill in these missing values by using the size that appears most frequently among the laptops' screen sizes.

The idxmax() method in pandas helps you find the row label (index) where the maximum value occurs in a DataFrame. It looks through the data and tells you which row has the highest value. This can be useful when you want to know the location of the highest value in your dataset.

# replacing missing data with mode
common_screen_size = df['Screen_Size_cm'].value_counts().idxmax()

# replace(old_value, new_value)
# inplace=True, changes are made directly to the DataFrame df 
df["Screen_Size_cm"].replace(np.nan, common_screen_size, inplace=True)

print(df["Screen_Size_cm"])

# Check again if replacing all NaN values are successful
nan_screen_size = df['Weight_kg'].isnull().sum()
nan_screen_size
# Output: 0

Converting the data types

Both the Weight_kg and Screen_Size_cm are float data types so there is no need to convert. However, for practice purposes, this is the code how to convert data types in Pandas.

df[["Weight_kg","Screen_Size_cm"]] = df[["Weight_kg","Screen_Size_cm"]].astype("float")
print(df[["Weight_kg", "Screen_Size_cm"]].dtypes)

Data Standardaization and Normalization

Data Standardaization

It is sometimes unavoidable that the data is written into different formats as it is collected from different places by different people.

This is where data formatting comes in. Data formatting(standardization) is like putting all your information into a common language everyone can understand. It's a step in cleaning up your dataset where you make sure everything looks the same and makes sense. This consistency helps people compare and analyze the data without any confusion.

Looking at the head of the DataFrame again, there is a chance that Screen_Size_cm may have inputs of different measurments(ie. cm or inches). Descriptive statistics can be done to check for inconcistencies.

Descriptive Statistics

Use the describe() method by giving a summary of the data's main points like its average, spread, and overall shape, which saves time from having to look at each separate piece of data. This can be really useful when trying to understand the 'Screen_Size_cm' column in your DataFrame.

# Assuming 'Screen_Size_cm' is the column of interest
screen_size_stats = df['Screen_Size_cm'].describe()

# Print the descriptive statistics
print(screen_size_stats)

Provided results:

  • count: indicates that there are 238 non-null values in the column.

  • mean: screen size is approximately 37.31 cm.

  • standard deviation (std): approximately 2.96 cm, indicating the variability of screen sizes around the mean.

  • minimum: screen size is 30.48 cm, and the maximum screen size is 43.94 cm.

There doesn't appear to be any obvious inconsistencies in the data. However, it is good to further investigate any specific patterns or anomalies that could be present in the dataset.

Standardize the measurement from cm to inches

Let's say it is decided to standardize the unit of measurement from cm to inches. The conversion of units can be done as the following:

image.png

Create a new column 'Screen_Size_inch' without overwriting the existing 'Screen_Size_cm' column, you can use the following code:

# Data standardization: convert screen size from cm to inch
df['Screen_Size_inch'] = df['Screen_Size_cm'] / 2.54
display(df.head())

Use descriptive statistics and check for outliers in the Weight_kg column as well.

image.png

# 'Screen_Size_cm' is the column of interest
screen_weight_stats = df['Weight_kg'].describe()

# Print the descriptive statistics
print(screen_weight_stats)

Provided results shows that:

  • Count: There are 238 non-null entries for the weight of laptops in the dataset.

  • Mean: The average weight of laptops is approximately 1.86 kilograms.

  • Standard Deviation: The standard deviation, which measures the dispersion of weights around the mean, is approximately 0.49 kilograms.

  • Minimum: The minimum weight recorded in the dataset is approximately 0.81 kilograms.

There doesn't appear to be any obvious inconsistencies in the data but let's just create a new columns for the weight in pounds unit.

# Convert kilograms to pounds and create a new column 'Weight_pounds'
df['Weight_pounds'] = df['Weight_kg'] * 2.205
display(df.head())

Data Normalization

Data normalization is a way to make sure all the numbers in a dataset are in a similar range. It helps us compare different pieces of data more easily. We adjust the numbers so they're not too big or too small. This makes it simpler to understand and analyze the information.

Here, the CPU_frequency column can be normalized and rouned into 2 decimal places.

# Normalize CPU frequency and round to 2 decimal places
df['CPU_frequency'] = (df['CPU_frequency'] / df['CPU_frequency'].max()).round(2)
display(df.head())

Before Normalization

After Normalization

Data Binning

Binning means putting similar things together in groups. In this case, create 3 bins for the attribute Price. These bins would be named "Low", "Medium" and "High". The new attribute will be named Price-binned.

bins = np.linspace(min(df["Price"]), max(df["Price"]), 4)
group_names = ['Low', 'Medium', 'High']
df['Price-binned'] = pd.cut(df['Price'], bins, labels=group_names, include_lowest=True )
display(df.head())

Binning Code Breakdown

1. Creating Bins:

Divide the prices of items into different groups, or bins, to analyze them more easily.

Create a set of numbers that evenly cover the range of prices.

Use np.linspace(min(df["Price"]), max(df["Price"]), 4) to create these numbers.

Here, min(df["Price"]) gives the smallest price in the dataset, and max(df["Price"]) gives us the largest. The 4 means divides the range into 4 equal parts.

2. Assigning Labels:

Once the bins are created, it need names for each group. Define the names for the groups as 'Low', 'Medium', and 'High'.

3. Performing Binning:

Take the actual prices in the Price column and assign them to the appropriate group.

pd.cut(df['Price'], bins, labels=group_names, include_lowest=True) does this.

It checks each price and decides which bin it belongs to based on the bins defined earlier.

If the price is within a certain range, it gets assigned the corresponding label ('Low', 'Medium', or 'High').

4. Creating a New Column:

Create a new column in the DataFrame called Price-binned. This column will hold the labels ('Low', 'Medium', 'High') assigned to each price group.

value_counts()

Count the occurrences of each category in the Price-binned column.

# Print the count of each price category
print(df['Price-binned'].value_counts())

Visualize Binned Data

Plot the Price bins into a bar graph.

plt.bar() This function creates a bar plot using the specified data.

group_names This is the x-axis of the bar plot, which represents the different price groups ('Low', 'Medium', 'High').

df["Price-binned"].value_counts() This part calculates the frequency of each price group ('Low', 'Medium', 'High') in the Price-binned column of the DataFrame.

# Create the bar plot
plt.bar(group_names, df["Price-binned"].value_counts(), color=['lightblue', 'violet', 'salmon'])

# Add labels and title
plt.xlabel("Price Group")
plt.ylabel("Count")
plt.title("Distribution of Price Groups")

# Add grid lines
plt.grid(axis='y', linestyle='--', alpha=0.8)

# Show plot
plt.show()

Indicator Variables (Dummy Variables)

Converting Categorical Variables to Numerical Variables

The Screen attribute has categories like IPS Panel and Full HD. These categories are not numerical and cannot be directly used in mathematical computations but it can be represented from a categorical data into a numerical format which is helpful to have in statistical models. . Each category becomes a separate binary (0 or 1) variable.

Keep the original Screen column in addition to creating a new dummy variable named Original_Screen.

# Indicator Variable: Screen
dummy_variable_1 = pd.get_dummies(df["Screen"])
dummy_variable_1.rename(columns={'IPS Panel':'Screen-IPS_panel', 'Full HD':'Screen-Full_HD'}, inplace=True)

# Add the indicator variables to the DataFrame
df = pd.concat([df, dummy_variable_1], axis=1)

# Keep the original "Screen" column as well by creating a new column
df['Original_Screen'] = df['Screen']

Code Breakdown

Use the pd.get_dummies() function form pandas to create dummies.

rename(columns={'IPS Panel':'Screen-IPS_panel', 'Full HD':'Screen-Full_HD'}) Specifies the renaming operation. Uses the rename() method to rename the following:

IPS Panel':'Screen-IPS_panel 'IPS Panel' to 'Screen-IPS_panel'

Full HD':'Screen-Full_HD 'Full HD' to 'Screen-Full_HD'.

inplace=True Modifies the original DataFrame

concat() function concatenate two DataFrames along the specified axis. axis =1concatenation should be done along the columns. axis = 0 is rows.


Github Documentation

https://github.com/Akina-Aoki/Data-Science-Files/blob/823b5b5950ed0baec4318f284e9c044c6e2d76be/Data%20Wrangling%20Laptop%20Pricing%20Dataset.ipynb

Disclosure

The content of this learning log is based on my personal reflections and insights gained from completing the IBM Data Analysis in Python course on Coursera. While I have engaged with the course materials and exercises, the views and interpretations presented here are entirely my own and may not necessarily reflect those of the course instructors or Coursera.