Data engineering
Contents
42.85. Data engineering#
This assignment focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Please refer to Machine Learning productionization - Data engineering to learn more.
Fill ____
pieces of the below implementation in order to pass the assertions.
42.85.1. Exploring dataset#
Learning goal: By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
In order to explore this functionality, we will import the modefined version of Python scikit-learn library’s iconic dataset Iris.
import pandas as pd
from sklearn.datasets import load_iris
import math
iris_df = pd.read_csv('../../assets/data/modefined_sklearn_iris_dataset.csv', index_col=0)
iris_df
To start off, print the summary of a DataFrame.
iris_df.____
Next, let’s check the actual content of the DataFrame
.
# displying first 5 rows of our iris_df
iris_df____
# in the first five rows, which one's spepal length is 5.0cm?
assert iris_df.iloc[____, 0] == 5.0
Conversely, we can check the last few rows of the DataFrame.
# displying last 5 rows of our `iris_df`.
iris_df.____
# in the last five rows, which one's spepal width is 2.5cm?
assert iris_df.iloc[____, 1] == 2.5
Takeaway: Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.
42.85.2. Dealing with missing data#
Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a “reload” of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place.
Learning goal: By the end of this subsection, you should know how to replace or remove null values from DataFrames.
In pandas, the isnull()
and notnull()
methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using numpy for NaN values:
iris_isnull_df = iris_df.isnull()
print(iris_isnull_df)
# find one row with missing value
assert iris_isnull_df.iloc[____, ____] == True
assert math.isnan(iris_df.iloc[____, ____]) == True
# get all the rows with missing data
iris_with_missing_value_df = iris_df____
assert iris_with_missing_value_df.shape[0] == 16
Dropping null values: Beyond identifying missing values, pandas provides a convenient means dropna
to remove null values from Series and DataFrames. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.)
# remove all the rows with missing values
iris_with_dropna_on_row_df = iris_df.____
assert iris_with_dropna_on_row_df.shape[0] == 134
# remove all the columns with missing values
iris_with_dropna_on_column_df = iris_df.____
assert iris_with_dropna_on_column_df.columns.shape[0] == 0
# remove all the rows with 2 missing values
iris_with_dropna_2_values_on_rows_df = iris_df.____
assert iris_with_dropna_2_values_on_rows_df.shape[0] == 144
# remove all the rows with 1 missing values
iris_with_dropna_1_values_on_rows_df = iris_df.____
assert iris_with_dropna_1_values_on_rows_df.shape[0] == 147
Filling null values: Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use isnull
to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides fillna
, which returns a copy of the Series or DataFrame with the missing values replaced with one of your choosing.
# fll all the missing values with 0
iris_with_fillna_df = iris_df.____
# get all the rows with missing data
iris_with_missing_value_after_fillna_df = iris_with_fillna_df____
assert iris_with_missing_value_after_fillna_df.shape[0] == 0
assert iris_with_fillna_df.iloc[____, 3] == -1
# forward-fill null values, which is to use the last valid value to fill a null:
iris_with_fillna_forward_df = iris_df.____
# get all the rows with missing data
iris_with_missing_value_after_fillna_forward_df = iris_with_fillna_forward_df____
assert iris_with_missing_value_after_fillna_forward_df.shape[0] == 0
assert float(iris_with_fillna_forward_df.iloc[3, 3]) == 0.2
# back-fill null values, which is to use the next valid value to fill a null:
iris_with_fillna_back_df = iris_df.____
# get all the rows with missing data
iris_with_missing_value_after_fillna_back_df = iris_with_fillna_back_df____
assert iris_with_missing_value_after_fillna_back_df.shape[0] == 0
assert float(iris_with_fillna_back_df.iloc[3, 3]) == 0.1
42.85.3. Removing duplicate data#
Data that has more than one occurrence can produce inaccurate results and usually should be removed. This can be a common occurrence when joining two or more datasets together. However, there are instances where duplication in joined datasets contain pieces that can provide additional information and may need to be preserved.
Learning goal: By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries.
Identifying duplicates: You can easily spot duplicate values using the duplicated
method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let’s create another example DataFrame to see this in action.
iris_isduplicated_df = iris_df.____
print(iris_isduplicated_df)
# find one row with duplicated value
assert iris_isduplicated_df.iloc[____, ____] == True
Dropping duplicates: drop_duplicates
simply returns a copy of the data for which all of the duplicated values are False:
# remove all the rows with duplicated values
iris_with_drop_duplicates_on_df = iris_df.drop_duplicates()
assert iris_with_drop_duplicates_on_df.shape[0] == 143
Both duplicated
and drop_duplicates
default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:
# remove all the rows with duplicated values on column 'petal width (cm)'
iris_with_drop_duplicates_on_column_df = iris_df.____
assert iris_with_drop_duplicates_on_column_df.shape[0] == 27
42.85.4. Handle inconsistent data#
Depending on the source, data can have inconsistencies in how it’s presented. This can cause problems in searching for and representing the value, where it’s seen within the dataset but is not properly represented in visualizations or query results. Common formatting problems involve resolving whitespace, dates, and data types. Resolving formatting issues is typically up to the people who are using the data. For example, standards on how dates and numbers are presented can differ by country.
Learning goal: By the end of this subsection, you should know how to handle the inconsistent data format in the DataFrame.
Let’s cleaning up the 4th column petal width (cm)
to make sure there’s no data entry inconsistencies in it. Firstly, we will use a convenient method unique
from pandas to check the unique values of this column
In pandas, the unique
method is a convenient way to unique values based on a hash table:
column_to_format = ____
column_to_format_unique = column_to_format.____
print(column_to_format_unique)
# find one row with duplicated value
assert column_to_format_unique.shape[0] == 27
Regardless the nan
value, you may find the numeric valus are in different precision. More specifically, 1.
or 1.5012
are not in the same precision as other numbers. We want to append tailing 0
to numbers like 1.
, and round numbers like 1.5012
to 1.5
.
# firstly, let's apply `round`` to the values to make the precision all as .1f
formatted_column = column_to_format.____
print(formatted_column.unique())
assert formatted_column.unique().shape[0] == 23
# now, let's add tailing 0 if needed to make numbers like 1. to be 1.0.
# You may need to filter the nan value while processing.
formatted_column = formatted_column.____
print(formatted_column.unique())
assert formatted_column.unique().shape[0] == 23
42.85.5. At last#
Let’s apply all the methods above to make the data to be clean.
# remove all rows with missing values
no_missing_data_df = iris_df.____
# remove all rows with duplicated values
no_missing_dup_data_df = no_missing_data_df.____
# apply the precision .1f to all the numbers
cleand_df = no_missing_dup_data_df.____
assert no_missing_data_df.shape[0] == 134
assert no_missing_dup_data_df.shape[0] == 129
assert cleand_df[cleand_df.columns[3]].unique().shape[0] == 22
Also, you could refer to below for more about how to handle data quality.
missing data - pandas - Working with missing data
duplicate data - pandas - Duplicate Labels
outlier
42.85.6. Acknowledgments#
Thanks to Microsoft for creating the open source course Data Science for Beginners. It contributes some of the content in this chapter.