Udacity Data Engineering Capstone Project

Project Summary

The project follows the follow steps:

Step 1: Scope the Project and Gather Data
Step 2: Explore and Assess the Data
Step 3: Define the Data Model
Step 4: Run ETL to Model the Data
Step 5: Complete Project Write Up

Step 1: Scope the Project and Gather Data

Scope

The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project.

i94 Immigration Sample Data : sample data which is from the US National Tourism and Trade Office. This data comes from the US National Tourism and Trade Office. This table is used for the fact table in this project.
World Temperature Data world_temperature. This dataset contains temperature data of various cities from 1700’s – 2013. This dataset came from Kaggle. This table is not used because the data is available until 2013
U.S. City Demographic Data us-cities-demographics. This dataset contains population details of all US Cities and census-designated places includes gender & race information. This data came from OpenSoft. The table is grouped by state to get aggregated statistics.
Airport Codes is a simple table of airport codes and corresponding cities. The rows where IATA codes are available in the table are selected for this project.

The project builds a data lake using Pyspark that can help to support the analytics department of the US immigration department to query the information by extracting data from all the sources. The conceptual data model is a Factless fact based transactional star schema with dimensions tables. Some examples of the information which can be queries from the data model include the numbers of visitors by nationality, visitor’s main country of residence, their demographics and flight information. Python is the main language used to complete the project. The libraries used to perform ETL are Pandas, Pyarrow and Pyspark. The environment used is workspace by Udacity. Immigration data was transformed from sas format to parquet format using Pyspark. These parquest files were ingested using Pyarrow and explored using Pandas to gain an understanding of the data and before building a conceptual data model. Pyspark was then used to build the ETL pipeline. The data sources provided have been cleaned, transformed to create new features and then save the data tables are parquet file. The two notebooks with all the code and output are as follows:

1. exploringUsingPandas.ipynb

2. exploringUsingPyspark.ipynb

Describe and Gather Data

Immigration Data

“Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” (https://en.wikipedia.org/wiki/Form_I-94) .It lists the traveler’s immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler’s lawful admission to the United States ( https://i94.cbp.dhs.gov/I94/(

This is the main dataset and there is a file for each month of the year of 2016 available in the directory ../../data/18-83510-I94-Data-2016/ . It is in SAS binary database storage format sas7bdat. This project uses the parquet files available in the workspace and the folder called sap_data. The data is for the month of the month of April of 2016 which has more than three million records (3.096.313). The fact table is derived from this table.

World Temperature Data

Data is from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. The original dataset from Kaggle includes several files (https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). But for this project, only the GlobalLandTemperaturesByCity was analyzed. The dataset provides a long period of the world’s temperature (from year 1743 to 2013). However, since the immigration dataset only has data in the year of 2016, the vast majority of the data here seems not to be suitable.

Airports Data

“Airport data includes IATA airport code.An IATA airport code, also known as an IATA location identifier, IATA station code or simply a location identifier, is a three-letter geocode designating many airports and metropolitan areas around the world, defined by the International Air Transport Association (IATA). IATA code is used in passenger reservation, ticketing and baggage-handling systems (https://en.wikipedia.org/wiki/IATA_airport_code)”. It was downloaded from a public domain source (http://ourairports.com/data/)

U.S. City Demographic Data

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau’s 2015 American Community Survey. This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. The US City Demographics is the source of the STATE dimension in the data model and grouped by State.

Gather Data

from pyspark.sql import SparkSession
spark = SparkSession.builder.
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

df_spark.show(2)

+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|       admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|  6.0|2016.0|   4.0| 692.0| 692.0|    XXX|20573.0|   null|   null|   null|  37.0|    2.0|  1.0|    null|    null| null|      T|   null|      U|   null| 1979.0|10282016|  null|  null|   null|1.897628485E9| null|      B2|
|  7.0|2016.0|   4.0| 254.0| 276.0|    ATL|20551.0|    1.0|     AL|   null|  25.0|    3.0|  1.0|20130811|     SEO| null|      G|   null|      Y|   null| 1991.0|     D/S|     M|  null|   null| 3.73679633E9|00296|      F1|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
only showing top 2 rows

#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

Step 2: Explore and Assess the Data

Explore the Data

exploringUsingPandas.ipynb shows the workings to assess and explore the data.

The main finding and cleaning steps necessary are as follows:

The dataset is for 30 days in the month of April and year 2016.
Most of the people used air as mode of travel. Some people do not report their mode of transport.
Males immigrated more than females
i94 has missing values. These rows need to be dropped.
there are no duplicate gender and address for each cicid.
Immigration was to 243 different cities for multiple states.
Immigration was from 229 different cities.
Departure date is less than the arrival date. Therefore, these visitors are still in the country.
airline & fltno is also missing in some rows and hence the mode of transport was different.
i94 form supports O, M, F gender.Null values are considered invalid
Some arrival & departure records don’t have a matching flag (matflag ).
There is a minimum age as -3. The data selected is more than age of zero years.
The dates are stored in SAS date format, which is a value that represents the number of days between January 1, 1960, and a specified date. We need to convert the dates in the dataframe to a string date format in the pattern YYYY-MM-DD.
insnum can be dropped as it is for US residents or citizens
Count, dtadfile, admnum, i94res, dtaddto, occup, visapost can be dropped as these do not provide any extra information or have high missing values.
Demographic dataset doesnot have many missing values but has data for only 48 states.
Most of the iata_code are missing. Almost 50% of local codes are also missing
Select only where IATA codes are available from US airports and type of airport is either large, medium and small.
Extract iso regions and dropped continent
Rename the columns of the dataset to more meaning full names
Convert the data types of the columns
Removed city and race from Demographics
Grouped the data to provide an aggregated statistics per US state
Drop duplicates

Step 3: Define the Data Model

3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

For this project, Star schema is deployed in a relational database management system as dimensional structures. Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/ foreign key relationships.

3.2 Mapping Out Data Pipelines

The project involved four key decisions during the design of a dimensional model:

Select the business process.

The business process for the immigration department is to allow valid visitors into the country. The process generate events and capture performance metrics that translate into facts in a fact table.

Declare the grain.

The grain establishes exactly what a single fact table row represents. In the project the records are recorded as the event of a visitor entring the USA occurs. It is done before choosing the fact and dimension table and becomes a binding contract on the design. This ensures uniformity on all dimensional designs and critical to BI application performance and ease of use.

Identify the dimensions.

Dimensions table provides the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. In this project, a dimension is single valued when associated with a given fact row. Every dimension table has a single primary key column. This primary key is embedded as a foreign key in the associated fact table where the dimension row’s descriptive context is exactly correct for that fact table row.

Dimension tables are wide, flat denormalized tables with many low cardinality text attributes. It is designed with one column serving as a unique primary key. This primary key is not the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. These surrogate keys are simple integers, assigned in sequence. The tables also denormalize the many-to-one fixed depth hierarchies into separate attributes on a flattened dimension row. Dimension denormalization supports dimensional modeling’s twin objectives of simplicity and speed.

Identify the facts

The fact table focuses on the results of a single business process. A single fact table row has a one-to-one relationship to a measurement event as described by the fact table’s grain. Thus a fact table design is entirely based on a physical activity and is not influenced by the demands of a particular report. Within a fact table, only facts consistent with the declared grain are allowed. In this project, the information about the visitor is the fact. The fact table is transactional with each row corresponding to a measurement event at a point in space and time. It is also Factless Fact Tables as the event merely records a set of dimensional entities coming together at a moment in time. Factless fact tables can also be used to analyze what didn’t happen. These queries always have two parts: a factless coverage table that contains all the possibilities of events that might happen and an activity table that contains the events that did happen. When the activity is subtracted from the coverage, the result is the set of events that did not happen. Each row corresponds to an event. The fact table contains foreign keys for each of its associated dimensions, as well as date stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.

(http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf)

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

Build the data pipelines to create the data model.

4.2 Data Quality Checks

Explain the data quality checks you’ll perform to ensure the pipeline ran as expected. These could include:

Integrity constraints on the relational database (e.g., unique key, data type, etc.)
Unit tests for the scripts to ensure they are doing the right thing
Source/Count checks to ensure completeness

exploringUsingPyspark.ipynb are the workings for task 4.1 and 4.2

4.3 Data dictionary

dataDictionary.md has the data model

Step 5: Complete Project Write Up

Description of how to approach the problem differently under the following scenarios:
- The data was increased by 100x.1. Use of Redshift (https://aws.amazon.com/redshift/). It allows querying petabytes of structured and semi-structured data across the data warehouse2. Use of Cassandra (http://cassandra.apache.org/). It offers robust support for clusters spanning multiple datacenters with asynchronous masterless replication allowing low latency operations for all clients.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.1. For small datasets, a cron job will be sufficient2. Use of Airflow (https://airflow.apache.org/docs/stable/macros.html)
- The database needed to be accessed by 100+ people.1. Use of Redshift with auto-scaling capabilities and good read performance2. Use of Cassandra with pre-defined indexes to optimize read queries3. Use of Elastic Map Reduce (https://aws.amazon.com/emr/). It allows provisioning one, hundreds, or thousands of compute instances to process data at any scale.

Tagged Data Engineering, Data Warehousing, Education