Project Summary

The project follows the follow steps:

Step 1: Scope the Project and Gather Data

Scope

The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project.

The project builds a data lake using Pyspark that can help to support the analytics department of the US immigration department to query the information by extracting data from all the sources. The conceptual data model is a Factless fact based transactional star schema with dimensions tables. Some examples of the information which can be queries from the data model include the numbers of visitors by nationality, visitor’s main country of residence, their demographics and flight information. Python is the main language used to complete the project. The libraries used to perform ETL are Pandas, Pyarrow and Pyspark. The environment used is workspace by Udacity. Immigration data was transformed from sas format to parquet format using Pyspark. These parquest files were ingested using Pyarrow and explored using Pandas to gain an understanding of the data and before building a conceptual data model. Pyspark was then used to build the ETL pipeline. The data sources provided have been cleaned, transformed to create new features and then save the data tables are parquet file. The two notebooks with all the code and output are as follows:

1. exploringUsingPandas.ipynb

2. exploringUsingPyspark.ipynb

Describe and Gather Data

Immigration Data

“Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” (https://en.wikipedia.org/wiki/Form_I-94) .It lists the traveler’s immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler’s lawful admission to the United States ( https://i94.cbp.dhs.gov/I94/(

This is the main dataset and there is a file for each month of the year of 2016 available in the directory ../../data/18-83510-I94-Data-2016/ . It is in SAS binary database storage format sas7bdat. This project uses the parquet files available in the workspace and the folder called sap_data. The data is for the month of the month of April of 2016 which has more than three million records (3.096.313). The fact table is derived from this table.

World Temperature Data

Data is from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. The original dataset from Kaggle includes several files (https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). But for this project, only the GlobalLandTemperaturesByCity was analyzed. The dataset provides a long period of the world’s temperature (from year 1743 to 2013). However, since the immigration dataset only has data in the year of 2016, the vast majority of the data here seems not to be suitable.

Airports Data

“Airport data includes IATA airport code.An IATA airport code, also known as an IATA location identifier, IATA station code or simply a location identifier, is a three-letter geocode designating many airports and metropolitan areas around the world, defined by the International Air Transport Association (IATA). IATA code is used in passenger reservation, ticketing and baggage-handling systems (https://en.wikipedia.org/wiki/IATA_airport_code)”. It was downloaded from a public domain source (http://ourairports.com/data/)

U.S. City Demographic Data

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau’s 2015 American Community Survey. This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. The US City Demographics is the source of the STATE dimension in the data model and grouped by State.

Gather Data

from pyspark.sql import SparkSession
spark = SparkSession.builder.
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
df_spark.show(2)
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|       admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|  6.0|2016.0|   4.0| 692.0| 692.0|    XXX|20573.0|   null|   null|   null|  37.0|    2.0|  1.0|    null|    null| null|      T|   null|      U|   null| 1979.0|10282016|  null|  null|   null|1.897628485E9| null|      B2|
|  7.0|2016.0|   4.0| 254.0| 276.0|    ATL|20551.0|    1.0|     AL|   null|  25.0|    3.0|  1.0|20130811|     SEO| null|      G|   null|      Y|   null| 1991.0|     D/S|     M|  null|   null| 3.73679633E9|00296|      F1|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
only showing top 2 rows
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

Step 2: Explore and Assess the Data

Explore the Data

exploringUsingPandas.ipynb shows the workings to assess and explore the data.

The main finding and cleaning steps necessary are as follows:

Step 3: Define the Data Model

3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

For this project, Star schema is deployed in a relational database management system as dimensional structures. Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/ foreign key relationships.

3.2 Mapping Out Data Pipelines

The project involved four key decisions during the design of a dimensional model:

  1. Select the business process.

The business process for the immigration department is to allow valid visitors into the country. The process generate events and capture performance metrics that translate into facts in a fact table.

  1. Declare the grain.

The grain establishes exactly what a single fact table row represents. In the project the records are recorded as the event of a visitor entring the USA occurs. It is done before choosing the fact and dimension table and becomes a binding contract on the design. This ensures uniformity on all dimensional designs and critical to BI application performance and ease of use.

  1. Identify the dimensions.

Dimensions table provides the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. In this project, a dimension is single valued when associated with a given fact row. Every dimension table has a single primary key column. This primary key is embedded as a foreign key in the associated fact table where the dimension row’s descriptive context is exactly correct for that fact table row.

Dimension tables are wide, flat denormalized tables with many low cardinality text attributes. It is designed with one column serving as a unique primary key. This primary key is not the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. These surrogate keys are simple integers, assigned in sequence. The tables also denormalize the many-to-one fixed depth hierarchies into separate attributes on a flattened dimension row. Dimension denormalization supports dimensional modeling’s twin objectives of simplicity and speed.

  1. Identify the facts

The fact table focuses on the results of a single business process. A single fact table row has a one-to-one relationship to a measurement event as described by the fact table’s grain. Thus a fact table design is entirely based on a physical activity and is not influenced by the demands of a particular report. Within a fact table, only facts consistent with the declared grain are allowed. In this project, the information about the visitor is the fact. The fact table is transactional with each row corresponding to a measurement event at a point in space and time. It is also Factless Fact Tables as the event merely records a set of dimensional entities coming together at a moment in time. Factless fact tables can also be used to analyze what didn’t happen. These queries always have two parts: a factless coverage table that contains all the possibilities of events that might happen and an activity table that contains the events that did happen. When the activity is subtracted from the coverage, the result is the set of events that did not happen. Each row corresponds to an event. The fact table contains foreign keys for each of its associated dimensions, as well as date stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.

(http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf)

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

Build the data pipelines to create the data model.

4.2 Data Quality Checks

Explain the data quality checks you’ll perform to ensure the pipeline ran as expected. These could include:

exploringUsingPyspark.ipynb are the workings for task 4.1 and 4.2

4.3 Data dictionary

dataDictionary.md has the data model

Step 5: Complete Project Write Up