Udacity Data Engineering Capstone Project

Project Summary The project follows the follow steps: Step 1: Scope the Project and Gather Data Step 2: Explore and Assess the Data Step 3: Define the Data Model Step 4: Run ETL to Model the Data Step 5: Complete Project Write Up Step 1: Scope the Project and Gather Data Scope The project is […]

Ethical, Privacy and Data Protection Issues

Data is often reduced to what can fit into a mathematical model. Yet, taken out of context, data may lose its meaning. Ethics, privacy, and data protection issues are often an afterthought or regulatory hurdle to be jumped through. Ethical Issues include: Non-objective analysis Incomplete Reporting Misleading Reporting Lack of Consideration Moral agency is the […]

A Comparison between Cassandra and MySQL

Introduction Cassandra is a distributed, no single point of failure, continuously available and scalable. NoSQL database that manages a large amount of data across many data centres and cloud servers. It offers both operation simplicity and capacity to scale linearly. While MySQL is the world’s most popular, cost-effective, high-performance relational database(Kumar, 2016). It comes with a […]

Web Server Log Analysis with Spark

Web Server Log Analysis with Spark This lab will demonstrate how easy it is to perform web server log analysis with Apache Spark. Server log analysis is an ideal use case for Spark. It’s a very large, common data source and contains a rich set of information. Spark allows you to store your logs in […]

Building a word count application in Spark

These are my solutions for Apache Spark. Building a word count application in Spark This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. […]

Spark Tutorial: Learning Apache Spark

Spark Tutorial: Learning Apache Spark includes my solution for the EdX course. This tutorial will teach you how to use Apache Spark, a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer. However, many datasets today are too large to be stored on a […]