DATA ENGINEERING: ETL

NikhilNimje
4 min readNov 16, 2022

--

A beginner’s Project

INTRODUCTION:

This project allowed me to gain experience and knowledge in data engineering. In this project, data was collected and prepared for analysis for a scooter company.

Data engineering involves three steps:

1. Extract

2. Transform

3. Load

For this project, I worked for an e-scooter company and prepared and collected the data for the 5 cities’ populations, geographic coordinates, weather information and flight details etc. Mainly, 3 tables were formed,

  1. Cities Table: Includes Population, City name, Country name etc.
  2. Weather Table: Includes all the vital weather information
  3. Flights Table: Includes all the crucial information about the arrival flights in these cities.

Following is a detailed overview of how I achieved the objective involving these 3 steps.

  1. EXTRACTION:

For the extraction part, Web Scraping and APIs were used to extract publicly available data.

1.1 Web Scraping:

  • In this project, I collected data about the Cities, their population, geographic coordinates, etc. The best website for this was WIKIPEDIA.
  • I inspected the site on the browser, which gave the HTML code behind all the information displayed on the website.
  • To retrieve this data in python, the beautifulsoup library was used.
  • The following YouTube resource was beneficial for me in getting an insight into how to scrape data with Python and Beautiful Soup.

https://www.youtube.com/watch?v=XQgXKtPSzUI

1.2 Data Collection from APIs:

Data can also be accessed via publicly available APIs. For this project, I collected weather and flight information for the above cities.

  • For extraction of the weather data and flight data, the request library and its JSON response were taken into account.

Example of a JSON Response:

This completes the step of extraction.

2. TRANSFORM:

The transform step is to convert the extracted raw data into meaningful, structured data.

This process involves selecting a portion of the data from the data extracted above and structuring and formatting this data to make it ready for analysis. It also involves cleaning the data.

  • For web scraping, I worked with HTML code, and the data was explored and then transformed using the Beautifulsoup library.
  • Weather data and flight data were transformed using the request library and its JSON response.
  • Here, we can look at the weather’s JSON response, which contains all weather information for 5 days, and it also gives the weather information every 3 hours.
  • From this JSON response, I only took out the data, which is essential for the data analysis, such as Rain, Temperature, etc.
  • To achieve this, I explored and navigated through the data and selected only the data that I wanted to take over.

The Final transformed file looks something like this.

This gives us the following output:

This finishes step 2 i.e. Transformation.

3. LOAD:

Now, the last step is loading the data.

  • To test/ set up a MySQL database locally, the database was created with all the tables and all tables had all the columns and most importantly, relationships between these tables.

Then the connection of Python with MySQL was performed using SQLAlchemy and pandas. DataFrame.to_sql() method.

  • Data loading was tested on local MySQL, and when it looked fine, I moved on to the cloud.
  • To load data into the MySQL database hosted by AWS, the following steps were performed:

1. Sign up for an AWS Account

2. Set up an Amazon RDS MySQL Instance

3. Connect to your Amazon RDS MySQL Instance

  • For moving all the scripts, I used serverless computing, AWS Lambda.

To perform these actions, a lambda function was created, and then the Lambda function was connected to the RDS instance.

This how I completed Step 3.

In this way, I completed my first-ever Data Engineering project In short, which included 3 basic steps: extract, transform and load. Finally, I would like to thank everyone who helped me finish this project. Thank you for reading.

--

--

NikhilNimje
NikhilNimje

Written by NikhilNimje

I am a DATA SCIENCE, MACHINE LEARNING, and ARTIFICIAL INTELLIGENCE enthusiast. My interests include Statistics, Machine Learning, and Programming.

No responses yet