Extract, Transform, Load Process

The Extract, Transform, Load (ETL) process is diagramed in the two images on this page. The screenshots display the flow, or data pipeline of a typical project. Large businesses opt for an ELT software to assist. Some ELT tools include:

Blue circle with a white lowercase 'a' in the center.

Green upward trending line graph on a black background

Databricks

Alteryx

Matillion

Pipelines can vary in size, complexity, style, and structure. However, they all have the same goal of landing cleansed data somewhere - usually a data warehouse. A data warehouse is used for storing, organizing, and analyzing large volumes of data from multiple sources. It’s specifically designed to support business intelligence (BI) activities like reporting, data analysis, and decision-making. Some of the more well-known data warehouses are:

Google Analytics icon with a magnifying glass over a bar chart on a blue hexagonal background.

Snowflake

Goggle Big Query

Amazon Redshift

Alteryx

The topmost image is taking data from offline Excel files, cleansing them, then writing data into a data warehouse. The blue sections indicate the extract piece of the process, which in this case is a connection to an Excel file. The pink indicates the transform step, where data is being altered in some way. The green indicates the load step, where data is being stored somewhere. There are a bevy of different transform steps, some of the most common include filters, joins, summarizations, and add columns.

Filtering the data will split the data. As an example, if we only wanted data for year 2025 then we would code “YEAR” = ‘2025’. There are always 2 outputs, one dataset that is 2025 and the other dataset that is everything but 2025.

Icon of a network with a central circle connected to six surrounding circles, on a purple background.

A join will split the data 3 ways. The ones that match the other dataset come through the middle rightmost output in the icon. The other two nodes represent the unique values that are in only 1 dataset but not the other.

Orange background with a white Greek letter sigma.

Summarizing (AKA a group-by) is meant to aggregate data from smaller granularities; ie the task is to analyze price by brand, but the data is by item. You can aggregate and take the average price by brand using summation.

Chemical laboratory flask with bubbling liquid inside, against a dark background.

Adding columns is just like what it sounds like - you are making an entirely new column in the dataset. Here you can code and write any acceptable formula; ie making a “Price” column by taking “Sales”/”Units”.

There are over 100+ transformation methods.

Flow diagram showing data extraction, transformation, and loading processes with labeled sections 'EXTRACT,' 'TRANSFORM,' 'LOAD,' using blue and pink nodes connected by lines.

Alteryx

The bottommost image is taking data that is already in the data warehouse or database (in-DB), cleansing it, then writing data back into that same data warehouse. The benefit of this is that the computing power to perform this ETL job is on the data warehouse itself. The data warehouse is normally on a server or some type of cloud-based storage and will run far faster than running the same job on your own personal computer.

Extract, Transform, Load Process

Rob Ranieri

Location

Contact