In the early days of Hadoop MapReduce being used as ETL(extract, transform and load) framework, world has moved on rapidly on technology front. Now, Spark is de facto standard to perform ETL on big data. The Five Key Differences of Apache Spark vs Hadoop MapReduce:
- Apache Spark is potentially 100 times faster than Hadoop MapReduce.
- Apache Spark utilizes RAM and isn’t tied to Hadoop’s two-stage paradigm.
- Apache Spark works well for smaller data sets that can all fit into a server's RAM.
- Hadoop is more cost effective processing massive data sets.
- Apache Spark is now more popular that Hadoop MapReduce.
With the rise of cloud computing adoption by customers across all verticals, Spark based cloud native ETL products have surfaced to meet the need of the market.
Public Cloud Big Data ETL
Public cloud providers like Azure, AWS and Google offer Cloud ETL products. As we all know, ETL is mostly code ( SQL, Hadoop MapReduce, PIG, Spark, Hive etc. ) driven or visual like Microsoft SSIS, IBM Datastage, Informatica PowerCenent and Talend. For cloud scale ETL, what are native visual ETL products in the market? I looked at Azure, AWS and Google offerings. Azure Data Factory[1] and AWS Glue [2]are visual ETL products from two competing cloud service providers. Both are PaaS products focused on ETL/ELT. Both are serverless offerings and use Spark as an underlying tech stack. Because of no visual focus on the Google products like Data Fusion/Data Flow/Dataproc[3], this blog will focus on visual development aspect of AWS and Azure offerings. Purpose of the blog is to show subtle difference between AWS Glue and Azure Data Factory when it comes to complex transformation keeping visual and code free usage in mind.
AWS Glue vs Azure Data Factory(ADF)
Azure Data Factory is a cloud-based data integration service for creating ETL and ELT pipelines[1]. It allows users to create data processing workflows in the cloud, either through a graphical interface or by writing JSON structures, for orchestrating and automating data movement and data transformation. Amazon Web Services (AWS) has a host of tools for working with data in the cloud. Glue focuses on ETL[2]. It's one of two AWS tools for moving data from sources to analytics destinations; the other is AWS Data Pipeline, which is more focused on data transfer.
As said above, I want to compare Glue and ADF on basic need of data engineers. That is transformation. To test it out, I set up Glue catalog to S3 source. To achieve that, I had to complete IAM setup for proper access and authorization. Before jumping into the AWS Glue tutorial, I read through the documentation to setup the required IAM roles for AWS Glue. With ADF, you do not have to deal with any of that because the permissions are addressed much earlier.
Even for a simple use case with minimum transformation, there were a lot more clicks and a lot more page navigation in AWS Glue. When you run an AWS Glue job for the first time, Glue builds the underlying server less environment and that can take some significant amount of time. Compare that to ADF - if your ADF pipeline does not need any complex transformations, the underlying Azure Databricks cluster is not provisioned and you can run your pipeline with a few clicks. Azure Data Factory supports both pre- and post-load transformations. Users apply transformations either by using a GUI to map them, or in code using Power Query Online. Azure Data Factory supports a wide range of transformation functions[5]. AWS Glue provides 16+ built-in preload transformations that let ETL jobs modify data to match the target schema. Glue generates Python code for ETL jobs that developers can modify to create more complex transformations, or they can use code written outside of Glue.
Here is the job script that corresponded to my ETL task[6]. With Glue Studio, source and destination sections can be visually authored. But, during transformation, you have to still rely on python/Scala code. Look at transformation section of my Job code below. Let us see later how it is done in Azure Data Factory.
.......
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
#########################################
### TRANSFORM (MODIFY DATA)
#########################################
#Create a decade column from year
decade_col = f.floor(data_frame["year"]/10)*10
data_frame = data_frame.withColumn("decade", decade_col)
#Group by decade: Count movies, get average rating
data_frame_aggregated = data_frame.groupby("decade").agg(
f.count(f.col("movie_title")).alias('movie_count'),
f.mean(f.col("rating")).alias('rating_mean'),
)
#Sort by the number of movies per the decade
data_frame_aggregated = data_frame_aggregated.orderBy(f.desc("movie_count"))
#Print result table
#Note: Show function is an action. Actions force the execution of the data frame plan.
#With big data the slowdown would be significant without cacching.
data_frame_aggregated.show(10)
#########################################
### LOAD (WRITE DATA)
#########################################
#Create just 1 partition, because there is so little data
data_frame_aggregated = data_frame_aggregated.repartition(1)
#Convert back to dynamic frame
dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated, glue_context, "dynamic_frame_write")
#Write data back to S3
glue_context.write_dynamic_frame.from_options(
frame = dynamic_frame_write,
connection_type = "s3",
connection_options = {
"path": s3_write_path,
#Here you could create S3 prefixes according to a values in specified columns
#"partitionKeys": ["decade"]
},
format = "csv"
)
............
Setup and Evaluation
Learning the Glue console is one thing, but the actual logic lies in the Spark scripts. Tuning the code impacts significantly to the execution performance. With AWS Glue, developers will need a lot more scripting with Python or Scala and will need to know Spark[2]. Azure Data Factory not only automates backend cluster management , but also provides citizen developers with a more intuitive visual drag/drop low code alternative. Glue has a concept of crawler. A crawler sniffs metadata from the data source such as file format, column names, column data types and row count. The metadata makes it easy for others to find the needed datasets. The Glue catalog enables easy access to the data sources from the data transformation scripts. The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema. I will share some pictures of my own Glue run later in the blog.
In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. It is possible to create a custom classifiers where the schema is defined in grok patterns which are close relatives of regular expressions. My sample file is in the CSV (moviedata.csv ) format and was recognized automatically by my Glue crawler called skscrawler. I had to give proper permission for your VPC and create end point for crawler process to connect to my source. That makes crawler setup complex.
With AWS Glue, you are limited to using other AWS services as data sources, though one could argue that you could use JDBC to connect with other databases. In contrast, ADF can connect to a lot more data sources structured and unstructured formats including SaaS platforms, Web services, AWS services and many more.
The proof-of-concept used Azure Blob Storage as a data source(source1) and and transformed the data using ADF native capability of drag/drop of activities ( Derived Column and Aggregate) which contained Expression functions to mimic Python functions in the script above.
The transformed data was sent to another Azure Blob folder (sink1) that contained three computed columns only i.e decade, movie_count, average_rating. Getting started on ADF is straightforward[1][4]. At Azure Portal page, you can start trial for FREE. The documentation and tutorials are very helpful, and there are many pipeline templates for reference.
Let us see what we found in AWS Glue. Picture below shows result of my crawler run to extract metadata for moviedata.csv file. To run it, I had to set up permissions via IAM roles and access policies, S3 connection and a classifier. Table created in mygluedb catalog is movieddata_csv.
Successful AWS Glue Job and Script output are as below. If you click Script Tab below, you will find Python script used to transform the data source.
Getting started with Glue jobs can take some time with all the menus and options. The focus of this blog was in a single transformation script, but Glue also provides tools to manage larger group of jobs . You can schedule jobs with triggers or orchestrate relationships between triggers, jobs and crawlers with workflows with clicks and setups while ADF offers all that with in its main canvas.
In the Glue script example above , we did read the data first to Glue’s DynamicFrame and then converted that to native PySpark DataFrame. This method makes it possible to take advantage of Glue catalog but at the same time use native PySpark functions. However, it appears that Glue performance to be extremely poor when converting from DynamicFrame to DataFrame. This applies especially when you have one large file instead of multiple smaller ones. If the execution time and data reading becomes the bottleneck, you may want to use native PySpark read function to fetch the data from S3. In case of ADF, you do not have to worry about what Data Frames are used and when, You can control partitioning, concurrency etc. if you want. ADF can work with default settings and it manages life cycle of Spark cluster for you similar to on demand cloud computing. It follows Pay As You Go model.
Now, let us think of SCD (slowly changing dimension) Type 2 transformation which is routine in ETL space. If we go PySPark route, your core script do full merge, delete and insert while keeping history i.e start date and end date for dimensional records, might look like this example [7] below -
.....
df_source_new = df_source.withColumn('src_start_date', lit(
current_date)).withColumn('src_end_date', lit(high_date))
# FULL Merge, join on key column and also high date column to make only join to the latest records
df_merge = df_target.join(df_source_new, (df_source_new.src_id == df_target.id) &
(df_source_new.src_end_date == df_target.end_date), how='fullouter')
# Derive new column to indicate the action
df_merge = df_merge.withColumn('action',
when(df_merge.attr != df_merge.src_attr, 'UPSERT')
.when(df_merge.src_id.isNull() & df_merge.is_current, 'DELETE')
.when(df_merge.id.isNull(), 'INSERT')
.otherwise('NOACTION')
)
# Generate the new data frames based on action code
column_names = ['id', 'attr', 'is_current',
'is_deleted', 'start_date', 'end_date']
# For records that needs no action
df_merge_p1 = df_merge.filter(
df_merge.action == 'NOACTION').select(column_names)
# For records that needs insert only
df_merge_p2 = df_merge.filter(df_merge.action == 'INSERT').select(df_merge.src_id.alias('id'),
df_merge.src_attr.alias(
'attr'),
lit(True).alias(
'is_current'),
lit(False).alias(
'is_deleted'),
df_merge.src_start_date.alias(
'start_date'),
df_merge.src_end_date.alias(
'end_date')
)
# For records that needs to be deleted
df_merge_p3 = df_merge.filter(
df_merge.action == 'DELETE').select(column_names).withColumn('is_current', lit(False)).withColumn('is_deleted', lit(True))
# For records that needs to be expired and then inserted
df_merge_p4_1 = df_merge.filter(df_merge.action == 'UPSERT').select(df_merge.src_id.alias('id'),
df_merge.src_attr.alias(
'attr'),
lit(True).alias(
'is_current'),
lit(False).alias(
'is_deleted'),
df_merge.src_start_date.alias(
'start_date'),
df_merge.src_end_date.alias(
'end_date')
)
df_merge_p4_2 = df_merge.filter(
df_merge.action == 'UPSERT').withColumn(
'end_date', date_sub(df_merge.src_start_date, 1)).withColumn(
'is_current', lit(False)).withColumn(
'is_deleted', lit(False)).select(column_names)
....
You can achieve similar results using ADF Mapping Data Flow with few clicks to tap its LookUp, Filter, Derived Column, Sink, Join, Select and other transformations [5]. A complete example, written by Mark Kromer, is available at mssqldude.wordpress.com/2019/04/15/adf-slowly-changing-dimension-type-2-with-mapping-data-flows-complete/ and JSON code for pipeline at github.com/kromerm/marksadfrepo/blob/gaversion/dataflow/DimEmployeeLoader2.json
Conclusion
While AWS Glue and its tools can serve purpose of AWS Ecosystem well, its catalog approach and required coding for complex transformations make Azure Data Factory a painless alternative. In Azure Data Factory, you define your data sets using Linked Service which are connection strings for all practical purpose. Then, you drag away source, transformation, sink activities and build pipeline quickly without messy PySpark/Scala code. On CI/CD front, you need to use AWS Code Pipeline to make the code GIT ready. CI/CD feature is built into ADF product.
Most of all, Azure Data Factory has rich set of 100 connectors and you do not need a Glue catalog to run your ETL job. ADF is all-in-one way to author. manage and orchestrate your big data ETL jobs at cloud scale with simple Excel macro like knowledge for expressions. It empowers data scientists to become data engineers thus reducing DevOps cycles and and generating enormous value for any data science/engineering project.
References:
- ADF Introduction - docs.microsoft.com/en-us/azure/data-factory/introduction
- AWS Glue - aws.amazon.com/glue
- Google Cloud ETL - cloud.google.com/learn/what-is-etl
- Mapping Data Flow Intro - azure.microsoft.com/en-us/blog/azure-data-factory-mapping-data-flows-are-now-generally-available/
- Mapping Data Flow Transformations - docs.microsoft.com/en-us/azure/data-factory/data-flow-transformation-overview
- AWS Glue Job Example - data.solita.fi/aws-glue-tutorial-with-spark-and-python-for-data-developers/
- PySpark SCD2 Example - kontext.tech/column/spark/286/implement-scd-type-2-full-merge-via-spark-data-frames