AWS GLUE

September 04, 2020

AWS Glue - batch jobs, ETL, minimum 5 min intervals, no support for NoSQL stores, not suitable for heterogeneous processing use AWS Data Pipeline. Configurable DPUs (Data Processing Units)

fully managed ( serverless) Scale out Apache Spark environment, pay-as-you-go, ETL service, discovers and profiles data via Glue Data Catalog, generates ETL code to transform data into target schema, can run the job to load data into destination, allows you to configure, orchestrate and monitor complex data flows.

The AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. In the context of updating the metadata, whatever you can do with a Hive DB, you can also do with AWS Glue Data Catalog.

AWS Glue = Data Catalog + Flexible Scheduler

For supported data sources see AWS Glue FAQ

AWS Glue can also be used for complex ETL of streaming data. If focus is on delivery of streaming data use Kinesis Firehose & if focus is real time analytics and more general stream processing use Kinesis Data Analytics but ETL focused scenarios use Glue.

AWS Data Pipeline launches resources in your account allowing you direct access to EC2 instances or EMR cluster and supports heterogenous set of jobs that run on variety of engines like Hive, Pig etc.

AWS Batch - for full control and visibility of compute resources for batch jobs.

what is difference between AWS Batch and AWS Data Pipeline ? In AWS batch , the focus is on some complex computation of large amount of data. In Data Pipeline, the focus is on movement and one or more steps of data transformation ( data driven workflow ).

AWS DMS - for on-prem to AWS migration/replication.

Search This Blog

AWS, BIG DATA, JAVA

AWS GLUE

Comments

Post a Comment

Popular posts from this blog

Okta

Statistics & Machine Learning

Unit of Storage