1/7/2024 0 Comments Airflow emr![]() ![]() I next part of our workflow is the same, except this time we have added some more variables. We can take a look at the documentation for this operator at the Apache Airflow website, Amazon EMR OperatorsĪs part of our workflow, we want to create an Amazon EMR cluster, add some steps to run some of the Presto and Apache Hive queries, and then terminate the cluster so we need to add those operators (EmrCreateJobFlowOperator, EmrAddStepsOperator, EmrTerminateJobFlowOperator and EmrStepSensor) in our DAG from airflow import DAG, settings, secretsįrom _operator import PythonOperator, BranchPythonOperatorįrom _operator import DummyOperatorįrom _add_steps_operator import EmrAddStepsOperatorįrom _create_job_flow_operator import EmrCreateJobFlowOperatorįrom _terminate_job_flow_operator import EmrTerminateJobFlowOperatorįrom _step_sensor import EmrStepSensorįrom _rule import TriggerRule However, this time we are using Amazon EMR and if we look at the available Apache Airflow operators we can see that there is an Amazon EMR operator which will make our life easy. Not surprisingly, this workflow begins in a very similar way to the previous one. Clean up and shut down any resources so we can minimise the cost of running this operation.Move the export csv file to a new location in the data lake.Export the new table as a csv file (again using the scripts we already uploaded).Create a new table that just contains the information we are looking for (in this example, films of a particular genre).Create tables to import the movie and ratings data (using the scripts we uploaded).Check to see if a database exists and create it if it does not exist.Create our Apache Hive and Presto SQL scripts and upload those to a location on Amazon S3.As we are automating this, a lot of the stuff we would not need to do because we absorb that as part of the manual work (for example, I already have a database called XX, so I do not need to re-create that) we need to build into the workflow. To recap: We are using the Movielens dataset, loaded it into our data lake on Amazon S3 and we have been asked to a) create a new table with a subset of the information we care about, in this instance a particular genre of films, and b) create a new file with the same subset of information available in the data lake.Īs part of the set of manual steps we are trying to automate, we are using Amazon EMR (again as for the previous post, if you want to see those manual steps, refer to the documentation in the GitHub repository) together with some Apache Hive and Presto SQL scripts to create tables and export files. All the code so you can reproduce this yourself can be found in the GitHub repository here. ![]() Make sure you recap the setup from Part One. In this post, Part Two, we will do the same thing but automate the same example ELT workflow using Amazon EMR. This stack also creates an EMR Studio environment that can be used to build and deploy data notebooks.ĭisclaimer: I work for AWS on the EMR team and built this stack for my various demos and it is not intended for production use-cases.In Part One, we automated an example ELT workflow on Amazon Athena using Apache Airflow. I wanted to write a post about how I built my own Apache Spark environment on AWS using Amazon EMR, Amazon EKS, and the AWS Cloud Development Kit (CDK). Overview OpenAQ maintains a publicly accessible dataset of various air quality metrics that’s updated every half hour. With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post. 8 min Build your own Air Quality Monitor with OpenAQ and EMR on EKSįire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis.Overview Before you get started, it’s good to have an understanding of the different components of an Airflow task. So here’s a guide on how I made a new operator in the AWS provider package. And weighing in at over half a million lines of code, Airflow is a pretty complex project to wade into. While I’ve been a consumer of Airflow over the years, I’ve never contributed directly to the project. Recently, I had the opportunity to add a new EMR on EKS plugin to Apache Airflow. Building and Testing a new Apache Airflow Plugin ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |