Using Amazon Data Pipeline for long running ETL processes
Last week I wrote about using AWS Lambda functions in order to facilitate event based processing of long running ETL functions. AWS Lambdas can invoke the Qubole Data Platform’s API to start an ETL process.
Today I wanted to show how the same could be accomplished using Amazon Data Pipeline. Data pipeline is an Amazon tool for moving data between different Amazon and compute resources. It has robust functionality for retrying and conditional branching logic. Coupling this logic with the long running ETL capabilities of the QDS platform are an ideal match.
A common use case is to process data in S3 using Hive or Spark and publish the results to Redshift. By loading the data into redshift you can provide real time analytical capabilities to end users via visualization tools such as Tableau while leaving the costly heavy lifting of ETL to Hive and Spark running on Hadoop.
The following example implements this data pipeline.
As you can see from the above diagram our data pipeline runs on a daily schedule. The first thing it does is startup a t1.micro instance which will be used for processing the workload. As before this pipeline could also be invoked with a AWS Lamda function.
Once the cluster is running a shell command will be invoked. This shell command is the integration point with the Qubole platform.
#Qubole API Key
curl -s -X POST \
-H "X-AUTH-TOKEN: $AUTH" \
-H "Content-Type:application/json" \
-d $(request_body) https://api.qubole.com/api/v1.2/command_templates/345/run
The shell command will execute the ETL query and when it is complete the data pipeline will begin exporting the data to Redshift. The caller does not need to manage the state of the cluster. If the cluster is not running it will be automatically be started and scaled appropriately. After the process is complete the EC2 resources used by the pipeline will be torn down. The QDS Hadoop cluster will also time out and begin to scale down or shut off completely depending on its configuration.
One thing to keep in mind is that unlike AWS Lambda, data pipelines are not a server less architecture. Generally speaking your processing logic will run on an EC2 instance. At a minimum you will pay for the resources that instance uses for 1hr. In the above scenario the instance used to control the pipeline must remain up for the entire time it is executing. This is because of the polling step towards the end of the shell script. We must continue to block until the ETL process is complete and we can then hand off to the next step in the pipeline.
Due to not being a server less architecture this approach has a few cons.
- The EC2 resource used for the data pipeline will need to be running for the entire time the EC2 process is running. This can add unnecessary cost to the process as the only thing happing on the EC2 instance is polling for job completion.
- If the resources that are required to launch the command are not available you must wait for them to be finished and pay for a full hour of their existence.
This scenario can be changed slightly to allow better use of the compute resources. Similar to our last example we can use an AWS Lambda function to execute a Hadoop ETL function when new data arrives in S3.
In order to achieve this we can use a Qubole Workflow component. The workflow component allows you to chain together multiple commands in sequence. In this case we run our Hive ETL followed by an Amazon S3 API call to continue the data pipeline.
Since the Qubole Data Service does the heavy lifting we can remove the Shell Command node from the data pipeline. QDS will activate the pipeline when the ETL process is complete.
By using this approach the usage of EC2 resources is kept to a minimum. Only after the long running part of the ETL process is complete will an instance for the pipeline be started. Also, the AWS Lambda function will start the QDS cluster if it is not already running and scale it appropriately. Once the process is complete the cluster will scale out and shut down completely if there is no other workload running.
By combining various web based technologies we can efficiently process workloads while reducing the number of resources consumed.