Why is Amazon Athena so fast

Automate Athena queries with Python



introduction

In the last few weeks I had the opportunity to deal intensively with Amazon Athena. For all those who have not yet had any contact with it, with Athena you can essentially carry out SQL queries on data that is in S3 (under the hood, Athena is a managed Hive / Presto cluster). The billing model is also reasonably attractive - you only pay for that quantity of processed data and here the costs at around $ 5 / TB are also very manageable, if you consider how complex it is to set up an EMR cluster for rare or one-off queries.

In this blog post, I'll show you how to automate Athena with Python.

Use Athena

Interactive

If you use Athena interactively, i.e. in the AWS console, this is very easy - in the GUI you can find the created schemas and tables on the left side and the SQL editor at the top right. After writing the SQL statement, use the big blue one Run query Button, lean back, admire the loading bar and after a few seconds have the result at the bottom right, which can also be downloaded as CSV.

Inside the code

Using Athena from scripts or similar is a bit more annoying - at least when using Lambda or trying to design the architecture serverless.

Sending SQL queries via the Python SDK is relatively easy - if you use the standard boto3 it looks something like this:

Starting queries is all well and good, but mostly we are also interested in the result - or at least whether the query was executed successfully.

The problem here is that Athena queries tend to be in the range of seconds and minutes rather than milliseconds in terms of runtime - the hard upper limit here is 30 minutes. This is a problem when we work with lambda, because the maximum runtime for lambda functions is currently 15 minutes - long-running lambdas are not so nice anyway. (If you know in advance that the queries won't take long, you can jump straight to the “Fast Queries” section.)

Step functions to the rescue! Yes, I know using another service for this problem is not ideal, but Athena currently has two major limitations:

  1. There is no Lambda trigger when a query has ended
  2. There is also no integration with SNS, SQS or the like when queries are terminated

In short: Athena lacks integrations for finished queries (if I should have overlooked something, please get in touch).

In case you haven't run into Step Functions before: This is a service in which I can define complex workflows that contain, for example, several Lambda functions. These workflows are defined as state machines.

My step functions for cross-country queries usually look like this:

A Lambda function starts the query and forwards the ID of the execution to the next steps. Now a form of loop begins - we first wait five seconds in a wait step, then another lambda function is triggered, which calls up the current status of the execution. A decision step (choice step) then defines under which conditions which one is taken away. If the query is still running or is still in the queue, it goes back to the waiting step. If there is an error, we change to an error state and if the query has run successfully, we process the result in another Lambda function.

You can find an example project with the code and the state machine on Github. To make my life easier, I wrote the mini-library that abstracts out some of the annoying parts of the AWS API.

Next, I'll also show the code for cross-country queries and then we'll look at the simple case with fast queries.

Automate Athena

Cross-country queries

As mentioned earlier, there are three Lambda functions involved - we start with the function that executes the query:

The function first prepares the parameters for the query:

  • - the SQL statement itself
  • - the name of the schema in which the query is executed
  • - the name of the S3 bucket in which the results are to be stored - here I dynamically assemble the name of the standard Athena bucket

The actual code for executing the query consists of just two lines. We create an object and perform the function on it. As a result, we receive the Query Execution Id, which uniquely identifies the execution.

Then we assemble the return object for Lambda - this becomes the input for the status request. In principle, it would be sufficient if we passed on the Query Execution Id here, but I like statistics and some preparations are being made for this.

The next lambda function gets the status of the execution and is much simpler. An object is built from the and then the status and some statistics are called up via the query and saved in the result object.

The choice step in the step function defines how this output is processed, here is an excerpt from the definition of the state machine - the complete definition can be found in the repository:

This excerpt tells the state machine that the error state should be switched to if the query has the status or.

We only get to the next step of processing if the query was successful. This lambda function again builds an object and retrieves the result:

You can then do whatever you want with the result within the lambda function.

Now let's look at the simpler case, quick queries:

Fast queries

I would recommend this procedure for queries with a maximum runtime of 5 minutes - otherwise it is probably worth setting up the state machine.

This code is also based on the:

The same functions as described above are used here - apart from the additional waiting step in the middle. By default, the function waits up to 60 seconds for a result, a longer value can be transferred in the constructor of the.

Summary

In this post I showed you how to use the mini-library to automate short and long Athena queries with Python.

If you have any questions, comments or criticism, feel free to contact us by email or on Twitter (@Maurice_Brg)

Photo by Hitesh Choudhary on Unsplash