Amazon Athena

Key Takeaways: 

  1. Data leaders all over the world are looking to move their business intelligence workloads to the cloud.  
  1. For companies operating in the AWS ecosystem such as Merit Data and Technology, Athena offers an interactive query service to analyse data in Amazon S3. 
  1. Thanks to being Serverless, Athena does not require any infrastructure setup and is, cost-effective and operationally efficient.  
  1. All you need to do is bring your data to Amazon S3 and start querying using standard SQL.  
  1. Last but not least, you pay only for queries that you run.  

What is Amazon Athena? 

Amazon Athena is an interactive query service that uses standard SQL to analyze data directly in Amazon S3 and generate ad-hoc reports within seconds.  

Being serverless, Athena does not need any infrastructure setup or management and instead allows the use of query as a service, to process logs, run interactive queries, and perform ad-hoc analysis. A scalable solution, Athena allows the parallel execution of several queries, including running complex queries on large datasets.  

The charges depend on the volume of data scanned per query, which can be significantly lowered by compressing, partitioning, or converting data into a column format, to reduce the volume to be scanned. 

Being serverless, it frees users of worries about infrastructure, scaling, configuration, updates, or failures to accommodate an increase in datasets and users. This enables the users to focus on the data rather than the infrastructure. All you need to do to get started is to log into the Athena console, enter DDL statements in the console wizard to define the schema, and use a built-in query editor to start querying without needing complex ETL jobs for data preparation prior to analysis. 

Amazon Athena Features 

Amazon Athena uses the open-source, distributed SQL query engine called Presto, which has been optimised for low latency and ad hoc analysis of data. As a result, queries can be run using ANSI SQL on large datasets in Amazon S3, and supports large joins, arrays, and window functions. Athena’s JDBC driver allows connecting to Athena from several BI tools and supports many data formats such as JSON, CSV, ORC, Avro, and Parquet. 

Some of its other features include: 

  • High Availability & Durability: Amazon Athena queries are executed drawing on the compute resources of multiple facilities. The queries are routed automatically and appropriately in case of any facility being unreachable. With Amazon S3 as the underlying data store, Athena ensures high availability and durability of data of 99.99% of objects. In every facility, redundancy of storage is ensured across facilities and devices. 
  • Security: Amazon Athena leverages AWS Identity and Access Management (IAM) policies, Amazon S3 bucket policies, and access control lists (ACLs), for permission-based access control access of data. It also facilitates querying and writing encrypted data and results in Amazon S3 while supporting server-side and client-side encryption. 
  • Integration with Glue: Amazon Athena’s out-of-the-box integration with AWS Glue Data Catalog enables creating a single metadata repository across different services and crawl data sources. This enables discovering and populating the Data Catalog with new and modified table and partition definitions and maintaining schema versioning. Query performance optimisation is also possible by transforming data or converting it into columnar formats using Glue’s fully-managed ETL capabilities. 
  • Federated Query: Athena Data Source Connectors running on AWS Lambda facilitates federated querying in Athena. Athena Query Federation SDK also allows building connectors for any data source. 
  • Machine Learning: SageMaker Machine Learning models can be invoked in Athena SQL query to run inference, simplify complex tasks such as anomaly detection, sales predictions, and customer cohort analysis.  

Advantages of AWS Athena 

Some of the advantages of using AWS Athena for data engineering projects include:  

No Infrastructure Management Needed: Often, the greatest drain on resources is investing in infrastructure and its maintenance. Being serverless, Athen reduces this burden and you can get going from day one without having to set up clusters, regulate capacity, or load data.  

Lower Costs: AWS Athena allows you to pay for the queries you run and not for compute instances, thereby lowering your running costs.  

Widely Accessible: Athena queries are run using standard SQL, making it widely accessible to not only developers and engineers but even business analysts and data professionals.  

Flexibility: Amazon Athena integrates with several open-source file formats, thereby providing flexibility.  

Disadvantages of AWS Athena  

AWS Athena is not without its disadvantages, some of which include:  

  • Absence of data optimization capabilities 
  • Common resources shared globally by AWS Athena users that can cause fluctuations in query performance 
  • It is only a query engine that lacks Data Manipulation Language (DML) interface to insert, delete, and update data 
  • Data sets stored in Amazon S3 need to be partitioned to run the Athena query, which can affect performance 
  • Since AWS Athena lacks indexing capabilities, it can prove to be limiting when consolidating large tables 

Merit Group’s expertise in cloud BI 

At Merit Group, we work with some of the world’s leading B2B intelligence companies like Wilmington, Dow Jones, Glenigan, and Haymarket. Our data and engineering teams work closely with our clients to build data products and business intelligence tools. Our work directly impacts business growth by helping our clients to identify high-growth opportunities.  

Our specific services include high-volume data collection, data transformation using AI and ML, web watching, BI, and customized application development.  

We’re experts in Cloud BI, helping companies streamline and migrate to a truly next-generation BI stack. 

Our team also brings to the table deep expertise in building real-time data streaming and data processing applications. Our expertise in data engineering is especially useful in this context. Our data engineering team brings to fore specific expertise in a wide range of data tools including Airflow, Kafka, Python, PostgreSQL, MongoDB, Apache Spark, Snowflake, Redshift, Athena, Looker, and BigQuery.  

If you’d like to learn more about our service offerings or speak to a GCP expert, please contact us here:

  • 01 /

    A Hybrid Solution for Automotive Data Processing at Scale

    Automotive products needed millions of price points and specification details to be tracked for a large range of vehicles.

  • 02 /

    A Unified Data Management Platform for Processing Sports Deals

    A global intelligence service provider was facing challenge with lack of a centralised data management system which led to duplication of data, increased effort and the risk of manual errors.