A Brief Primer on AWS Athena
These days, data analytics is a challenge due to the massive growth in big data. It is hard to store and validate large data sets, especially when data comes in all shapes and types. There are many other alternative ways to address these problems. One such alternative is AWS Athena.
A report suggests that 80% of data will be unstructured by 2025 [1]. Also, 95% of businesses [2] have problems managing unstructured data. Such issues prevent a company from making timely business decisions.
Data analysts need powerful query engines to handle such data because engines like MySQL can cause delays. They usually take longer to generate results as they cannot read Big data sources in time.
AWS Athena is a powerful query service. It makes it easier to work with a high volume of data. It integrates with Amazon S3 [3], which is simply a data storage service. You can connect S3 with Athena and use simple SQL to query your data.
What is AWS Athena?
AWS Athena is a serverless big data analysis tool that lets you connect seamlessly with any data source through Amazon S3. And since it’s serverless, you don’t have to worry about infrastructure. AWS takes care of everything. This means your overall expenses will decrease as there are no extra storage costs apart from S3 charges!
AWS Athena Pricing
So, if there is no maintenance, then what do you pay for? Well, you pay for the amount of data scanned. You have to pay $5 for every TB of data scanned. This is very reasonable.
Writing Queries with AWS Athena
But what about querying? To make things simpler, you can use standard SQL. But with Athena, you can use SQL on raw data directly. This raw data is stored in S3. So, there is no need to process your data.
Plus, AWS Athena comes with AWS Glue Data Catalog. This is a service that simplifies ETL processes. The service connects to your data sources and infers their schemas. The Data Catalog maintains metadata tables. And these tables have information about your data sources. This information is used to optimize ETL jobs.
Limitations of AWS Glue Data Catalog
However, the AWS Glue Data Catalog is limited to some extent. It only optimizes ETL functions. But a powerful data catalog can also help with analytics and discoverability of data sources.
For example, the catalog can have metadata of all the data sources in a system. This metadata can categorize the sources according to some business use-case. This way, finding it will be a cinch whenever someone wants to search for data relating to financial projections.
How does AWS Athena work?
It is very easy to work with AWS Athena. The first step is to load data to S3. Of course, you won’t have to do this if data is already present. Otherwise, you can either load it manually or use Kinesis Firehose to stream your data into S3.
Next, you need to connect your data source to Athena. This can also be done manually. But you can use the AWS Glue Data Catalog instead. The AWS Glue Data Catalog will crawl your data sources and infer their schemas.
This makes it easy to transform raw data into nice, clean tables. Let’s say you have a data source in JSON format. The Data Catalog will automatically create the columns by reading this JSON. So you don’t have to add columns to create the schema manually.
After creating the schema, you can go back to Athena and start querying. You can use simple SQL to analyze your data.
What are the Key Features of AWS Athena?
AWS Athena lets you work with raw data. This is useful because big data can come in many formats. AWS Athena can do all the ETL stuff on all these automatically. So you don’t have to worry about configurations.
With AWS S3, you can store a large amount of data from any source. This means storing data from spreadsheets, media files, emails, etc. AWS S3 is also highly available, which means that AWS replicates data sources in S3 in many data centers. So, if one goes down, the other will be available.
There is also parallel execution, which allows queries to be executed with a lot of computing power. So, results are delivered in no time.
Additionally, Athena uses the Presto query engine, which makes it easy to analyze data from different sources through a single query.
Finally, AWS Athena is secure and integrates with AWS Glue Data Catalog, which helps maintain metadata about your data sources.
Enhance AWS Athena with Sherloq
You can make AWS Athena better with the Sherloq add-on. The add-on makes it easy to save and share queries. With its data glossary, you can easily view your query history, which shows usage rates, user info, and status info.
You can also extract and communicate insights from your data more efficiently by using Sherloq’s Auto ML recommendations feature. It gives query suggestions based on other teams’ usage. On top of this, you can see snapshots of results from saved queries along with metadata like the schema name, owner, query status, etc.
With big data, you often need to write long queries. This can become burdensome. So, to save time, Sherloq allows you to select a part of the query and save.
And you don’t have to reconfigure your data stack. Sherloq works well with any existing stack. You can easily save complex query snippets for later. And with just a few clicks, you can share queries over email or through a shareable link.
So start collaborating with your team to code faster and streamline your SQL workflow experience. Get access to Sherloq now!
References:
1. https://venturebeat.com/data-infrastructure/report-80-of-global-datasphere-will-be-unstructured-by-2025/
2. https://techjury.net/blog/big-data-statistics/#gref
3. https://aws.amazon.com/s3/