Amazon Redshift Spectrum

Jelizaveta Malinina
4 min readFeb 19, 2020

In this post I’d like to talk to you about RS Spectrum and answer some of the questions that I had when first heard about it. What is Spectrum? How it works? How it differs from Redshift cluster? How to get started with it? Towards the end I’ll also outline the error that I’ve faced with and how to resolve it. Without further ado let’s get started!

What is RS Spectrum?

If I had to put its definition into one sentence, I would say: “RS Spectrum is a feature within AWS Redshift data warehouse service that allows you to run fast, complex analysis on data stored in Amazon S3 buckets”. In other words, it eliminates the need to move data from cloud storage to Redshift cluster for data analysis. Isn’t that awesome? Especially, if you don’t query this data frequently, it can also save you a lot of money!

If you’re familiar with Amazon Athena you might be thinking that the principal sounds quite similar, and you are right. The main difference is that with Athena you’re going fully serverless, skipping the warehouse. AWS Athena usually is a preferred option if your query is simple and interactive and doesn’t require large-scale aggregations. RS Spectrum, in contrast, allows you to perform more complex data analytics and aggregations with other Redshift tables.

How it works?

Redshift Spectrum lives on dedicated Redshift servers that are independent of your cluster. RS Spectrum queries use much less of cluster’s processing capacity in comparison to other as Spectrum breaks query into parts and spreads them across as many AWS-managed nodes as required (up to thousands of instances!). You can query the same data stored on S3 by multiple clusters, but remember that queries can only be executed in the same region where data is saved. AWS documentation perfectly explains of how external tables are created.

You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. You can create and manage external tables either from Amazon Redshift using data definition language (DDL) commands or using any other tool that connects to the external data catalog. Changes to the external data catalog are immediately available to any of your Amazon Redshift clusters.

To improve performance, you can also add partitions to your external tables; this would save computational time if you’re going to filter your results by these columns, for example.

Once the external table has been defined, you can start working with just like with any other Amazon Redshift table. You won’t notice a difference unless you know!

Redshift VS Redshift Spectrum:

Now let’s have a look why you might or might not consider choosing RS Spectrum over Redshift cluster

Cost

  • RS Spectrum users can take advantage of inexpensive S3 storage + additional execution cost
  • But watch out the frequency of queries and how much they cost you! If data scanning costs start being larger than data storage on Redshift cluster, you might want to consider to move your frequently accessed data to Redshift cluster instead

Speed

  • Redshift Spectrum queries do tend to take longer to execute, mainly because data moves between S3 and Spectrum. However, if time-efficiency is not crucial for you RS Spectrum might be a good option

Operations

  • RS Spectrum doesn’t allow for UPDATE operations

Getting Started:

So, now that you know how RS Spectrum works and what it’s used for, we can create our own external table. There are several ways how you can achieve it, but today we’ll have a look on how to create it using Glue Crawlers.

Note: AWS Glue is not supported in your chosen region, you might need to use Athena data catalog.

During this lab I’m using ‘eu-west-1’ region as it’s the closest to me and supports Glue. For this lab you’ll need:

  • Data stored in S3 that you want to access through Redshift
  • Glue Crawler for the data
  • IAM Role for RS Spectrum table
  • Running Redshift cluster in the same region as S3 bucket

We start by creating and running a Glue Crawler. I assume you’re all familiar with how to do that, if not have a look at AWS documentation. I find doing it from AWS User Interface very intuitive and quick. During this stage take a note of the database name that you’re creating with the Crawler, you’ll need it later. In my case I’ve created a database called store_data.

Now let’s create an IAM role. You can find more details here, but all you need to do is to give this role AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess. I’ve called my IAM role ‘redshift-spectrum-role’.

Next, we need to go to a database management environment (I usually prefer DBeaver or DataGrip). Once in the console of your Redshift cluster, run the following commands substituting required values:

CREATE EXTERNAL SCHEMA rs_store_data FROM DATA CATALOG
DATABASE ‘store_data’
IAM_ROLE ‘arn:aws:iam::….:role/redshift-spectrum-role’
REGION ‘eu-west-1’

Now that you’ve successfully run those commands, you can query your data straight away! Tables will be located under the schema you’ve defined, which was ‘rs_store_data’ for my case.

By now you should have quite a solid understanding of what RS Spectrum is, why you might want to use it and how to create external tables. When I just started working with it couple of months ago, I’ve faced with a very annoying issue that when I was trying to query my external table, I was constantly getting ‘Accessed Denied’ error. I’ve checked the IAM role hundreds of times but it all seemed to be correct. Annoyingly, the error itself wasn’t very useful in debugging, but what I’ve found out was that the bucket I was trying to query was encrypted! Therefore, my IAM role used for RS Spectrum needed to have access to KMS keys! Keep that in mind guys, and if you ever face with this error do check your encryption!

Thank you for reading and I hope you enjoyed it!

--

--

Jelizaveta Malinina

I’m an AWS certified Senior Data Engineer wanting to share knowledge with each one of you.