The S3 Cookbook Get cooking with Amazon’s Simple Storage Service

Scott Patten

This book is for sale at http://leanpub.com/thes3cookbook

This version was published on 2015-01-15

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do.

©2010 - 2015 Scott Patten Tweet This Book!

Please help Scott Patten by spreading the word about this book on Twitter! The suggested hashtag for this book is #thes3cookbook. Find out what other people are saying about the book by clicking on this link to search for this hashtag on Twitter: https://twitter.com/search?q=#thes3cookbook Contents

Preface ...... 1 Conventions Used in This Book ...... 1 Using Code Examples ...... 1 Getting the Code ...... 1 How to Contact Me ...... 2 Why Ruby ...... 2

Chapter 1. What is S3, and what can I use it for? ...... 4 Backups ...... 4 Serving Data ...... 4 Use Cases ...... 4 NASDAQ Market Replay ...... 5 Jason Kester ...... 7

Chapter 2. S3’s Architecture ...... 9 A Quick, Tounge-in-cheek, Overview ...... 9 Amazon S3 and REST ...... 9 Buckets ...... 12 S3 Objects ...... 15 Access Control Policies ...... 16 Logging Object Access ...... 16

S3 Recipes ...... 17 Signing up for Amazon S3 ...... 17 Installing Ruby and the AWS/S3 Gem ...... 19 Setting up the S3SH command line tool ...... 22 Installing the S3Lib library ...... 24 Making a request using s3Lib ...... 26 Getting the response with AWS/S3 ...... 28 Installing The FireFox S3 Organizer ...... 30 Working with multiple s3 accounts ...... 31 Accessing your buckets through virtual hosting ...... 33 Creating a bucket ...... 34 Creating a European bucket ...... 37 CONTENTS

Synchronizing two buckets ...... 40 Listing All Of Your Buckets ...... 43 Listing only objects with keys starting with some prefix ...... 44 Downloading a File From S3 ...... 46 Understanding access control policies ...... 48 Setting a canned access control policy ...... 52 Keeping the Current ACL When You Change an Object ...... 54 Making sure that all objects in a bucket are publicly readable ...... 57 Detecting if a File on S3 is the Same as a Local File ...... 60

Chapter 4. Authenticating S3 Requests ...... 62 Authenticating S3 Requests ...... 62 Writing an S3 Authentication Library ...... 66 The HTTP Verb ...... 69 The Canonicalized Positional Headers ...... 70 The Canonicalized Amazon Headers ...... 78 Date Stamping Requests ...... 83 The Canonicalized Resource ...... 85 The Full Signature ...... 88 Signing the Request ...... 93 Making the Request ...... 98 Error Handling ...... 101 Preface

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constantwidth Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Note This icon signifies a tip, suggestion, or general note.

Warning This icon indicates a warning or caution.

Using Code Examples

All of the code in this book is released under an MIT license, and can be used pretty much anywhere and anyhow you please.

Getting the Code

All of the code in the book is available on github¹ at http://github.com/spatten/thes3cookbook². You can checkout the code using Git with the following command:

¹http://github.com ²http://github.com/spatten/thes3cookbook Preface 2

1 $> gitclonegit ://github.com/spatten/thes3cookbook.git

You can get a .zip or .tar archive of the code by going to http://github.com/spatten/thes3cookbook³ and clicking on the ‘download’ button.

How to Contact Me

I can be reached via e-mail at [email protected]. Please contact me if you have any questions, comments, kudos or criticism on the book. Constructive criticism is definitely appreciated; I want this book to get better through your feedback.

Why Ruby

You might be asking yourself why I wrote the examples in this book in Ruby. Here’s a quick set of reasons:

It’s concise The last thing you want if you’re reading (or writing!) a book with lots of code in it is lots of repetitive, boilerplate code. Ruby keeps this to a minimum.

irb irb is an which you can use to play around with Ruby. In this book, two other programs (s3sh and s3lib) extend irb to allow you to play around with S3 on the command prompt.

It’s available Ruby comes pre-installed on most Unixes, including OS X. If you are on Windows, the Ruby One Click Installer will get you up and running quickly. See “Installing Ruby and the AWS/S3 Gem” for instructions on installing Ruby

AWS/S3 Marcel Molina’s AWS/S3 library is an elegant interface to Amazon S3. It’s used in most of the examples in the book.

RubyGems The RubyGems package library allows you to easily install the additional libraries needed to communicate with S3.

Ruby makes me happy , the creator of Ruby, often says that “… Ruby is designed to make programmers happy.” He also says “…I designed Ruby to minimize my surprise.” This works for the me: Ruby is a pleasure to program in and (once you get used to it) easy to read.

³http://github.com/spatten/thes3cookbook Preface 3

This is not to say that Ruby is the best and only language out there. It just happens to be one that is well suited to this book, so I went with it. I’ve written a quick intro to Ruby in Appendix A, A Short Introduction to Ruby. Chapter 1. What is S3, and what can I use it for?

The purpose of this section of the book is to show you what S3 can be used for. To that end, I’ve talked to a couple of companies who are doing interesting things with S3. I’ll also talk about some of the more common use cases.

Backups

The first thing you think of when you hear about S3 is backups. It’s quite a nice solution for this: you can easily backup any type of file and the storage is pretty cheap. Personally, I use it to back up all of my pictures. I used to back them up onto a rewriteable CD, but that was flaky and time consuming. Then I used various solutions such as synchronizing multiple computers on my home network. This is fine for most uses, but pictures of my kids are irreplaceable, so I prefer an online backup. That way, even if my house burns down, the big quake hits Vancouver or all of my computers are stolen, I know that my pictures are safe. Yes, it’s a little paranoid, but that’s what backups are all about! The big win for S3, however, is how easy it is to back up just about anything. There’s no GUI or Web interface to work around - it’s designed for people like you and me: people who can code. You can use it to back up your SVN repositories, your databases or user generated content on your website. Not only that, you can also share your backed up files with others in creative ways.

Serving Data

The second thing you might want to do with S3 is serve data to your users. This data might be static data for your site (“Using S3 as an asset host”), user generated data (“Serving user generated data from s3”), a Bit Torrent for a large media file (“Seeding a bit torrent”) or a file that only authenticated people can access (“Giving access to a bucket or object with a special URL” and “Giving another user access to an object or bucket using S3SH”). When you are serving data, you will probably want to keep track of what is being viewed and by who. This is discussed in “Determining logging status for a bucket” to “Accessing your logs using S3stat”.

Use Cases

There are a whole bunch of people out there using S3 in lots of different ways. I wanted to get in touch with some of them to find out how they were using S3 and why they decided to use Chapter 1. What is S3, and what can I use it for? 5

S3. So, I semi-randomly sent out two e-mails to some people whose work I had found interesting. Amazingly enough, both were kind enough to take the time to talk to me. Claude Courbois works for the NASDAQ Stock Exchange, and wrote a great AIR application called Market Replay that gets all of its data directly from S3. Jason Kester runs a bunch of web applications, including S3stat, Twiddla and Blogabond, which all use S3 heavily.

NASDAQ Market Replay

NASDAQ Market Replay is an Adobe AIR application that allows professional stock traders and other people interested in trading stocks to see exactly what happened when a trade is made. It gives you data, in ten minute windows, on every transaction made on a single stock. This allows the users of the application to replay the data and figure out why they got the price they got when they bought or sold the stock. You can find more about Market Reply and download a free trial copy at https://data.nasdaq.com/MR.aspx⁴. Claude Courbois of the NASDAQ OMX Group kindly agreed to talk with me about Market Replay and how and why they decided to use S3.

Market Replay By The Numbers

The data for each stock is stored in 10 minute chunks, two files per chunk (one for trades, the other for quotes). There are 40 10 minute chunks per day. There are around 6000 stocks traded on NASDAQ (3000 listed on NASDAQ, and another 3000 from the NYSE and AMEX exchanges). That makes 40 x 2 x 6000 = 480,000 new files every day. These files are all stored in two buckets (one for trades, the other for quotes). Data is never purged: they are planning to keep the data forever. Every year, there are about 260 trading days per year, which amounts to around 125 millions files per year. Wow. According to Claude, what makes this all possible is that finding a file on S3 is fast, and doesn’t strongly depend on the number of files in the bucket. They don’t ever need to index all of the files in their buckets as they create the file names based on the stock symbol and the slice of time the data responds to.

Markey Replay’s Architecture

Market Replay is an Adobe AIR application: it is written in Flex and ActionScript, and runs inside the Adobe AIR runtime on the user’s computer. When you’re building a FLEX or AIR app that talks to S3, one of the key things is to figure out how to authenticate to S3 without compiling your secret key into the application. Markey Replay gets its data by making requests to a server, which grabs the data from S3 and then sends it back down to the AIR application.

⁴https://data.nasdaq.com/MR.aspx Chapter 1. What is S3, and what can I use it for? 6

Claude also looked into using the server to generate authenticated URLs for the AIR app and allowing the application to get the data directly from S3. They may still move to this, but right now it’s not a bottleneck so they are planning on leaving it as is. Files are uploaded to S3 constantly while trading is happening. The raw trading data is massaged into a format that works well for the Market Replay app before being uploaded. Any further data manipulation and visualization is done by the AIR app. The newest data is about 15-20 minutes old. One of the design decisions made during the development of Market Replay was the size of the time slice in the data files. The size needed to be small enough to keep download and data processing time low, keeping the application responsive. Smaller files also mean that the data you are getting is fresher: if the time slice was an hour, then data wouldn’t be fresher than an hour. It had to be large enough that users didn’t need to make multiple requests to view a single trade. Also, S3 charges $0.01 per 10,000 GET requests. If file sizes were too small, this might actually become a factor. In the end, they decided on storing the data in 10 minute chunks.

Why S3?

So, why did NASDAQ choose to use S3 for this application? First and foremost was the pricing structure of S3. Not just the low cost (although that was important), but the predicability of costs. Claude could easily calculate how much his storage costs would be, and how much adding another customer would increase transfer costs. Having solidly predictable costs allowed them to sell the idea of the product within NASDAQ. The low cost of storage on S3 allows NASDAQ to keep their historical data forever. Even with the huge number of files they’re putting on S3, Claude still pays the monthly Amazon S3 bill on his corporate credit card.

Why Not S3?

When I asked Claude what problems they’ve had with S3, he had to think for a bit. They have had no problems with the service itself. One drawback that they have thought about with S3 is that it limits what you can do with the application. As an example, it would be hard to do things like find out the highest price on a given stock in the last 30 days. For what they wanted to do, this wasn’t a requirement. They think they could do this by running an EC2 instance that parsed the data nightly and filled up a relational database with the results, increasing the granularity of the data in order to make the amount of storage manageable. If you are building a web application which needs to make requests like this and doesn’t need the huge storage of data they require, then perhaps a traditional relational database or even Amazon’s Simple DB web service would be more appropriate. Chapter 1. What is S3, and what can I use it for? 7

Jason Kester

Jason Kester is a man of many web apps. I wanted to talk to Jason when I read about S3stat (http://s3stat.com⁵), a web application that parses your S3 log information and gives you nice graphical analytics as a result. Then I realized that Jason also has two other web apps that make heavy use of S3. Twiddla (http://twiddla.com⁶) is an online whiteboarding app. Blogabond (http://blogabond.com⁷) is a blog site for world travellers. Jason and I talked about all three in our conversation.

S3stat

S3stat was created when Jason started using S3 and missed his daily web-analytics hit. He set up logging for his buckets, but found it a bit painful. In order to reduce that pain for others, he created S3stat to help others analyze their logs. To find out how to set up S3stat for your own buckets, see “Accessing your logs using S3stat” (which was written by Jason). If you want to go through the pain yourself, then you can check out “Enabling logging on a bucket” and “Parsing logs”, which tell you how to enable logging on a bucket and how to parse the results. S3stat uses EC2 as well as S3. Every day, S3stat starts up an EC2 instance, grabs all of the logs for all of S3stat’s users, and parses them. The results are placed in a bucket owned by the appropriate user. Once the parsing is done, the EC2 instance is shut down for the day. The hard part of the parsing is that the logs created by S3 are not in a standard format. Jason converts the log files into Webalyzer format, so that it’s easy to create graphs of the results. You might need something a little different. See “Parsing log files to find out how many times an object has been accessed” for an example of parsing S3 logs to find out how many times a single object has been accessed.

Twiddla

Twiddla is an online whiteboarding web app. It’s quite useful if you want to talk about an image or layout with a bunch of people who aren’t in the same room. When you use it, you upload files and allow other people in your session to view them. S3 comes into play with the storing and sharing of images. Files are uploaded to S3, and then everyone else in your session gets a time limited authenticated URL (see “Giving access to a bucket or object with a special URL” for more information on generating authenticated URLs). The limited time that the authenticated URLs are available for gives an extra level of security.

⁵http://s3stat.com ⁶http://twiddla.com ⁷http://blogabond.com Chapter 1. What is S3, and what can I use it for? 8

Blogabond

Blogabond is a blogging platform for world travellers. People travel around the world, blog about it, mark the places they’ve been on a map and - here’s where S3 comes in - upload pictures. Lots of pictures. The pictures are uploaded to the Blogabond server, resized, and then uploaded to S3. Files need to be resized as they are typically the full size images from a camera. When someone reads your Blogabond posts, the images are served directly from S3.

Why S3?

Why did Jason choose S3 for his web applications? Well, for S3stat, it’s kind of obvious. For the rest of them, it was due to a few things. First, Jason trusts Amazon. They do lots of things that would feel invasive coming from other companies (like when you log on to Amazon.com and it tells you what books you should read next, and it’s right!), but somehow they manage to do it without feeling creepy. Second, Jason likes the design philosophy of Amazon Web Services: build something useful, and then charge a low price purely based on usage. For example, the lack of a minimum monthly fee or signup costs. Plus, every once in a while you get an e-mail from Amazon saying the prices have gone down. You’ve gotta like that! Finally, Jason has found the people at Amazon very open to communication. S3stat is a pretty unique application, so Jason had some conversations with a bunch of people inside of Amazon regarding how he was doing things and whether it was okay with them. He found them very open and responsive. Chapter 2. S3’s Architecture

A Quick, Tounge-in-cheek, Overview

Whenever I start explaining S3’s architecture to someone, part of me wants to affect a thick, drawling accent and say “Well, there’s these buckets, see, and, well, you put stuff in them.” That’s about all there is to it: no nesting, no directories, nothing but buckets and and the objects you put in them. Before you decide to skip the rest of this chapter, remember that the devil is in the details (you might say that programming is quite devilish, as it is all about details). Keep reading to learn all about the little things that are going to trip you up as you delve deeper in to S3.

Amazon S3 and REST

What is REST?

Amazon S3 has two : the SOAP API and the REST API. The SOAP API is basically ignored in this book. Why is that? Well, it’s partly a personal decision. I love playing with and building REST APIs. SOAP APIs I try my best to stay away from. Second, I firmly believe that REST is both simpler to understand and simpler to implement. Finally, we are all used to using RESTful web services, whether we know it or not. There’s this technology called the World Wide Web that is built on it. So, what is REST? It stands for “Representational State Transfer”, and it was first described by Roy Fielding in his PhD. thesis ([http://www.ics.uci.edu/∼fielding/pubs/dissertation/top.htm][1]). There’s no need to plow through the whole thesis. Here’s REST in a nutshell. REST is a Resource Oriented Architecture (ROA). A Resource Oriented Architecture constrains the number of verbs you are using, but leaves the number of nouns that you are acting on with those verbs unlimited. A RESTful web service constricts the verbs you are using to five: GET, POST, PUT, DELETE and HEAD. Each resource will respond to some or all of the RESTful verbs. Each verb has a specific task, as shown in the following table Table 2.1. The RESTful Verbs Verb Action Idempotent? GET Responds with information about the resource Yes POST Creates a sub-resource of the resource being POSTed to No PUT Creates or updates the resource being PUT to Yes DELETE DELETES the resource Yes HEAD Gets metadata about the resource Yes Chapter 2. S3’s Architecture 10

Idempotency and REST Idempotency and idempotent are two words you see thrown around a lot when people are discussing RESTful architectures. What is it? A request is idempotent if you can make the same request multiple times in a row and get the same response every time. This means that making a GET or a HEAD request should make no significant changes to a resource. (Nobody’s going to complain if you break strict idempotency by doing things like updating the number of times a resource has been accessed every time you GET it.) Making GET and HEAD requests idempotent is sort of the default: you have to work hard to not do this. It’s the PUT and DELETE verbs that cause all the trouble here. A DELETE request is only idempotent if you can delete the same resource multiple times in a row. This means that you must give a SUCCESS response even if the resource has already been deleted. The same thing holds for a PUT request: you must be able to create the same resource multiple times with a PUT request, and get a SUCCESS (and exactly the same created object) every time.

Amazon S3 And REST

Every object and bucket on S3 has a resource. You might think of it as the URL or path to the object or bucket. For example, an object with a key of claire.jpg that is stored in the bucket with a name of spattenphotos will have a resource of /spattenphotos/claire.jpg. The spattenphotos bucket has a resource of /spattenphotos. To create a new object, you make a PUT to the object’s resource. The object’s value will be the body of the request. Making a PUT to /spattenphotos/scott.jpg will create a new object in the spattenphotos bucket with a key of scott.jpg. If I wanted to delete that object, I would make a DELETE request to /spattenphotos/scott.jpg. To see the contents of the object, I make a GET request to it. To see its meta-data without having to download the whole object, I make a HEAD request. Objects don’t respond to POST requests. Buckets act in the same way: I create a new bucket by making a PUT request to its resource. I delete it with a DELETE request. I find its contents with a GET request. Buckets don’t respond to either HEAD or POST requests. You might be thinking that this is pretty limited. What if I want to, for example, change the permissions on a bucket? That’s where sub-resources come in. Both buckets and objects have an acl sub-resource which contains information about the permissions on an object. The acl sub-resource is created by tacking ?acl on to the end of the bucket or object’s resource. So, the spattenphotos bucket has its acl sub-resource at spattenphotos?acl. (ACL stands for Access Control List. See “Understanding access control policies” for more information on them.) You set the permissions on a bucket or object by making a PUT request to its acl sub-resource. You read its permissions by making a GET request to the acl sub-resource. Buckets have another sub-resource, logging, that responds to GET and PUT requests. Objects have a torrent sub-resource that responds only to GET requests. Chapter 2. S3’s Architecture 11

Getting sort of repetitive, isn’t it? Yep, and that’s the whole point. There’s no need to memorize a whole list of methods for each resource. This makes the coding simpler. Here’s a table of all of the resources available in S3, and what the RESTful verbs do to them. Notice that none of the resources in S3 respond to POSTs. That’s okay: there’s nothing that says they should, and it makes things simpler. Table 2.2. REST and S3 Resource how it is GET PUT DELETE HEAD constructed Bucket / bucket’s objects updates the the bucket bucket Object // object content metadata ACL /?acl ACL sub- or updates the resource ACL // ?acl

logging /? logging sub- logging logging for that resource status bucket for the bucket

torrent // BitTorrent sub- ?torrent file for the object

Keep this table in mind as you read through the rest of this chapter. It should help you to understand what’s going on a bit more easily. For more reading about RESTful architecture, I highly recommend Leonard Richardson and Sam Ruby’s RESTful Web Services. Chapter 2. S3’s Architecture 12

Buckets

A bucket is a container for data objects. Besides the objects it contains, a bucket has a name, owner, Access Control Policy and a location. Buckets cannot contain other buckets, so there is no nesting of buckets. Each AWS account can have up to 100 buckets. There is no limit on the number of objects that can be placed in a bucket. Figure 2.1. The Amazon S3 Service has many buckets

The Amazon S3 Service has many buckets

Figure 2.2. Each bucket has a name, an owner, an Access Control Policy, logging info, a location and many objects.

Each bucket has a name, an owner, an Access Control Policy, logging info, a location and many objects.

Bucket Names

When you create a bucket, you give it a name. Each bucket within S3 must have a unique name. If you try to create a bucket with a name that already exists, you will get a BucketAlreadyExists error. A bucket’s name cannot be changed after it is created. You can always, of course, make a new bucket and copy the contents of the old bucket in to it. A bucket name must conform to a few rules. The following is lifted almost verbatim (with permis- sion) from the Amazon S3 API documentation, currently found at http://docs.amazonwebservices. com/AmazonS3/2006-03-01/BucketRestrictions.html. Chapter 2. S3’s Architecture 13

• Bucket names can only contain lowercase letters, numbers, periods (.), underscores (_), and dashes (-). • Bucket names must start with a number or letter. • Bucket names must be between 3 and 255 characters long. • Bucket names cannot be in an IP address style (e.g., 192.168.5.4).

As well, you will often want to use your bucket name as part of a URL, so here are a few other recommendations

• Bucket names should not contain underscores (_). • Bucket names should be between 3 and 63 characters long. • Bucket names should not end with a dash. • Dashes cannot appear next to periods. For example, “my-.bucket.com” and “my.-bucket” are invalid. • Bucket names should not contain upper-case characters.

Bucket URL

A bucket’s URL is given by

1 http://s3.amazonaws.com/

Bucket Ownership

A bucket is owned by its creator, and that ownership cannot be transferred. You cannot create a bucket anonymously. If you want to give others access to a bucket that you own, then you need to edit the bucket’s access control policy.

Bucket Sub-resources

A bucket also has Access Control Policy and logging sub-resources. The Access Control Policy sub-resource determines who can list or change items in the bucket. By default, only the owner of a bucket can view or change a bucket and its contents. You can change this by changing a bucket’s Access Control Policy (See “Access Control Policies”). The logging sub-resource allows you to enable, disable and configure logging of requests made to a bucket. This is explained in “Enabling logging on a bucket”. Chapter 2. S3’s Architecture 14

Creating and Deleting Buckets

You create a bucket by making a HTTP PUT request to the bucket’s URL. If you try to create a bucket that already exists and is owned by someone else, you will get a BucketAlreadyExists error. Re-creating a bucket that you own has no effect. You delete a bucket by making a HTTP DELETE request to the bucket’s URL. A bucket cannot be deleted unless it is empty.

Listing a Bucket’s Contents

To list a bucket’s content, you make a HTTP GET request to the bucket’s URL. The response to your GET request will be an XML object telling you all of the bucket’s properties and listing all of the objects it contains. For more information, see “Listing All Objects in a Bucket” A bucket can contain an unlimited number of objects. Listing the complete contents of a bucket can get unwieldy once the number of objects gets large. Also, Amazon will never return more than 1000 objects in a bucket listing. Luckily, there are a number of ways to filter the objects that are returned when you list a bucket’s contents. I won’t go in to detail here. Instead I will point to the recipes that explain how these methods are used. Bucket filtering methods max-keys Sending this parameter in the query string will limit the number of objects returned in the bucket listing. By default, max-keys is 1000. It can never be set to more than 1000. max-key is used in conjunction with marker to paginate results. “Paginating the list of objects in a bucket” explains how to paginate the list of objects in a bucket. “Listing All Objects in a Bucket” explains how to make sure you get all of the objects in a bucket. prefix If prefix is set, only objects with keys that begin with that prefix will be listed (see “Listing only objects with keys starting with some prefix”). prefix is also used in conjunction with delimiter (see below). marker If marker is set, only objects with keys that occur alpabetically after marker will be returned. This is used in conjunctions with max-keys to paginate results in “Paginating the list of objects in a bucket”. delimiter A delimiter is always used in conjunction with prefix. It is used to help you move around in directory structures within your bucket. It’s use is best explained by example, so go check out “Listing objects in folders”. Chapter 2. S3’s Architecture 15

S3 Objects

An S3 Bucket is just a container for S3 Objects. In many ways, an S3 Object is just a container for data. Amazon S3 has no knowledge of the contents of your objects. It just stores it as a bunch of bits. There are a few other attributes to an S3 Object, though. Every S3 Object must belong to a bucket. It also has a key, owner, value, meta-data and an Access Control Policy.

Object Keys

An object’s key is its name. The name cannot be changed after it has been created and must be unique within a bucket. An Object’s key can be any sequence of between 0 and 1024 Unicode UTF-8 characters. This means that you can have a key with a name of length zero (the empty string, "").

Object Values

The value is the data that an object contains. The value is a sequence of bytes; Amazon S3 doesn’t really care what it is. The only limitation is that it must be less than 5 GB in size.

Object Metadata

An object can have two types of metadata: metadata supplied by S3 (system metadata), and metadata supplied by the user (user metadata). The metadata supplied by Amazon S3 are (at the time of writing) the following:

Last-Modified The date and time the object was last modified

ETag An MD5 Hash of the Object’s value. You can use this to determine if the Object’s value is the same as another chunk of data that you have access to. This is discussed in more detail in “Synchronizing a Directory” Chapter 2. S3’s Architecture 16

Content-Type The object’s MIME type. If you don’t provide a content type, it defaults to binary/octet-stream (see http://docs.amazonwebservices.com/AmazonS3/2006-03-01/RESTObjectPUT.html).

Content-Length The length of the Object’s value, in bytes. This length does not include the key or the metadata.

You can also add your own, custom metadata to an object. For example, if you are storing a photograph on S3, you might want to store the date the photo was taken and the names of the people in the picture. See “Reading an object’s metadata” and “Adding metadata to an object” to learn how to set and read an Object’s user metadata.

Bit Torrents

Any object that is publicly readable may also be downloaded using a BitTorrent client. “Seeding a bit torrent” explains how this is done.

Access Control Policies

Both buckets and objects have Access Control Policies (ACP). An Access Control Policy defines who can do what to a given Bucket or Object. ACPs are built from a list of grants on that object or bucket. Each grant gives a specific user or group of users (the grantee) a permission on that bucket or object. Grants can only give access. An object or bucket without any grants on it is un-readable or writable. “Understanding access control policies” will get you started by explaining how ACLs work. After that, you might want to read about “Reading a bucket or object’s ACL”, “Granting public read access to a bucket or object using S3SH” or “Giving another user access to an object or bucket using S3SH”.

Logging Object Access

If you want to get some analytics on who is using the objects in one of your buckets, you can turn on logging for that bucket. “Determining logging status for a bucket” explains how to find out if logging is enabled for a bucket, and, if it is, what the logging settings are. “Enabling logging on a bucket” shows you how to turn logging on for a bucket, how to set what bucket the logs are sent to and how to prepare the bucket that is receiving the logs. “Allowing someone else to read logs in one of your buckets” shows you how to give someone else read access to your logs. “Logging multiple buckets to a single bucket” shows you how to collect all of your buckets’ logs into a single logging bucket, and how to set a prefix so you can tell which bucket the logs came from. “Parsing logs” and “Parsing log files to find out how many times an object has been accessed” discuss parsing logs and extracting data from them. Finally, if you don’t want to do any of this yourself, “Accessing your logs using S3stat” shows you how to use s3stat.com to parse your logs for you. S3 Recipes

Signing up for Amazon S3

Signing up for any of the Amazon Web Services is a two step process. First, sign up for an Amazon Web Service Account. Second, sign up for that specific service.

Signing up for Amazon Web Services

Note: If you have already signed up for Amazon Elastic Compute Cloud or any other Amazon Web Services, you can skip this step. Go to http://aws.amazon.com⁸. On the right sidebars, there’s a link that says ‘sign up today’. Click on that.

You can use an already existing Amazon account, or sign up for a new account. If you use an existing account, you won’t have to enter your address or credit card information. Once you have signed up, Amazon will send you an e-mail, which you can safely ignore for now. You will also be taken to a page with a set of links to all of the different Amazon Web Services. Click on the ‘Amazon Simple Storage Service’ link.

Signing up for Amazon Simple Storage Service

If you have just followed the directions above, you will be looking at the correct web page. If not, go to http://s3.amazonaws.com⁹. On the right hand side of the page, you will see a button labeled ‘Sign up for this web service’. Click on it, scroll to the bottom, and enter your credit card information. On the next page, enter your

⁸http://aws.amazon.com ⁹http://s3.amazonaws.com S3 Recipes 18 billing address. (if you are using an already existing Amazon account, you won’t have to enter this information). Once you are done, click on ‘complete sign up’.

Your access id and secret

You will get a second e-mail from Amazon with directions on getting your account information and your access key. Click on the second link in the e-mail or go to http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access- key¹⁰. On the right-hand side of the page, you will see your access key and secret access key. You will need to click on ‘show’ to see your secret access key.

Your access key is

Next Steps

Once you have signed up, you will want to install Ruby and the AWS/S3 gem (“Installing Ruby and the AWS/S3 Gem”) and set up s3sh (“Setting up the S3SH command line tool”) and s3lib (“Installing the S3Lib library”). These tools are used in almost all of the rest of the recipes.

¹⁰http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key S3 Recipes 19

Installing Ruby and the AWS/S3 Gem

The Problem

You want to use the AWS/S3 library to use the examples in this book. You will need to install Ruby first as well.

The Solution

The AWS/S3 Gem is a Ruby library for talking to Amazon S3 written by Marcel Molina. It wraps the S3 Rest interface into an elegant Ruby library. Full documentation for the library can be found at amazon..org¹¹. It also comes with a command line tool to interact with S3, s3sh. We will be using s3sh and the AWS/S3 library in many of the S3 Recipes, so it will be worth your while to install it. There are three steps to this process:

• Install Ruby • Install RubyGems • Install the AWS/S3 Gem

Installing Ruby

First, check to make sure that you don’t already have Ruby installed. Try typing

1 $> ruby

at the command prompt. If it’s not installed, read the section specific to your for installation directions. If none of those options work for you, then you can download the or pre-compiled packages at http://www.ruby-lang.org/en/downloads/¹².

On Windows On Windows, the easiest way to install Ruby and RubyGems is via the ‘One-click Ruby Installer’. Go to http://rubyinstaller.rubyforge.org/wiki/wiki.pl¹³, download the latest version and run the executable. This will install Ruby and RubyGems.

On OS X If you are using OS X 10.5 (Leopard) or greater, Ruby and RubyGems will be installed when you install the XCode developer tools that came with your computer. On earlier versions of OS X, Ruby will be installed by you will have to install RubyGems yourself.

¹¹http://amazon.rubyforge.org ¹²http://www.ruby-lang.org/en/downloads/ ¹³http://rubyinstaller.rubyforge.org/wiki/wiki.pl?? S3 Recipes 20

On Unix You most likely have Ruby installed on your Unix machine. If not, use your package manager to get it. On Redhat, yum install ruby On Debian, apt-get install ruby If you want to roll your own or are using a more esoteric version of Unix, download the source code or pre-compiled packages at http://www.ruby-lang.org/en/downloads/¹⁴.

Installing RubyGems

RubyGems is the package manager for Ruby. It allows you to easily install, uninstall and upgrade packages of Ruby code. Before trying to install it, check to make sure that you don’t already have installed. Try typing

1 $> gem

at the command prompt. If it’s not installed, then do the following:

• Download the latest version of RubyGems from RubyForge at http://rubyforge.org/frs/?group_- id=126¹⁵ • Uncompress the package you downloaded in to a directory • CD in to the directory and then run the setup program $> ruby setup.rb

Installing the AWS/S3 gem

Once you have Ruby and RubyGems installed, installing the Amazon Web Services S3 Gem is simple. Just type

1 $> gem install aws-s3

or

1 $> sudo gem install aws-s3

at the command prompt. You should see something similiar to this:

¹⁴http://www.ruby-lang.org/en/downloads/ ¹⁵http://rubyforge.org/frs/?group_id=126 S3 Recipes 21

1 $> sudo gem install aws-s3 2 Successfully installed aws-s3-0.4.0 3 1 gem installed 4 Installing ri documentation for aws-s3-0.4.0... 5 Installing RDoc documentation for aws-s3-0.4.0... S3 Recipes 22

Setting up the S3SH command line tool

One of the great tools that comes with the AWS/S3 gem is the s3sh command line tool. You need to have Ruby and the AWS/S3 gem installed (“Installing Ruby and the AWS/S3 Gem”) before going any farther with this recipe. Once you have installed the AWS/S3 gem, you should be able to start up s3sh by typing s3sh at the command prompt. After a few seconds, you will see a new prompt that looks like ‘>>’. You can use the Base.connected? command from the AWS/S3 library to see if you are connected to S3.

1 $> s3sh 2 >> Base.connected? 3 => false 4 >>

The Base.connected? command is returning false, telling us that you are not connected to S3. To connect to S3, you need to provide your authentication information: your AWS ID and your AWS secret. There are two ways to do this: the hard way and the easy way. Let’s do the hard way first. The hard way isn’t all that hard. You use the Base.establish_connection! command from AWS/S3 library to connect to S3.

1 >> Base.establish_connection!(:access_key_id => 'your AWS ID', 2 :secret_access_key => 'your AWS secret') 3 >> Base.connected? 4 => true 5 >>

The hard part is that you’ll have to do that every time you start up s3sh. If you’re lazy like me, you can avoid this by setting two environment variables. AMAZON_ACCESS_KEY_ID should be set to your AWS ID, and AMAZON_SECRET_ACCESS_KEY should be set to your AWS secret. I’m not going to go in to the gory details of how you do this. If you have them set correctly, you will automatically be authenticated with S3 when you start up s3sh.

1 $> env | grep AMAZON 2 AMAZON_ACCESS_KEY_ID=my_aws_id 3 AMAZON_SECRET_ACCESS_KEY=my_aws_secret 4 $> s3sh 5 >> Base.connected? 6 => true 7 >> S3 Recipes 23

Now that you are connected, you can play around a little. Try the following recipes for some inspiration:

• “Creating a bucket” • “Uploading a file to s3” • “Downloading a File From S3” S3 Recipes 24

Installing the S3Lib library

The Problem

You want to use S3Lib to follow along with the recipes or to fool around with S3 requests.

The Solution

Install the S3Lib gem with one of the following commands. Use the sudo version if you’re on a Unix or OS X system, the non-sudo version if you’re on Windows or using rvm or rbenv.

1 $> sudo gem install s3lib 2 3 C:\> gem install s3lib

Once you have the gem installed, follow the directions in “Setting up the S3SH command line tool” to set up your environment variables.

Discussion

Test out your setup by opening up an s3lib session and trying the following:

1 $> s3lib 2 >> S3Lib.request(:get, '').read 3 => " 4 5 ... 6 "

If you get a nice XML response showing a list of all of your buckets, everything is working properly. If you get something that looks like this, then you haven’t set up the environment variables correctly: S3 Recipes 25

1 $> s3lib 2 >> S3Lib.request(:get, '') 3 S3Lib::S3ResponseError: 403 Forbidden 4 amazon error type: SignatureDoesNotMatch 5 from /Library/Ruby/Gems/1.8/gems/s3-lib-0.1.3/lib/s3_authenticator.rb:39\ 6 :in `request' 7 from (irb):1

Make sure you’ve followed the directions in “Setting up the S3SH command line tool”, then try again. S3 Recipes 26

Making a request using s3Lib

The Problem

You want to make requests to S3 and receive back unprocessed XML results. You might just be experimenting, or you might be using S3Lib as the basis for an S3 Library

The Solution

Make sure you’ve installed S3Lib as described in “Installing the S3Lib library”. Then, require the S3Lib library and use S3Lib.request to make your request. Here’s an example script:

1 #!/usr/bin/env ruby 2 3 require 'rubygems' 4 require 's3lib' 5 6 puts S3Lib.request(:get, '').read

To use S3Lib in an interactive shell, use irb, requiring s3lib when you invoke it:

1 $> irb -r s3lib 2 >> puts S3Lib.request(:get, '').read 3 4 5 ... 6

Discussion

The S3Lib::request method takes three arguments, two of them required. The first is the HTML verb that will be used to make the request. It can be :get, :put, :post, :delete or :head. The second is the URL that you will be making the request to. The final argument is the params hash. This is used to add headers or a body to the request. If you want to create an object on S3, you make a PUT request to the object’s URL. You will need to use the params hash to add a body (the content of the object you are creating) to the request. You will also need to add a content-type header to the request. Here’s a request that creates an object with a key of new.txt in the bucket spatten_test_bucket with a body of ‘this is a new text file’ and a content type of ‘text/plain’. S3 Recipes 27

1 S3Lib.request(:put, 'spatten_test_bucket/new.txt', 2 :body => "this is a new text file", 3 'content-type' => 'text/plain')

The response you get back from an S3Lib.request is a Ruby IO object. If you want to see the actual response, use .read on the response. If you want to read it more than once, you’ll need to rewind between reads:

1 $> irb -r s3lib 2 >> response = S3Lib.request(:get, '') 3 => # 4 >> puts response.read 5 6 7 ... 8 9 >> puts response.read 10 11 >> response.rewind 12 >> puts response.read 13 14 15 ... 16 S3 Recipes 28

Getting the response with AWS/S3

The Problem

You have made a request to S3 using the AWS/S3, and you want to see the response status and/or the raw XML response.

The Solution

Use Service.response to get both

1 $> s3sh 2 >> Bucket.find('spattentemp') 3 >> Service.response 4 => # 5 >> Service.response.code 6 => 200 7 >> Service.response.body 8 => " 9 10 spattentemp 11 12 13 1000 14 false 15 16 acl.rb 17 2008-09-12T18:45:27.000Z 18 "87e54e8253f2be98ec8f65111f16980d" 19 4141 20 21 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 22 scottpatten 23 24 STANDARD 25 26 27 .... 28 29 30 service.rb 31 2008-09-12T18:45:22.000Z S3 Recipes 29

32 "98b9dce82771bbfec960711235c2d445" 33 455 34 35 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 36 scottpatten 37 38 STANDARD 39 40 "

Discussion

There are a lot of other useful methods that Service.response responds to. Two you might use are Service.response.parsed, which returns a hash obtained from parsing the XML, and Service.response.server_error?, which returns true if the response was an error and false otherwise.

1 >> Service.response.parsed 2 => {"prefix"=>nil, "name"=>"spattentemp", "marker"=>nil, "max_keys"=>1000, 3 "is_truncated"=>false} 4 >> Service.response.server_error? 5 => false S3 Recipes 30

Installing The FireFox S3 Organizer

The Problem

You want a GUI for your S3 account, and you’ve heard the S3 FireFox organizer is pretty good.

The Solution

In FireFox, go to http://addons.mozilla.org¹⁶ and search for ‘amazon s3 organizer’. Click on the ‘Add to FireFox’ button for the ‘Amazon S3 FireFox Organizer (S3Fox)’. Follow the installation instructions, and then restart FireFox. There will now be a ‘S3 Organizer’ entry in the Tools menu. Click on that, and you’ll see something like this: Figure 3.1. S3Fox alert box

S3Fox alert box

Click on the ‘Manage Accounts’ button and then enter a name for your account along with your Access Key and Secret Key. After clicking on ‘Close’, you should see a list of your buckets.

Discussion

For a list of tools that work with S3, including some other GUI applications, see this blog post at elasti8.com: http://www.elastic8.com/blog/tools_for_accessing_using_to_backup_your_data_to_and_- from_s3.html¹⁷

¹⁶http://addons.mozilla.org ¹⁷http://www.elastic8.com/blog/tools_for_accessing_using_to_backup_your_data_to_and_from_s3.html S3 Recipes 31

Working with multiple s3 accounts

The Problem

If you’re like me, you have a number of clients all with different S3 accounts. Using your command line tools to work with their accounts can be annoying as you have to copy and paste their access_- key and amazon_secret_key in to the correct environment variables every time you change accounts. This recipe provides a quick way of switching accounts.

The Solution

The first thing you need to do is create a file called .s3_keys.yml in your home directory. This is a file in the YAML format (YAML stands for “YAML Ain’t Markup Language”. The official web-site for YAML is at http://www.yaml.org/¹⁸). Make an entry in the file for each S3 account you have. It should look something like this: Example 3.4. .s3_keys.yml <<(code/working_with_multiple_s3_accounts_recipe/.s3_keys.yml)

The s3sh_as program

Now we need a program that will read the .s3_keys.yml file, grab the correct set of keys, set them in the environment and then open up a s3sh shell. Here’s something that does the trick: Example 3.5. s3sh_as <<(code/working_with_multiple_s3_accounts_recipe/s3sh_as)

Discussion

To use s3sh_as, put s3sh_as somewhere in your path, and then call it like this:

1 $> s3sh_as

For example, if I wanted to use my personal account, I would type

1 $> s3sh_as personal

If I wanted to do some work on client_1’s account, I would type

1 $> s3sh_as client_1

If you don’t want to type the code in yourself, then just install the S3Lib gem.

¹⁸http://www.yaml.org/ S3 Recipes 32

1 sudo gem install s3lib

When you install the S3Lib gem, a version of s3sh_as is automatically installed. The s3lib program is also installed when you install the S3Lib gem. This program, which provides a shell to play around with the S3Lib library, will read the .s3keys.yml file just like s3sh_as, so you can use it to access multiple accounts as well. S3 Recipes 33

Accessing your buckets through virtual hosting

The Problem

You want to access your buckets as either bucketname.s3.amazonaws.com or as some.other.hostname.com.

The Solution

If you make a request with the hostname as s3.amazonaws.com, the bucket is taken as everything be- fore the first slash in the path of the URI you pass in. The object is everything after the first slash. For example, take a GET request to http://s3.amazonaws.com/somebucket/object/name/goes/here. The host is s3.amazonaws.com and the path is somebucket/object/name/goes/here. Since the host is s3.amazonaws.com, S3 parses the path and finds that the bucket is somebucket and the object key isobject/name/goes/here. If the hostname is not s3.amazonaws.com, then S3 will parse the hostname to find the bucket and use the full path as the object key. This is called virtual hosting. There are two ways to use virtual hosting. The first is to use a sub-domain of s3.amazonaws.com. If I make a request to http://somebucket.s3.amazonaws.com, the bucket name will be set to somebucket. If you include a path in the URL, this will be the object key. http://somebucket.s3.amazonaws.com/some.key is the URL for the object with a key of some.key in the bucket somebucket. The second method of doing virtual hosting uses DNS aliases. You name a bucket some domain or sub-domain that you own, and then point the DNS for that domain or subdomain to the proper sub- domain of s3.amazonaws.com. For example, I have a bucket called assets0.plotomatic.com, which has its DNS aliased to assets0.plotomatic.com.s3.amazonaws.com. Any requests to http://assets0.plotomatic.com will automagically be pointed at my bucket on S3.

Discussion

The ability to do virtual hosting is really useful in a lot of cases. It’s used for hosting static assets for a website (see “Using S3 as an asset host”) or whenever you want to obscure the fact that you are using S3. One other benefit is that it allows you to put things in the root directory of a site you are serving. Things like robots.txt and crossdomain.xml are expected to be in the root, and there’s no way to do that without using virtual hosting. There’s not room here to explain how to set up DNS aliasing for every Domain Registrar out there. Look for help on setting up DNS Aliases or C Name settings. This blog post from Blogger.com gives instructions for a few common registrars: http://help.blogger.com/bin/answer.py?hl=en-ca&answer=58317¹⁹

¹⁹http://help.blogger.com/bin/answer.py?hl=en-ca&answer=58317 S3 Recipes 34

Creating a bucket

The Problem

You want to create a new bucket

The Solution

To create a bucket, you make a PUT request to the bucket’s name, like this:

1 PUT /my_new_bucket 2 Host: s3.amazonaws.com 3 Content-Length: 0 4 Date: Wed, 13 Feb 2008 12:00:00 GMT 5 Authorization: AWS some_id:some_authentication_string

To make the authenticated request using the s3lib library

1 #!/usr/bin/env ruby 2 require 'rubygems' 3 require 's3lib' 4 5 response = S3Lib.request(:put,'/my_new_bucket')

To create a bucket in S3SH, you use the Bucket.create command:

1 $> s3sh 2 >> Bucket.create('my_new_bucket') 3 => true

Creating buckets virtual hosted style

You can also make the request using a virtual hosted bucket by setting the Host parameter to the virtual hosted bucket’s url: S3 Recipes 35

1 PUT / 2 Host: mynewbucket.s3.amazonaws.com 3 Content-Length: 0 4 Date: Wed, 13 Feb 2008 12:00:00 GMT 5 Authorization: AWS some_id:some_authentication_string

There’s no way to do this using s3sh, but there’s no real reason why you need to create a bucket using virtual hosting. Here’s how you make the PUT request to a virtual hosted bucket using the s3lib library:

1 #!/usr/bin/env ruby 2 require 's3lib' 3 4 response = S3Lib.request(:put,'/', 5 {'host' => 'newbucket.s3.amazonaws.com'})

Remember that URLs cannot contain underscores (‘_’), so you won’t be able to create or use a bucket named ‘my_new_bucket’ using virtual hosting.

Errors

If you try to create a bucket that is already owned by someone else, Amazon will return a 409 Conflict error. Is s3sh, a AWS::S3::BucketAlreadyExists error will be raised.

1 $> s3sh 2 >> Bucket.create('not_my_bucket') 3 AWS::S3::BucketAlreadyExists: The requested bucket name is not available. 4 The bucket namespace is shared by all users of the system. Please select a diffe\ 5 rent name and try again. 6 from /opt/local/lib/ruby/gems/1.8/gems/aws-s3-0.4.0/bin/../lib/aws/s3/er\ 7 ror.rb:38:in `raise' 8 from /opt/local/lib/ruby/gems/1.8/gems/aws-s3-0.4.0/bin/../lib/aws/s3/ba\ 9 se.rb:72:in `request' 10 from /opt/local/lib/ruby/gems/1.8/gems/aws-s3-0.4.0/bin/../lib/aws/s3/ba\ 11 se.rb:83:in `put' 12 from /opt/local/lib/ruby/gems/1.8/gems/aws-s3-0.4.0/bin/../lib/aws/s3/bu\ 13 cket.rb:79:in `create' 14 from (irb):1

Discussion

Since a bucket is created by a PUT command, the request is idempotent: you can issue the same PUT request multiple times and have the same effect each time. In other words, Bucket.create won’t complain if you try to create one of your buckets again S3 Recipes 36

1 >> Bucket.create('some_bucket_that_does_not_exist') 2 => true 3 >> Bucket.create('some_bucket_that_does_not_exist') # It exists now, but that's \ 4 okay 5 => true

This is useful if you are not sure that a bucket exists. There’s no need to write something like this

1 def function_that_requires_a_bucket 2 begin 3 Bucket.find('some_bucket_that_may_or_may_not_exist') 4 rescue AWS::S3::NoSuchBucket 5 Bucket.create('some_bucket_that_may_or_may_not_exist') 6 end 7 ... rest of method ... 8 end

You can just use Bucket.create

1 def function_that_requires_a_bucket 2 Bucket.create('some_bucket_that_may_or_may_not_exist') 3 ... rest of method ... 4 end

One last thing to note is that Bucket.create returns true if it is successful and raises an error otherwise. Bucket.create does not return the newly created bucket. If you want to create a bucket and then assign it to a variable, you need to use Bucket.find to do the assignation

1 def function_that_requires_a_bucket 2 Bucket.create('my_bucket') 3 my_bucket = Bucket.find('my_bucket') 4 ... rest of method ... 5 end S3 Recipes 37

Creating a European bucket

The Problem

For either throughput or legal reasons, you want to create a bucket that is physically located in Europe.

The Solution

To create a bucket that is located in Europe rather than North America, you add some XML to the body of the PUT request when creating the bucket. The XML looks like this:

1 2 EU 3

The following code will create a European bucket named spatteneurobucket:

1 $>s3lib 2 >> euro_xml = < 4 EU 5 6 XML 7 >> S3Lib.request(:put, 'spatteneurobucket', :body => euro_xml, 'content-type' =>\ 8 'text/xml') 9 => #

Discussion

There are a few things worth noting here. First, as usual, I had to add the content-type to the PUT request. Second, European buckets must be read using virtual hosting. The GET request using virtual hosting will look like this: S3 Recipes 38

1 >> S3Lib.request(:get, '', 'host' => 'spatteneurobucket.s3.amazonaws.com').read 2 => " 3 4 spatteneurobucket 5 6 7 1000 8 false 9 "

There are not objects in this bucket, so there are no content tags. If I try to do a standard GET request, I will raise an error

1 >> S3Lib.request(:get, 'spatteneurobucket') 2 URI::InvalidURIError: bad URI(is not URI?): 3 from /opt/local/lib/ruby/1.8/uri/common.rb:436:in `split' 4 from /opt/local/lib/ruby/1.8/uri/common.rb:485:in `parse' 5 ... 6 from (irb):27

The requirement to use virtual hosting also means that there are extra constraints on the bucket name, as discussed in “Bucket Names”. Because this constraint is required, Amazon enforces them. If you try to create, for example, a bucket with underscores in its name, Amazon will complain:

1 >> S3Lib.request(:put, 'spatten_euro_bucket', :body => euro_xml, 'content-type' \ 2 => 'text/xml') 3 S3Lib::S3ResponseError: 400 Bad Request 4 amazon error type: InvalidBucketName 5 from /Users/Scott/versioned/s3_and_ec2_cookbook/code/s3_code/library/s3_\ 6 authenticator.rb:39:in `request' 7 from (irb):17

Finally, if you try to create a European bucket multiple times, an error is raised by Amazon: S3 Recipes 39

1 >> S3Lib.request(:put, 'spatteneurobucket', :body => euro_xml, 'content-type' =>\ 2 'text/xml').read 3 S3Lib::S3ResponseError: 409 Conflict 4 amazon error type: BucketAlreadyOwnedByYou 5 from /Users/Scott/versioned/s3_and_ec2_cookbook/code/s3_code/library/s3_\ 6 authenticator.rb:39:in `request' 7 from (irb):22

This is different behavior from standard buckets, where you are able to create a bucket again and again with no problems (or affects, either). S3 Recipes 40

Synchronizing two buckets

The Problem

You have two buckets that you want to keep exactly the same (you are probably using them for hosting assets, as in “Using S3 as an asset host”).

The Solution

Use conditional object copying to copy all files from one bucket to another. The following code goes through every object in the source bucket and copies it to the target bucket if either the object doesn’t exist in the target bucket or if the target bucket’s version of the object is different than the source bucket’s version. Example 3.7. synchronize_buckets

1 #!/usr/bin/env ruby 2 require 'rubygems' 3 require 'aws/s3' 4 include AWS::S3 5 6 7 module AWS 8 module S3 9 10 class Bucket 11 12 # copies all files from current bucket to the target bucket. 13 # target_bucket can be either a bucket instance or a string 14 # containing the name of the bucket. 15 def synchronize_to(target_bucket) 16 objects.each do |object| 17 object.copy_to_bucket_if_etags_dont_match(target_bucket) 18 end 19 end 20 21 end 22 23 class S3Object 24 25 # Copies the current object to the target bucket. 26 # target_bucket can be a bucket instance or a string containing S3 Recipes 41

27 # the name of the bucket. 28 def copy_to_bucket(target_bucket, params = {}) 29 if target_bucket.is_a?(AWS::S3::Bucket) 30 target_bucket = target_bucket.name 31 end 32 puts "#{key} => #{target_bucket}" 33 begin 34 S3Object.store(key, nil, target_bucket, 35 params.merge('x-amz-copy-source' => path)) 36 rescue AWS::S3::PreconditionFailed 37 end 38 end 39 40 # Copies the current object to the target bucket 41 # unless the object already exists in the target bucket 42 # and they are identical. 43 # target_bucket can be a bucket instance or a string containing 44 # the name of the bucket. 45 def copy_to_bucket_if_etags_dont_match(target_bucket, params = {}) 46 unless target_bucket.is_a?(AWS::S3::Bucket) 47 target_bucket = AWS::S3::Bucket.find(target_bucket) 48 end 49 if target_bucket[key] 50 params.merge!( 51 'x-amz-copy-source-if-none-match' => target_bucket[key].etag) 52 end 53 copy_to_bucket(target_bucket, params) 54 end 55 56 end 57 58 end 59 end 60 61 USAGE = "Usage: synchronize_buckets " 62 (puts USAGE;exit(0)) unless ARGV.length == 2 63 source_bucket_name, target_bucket_name = ARGV 64 65 AWS::S3::Base.establish_connection!( 66 :access_key_id => ENV['AMAZON_ACCESS_KEY_ID'], 67 :secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY'] 68 ) S3 Recipes 42

69 70 Bucket.create(target_bucket_name) 71 Bucket.find(source_bucket_name).synchronize_to(target_bucket_name)

You run the script like this:

1 $> ./synchronize_buckets spatten_s3demo spatten_s3demo_clone 2 eventbrite_com_errors.jpg => spatten_s3demo_clone 3 test.txt => spatten_s3demo_clone 4 vampire.jpg => spatten_s3demo_clone

Discussion

This script just screamed out for the addition of methods to the Bucket and S3Object classes. I borrowed the copy_to_bucket and copy_to_bucket_if_etags_dont_match methods from “Copying an object”, and added the Bucket.synchronize_to method. If you want to maintain the permissions on the newly created objects, you’ll have to add function- ality to copy grants or to add an :access parameter to the params hash passed to S3Object.store. This script will never delete objects from the target bucket. I’ll leave it as an exercise for the reader to add this functionality. S3 Recipes 43

Listing All Of Your Buckets

The Problem

You want to know the names of all of your buckets.

The Solution

Use the Service::buckets method from the AWS/S3 library. This will return an array of Bucket objects, sorted by creation date. If you want just the names of the buckets, then you can use collect on the array

1 $> s3sh 2 >> Service.buckets 3 => [#"assets0.\ 4 plotomatic.com", "creation_date"=>Thu Sep 06 16:25:25 UTC 2007}>, 5 #"assets1.plot\ 6 omatic.com", "creation_date"=>Thu Sep 06 16:53:18 UTC 2007}>, 7 #"assets2.plot\ 8 omatic.com", "creation_date"=>Thu Sep 06 17:18:47 UTC 2007}>, 9 10 .... 11 12 #"zunior_bucke\ 13 t", "creation_date"=>Sun Jul 27 18:31:07 UTC 2008}>] 14 >> Service.buckets.collect {|bucket| bucket.name} 15 => ["assets0.plotomatic.com", "assets1.plotomatic.com", "assets2.plotomatic.com"\ 16 , ..., "zunior_bucket"]

Discussion

You get the listing of all of the buckets you own by making an authenticated GET request to the root URL of the Amazon S3 service: http://s3.amazonaws.com²⁰. See “Listing All of Your Buckets” in the API section for more information.

²⁰http://s3.amazonaws.com S3 Recipes 44

Listing only objects with keys starting with some prefix

The Problem

You have a bucket with a large number of files in it, and you only want to list files starting with a given string.

The Solution

Use the prefix parameter when you are requesting the list of objects in the bucket. This will limit the objects to those with keys starting with the given prefix. If you are doing this by hand, then you add the prefix by including it as a query param on the bucket’s URL

1 /bucket_name?prefix=

If you are using the AWS-S3 library, then you set the prefix command like this:

1 $> s3sh 2 >> b = Bucket.find('spatten_test_bucket', :prefix => 'test') 3 >> b.objects.collect {|object| object.key} 4 => ["test.mp3", "test1.txt"]

The interface for S3Lib.request is the same: add a :prefix key to the params hash

1 $> s3lib 2 >> S3Lib.request(:get, 'spatten_test_bucket', :prefix => 'test').read 3 4 5 spatten_test_bucket 6 test 7 8 1000 9 false 10 11 test.mp3 12 2008-08-14T22:24:58.000Z 13 "80a03d7ed8658fe3869d70d10999e4ff" 14 7182955 15 S3 Recipes 45

16 9 17 d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 18 19 scottpatten 20 21 STANDARD 22 23 24 test1.txt 25 2008-04-29T04:47:03.000Z 26 "fd2f80fc0ef8c6cc6378d260182229be" 27 6 28 29 30 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 31 32 scottpatten 33 34 STANDARD 35 36

In both cases, only the two files starting with test are returned in the object list.

Discussion

In the example above, the URL for the GET request that is actually made to S3 is http://s3.amazonaws.com/spatten_- test_bucket?prefix=test. If you try doing this directly, you’ll get an error:

1 >> S3Lib.request(:get, 'spatten_test_bucket?prefix=test') 2 S3Lib::S3ResponseError: 403 Forbidden 3 amazon error type: SignatureDoesNotMatch 4 from /Library/Ruby/Gems/1.8/gems/s3-lib-0.1.3/lib/s3_authenticator.rb:39\ 5 :in `request' 6 from (irb):6

The ?prefix=test has to be omitted from the URL when it is being used to sign the request. Both the AWS-S3 and the S3Lib libraries opt to add the prefix in after the URL has been calculated rather than allowing you to add it directly and stripping it out during the signature calculation. S3 Recipes 46

Downloading a File From S3

The Problem

You have a file stored on S3. You want it on your hard drive. Stat!

The Solution

To get the value of an object on S3, you make an authenticated GET request to that object’s URL. Using the AWS::S3 library, you can use S3Object::value class method to get the value of an object. The S3Object::value method takes as its arguments the key and bucket of the object:

1 S3Object.value(key, bucket)

Once you have read an object’s value, you can write the value to disk. Here’s a script to download an object and write its value to a file: Example 3.18. download_object <<(code/downloading_a_file_from_s3_recipe/download_object.rb) Here it is in action

1 $> ./code/s3_code/download_object spatten_test_bucket viral_marketing.txt ~/vira\ 2 l_marketing.txt 3 $> more ~/viral_marketing.txt 4 Feel free to pass this around!

Here’s the same thing, making the GET request by hand using S3Lib Example 3.19. download_object_by_hand <<(code/downloading_a_file_from_s3_recipe/down- load_object_by_hand.rb)

Discussion

A more ‘Unixy’ way of doing this would be to have the download_object script output the value of the object to STDOUT. You could then redirect the output to wherever you want. This makes the script simpler, too, so it’s all good. Example 3.20. download_object_unixy S3 Recipes 47

1 #!/usr/bin/env ruby 2 3 require 'rubygems' 4 require 'aws/s3' 5 include AWS::S3 6 7 # Usage: download_object 8 # Downloads the object with a key of key in the bucket named bucket and 9 # writes it to a file named filename. 10 bucket, key = ARGV 11 12 AWS::S3::Base.establish_connection!( 13 :access_key_id => ENV['AMAZON_ACCESS_KEY_ID'], 14 :secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY'] 15 ) 16 17 puts S3Object.value(key, bucket)

Without redirection, it will just output the contents of the object

1 $> download_object spatten_test_bucket viral_marketing.txt 2 Feel free to pass this around! 3 /Users/spatten/book

You can also redirect the output to a file

1 $> download_object spatten_test_bucket viral_marketing.txt > ~/viral_marketing.t\ 2 xt 3 $> more ~/viral_marketing.txt 4 Feel free to pass this around!

The solutions in this recipe will fail for large files, as you’re loading the whole file in to memory before doing anything with it. This is solved in the next recipe, “Streaming a File From S3” S3 Recipes 48

Understanding access control policies

The Problem

You want to understand how giving and removing permissions to read and write your objects and buckets works.

The Solution

You need to learn all about the wonderful world of Access Control Policies (ACPs), Access Control Lists (ACLs) and Grants. Both buckets and objects have Access Control Policies (ACP). An Access Control Policy defines who can do what to a given Bucket or Object. ACPs are built from a list of grants on that object or bucket. Each grant gives a specific user or group of users (the grantee) a permission on that bucket or object. Grants can only give access. An object or bucket without any grants on it is un-readable or writable.

Warning

The nomenclature is a bit confusing here. You’ll see references to both Access Control Policies (ACPs) and Access Control Lists (ACLs). They’re pretty much synonymous. If it helps, you can think of the Access Control Policy as being a more over-arching concept, and the Access Control List as the implementation of that concept. Really, though, they’re interchangeable. To avoid writing ‘bucket or object’ over and over in this recipe, I’m going to use resource to refer to both buckets and objects.

Grants

An Access Control List is made up of one or more Grants. A grant gives a user or group of users a specific permission. It looks like this

1 2 4 ... info on the grantee ... 5 6 permission_type 7

The permission_type, grant_type and the information on the grantee are explained in detail below. S3 Recipes 49

Grant Permissions A grant can give one of five different permissions to a resource. The permis- sions are READ, WRITE, READ_ACP, WRITE_ACP and FULL_CONTROL. Table 3.1. Grant Permission Types Type Bucket Object READ List bucket contents Read an object’s value and metadata WRITE Create, over-write or delete an object in the bucket Not supported for Objects READ_ACP Read the ACL for a bucket or object. The owner of a resource has this permission without needing a grant. WRITE_ACP Write the ACL for a bucket or object. The owner of a resource has this permission without needing a grant. FULL_CONTROL Equivalent to giving READ, WRITE, READ_ACP and WRITE_ACP grants on this resource. The XML for a permission looks like this:

1 READ

Where READ is replaced by whatever permission type you are granting.

Grantees When you create a grant, you must specify who you are granting the permission to. There are currently six different types of Grantees.

Owner The owner of a resource will always have READ_ACP and WRITE_ACP permissions on that resource. When a resource is created, the owner is given FULL_CONTROL access on the resource using a ‘User by Canonical Representation’ grant (see below). You will never actually create an ‘OWNER’ grant directly; to change the grant of the owner of a resource, create a grant by Canonical Representation.

User by Email You can grant access to anyone with an Amazon S3 account using their e-mail address. Note that if you create a grant this way, it will be translated to a grant by Canonical Representation by Amazon S3. The Grantee XML for a grant by email will look like this:

1 3 [email protected] 4 S3 Recipes 50

User by Canonical Representation You can also grant access to anyone with an Amazon S3 account by using their Canonical Representation. See “Finding the canonical user ID” for information on finding a User’s Canonical ID. The Grantee XML for a grant by canonical user will look like the following example.

1 3 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 4

AWS User Group This will give access to anyone with an Amazon S3 account. They will have to authenticate their request with standard Amazon S3 authentication. I really can’t think of a use case for this, but it’s here for completeness. The Grantee XML will look like this

1 3 http://acs.amazonaws.com/groups/global/AuthenticatedUsers 4

All Users This will give anonymous access to anyone. This is the access type I use the most. With it, anyone in the world can read an object that I put on S3. Note that signed requests for a resource with anonymous access will still be rejected unless the user doing the signing has access to the resource.

1 3 http://acs.amazonaws.com/groups/global/AllUsers 4

Log Delivery Group This will give access to the group that writes logs. You will need to give a WRITE and READ_ACP grant to this group on any buckets that you are sending logs to. For more information on logging, see “Enabling logging on a bucket”.

1 3 http://acs.amazonaws.com/groups/s3/LogDelivery 4 S3 Recipes 51

Discussion

Here are a few things that will trip you up with ACLs. First, if you update an Object’s value by doing a PUT to it, the ACL will be reset to the default value, giving the owner FULL_CONTROL and no access to anyone else. This is kind of nasty, and if you are writing a library for S3, you might think about changing this behavior to something more expected. A second thing to watch out for is that if you give someone else WRITE access to one of your Buckets, you will not own the Objects they create in it. This means that you won’t have READ access to those Objects or their ACLs unless it is explicitly given by the creator. WRITE access is defined by the Bucket, so you will be able to delete or over-write any Objects in a Bucket you own. Since you don’t have read access, you won’t be able to do things like find out how big Objects not owned by you are are and delete any that are too big. You will, however, have the pleasure of paying for any Objects contained in a Bucket you own. Finally, you can’t give someone WRITE access to just one object. If you want someone to have WRITE access to an object, you have to give the WRITE access to the bucket it is contained in. This will also give them the ability to create new objects in the bucket and to delete or overwrite existing objects. You definitely never want to give WRITE access on a bucket to the AWS User Group or the All User Group. S3 Recipes 52

Setting a canned access control policy

The Problem

You are creating a new object or bucket, and you want to make add one (and only one) of the following permissions to it. All of the canned access control policies (ACPs) give the owner a FULL_- CONTROL grant: the owner can read and write the object or bucket and its ACL. private The owner has full access, and no-one else can read or write either the object or bucket or its ACL. This is useful if you want to reset a bucket or object’s ACP to private (see the discussion for a bit more on this). public-read Anyone can read the object or bucket public-read-write Anyone can read or write to the object or bucket authenticated-read Anyone who has an Amazon S3 account can make an authenticated request to read the object or bucket log-delivery-write The LogDelivery group is able to read the bucket and the bucket’s Access Control List (ACL). These permissions are required on buckets that you are sending logs to.

The Solution

When you create the object or bucket, send a x-amz-acl header with one of the canned ACL types as its value.

1 S3Lib.request(:put, 'spatten_new_bucket', 'x-amz-acl' => 'public-read')

If you are using the AWS-S3 library, then add an :access key to the parameters you send. The canned access type will be a symbol, with all dashes changed to underscores (You can do this with S3Lib as well).

1 Bucket.create('spatten_new_bucket', :access => :public_read) S3 Recipes 53

Discussion

As I’m writing this, the AWS-S3 doesn’t support the log-delivery-write canned ACP. That’s usually okay, though, as you can use the Bucket#enable_log_delivery method instead, which sets the log-delivery-write permissions for you as while it’s turning on log delivery (see “Enabling logging on a bucket”) If the canned access control policies don’t do what you need, or if you want to give access to only certain poeple, then see “Understanding access control policies” and “Giving another user access to an object or bucket using S3SH”. If you have already created a bucket, you can re-set its access control policy to one of the canned ones by re-creating the bucket with a canned access control policy. You can do the same with objects, but you will need to upload the object’s contents and meta-data while you do it. S3 Recipes 54

Keeping the Current ACL When You Change an Object

The Problem

Suppose you have an object called ‘code/sync_directory.rb’ in the bucket ‘amazon_s3_and_ec2_- cookbook’. You want the object to be publicly readable, so you give a READ grant to the AllUsers group. Here’s what the ACL looks like:

1 $> s3sh 2 >> acl = S3Object.acl('code/sync_directory.rb', 'amazon_s3_and_ec2_cookbook') 3 >> acl.grants 4 => [#, #]

As you can see, I have FULL_CONTROL access and the AllUsers group has READ access. Now, suppose I change the code and upload it again to S3:

1 >> S3Object.store('code/sync_directory.rb', File.read('code/s3_code/sync_directo\ 2 ry.rb'), 'amazon_s3_and_ec2_cookbook') 3 >> acl = S3Object.acl('code/sync_directory.rb', 'amazon_s3_and_ec2_cookbook') 4 >> acl.grants 5 => [#]

Your public read grant has been destroyed! What’s happening is that when you store the object on S3, it actually re-creates the object. See the Discussion for more on this.

The Solution

The solution is to store the object’s ACL before re-uploading it, and then re-upload the ACL to S3 after doing the upload.

1 >> acl = S3Object.acl('code/sync_directory.rb', 'amazon_s3_and_ec2_cookbook') 2 >> S3Object.store('code/sync_directory.rb', 3 File.read('code/s3_code/sync_directory.rb'), 'amazon_s3_and_ec2_cookbook') 4 >> S3Object.acl('code/sync_directory.rb', 5 'amazon_s3_and_ec2_cookbook', acl)

You can re-download the ACL to check that it was preserved S3 Recipes 55

1 >> S3Object.acl('code/sync_directory.rb', 'amazon_s3_and_ec2_cookbook').grants 2 => [#, #]

Discussion

The re-setting of the ACL on re-upload is definitely not expected behavior, but it actually makes some sense when you think about what’s going on in the background. Creating an object on S3 is done by making a PUT request to the object’s URL. In a RESTful architecture, PUT requests must be idempotent: the result of the request must be the same every time you do it, regardless of what has happened in the past. Another way of putting this is that PUT requests should have no state. If an object’s ACL was stored, then the PUT request to create or update the object would have state. So, yes, it makes some sense that this happens. It can be pretty annoying, though, when you keep removing access on buckets or objects that you have made readable to someone else. If you find yourself storing and then re-saving an object’s ACL a lot, it might make sense to create a store_with_saved_acl method to take care of the details for you. Here’s an implementation: Example 3.26. store_with_saved_acl.rb <<(code/keeping_the_current_acl_when_you_change_- an_object_recipe/store_with_saved_acl.rb) Here it is in action

1 $> s3sh 2 >> require 'code/s3_code/store_with_saved_acl' \ 3 => true 4 >> S3Object.store_with_saved_acl('code/sync_directory.rb', 5 File.read('code/s3_code/sync_directory.rb'), 6 'amazon_s3_and_ec2_cookbook') 7 => # 8 >> S3Object.acl('code/sync_directory.rb', 9 'amazon_s3_and_ec2_cookbook').grants 10 => [#, #]

Let’s take this a step further. We don’t really want another method here, we just want to add a parameter to the S3Object::store command that saves the ACL. We want a call like this:

1 S3Object.store(key, data, bucket, :keep_current_acl => true)

to do the equivalent to S3Object::store_with_saved_acl. This is going to take a little Ruby magic to make it work, but it’ll be worth it. S3 Recipes 56

Example 3.27. store_with_saved_acl_parametrized.rb <<(code/keeping_the_current_acl_when_- you_change_an_object_recipe/store_with_saved_acl_parametrized.rb) The class << self ... end idiom means “run everything in here as if it was a class method”. The def store ... end method declaration inside of this block is equivalent to writing def self.store ... end outside of the class << self ... end block. So, why use the idiom? In this case, it allows you to use alias_method to redefine a class method. Take a look at the new store method, too. All it does is wrap the original store method within the ACL saving code. You have to copy the params hash because the call to old_store resets the params hash. S3 Recipes 57

Making sure that all objects in a bucket are publicly readable

The Problem

You have a bucket where all files must be publicly readable, and you want to make very sure that they are.

The Solution

There are two ways you can do this, both perfectly valid. The first is to go through every object in the bucket and check that the ACLs all have a READ grant for the AllUsers group. The second is to actually try reading every file without authentication. The ACL parsing method isn’t as direct as actually reading the files. However, actually reading the files wouldn’t be a good idea if the files were large. You could, however, just do a HEAD request on each file. Let’s try that out first. The bucket itself might not necessarily have public read permission, so I’m going to get the list of objects with an authenticated request. Here’s a script using the AWS-S3 library that works. There’s a bit of a conflict between the aws/s3 and the rest-open-uri gem, so if you run this script you’ll get some ugly warnings at the top. Example 3.28. make_sure_everything_is_publicly_readable <<(code/making_sure_that_all_ob- jects_in_a_bucket_are_publicly_readable_recipe/make_sure_everything_is_publicly_readable.rb) Here’s the output from that script:

1 $> ruby make_sure_everything_is_publicly_readable assets0.plotomatic.com 2 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:103: warnin\ 3 g: already initialized constant Options 4 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:339: warnin\ 5 g: already initialized constant StringMax 6 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:400: warnin\ 7 g: already initialized constant RE_LWS 8 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:401: warnin\ 9 g: already initialized constant RE_TOKEN 10 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:402: warnin\ 11 g: already initialized constant RE_QUOTED_STRING 12 /Library/Ruby/Gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:403: warnin\ 13 g: already initialized constant RE_PARAMETERS 14 http://s3.amazonaws.com/assets0.plotomatic.com/FILES_TO_UPLOAD is not accessible! 15 http://s3.amazonaws.com/assets0.plotomatic.com/REVISION is not accessible! S3 Recipes 58

As I mentioned above, we’re getting a bunch of warnings as rest-open-uri re-defines some constants that are defined by the open-uri gem. The script, however, worked properly. There are two files in the bucket that are not publicly accessible (they are created by the upload script, and there’s no need for anyone else to read them). Everything else in the bucket is readable.

Discussion

If you don’t like the warnings, then you can get the list of objects using the S3Lib library, which doesn’t conflict with rest-open-uri. Example 3.29. make_sure_everything_is_publicly_readable_s3lib

1 #!/usr/bin/env ruby 2 3 require 'rubygems' 4 require 's3lib' 5 6 SERVICE_URL = 'http://s3.amazonaws.com' 7 8 # Usage: make_user_everything_is_publicly_readable 9 bucket = ARGV[0] 10 11 objects = S3Lib::Bucket.find(bucket).objects 12 objects.each do |object| 13 url = File.join(SERVICE_URL, object.url) 14 begin 15 open(url, :method => :head) 16 rescue OpenURI::HTTPError, '403 Forbidden' 17 puts "#{url} is not accessible!" 18 end 19 end

Here’s the output:

1 $> ruby make_sure_everything_is_publicly_readable assets0.plotomatic.com 2 http://s3.amazonaws.com/assets0.plotomatic.com/FILES_TO_UPLOAD is not accessible! 3 http://s3.amazonaws.com/assets0.plotomatic.com/REVISION is not accessible!

Ahh much cleaner. You could also just turn warnings off when you run the script by running the script with ruby -W0or changing the shebang line to S3 Recipes 59

1 #!/usr/bin/env ruby -W0

…but that’s cheating, isn’t it? S3 Recipes 60

Detecting if a File on S3 is the Same as a Local File

The Problem

You have a rather large file on your local disk that you want to make sure is backed up to S3. A version of the file is already on S3, but you want to make sure that the S3 version is the same as your local version. You want to avoid uploading the file if it’s not necessary.

The Solution

Calculate the MD5 hash sum of your local file, and compare it to the etag of the file on S3. If they’re the same, then the files are equivalent and you don’t have to upload. If they’re not the same, you need to do the upload. (If the file doesn’t exist on S3, then you’ll have to upload it too). Here’s some code that will do the checking for you. Example 3.33. detect_file_differences

1 #!/usr/bin/env ruby 2 3 require 'digest' 4 require 'rubygems' 5 require 'aws/s3' 6 include AWS::S3 7 8 # Usage: detect_file_differences [] 9 # key will default to filename if it is not given. 10 11 filename = ARGV[0] 12 bucket = ARGV[1] 13 key = ARGV[2] || filename 14 15 AWS::S3::Base.establish_connection!( 16 :access_key_id => ENV['AMAZON_ACCESS_KEY_ID'], 17 :secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY'] 18 ) 19 20 begin 21 object = S3Object.find(key, bucket) 22 rescue AWS::S3::NoSuchKey 23 puts "The file does not exist on S3. You need to upload" 24 exit(0) 25 end S3 Recipes 61

26 27 md5 = Digest::MD5.hexdigest(File.read(filename)) 28 etag = object.etag 29 30 if md5 == etag 31 puts "They're the same. No need to upload" 32 else 33 puts "They're different. You need to upload the file to S3." 34 end

Discussion

For a more feature-rich version of this code that actually does some uploading, see “Synchronizing a Directory” and “Synchronizing Multiple Directories” Chapter 4. Authenticating S3 Requests

Authenticating S3 Requests

The section on authenticating S3 Requests in the S3 Developer’s Guide (http://docs.amazonwebservices. com/AmazonS3/2006-03-01/) is pretty intimidating. There are a lot of steps you have to go through in just the right order to get your authentication correct. Luckily there are sample implementations in a number of languages in the Getting Started Guide: http://docs.amazonwebservices.com/AmazonS3/ 2006-03-01/gsg/authenticating-to-s3.html. The code supplied by Amazon works just fine, but you may want to build your own implementation. You could just read the S3 Developer’s Guide: it’s the canonical resource for this information, and obviously if there’s a conflict between what I’m saying and what the Developer’s Guide says, go with the Developer’s Guide. That being said, I’m going to go through the authentication process in detail to hopefully smooth over some of the spots where I got confused.

The Authentication Process

Every request you make to Amazon S3 must be signed. This is done by adding a Authorization header to the request. An authenticated request to a bucket named mybucket would look like this:

1 GET /mybucket 2 Host: s3.amazonaws.com 3 Date: Wed, 13 Feb 2008 12:00:00 GMT 4 Authorization: AWS your_aws_id:signature

TODO: Create a annotated example request and canonical string The authorization header consists of “AWS”, followed by a space, your AWS ID, a colon and then a signature:

1 Authorization = AWS :

So that looks pretty straight-forward, except for the signature. To generate the signature, you generate a canonical string for your request, and then encode that string using your Secret Key. To encode the canonical string, you take the UTF-8 encoding of that string, encode it with your Secret Key using the HMAC-SHA1 algorithm and then Base64 encode the result. In pseudo-code: Chapter 4. Authenticating S3 Requests 63

1 Signature = Base64( HMAC-SHA1( UTF-8-Encoding-Of( canonical_string ) ) )

Assuming that you have the libraries to do the various encodings, all you need to do now is create the canonical string. The canonical string is a concatenation of the HTTP verb used to make the request, the canonicalized headers and the canonicalized resource.

1 canonical_string = "\n 2 \n 3 "

The

This is the simplest of the canonical string sub-elements. It is either GET, PUT, DELETE, HEAD or POST. It must be all uppercase.

The

The canonicalized_headers element is constructed from two sub-elements, the canonicalized_- positional_headers and the canonicalized_amazon_headers.

1 canonicalized_headers = \n 2

The The canonicalized_positional_headers are the values of the md5 hash, content type and date headers, separated by newlines. In pseudo-code:

1 Content-MD5 + "\n" + 2 Content-Type + "\n" + 3 Date + "\n"

Here’s an example

1 \"91ffa40f1a72a58f0d0b688032195088\"\n 2 text/plain\n 3 Wed, 27 Mar 2008 09:14:27 +0000

If one of the positional headers is not provided in the request, replace its value with an empty string and leave the new line (\n) in. For example, if a request had no MD5 hash or content type headers, it would look like this Chapter 4. Authenticating S3 Requests 64

1 \n 2 \n 3 Wed, 27 Mar 2008 09:14:27 +0000

The The Amazon headers are all headers that begin with x-amz-, ignoring case. You construct the canonicalized_amazon_headers with the following steps

• Find all headers that have header names that begin with x-amz-, ignoring case. These are the Amazon headers • Convert each of the Amazon header’s header names to lower case. (Just the header names, not their values) • For each Amazon header, combine the header name and header value by joining them with a colon. Remove any leading string on the header value as you do this • If you have multiple Amazon headers with the same name, then combine the values in to one value by joining them with commas, without any white space between them. • If any of the headers span multiple lines, un-fold them by replacing the newlines with a single space • Sort the Amazon headers alphabetically by header name • Join the headers together with new-lines (\n)

Here’s an example Request canonicalized_amazon_headers

1 GET /my_photos/vampire.jpg 2 Host: s3.amazonaws.com 3 X-Amz-Meta-Subject: Claire 4 X-Amz-Meta-Photographer: 5 Nadine Inkster 6 content-type: image/png 7 content-length: 10817 8 9 x-amz-meta-photographer: 10 Nadine Inkster\n 11 x-amz-meta-subject:Claire

Note that the header names have been lower cased, they are in alphabetical order and the spaces between the header names and the header values have been taken out. The Host, content-length and content-type headers are not included in the canonicalized_amazon_headers. Here’s a more complicated example showing the combination of multiple Amazon headers with the same name and the un-folding of a long header value. Request canonicalized_amazon_headers Chapter 4. Authenticating S3 Requests 65

1 GET /my_photos/birthday.jpg 2 Host: s3.amazonaws.com 3 content-type: image/png 4 content-length: 12413 5 X-Amz-Meta-Subject:Claire 6 X-Amz-Meta-Subject:Mika 7 X-Amz-Meta-Subject:Amber 8 X-Amz-Meta-Subject:Callum 9 X-Amz-Meta-Description: 10 Mika, Claire, Amber and Callum \n 11 at Mika's birthday party\n 12 13 14 x-ama-meta-description: 15 Mika, Claire, Amber and Callum 16 at Mika's birthday party 17 x-amz-meta-subject: 18 Claire,Mika,Amber,Callum

Note that the subjects have all been combined in to a single header, and the new-line in the description has been replaced with a space.

Joining the The canonicalized_headers are constructed by joining the canonicalized_positional_- headers and the canonicalized_amazon_headers with a newline (\n).

1 canonicalized_headers = \n

The

The canonicalized_resource is given by

1 /

elements in the canonicalized_resource bucket The name of the bucket. This must be included even if you are setting the bucket in the Host header using virtual hosting of buckets. If you are requesting the list of buckets you own, just use the empty string. uri Chapter 4. Authenticating S3 Requests 66

This is the http request uri, not including the query string. The uri should be URI encoded. sub-resource If the request is for a sub-resource such as ?acl, ?torrent or ?logging, then append it to the uri, including the ?. Some examples Request canonicalized_resource

1 GET /my_pictures/vampire.jpg 2 Host: s3.amazonaws.com 3 4 /my_pictures/vampire.jpg

The canonicalized_resource is the same as the URI in the request Request canonicalized_resource

1 GET /vampire.jpg 2 Host: 3 spattenpictures.s3.amazonaws.com 4 5 /spattenpictures/vampire.jpg

Note that the bucket name (spattenpictures) is extracted from the Host header even though it is not in the URI in the request.

Time Stamping Your Requests

All requests to Amazon S3 must be time-stamped. There are two ways of time-stamping your request: using the Date header (which will be in the canonicalized_positional_headers), or using the x-amz-date header (which will be in the canonicalized_amazon_headers). Only one of the two date headers should be present in your request. The date provided must be within 15 minutes of the current Amazon S3 system time. If not, you will receive a RequestTimeTooSkewed error response to your request.

Writing an S3 Authentication Library

Now that you know how a request is authenticated (in exhaustive detail), let’s implement a library to actually do the authentication. Chapter 4. Authenticating S3 Requests 67

We need a function that will take a HTTP verb, URL and a hash of headers as inputs and make an authenticated request. It authenticates the request by adding a signature to it. The signature is created by using the inputs to create a canonical string and creating an MD5 hash of that string using your Amazon Web Services Secret key. Okay, so we’re looking for a function that looks like this:

1 module S3Lib 2 3 class AuthenticatedRequest 4 5 def make_authenticated_request(verb, request_path, headers = {}) 6 ... some code here ... 7 end 8 9 end 10 end

To make a request, you would do something like this

1 $> irb 2 >> require 's3_authenticator' 3 => true 4 >> s3 = S3Lib::AuthenticatedRequest.new 5 => # 6 >> s3.make_authenticated_request(:get, '/', 7 {'host' => 's3.amazonaws.com'})

You know what, that’s too much code just to make a simple request. Let’s add a class method to the S3Lib module that instantiates an AuthenticatedRequest object and makes the call to AuthenticatedRequest#make_authenticated_request for us.

1 module S3Lib 2 def self.request(verb, request_path, headers = {}) 3 s3requester = AuthenticatedRequest.new() 4 s3requester.make_authenticated_request(verb, request_path, headers) 5 end 6 7 class AuthenticatedRequest 8 9 def make_authenticated_request(verb, request_path, headers = {}) 10 ... some code here ... Chapter 4. Authenticating S3 Requests 68

11 end 12 13 end 14 end

Now we can make a request like this:

1 $> irb 2 >> require 's3_authenticator' 3 => true 4 >> s3 = S3Lib.request(:get, '/', 5 {'host' => 's3.amazonaws.com'}) 6 => #

Okay, now that we have that out of the way, let’s get started. Test Driven Development When you first look at the requirements for authenticating an S3 request, it’s hard to know where to begin. It’s a complex set of requirements that would be much more simple if you broke it up in to smaller steps. It is also have something that must work correctly if the rest of our library is to function at all. Luckily for you, Amazon gives a set of example requests and the corresponding canonical strings and signatures in the S3 Developer’s Guide (http://docs.amazonwebservices.com/AmazonS3/2006- 03-01/). This sounds like a perfect fit for Test Driven Development (TDD), and that’s what we’re going to do for the rest of this section. In the previous section, I deliberately started from a top-down view of the specification to make a point: the best way to deal with a complex spec like this is to start from the bottom and work your way up. If you just start coding from the top down, you’ll find yourself just sitting there staring at the code not knowing where to start. So, let’s do some TDD. The key here will be to make small steps, and use our tests to make sure that all of our steps work for all cases, even the edge cases. The flow of development will look like this:

• Read part of the authentication specification • Create tests for that portion of the specification • Make sure your tests fail • Write the simplest thing that will make your tests pass Chapter 4. Authenticating S3 Requests 69

The HTTP Verb

Let’s start with something easy: the HTTP Verb section of the canonical string. Remember from the specification that the HTTP verb “… is either GET, PUT, DELETE, HEAD or POST. It must be all uppercase.” Also, it has to be followed by a carriage return. So, let’s test that. Our test will look something like this:

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def test_http_verb_is_uppercase 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 @s3_test.make_authenticated_request(:get, '/', 9 {'host' => 's3.amazonaws.com'}) 10 assert_match /^GET\n/, @s3_test.canonical_string 11 end 12 13 end

Let’s run that and see what happens. We know this is going to fail, as we haven’t actually written the canonical_string method yet.

1 $> ruby test/first_test.rb 2 Loaded suite test/first_test 3 Started 4 E 5 Finished in 0.000457 seconds. 6 7 1) Error: 8 test_http_verb_is_uppercase(S3AuthenticatorTest): 9 NoMethodError: undefined method `canonical_string' 10 for # 11 test/first_test.rb:9:in `test_http_verb_is_uppercase' 12 13 1 tests, 0 assertions, 0 failures, 1 errors

Good, it fails. Now, let’s write the simplest thing that will make it work. Something like this: Chapter 4. Authenticating S3 Requests 70

1 module S3Lib 2 3 def self.request(verb, request_path, headers = {}) 4 s3requester = AuthenticatedRequest.new() 5 s3requester.make_authenticated_request(verb, request_path, headers) 6 end 7 8 class AuthenticatedRequest 9 10 def make_authenticated_request(verb, request_path, headers = {}) 11 @verb = verb 12 end 13 14 def canonical_string 15 "#{@verb.to_s.upcase}\n" 16 end 17 18 end 19 end

Now, run the test again.

1 $> ruby test/first_test.rb 2 Loaded suite test/first_test 3 Started 4 . 5 Finished in 0.000358 seconds. 6 7 1 tests, 1 assertions, 0 failures, 0 errors

All right! We’re rolling now! Let’s go on to something a bit more complicated: the canonicalized headers.

The Canonicalized Positional Headers

The canonicalized headers consist of the canonicalized_positional_headers followed by the canonicalized_amazon_headers. To break this down to something simple, we don’t want to test both of them at once. Let’s start testing with the canonicalized_positional_headers. Once again, I’ll refresh your memory so that you don’t have to flip back to the last section. The canonicalized_positional_headers are the values of the MD5 hash, content type and date headers, separated by newlines. So, let’s write some tests for that spec: Chapter 4. Authenticating S3 Requests 71

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def test_http_verb_is_uppercase 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 @s3_test.make_authenticated_request(:get, '/', 9 {'host' => 's3.amazonaws.com'}) 10 assert_match /^GET\n/, @s3_test.canonical_string 11 end 12 13 def test_canonical_string_contains_positional_headers 14 @s3_test = S3Lib::AuthenticatedRequest.new 15 @s3_test.make_authenticated_request(:get, '', 16 {'content-type' => 'some content type', 17 'date' => 'December 25th, 2007', 18 'content-md5' => 'whee'}) 19 assert_match /^GET\n#{@s3_test.canonicalized_positional_headers}/, 20 @s3_test.canonical_string 21 end 22 23 def test_positional_headers_with_all_headers 24 @s3_test = S3Lib::AuthenticatedRequest.new 25 @s3_test.make_authenticated_request(:get, '', 26 {'content-type' => 'some content type', 27 'date' => 'December 25th, 2007', 28 'content-md5' => 'whee'}) 29 assert_equal "whee\nsome content type\nDecember 25th, 2007\n", 30 @s3_test.canonicalized_positional_headers 31 end 32 33 end

This will, of course, fail until we’ve written the canonicalized_positional_headers method. Making sure your tests fail I’ll spare you the details of the test failures from now on, but don’t take that to mean that you shouldn’t test that they fail. Trust me, knowing that your tests actually fail before you begin coding will save you hours of frustration at some point in your life. Making sure that your tests fail ensures that your tests are actually running, and it helps to ensure that they’re testing what you think you are testing. Get in to the habit of running them before you begin coding and giving yourself a pat on the back when they fail. Chapter 4. Authenticating S3 Requests 72

1 module S3Lib 2 3 def self.request(verb, request_path, headers = {}) 4 s3requester = AuthenticatedRequest.new() 5 s3requester.make_authenticated_request(verb, request_path, headers) 6 end 7 8 class AuthenticatedRequest 9 10 POSITIONAL_HEADERS = ['content-md5', 'content-type', 'date'] 11 12 def make_authenticated_request(verb, request_path, headers = {}) 13 @verb = verb 14 @headers = headers 15 end 16 17 def canonical_string 18 "#{@verb.to_s.upcase}\n#{canonicalized_headers}" 19 end 20 21 def canonicalized_headers 22 "#{canonicalized_positional_headers}" 23 end 24 25 def canonicalized_positional_headers 26 POSITIONAL_HEADERS.collect do |header| 27 @headers[header] + "\n" 28 end.join 29 end 30 31 end 32 end

If you run the tests, you’ll see that the new tests pass, but we’ve broken the test_http_verb_is_- uppercase test. Chapter 4. Authenticating S3 Requests 73

1 t$> ruby test/first_test.rb 2 Loaded suite test/first_test 3 Started 4 .E. 5 Finished in 0.00068 seconds. 6 7 1) Error: 8 test_http_verb_is_uppercase(S3AuthenticatorTest): 9 NoMethodError: undefined method `+' for nil:NilClass 10 ./test/../s3_authenticator_dev.rb:26:in `canonicalized_positional_headers' 11 ./test/../s3_authenticator_dev.rb:25:in `collect' 12 ./test/../s3_authenticator_dev.rb:25:in `canonicalized_positional_headers' 13 ./test/../s3_authenticator_dev.rb:21:in `canonicalized_headers' 14 ./test/../s3_authenticator_dev.rb:17:in `canonical_string' 15 test/first_test.rb:9:in `test_http_verb_is_uppercase' 16 17 3 tests, 2 assertions, 0 failures, 1 errors

That test doesn’t pass in all of the positional headers, so the canonicalized_positional_headers method is failing when it tries to add the non-existent header to a string. This is kind of fortuitous, as the specification for the canonicalized_positional_headers says that a positional header should be replaced by an empty string if it doesn’t exists. Let’s write a test for that spec and hopefully we’ll comply with that specification and fix the currently failing test all in one fell swoop. Here’s the new test

1 def test_positional_headers_with_only_date_header 2 @s3_test.make_authenticated_request(:get, '', 3 {'date' => 'December 25th, 2007'}) 4 assert_equal "\n\nDecember 25th, 2007\n", 5 @s3_test.canonicalized_positional_headers 6 end

To fix the problem, all we have to do is make sure that positional header is replaced with an empty string if it doesn’t exist Chapter 4. Authenticating S3 Requests 74

1 def test_positional_headers_with_only_date_header 2 @s3_test = S3Lib::AuthenticatedRequest.new 3 @s3_test.make_authenticated_request(:get, '', 4 {'date' => 'December 25th, 2007'}) 5 assert_equal "\n\nDecember 25th, 2007\n", 6 @s3_test.canonicalized_positional_headers 7 end

Our tests all pass now. Phew. DRYing up our tests You might have noticed that all of our tests start with the line @s3_test = S3Lib::AuthenticatedRequest.new. Now, I’m a pretty lazy guy, so I want to avoid all of that repetitive typing. Luckily, Ruby’s Unit Testing library is a faithful interpretation of the XUnit test specification. This means that if you define a setup method, it will be run before every test. Similarly, the teardown method, if defined, will be run after every test. Let’s refactor our test library to take advantage of the setup method.

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def setup 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 end 9 10 def test_http_verb_is_uppercase 11 @s3_test.make_authenticated_request(:get, '/', 12 {'host' => 's3.amazonaws.com'}) 13 assert_match /^GET\n/, @s3_test.canonical_string 14 end 15 16 def test_canonical_string_contains_positional_headers 17 @s3_test.make_authenticated_request(:get, '', 18 {'content-type' => 'some content type', 19 'date' => 'December 25th, 2007', 20 'content-md5' => 'whee'}) 21 assert_match /^GET\n#{@s3_test.canonicalized_positional_headers}/, 22 @s3_test.canonical_string 23 end 24 25 def test_positional_headers_with_all_headers Chapter 4. Authenticating S3 Requests 75

26 @s3_test.make_authenticated_request(:get, '', 27 {'content-type' => 'some content type', 28 'date' => 'December 25th, 2007', 29 'content-md5' => 'whee'}) 30 assert_equal "whee\nsome content type\nDecember 25th, 2007\n", 31 @s3_test.canonicalized_positional_headers 32 end 33 34 def test_positional_headers_with_only_date_header 35 @s3_test.make_authenticated_request(:get, '', 36 {'date' => 'December 25th, 2007'}) 37 assert_equal "\n\nDecember 25th, 2007\n", 38 @s3_test.canonicalized_positional_headers 39 end 40 41 end

Keeping your methods and variable private yet testable

You might have been gritting your teeth as I just blithely used attributes and methods that should be private in the tests above. In reality, you probably don’t want the canonical_string, @headers or any of the other methods I was testing against available in the public interface. I did this because I didn’t want to obscure what we’re really trying to do here: write an S3 authentication library. However, if you’re interested, here’s a nice method for keeping those methods and attributes private in production yet available for testing. The key to this technique is that Ruby’s classes are always open and extendible. So, you can open up the class when you’re testing it and make everything publicly accessible while still keeping everything locked down for production use. Here’s an example of how I did it while developing this class. Here’s the class refactored to keep things private. Notice that only the make_authenticated_request method is publicly available

1 module S3Lib 2 require 'time' 3 4 def self.request(verb, request_path, headers = {}) 5 s3requester = AuthenticatedRequest.new() 6 s3requester.make_authenticated_request(verb, request_path, headers) 7 end 8 9 class AuthenticatedRequest Chapter 4. Authenticating S3 Requests 76

10 11 POSITIONAL_HEADERS = ['content-md5', 'content-type', 'date'] 12 13 def make_authenticated_request(verb, request_path, headers = {}) 14 @verb = verb 15 @headers = headers 16 end 17 18 private 19 20 def canonical_string 21 "#{@verb.to_s.upcase}\n#{canonicalized_headers}" 22 end 23 24 def canonicalized_headers 25 "#{canonicalized_positional_headers}" 26 end 27 28 def canonicalized_positional_headers 29 POSITIONAL_HEADERS.collect do |header| 30 (@headers[header] || "") + "\n" 31 end.join 32 end 33 34 end

Now, in the same file that your tests are in, open up the class and publicize all of the methods we want to test (canonical_string, canonicalized_headers and canonicalized_positional_- headers) with public versions (public_canonical_string, public_canonicalized_headers and public_canonicalized_positional_headers). Next, make any instance variables you want access to readable (@headers). Finally, re-name the method calls in your tests by pre-pending public_ to them.

1 # Make private methods and attributes public so that you can test them 2 module S3Lib 3 class AuthenticatedRequest 4 5 attr_reader :headers 6 7 def public_canonicalized_headers 8 canonicalized_headers 9 end Chapter 4. Authenticating S3 Requests 77

10 11 def public_canonicalized_positional_headers 12 canonicalized_positional_headers 13 end 14 15 def public_canonical_string 16 canonical_string 17 end 18 19 end 20 end 21 22 require 'test/unit' 23 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev_private') 24 25 class S3AuthenticatorTest < Test::Unit::TestCase 26 27 def setup 28 @s3_test = S3Lib::AuthenticatedRequest.new 29 end 30 31 def test_http_verb_is_uppercase 32 @s3_test.make_authenticated_request(:get, '/', 33 {'host' => 's3.amazonaws.com'}) 34 assert_match /^GET\n/, @s3_test.public_canonical_string 35 end 36 37 def test_canonical_string_contains_positional_headers 38 @s3_test.make_authenticated_request(:get, '', 39 {'content-type' => 'some content type', 40 'date' => 'December 25th, 2007', 41 'content-md5' => 'whee'}) 42 assert_match /^GET\n#{@s3_test.public_canonicalized_positional_headers}/, 43 @s3_test.public_canonical_string 44 end 45 46 def test_positional_headers_with_all_headers 47 @s3_test.make_authenticated_request(:get, '', 48 {'content-type' => 'some content type', 49 'date' => 'December 25th, 2007', 50 'content-md5' => 'whee'}) 51 assert_equal "whee\nsome content type\nDecember 25th, 2007\n", @s3_test.publ\ Chapter 4. Authenticating S3 Requests 78

52 ic_canonicalized_positional_headers 53 end 54 55 def test_positional_headers_with_only_date_header 56 @s3_test.make_authenticated_request(:get, '', 57 {'date' => 'December 25th, 2007'}) 58 assert_equal "\n\nDecember 25th, 2007\n", 59 @s3_test.public_canonicalized_positional_headers 60 end 61 62 end

This technique of opening the class will also come in handy when we actually write the code to talk to Amazon S3, but want to be able to test without a live internet connection. Phew. That was a lot of reading and coding, but we have the positional headers working and well tested now. I’ll tone down the verbiage from now on so that we can finish up without killing too many extra forests.

The Canonicalized Amazon Headers

The Amazon headers are all headers that begin with x-amz-, ignoring case. You construct the canonicalized_amazon_headers with the following steps. I’m going to write out each specification along with some tests that express that specification.

• Find all headers that have header names that begin with x-amz-, ignoring case. These are the Amazon headers

1 def test_amazon_headers_should_remove_non_amazon_headers 2 @s3_test.make_authenticated_request(:get, '', 3 {'content-type' => 'content', 4 'some-other-header' => 'other', 5 'x-amz-meta-one' => 'one', 6 'x-amz-meta-two' => 'two'}) 7 headers = @s3_test.public_canonicalized_amazon_headers 8 assert_no_match /other/, headers 9 assert_no_match /content/, headers 10 end 11 12 def test_amazon_headers_should_keep_amazon_headers 13 @s3_test.make_authenticated_request(:get, '', 14 {'content-type' => 'content', Chapter 4. Authenticating S3 Requests 79

15 'some-other-header' => 'other', 16 'x-amz-meta-one' => 'one', 17 'x-amz-meta-two' => 'two'}) 18 headers = @s3_test.public_canonicalized_amazon_headers 19 assert_match /x-amz-meta-one/, headers 20 assert_match /x-amz-meta-two/, headers 21 end

• Convert each of the Amazon header’s header names to lower case. (Just the header names, not their values)

1 def test_amazon_headers_should_be_lowercase 2 @s3_test.make_authenticated_request(:get, '', 3 {'content-type' => 'content', 4 'some-other-header' => 'other', 5 'X-amz-meta-one' => 'one', 6 'x-Amz-meta-two' => 'two'}) 7 headers = @s3_test.public_canonicalized_amazon_headers 8 assert_match /x-amz-meta-one/, headers 9 assert_match /x-amz-meta-two/, headers 10 end

• For each Amazon header, combine the header name and header value by joining them with a colon. Remove any leading string on the header value as you do this

1 def test_leading_spaces_get_stripped_from_header_values 2 @s3_test.make_authenticated_request(:get, '', 3 {'x-amz-meta-one' => ' one with a leading space', 4 'x-Amz-meta-two' => ' two with a leading and trailin\ 5 g space '}) 6 headers = @s3_test.public_canonicalized_amazon_headers 7 assert_match /x-amz-meta-one:one with a leading space/, headers 8 assert_match /x-amz-meta-two:two with a leading and trailing space /, 9 headers 10 end

• If you have multiple Amazon headers with the same name, then combine the values in to one value by joining them with commas, without any white space between them. Chapter 4. Authenticating S3 Requests 80

1 def test_values_as_arrays_should_be_joined_as_commas 2 @s3_test.make_authenticated_request(:get, '', 3 {'x-amz-mult' => ['a', 'b', 'c']}) 4 headers = @s3_test.canonicalized_amazon_headers 5 assert_match /a,b,c/, headers 6 end

• If any of the headers span multiple lines, un-fold them by replacing the newlines with a single space

1 def test_long_amazon_headers_should_get_unfolded 2 @s3_test.make_authenticated_request(:get, '', 3 {'x-amz-meta-one' => "A really long head\ 4 er\n" + 5 "with multiple line\ 6 s."}) 7 headers = @s3_test.canonicalized_amazon_headers 8 assert_match /x-amz-meta-one:A really long header with multiple lines./, 9 headers 10 end

• Sort the Amazon headers alphabetically by header name

1 def test_amazon_headers_should_be_alphabetized 2 @s3_test.make_authenticated_request(:get, '', 3 {'content-type' => 'content', 4 'some-other-header' => 'other', 5 'X-amz-meta-one' => 'one', 6 'x-Amz-meta-two' => 'two', 7 'x-amz-meta-zed' => 'zed', 8 'x-amz-meta-alpha' => 'alpha'}) 9 headers = @s3_test.canonicalized_amazon_headers 10 assert_match /alpha.*one.*two.*zed/m, headers # /m on the reg-exp makes .* i\ 11 nclude newlines 12 end

• Join the headers together with new-lines (\n)

Here is some code that passes those tests, which hopefully means it achieves the specifications: Chapter 4. Authenticating S3 Requests 81

1 class Hash 2 3 def downcase_keys 4 res = {} 5 each do |key, value| 6 key = key.downcase if key.respond_to?(:downcase) 7 res[key] = value 8 end 9 res 10 end 11 12 def join_values(separator = ',') 13 res = {} 14 each do |key, value| 15 res[key] = value.respond_to?(:join) ? value.join(separator) : value 16 end 17 res 18 end 19 20 end 21 22 module S3Lib 23 require 'time' 24 25 def self.request(verb, request_path, headers = {}) 26 s3requester = AuthenticatedRequest.new() 27 s3requester.make_authenticated_request(verb, request_path, headers) 28 end 29 30 class AuthenticatedRequest 31 32 attr_reader :headers 33 POSITIONAL_HEADERS = ['content-md5', 'content-type', 'date'] 34 35 def make_authenticated_request(verb, request_path, headers = {}) 36 @verb = verb 37 @headers = headers.downcase_keys.join_values 38 end 39 40 def canonical_string 41 "#{@verb.to_s.upcase}\n#{canonicalized_headers}" 42 end Chapter 4. Authenticating S3 Requests 82

43 44 def canonicalized_headers 45 "#{canonicalized_positional_headers}#{canonicalized_amazon_headers}" 46 end 47 48 def canonicalized_positional_headers 49 POSITIONAL_HEADERS.collect do |header| 50 (@headers[header] || "") + "\n" 51 end.join 52 end 53 54 def canonicalized_amazon_headers 55 56 # select all headers that start with x-amz- 57 amazon_headers = @headers.select do |header, value| 58 header =~ /^x-amz-/ 59 end 60 61 # Sort them alpabetically by key 62 amazon_headers = amazon_headers.sort do |a, b| 63 a[0] <=> b[0] 64 end 65 66 # Collect all of the amazon headers like this: 67 # {key}:{value}\n 68 # The value has to have any whitespace on the left stripped from it 69 # and any new-lines replaced by a single space. 70 # Finally, return the headers joined together as a single string. 71 amazon_headers.collect do |header, value| 72 "#{header}:#{value.lstrip.gsub("\n"," ")}\n" 73 end.join 74 end 75 76 end 77 end

What did I just do to Hash? If you’re not used to Ruby, you might have gotten a little worried when you saw the additions I made to the Hash class. I opened up a base class and added a couple of methods to it. Yikes! You might think that this is crazy and will lead to all kinds of problems, but it is accepted practice in the Ruby world. Personally, I love it: it makes my code more readable and concise, and it has never caused me a problem. Chapter 4. Authenticating S3 Requests 83

Date Stamping Requests

If you were feeling especially alert, you might have noticed that I side-stepped a specification in the canonicalized headers sections. That spec is that all requests to S3 must be time-stamped. A request can be time stamped in two ways: through the positional date header, or through an amazon header called x-amz-date. The x-amz-date header over-rides the date header. Also, to make the life of the users of your library easier, let’s make the library provide a date header equal to the current time if none is passed in. Here’s a set of tests that express that spec:

1 def test_date_should_be_added_if_not_passed_in 2 @s3_test.make_authenticated_request(:get, '') 3 assert @s3_test.headers.has_key?('date') 4 end 5 6 def test_positional_headers_with_no_headers_should_have_date_defined 7 @s3_test.make_authenticated_request(:get, '' ) 8 date = @s3_test.headers['date'] 9 assert_equal "\n\n#{date}\n", @s3_test.canonicalized_positional_headers 10 end 11 12 def test_xamzdate_should_override_date_header 13 @s3_test.make_authenticated_request(:get, '', 14 {'date' => 'December 15, 2005', 15 'x-amz-date' => 'Tue, 27 Mar 2007 21:20\ 16 :26 +0000'}) 17 headers = @s3_test.public_canonicalized_headers 18 assert_match /2007/, headers 19 assert_no_match /2005/, headers 20 end 21 22 def test_xamzdate_should_override_capitalized_date_header 23 @s3_test.make_authenticated_request(:get, '', 24 {'Date' => 'December 15, 2005', 25 'X-amz-date' => 'Tue, 27 Mar 2007 21:20\ 26 :26 +0000'}) 27 headers = @s3_test.public_canonicalized_headers 28 assert_match /2007/, headers 29 assert_no_match /2005/, headers 30 end

We’ll use the fix_date method to add a date header if it doesn’t exist. Notice that the test accesses the @headers hash. The line that reads attr_reader :headers makes that accessible to our tests. Here’s the code: Chapter 4. Authenticating S3 Requests 84

1 module S3Lib 2 require 'time' 3 4 def self.request(verb, request_path, headers = {}) 5 s3requester = AuthenticatedRequest.new() 6 s3requester.make_authenticated_request(verb, request_path, headers) 7 end 8 9 class AuthenticatedRequest 10 11 attr_reader :headers 12 POSITIONAL_HEADERS = ['content-md5', 'content-type', 'date'] 13 14 def make_authenticated_request(verb, request_path, headers = {}) 15 @verb = verb 16 @headers = headers 17 fix_date 18 end 19 20 def fix_date 21 @headers['date'] ||= Time.now.httpdate 22 @headers.delete('date') if @headers.has_key?('x-amz-date') 23 end 24 25 def canonical_string 26 "#{@verb.to_s.upcase}\n#{canonicalized_headers}" 27 end 28 29 def canonicalized_headers 30 "#{canonicalized_positional_headers}" 31 end 32 33 def canonicalized_positional_headers 34 POSITIONAL_HEADERS.collect do |header| 35 (@headers[header] || "") + "\n" 36 end.join 37 end 38 39 end 40 end Chapter 4. Authenticating S3 Requests 85

The Canonicalized Resource

The canonicalized_resource is given by

1 /

The canonicalized_resource must start with a forward slash (/), it must include the bucket name (even if the bucket is not in the URI), and then comes the URI and the sub-resource (if any). The bucket name must be lower case. Here are some tests that express this.

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorCanonicalResourceTest < Test::Unit::TestCase 5 6 def setup 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 end 9 10 def test_forward_slash_is_always_added 11 @s3_test.make_authenticated_request(:get, '') 12 assert_match /^\//, @s3_test.canonicalized_resource 13 end 14 15 def test_bucket_name_in_uri_should_get_passed_through 16 @s3_test.make_authenticated_request(:get, 'my_bucket') 17 assert_match /^\/my_bucket/, @s3_test.canonicalized_resource 18 end 19 20 def test_canonicalized_resource_should_include_uri 21 @s3_test.make_authenticated_request(:get, 'my_bucket/vampire.jpg') 22 assert_match /vampire.jpg$/, @s3_test.canonicalized_resource 23 end 24 25 def test_canonicalized_resource_should_include_sub_resource 26 @s3_test.make_authenticated_request(:get, 'my_bucket/vampire.jpg?torrent') 27 assert_match /vampire.jpg\?torrent$/, @s3_test.canonicalized_resource 28 end 29 30 def test_bucket_name_with_virtual_hosting 31 @s3_test.make_authenticated_request(:get, '/', 32 {'host' => 'some_bucket.s3.amazonaws.com\ Chapter 4. Authenticating S3 Requests 86

33 '}) 34 assert_match /some_bucket\//, @s3_test.canonicalized_resource 35 assert_no_match /s3.amazonaws.com/, @s3_test.canonicalized_resource 36 end 37 38 def test_bucket_name_with_cname_virtual_hosting 39 @s3_test.make_authenticated_request(:get, '/', 40 {'host' => 'some_bucket.example.com'}) 41 assert_match /^\/some_bucket.example.com/, @s3_test.canonicalized_resource 42 end 43 44 end

Here is the AuthenticatedRequest library that passes these tests. Note the changes that have been made:

• The HOST constant has been added and set to 's3.amazonaws.com' • The get_bucket_name method has been created. It is called from the make_authenticated_- request method. This method extracts the bucket from the host header and saves it to the @bucket instance variable. • The canonicalized_resource method creates the canonicalized resource string. It is called in the canonical_string method.

1 class AuthenticatedRequest 2 3 attr_reader :headers 4 POSITIONAL_HEADERS = ['content-md5', 'content-type', 'date'] 5 HOST = 's3.amazonaws.com' 6 7 def make_authenticated_request(verb, request_path, headers = {}) 8 @verb = verb 9 @request_path = request_path.gsub(/^\//,'') # Strip off the leading '/' 10 11 @headers = headers.downcase_keys.join_values 12 fix_date 13 get_bucket_name 14 end 15 16 def fix_date 17 @headers['date'] ||= Time.now.httpdate 18 @headers.delete('date') if @headers.has_key?('x-amz-date') Chapter 4. Authenticating S3 Requests 87

19 end 20 21 def canonical_string 22 "#{@verb.to_s.upcase}\n#{canonicalized_headers}#{canonicalized_resource}" 23 end 24 25 def canonicalized_headers 26 "#{canonicalized_positional_headers}#{canonicalized_amazon_headers}" 27 end 28 29 def canonicalized_positional_headers 30 POSITIONAL_HEADERS.collect do |header| 31 (@headers[header] || "") + "\n" 32 end.join 33 end 34 35 def canonicalized_amazon_headers 36 37 # select all headers that start with x-amz- 38 amazon_headers = @headers.select do |header, value| 39 header =~ /^x-amz-/ 40 end 41 42 # Sort them alpabetically by key 43 amazon_headers = amazon_headers.sort do |a, b| 44 a[0] <=> b[0] 45 end 46 47 # Collect all of the amazon headers like this: 48 # {key}:{value}\n 49 # The value has to have any whitespace on the left stripped from it 50 # and any new-lines replaced by a single space. 51 # Finally, return the headers joined together as a single string. 52 amazon_headers.collect do |header, value| 53 "#{header}:#{value.lstrip.gsub("\n"," ")}\n" 54 end.join 55 end 56 57 def canonicalized_resource 58 canonicalized_resource_string = "/" 59 canonicalized_resource_string += @bucket 60 canonicalized_resource_string += @request_path Chapter 4. Authenticating S3 Requests 88

61 canonicalized_resource_string 62 end 63 64 def get_bucket_name 65 @bucket = "" 66 return unless @headers.has_key?('host') 67 @headers['host'] = @headers['host'].downcase 68 return if @headers['host'] == 's3.amazonaws.com' 69 if @headers['host'] =~ /^([^.]+)(:\d\d\d\d)?\.#{HOST}$/ # Virtual hosting 70 @bucket = $1.gsub(/\/$/,'') + '/' 71 else 72 # CNAME Virtual hosting 73 @bucket = @headers['host'].gsub(/(:\d\d\d\d)$/, '').gsub(/\/$/,'') + '/' 74 end 75 end 76 77 end

The Full Signature

Now that we have all of the parts of the signature coded up, we can use some samples provided by the S3 Developers Guide to test that it works when we bring it all together. The examples are in the sec- tion of the document on REST authentication at http://docs.amazonwebservices.com/AmazonS3/2006-03-01/RESTAuthentication.html²¹ Okay, so let’s take a look at the first example and code up a test to use it. Here’s the example: Request canonical_string

1 GET /photos/puppy.jpg HTTP/1.1 2 Host: johnsmith.s3.amazonaws.com 3 Date: 4 Tue, 27 Mar 2007 19:36:42 +0000 5 Authorization: AWS 0PN5J17HBGZHT7JJ3X82: 6 xXjDGYUmKxnwqr5KXNPGldn5LbA= 7 8 GET\n 9 \n 10 \n 11 Tue, 27 Mar 2007 19:36:42 12 +0000\n 13 /johnsmith/photos/puppy.jpg

²¹http://docs.amazonwebservices.com/AmazonS3/2006-03-01/RESTAuthentication.html Chapter 4. Authenticating S3 Requests 89

There are two interesting features to note:

• The bucket is provided in the Host header, not in the URL, but it still shows up in the canonical_resource in the canonical_string. • The actual encrypted Authorization header is provided for the sample request. I’ll talk about this more in the next section on signing the request.

Let’s translate that example to a unit test and see if all of our hard work comes together.

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def setup 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 end 9 10 # http://developer.amazonwebservices.com/connect/entry.jspa?externalID=123&cat\ 11 egoryID=48 12 def test_dg_sample_one 13 @s3_test.make_authenticated_request(:get, '/photos/puppy.jpg', 14 {'Host' => 'johnsmith.s3.amazonaws.com', 15 'Date' => 'Tue, 27 Mar 2007 19:36:42 +00\ 16 00'}) 17 expected_canonical_string = "GET\n\n\nTue, 27 Mar 2007 19:36:42 +0000\n" + 18 "/johnsmith/photos/puppy.jpg" 19 assert_equal expected_canonical_string, @s3_test.canonical_string 20 end 21 22 end

Save that in canonical_string_tests.rb and run it. Chapter 4. Authenticating S3 Requests 90

1 $> ruby test/canonical_string_tests.rb 2 Loaded suite test/canonical_string_tests 3 Started 4 . 5 Finished in 0.000447 seconds. 6 7 1 tests, 1 assertions, 0 failures, 0 errors

Phew. Everything works as planned. The next step is to take all of the examples from the Developer’s Guide, translate them to unit tests and make sure they pass.

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def setup 7 @s3_test = S3Lib::AuthenticatedRequest.new 8 end 9 10 # http://developer.amazonwebservices.com/connect/entry.jspa?externalID=123&ca\ 11 tegoryID=48 12 def test_dg_sample_one 13 @s3_test.make_authenticated_request(:get, '/photos/puppy.jpg', 14 {'Host' => 'johnsmith.s3.amazonaws.com', 15 'Date' => 'Tue, 27 Mar 2007 19:36:42 +\ 16 0000'}) 17 expected_canonical_string = "GET\n\n\nTue, 27 Mar 2007 19:36:42 +0000\n" + 18 "/johnsmith/photos/puppy.jpg" 19 assert_equal expected_canonical_string, @s3_test.canonical_string 20 end 21 22 def test_dg_sample_two 23 @s3_test.make_authenticated_request(:put, '/photos/puppy.jpg', 24 {'Content-Type' => 'image/jpeg', 25 'Content-Length' => '94328', 26 'Host' => 'johnsmith.s3.amazonaws.com', 27 'Date' => 'Tue, 27 Mar 2007 21:15:45 +0\ 28 000'}) 29 expected_canonical_string = "PUT\n\nimage/jpeg\nTue, 27 Mar 2007 21:15:45" \ 30 + Chapter 4. Authenticating S3 Requests 91

31 "+0000\n/johnsmith/photos/puppy.jpg" 32 assert_equal expected_canonical_string, @s3_test.canonical_string 33 end 34 35 def test_dg_sample_three 36 @s3_test.make_authenticated_request(:get, '', 37 {'prefix' => 'photos', 38 'max-keys' => '50', 39 'marker' => 'puppy', 40 'host' => 'johnsmith.s3.amazonaws.com', 41 'date' => 'Tue, 27 Mar 2007 19:42:41 +\ 42 0000'}) 43 assert_equal "GET\n\n\nTue, 27 Mar 2007 19:42:41 +0000\n/johnsmith/", @s3_t\ 44 est.canonical_string 45 end 46 47 def test_dg_sample_four 48 @s3_test.make_authenticated_request(:get, '?acl', 49 {'host' => 'johnsmith.s3.amazonaws.com'\ 50 , 51 'date' => 'Tue, 27 Mar 2007 19:44:46 +\ 52 0000'}) 53 54 assert_equal "GET\n\n\nTue, 27 Mar 2007 19:44:46 +0000\n" + 55 "/johnsmith/?acl", @s3_test.canonical_string 56 end 57 58 def test_dg_sample_five 59 @s3_test.make_authenticated_request(:delete, '/johnsmith/photos/puppy.jpg', 60 {'User-Agent' => 'dotnet', 61 'host' => 's3.amazonaws.com', 62 'date' => 'Tue, 27 Mar 2007 \ 63 21:20:27 +0000', 64 'x-amz-date' => 'Tue, 27 Mar\ 65 2007 21:20:26 +0000' }) 66 assert_equal "DELETE\n\n\n\nx-amz-date:Tue, 27 Mar 2007 21:20:26 +0000\n/jo\ 67 hnsmith/photos/puppy.jpg", @s3_test.canonical_string 68 end 69 70 def test_dg_sample_six 71 @s3_test.make_authenticated_request(:put, 72 '/db-backup.dat.gz', Chapter 4. Authenticating S3 Requests 92

73 {'User-Agent' => 'curl/7.15.5', 74 'host' => 'static.johnsmith.net:8080', 75 'date' => 'Tue, 27 Mar 2007 21:06:08 +\ 76 0000', 77 'x-amz-acl' => 'public-read', 78 'content-type' => 'application/x-downl\ 79 oad', 80 'Content-MD5' => '4gJE4saaMU4BqNR0kLY+\ 81 lw==', 82 'X-Amz-Meta-ReviewedBy' => ['joe@johns\ 83 mith.net', '[email protected]'], 84 'X-Amz-Meta-FileChecksum' => '0x026617\ 85 79', 86 'X-Amz-Meta-ChecksumAlgorithm' => 'crc\ 87 32', 88 'Content-Disposition' => 'attachment; \ 89 filename=database.dat', 90 'Content-Encoding' => 'gzip', 91 'Content-Length' => '5913339' }) 92 expected_canonical_string = "PUT\n4gJE4saaMU4BqNR0kLY+lw==\napplication/x-\ 93 download\n" + 94 "Tue, 27 Mar 2007 21:06:08 +0000\n" + \ 95 96 "x-amz-acl:public-read\nx-amz-meta-checksumalg\ 97 orithm:crc32\n" + 98 "x-amz-meta-filechecksum:0x02661779\n" + \ 99 100 "x-amz-meta-reviewedby:[email protected],[email protected]\n"\ 101 + 102 "/static.johnsmith.net/db-backup.dat.gz" 103 assert_equal expected_canonical_string, @s3_test.canonical_string 104 end 105 106 end

Now, let’s run them Chapter 4. Authenticating S3 Requests 93

1 $> ruby test/canonical_string_tests.rb 2 Loaded suite test/canonical_string_tests 3 Started 4 ...... 5 Finished in 0.001213 seconds. 6 7 6 tests, 6 assertions, 0 failures, 0 errors

SUCCESS! Ahh, that feels good. Only one step remains before we have a fully working library: we have to take that canonical_string and encode it.

Signing the Request

The whole point of this authentication procedure is to digitally sign your request by creating the canonical_string and use it to create an Authorization header. The Authorization header looks like this:

1 Authorization = AWS :

and the signature like this:

1 Signature = Base64( HMAC-SHA1( UTF-8-Encoding-Of( canonical_string ) ) )

We spent a lot of time figuring out how to make the canonical_string. The next step is much easier: we need to take that canonical_string and feed it through the algorithms to encode it. The method of encoding the canonical_string is highly language dependent, so I’m going to be lazy and point you to the Amazon S3 Getting Started Guide (http://docs.amazonwebservices.com/AmazonS3/2006- 03-01/gsg/²²). This page in the guide links to sample implementations in Java, C#, Perl, PHP, Ruby and Python: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/PreparingTheSamples.html²³. Here is some Ruby code that does the trick:

²²http://docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/ ²³http://docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/PreparingTheSamples.html Chapter 4. Authenticating S3 Requests 94

1 **require 'base64' 2 require 'digest/sha1' 3 require 'openssl'** 4 5 6 module S3Lib 7 8 class AuthenticatedRequest 9 10 def make_authenticated_request(verb, request_path, headers = {}) 11 @verb = verb 12 @request_path = request_path.gsub(/^\//,'') # Strip off the leading '/' \ 13 14 ** @amazon_id = ENV['AMAZON_ACCESS_KEY_ID'] 15 @amazon_secret = ENV['AMAZON_SECRET_ACCESS_KEY'] 16 ** 17 @headers = headers.downcase_keys.join_values 18 fix_date 19 get_bucket_name 20 end 21 22 ..... 23 24 ** def authorization_string 25 generator = OpenSSL::Digest::Digest.new('sha1') 26 encoded_canonical = \ 27 Base64.encode64(OpenSSL::HMAC.digest(generator, @amazon_secret, canonica\ 28 l_string)).strip 29 30 "AWS #{@amazon_id}:#{encoded_canonical}" 31 end** 32 33 34 end 35 end

I’ve added the @amazon_id and @amazon_secret instance variables in the make_authenticated_- request method, and added the authorization_string method that does all of the heavy lifting. All of the required libraries are included in the base Ruby distribution, so you should be able to just run this. Let’s write some unit tests to see if that really works. Luckily, we have the examples from the Amazon S3 Developer’s Guide to work with. The examples all use the same set of (fake) authentication Chapter 4. Authenticating S3 Requests 95

credentials. Table 4.1. S3 Authentication Credentials used in the examples Parameter Value AWSAccessKeyId 0PN5J17HBGZHT7JJ3X82 AWSSecretAccessKey uV3F3YluFJax1cknvbcGwgjvx4QpvB+leU8dUj2o We probably want to set these in the setup section of our tests. Remember that the S3Lib Library is getting the parameters from the AMAZON_ACCESS_KEY_ID and AMAZON_SECRET_ACCESS_KEY environment parameters, so we can set the parameters using Ruby’s ENV command:

1 def setup 2 # The id and secret key are non-working credentials 3 # from the S3 Developer's Guide 4 # http://developer.amazonwebservices.com/connect/entry.jspa 5 # ?externalID=123&categoryID=48 6 ENV['AMAZON_ACCESS_KEY_ID'] = '0PN6J17HBGXHT7JJ3X82' 7 ENV['AMAZON_SECRET_ACCESS_KEY'] = 'uV3F3YluFJax1cknvbcGwgjvx4QpvB+leU8dUj2o' 8 @s3_test = S3Lib::AuthenticatedRequest.new 9 end

We can then re-write the tests from the previous section to include a test for the Authentication header. Something like this:

1 require 'test/unit' 2 require File.join(File.dirname(__FILE__), '../s3_authenticator_dev') 3 4 class S3AuthenticatorTest < Test::Unit::TestCase 5 6 def setup 7 # The id and secret key are non-working credentials from the S3 Developer's \ 8 Guide 9 # See http://developer.amazonwebservices.com/connect/entry.jspa 10 # ?externalID=123&categoryID=48 11 ENV['AMAZON_ACCESS_KEY_ID'] = '0PN6J17HBGXHT7JJ3X82' 12 ENV['AMAZON_SECRET_ACCESS_KEY'] = 'uV3F3YluFJax1cknvbcGwgjvx4QpvB+leU8dUj2o' 13 @s3_test = S3Lib::AuthenticatedRequest.new 14 end 15 16 # See http://developer.amazonwebservices.com/connect/entry.jspa 17 # ?externalID=123&categoryID=48 18 def test_dg_sample_one Chapter 4. Authenticating S3 Requests 96

19 @s3_test.make_authenticated_request(:get, '/photos/puppy.jpg', 20 {'Host' => 'johnsmith.s3.amazonaws.com', 21 'Date' => 'Tue, 27 Mar 2007 19:36:42 +\ 22 0000'}) 23 expected_canonical_string = "GET\n\n\nTue, 27 Mar 2007 19:36:42 +0000\n" + 24 "/johnsmith/photos/puppy.jpg" 25 assert_equal expected_canonical_string, @s3_test.canonical_string 26 assert_equal "AWS 0PN6J17HBGXHT7JJ3X82:xXjDGYUmKxnwqr5KXNPGldn5LbA=", 27 @s3_test.authorization_string 28 end 29 30 # See http://developer.amazonwebservices.com/connect/entry.jspa 31 # ?externalID=123&categoryID=48 32 def test_dg_sample_two 33 @s3_test.make_authenticated_request(:put, '/photos/puppy.jpg', 34 {'Content-Type' => 'image/jpeg', 35 'Content-Length' => '94328', 36 'Host' => 'johnsmith.s3.amazonaws.com', 37 'Date' => 'Tue, 27 Mar 2007 21:15:45 +\ 38 0000'}) 39 expected_canonical_string = "PUT\n\nimage/jpeg\nTue, 27 Mar 2007 21:15:45 +0\ 40 000\n" + 41 "/johnsmith/photos/puppy.jpg" 42 assert_equal expected_canonical_string, @s3_test.canonical_string 43 assert_equal "AWS 0PN6J17HBGXHT7JJ3X82:hcicpDDvL9SsO6AkvxqmIWkmOuQ=", 44 @s3_test.authorization_string 45 end 46 47 def test_dg_sample_three 48 @s3_test.make_authenticated_request(:get, '', 49 {'prefix' => 'photos', 50 'max-keys' => '50', 51 'marker' => 'puppy', 52 'host' => 'johnsmith.s3.amazonaws.com', 53 'date' => 'Tue, 27 Mar 2007 19:42:41 +\ 54 0000'}) 55 assert_equal "GET\n\n\nTue, 27 Mar 2007 19:42:41 +0000\n/johnsmith/", 56 @s3_test.canonical_string 57 assert_equal 'AWS 0PN6J17HBGXHT7JJ3X82:jsRt/rhG+Vtp88HrYL706QhE4w4=', 58 @s3_test.authorization_string 59 end 60 Chapter 4. Authenticating S3 Requests 97

61 def test_dg_sample_four 62 @s3_test.make_authenticated_request(:get, '?acl', 63 {'host' => 'johnsmith.s3.amazonaws.com', 64 'date' => 'Tue, 27 Mar 2007 19:44:46 +\ 65 0000'}) 66 67 assert_equal "GET\n\n\nTue, 27 Mar 2007 19:44:46 +0000\n/johnsmith/?acl", 68 @s3_test.canonical_string 69 assert_equal 'AWS 0PN6J17HBGXHT7JJ3X82:thdUi9VAkzhkniLj96JIrOPGi0g=', 70 @s3_test.authorization_string 71 72 end 73 74 def test_dg_sample_five 75 @s3_test.make_authenticated_request(:delete, 76 '/johnsmith/photos/puppy.jpg', 77 {'User-Agent' => 'dotnet', 78 'host' => 's3.amazonaws.com', 79 'date' => 'Tue, 27 Mar 2007 21:20:27 +\ 80 0000', 81 'x-amz-date' => 'Tue, 27 Mar 2007 21:2\ 82 0:26 +0000' }) 83 assert_equal "DELETE\n\n\n\nx-amz-date:Tue, 27 Mar 2007 21:20:26 +0000\n" + 84 "/johnsmith/photos/puppy.jpg", 85 @s3_test.canonical_string 86 assert_equal 'AWS 0PN6J17HBGXHT7JJ3X82:k3nL7gH3+PadhTEVn5Ip83xlYzk=', 87 @s3_test.authorization_string 88 end 89 90 def test_dg_sample_six 91 @s3_test.make_authenticated_request(:put, 92 '/db-backup.dat.gz', 93 'User-Agent' => 'curl/7.15.5', 94 'host' => 'static.johnsmith.net:8080', 95 'date' => 'Tue, 27 Mar 2007 21:06:08 +00\ 96 00', 97 'x-amz-acl' => 'public-read', 98 'content-type' => 'application/x-downloa\ 99 d', 100 'Content-MD5' => '4gJE4saaMU4BqNR0kLY+lw\ 101 ==', 102 'X-Amz-Meta-ReviewedBy' => Chapter 4. Authenticating S3 Requests 98

103 ['[email protected]', '[email protected]\ 104 t'], 105 'X-Amz-Meta-FileChecksum' => '0x02661779\ 106 ', 107 'X-Amz-Meta-ChecksumAlgorithm' => 'crc32\ 108 ', 109 'Content-Disposition' => 'attachment; fi\ 110 lename=database.dat', 111 'Content-Encoding' => 'gzip', 112 'Content-Length' => '5913339') 113 expected_canonical_string = "PUT\n4gJE4saaMU4BqNR0kLY+lw==\napplication/x-d\ 114 ownload\n" + 115 "Tue, 27 Mar 2007 21:06:08 +0000\n" + 116 "x-amz-acl:public-read\nx-amz-meta-checksumalgorithm:crc32\n" + 117 "x-amz-meta-filechecksum:0x02661779\n" + 118 "x-amz-meta-reviewedby:[email protected],[email protected]\n" + 119 "/static.johnsmith.net/db-backup.dat.gz" 120 assert_equal expected_canonical_string, @s3_test.canonical_string 121 assert_equal 'AWS 0PN6J17HBGXHT7JJ3X82:C0FlOtU8Ylb9KDTpZqYkZPX91iI=', 122 @s3_test.authorization_string 123 end 124 125 end

Making the Request

We’ve finally got to the point where we have a signature that we can use to sign the request. The last step will be to actually make the request. We’ll be using the rest-open-uri library to do this, so if you haven’t already installed it, do a

1 $> sudo gem install rest-open-uri

from the command line (omit the sudo if you’re on Windows) to get it installed. Once you have the rest-open-uri gem installed, making the request is simple. open-uri is in the Ruby Standard Library. It extends the Kernel::open method so that any file that starts with “xxx://” is opened as a URL. So, to open a URL, you just use something like this: Chapter 4. Authenticating S3 Requests 99

1 require 'open-uri' 2 reddit = open('http://reddit.com') 3 puts reddit.readlines

rest-open-uri (http://rubyforge.org/projects/rest-open-uri/²⁴) is a library by Leonard Richardson that extends open-uri, adding support for all of the RESTful verbs. You make a PUT request by adding a :method => :put to the headers hash when you make a call.

1 require 'rubygems' 2 require 'rest-open-uri' 3 4 # PUT to http://example.com/some_resource 5 open('http://example.com/some_resource', :method => :put) 6 # DELETE http://example.com/deleteable_resource 7 open('http://example.com/deleteable_resource', :method => :delete)

To make the request, then, we just need to require the rest-open-uri library and then add the following line to the make_authenticated_request method

1 req = open(uri, @headers.merge(:method => @verb, 2 'Authorization' => authorization_string))

Okay, let’s try this sucker out! First, make sure that you have actually set your environment parameters correctly so that you can authenticate to S3. AMAZON_ACCESS_KEY_ID should be set to your Amazon ID and AMAZON_SECRET_ACCESS_KEY to your Amazon Secret Key. On OS X or Unix, you can see the environment by typing env at the command line

1 $> env | grep AMAZON 2 AMAZON_ACCESS_KEY_ID=your_amazon_access_key_which_is_a_bunch_of_numbers 3 AMAZON_SECRET_ACCESS_KEY=your_secret_amazon_access_key

Okay, now that we’re sure about the authentication, let’s go do some testing

²⁴http://rubyforge.org/projects/rest-open-uri/ Chapter 4. Authenticating S3 Requests 100

1 $> irb 2 >> require 's3_authenticator.rb' 3 => true 4 >> S3Lib.request(:get, '/spatten_presentations') 5 => # 6 >> S3Lib.request(:get, '/spatten_presentations').read 7 => "\n 8 9 spatten_presentations 10 11 12 1000 13 false 14 15 ploticus_dsl.pdf 16 2007-09-12T16:09:24.000Z 17 "94ca8590f028f8be0310bd5b2fabafdc" 18 509594 19 20 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 21 scottpatten 22 23 STANDARD 24 25 26 s3-on-rails.pdf 27 2007-12-05T19:38:32.000Z 28 "891b100f53155b8570bc5e25b1e10f97" 29 184748 30 31 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 32 scottpatten 33 34 STANDARD 35 36 "

Hey, it works! Notice that the first request we did just returned a StringIO object. That’s what the open command returns. To get at the body of the request, we use the read method on the StringIO object. Chapter 4. Authenticating S3 Requests 101

Note

If you want to read an IO object more than once, you need to rewind it between reads. Like this:

1 request.read 2 request.rewind 3 request.read

Notice that the listing shows the bucket we requested (spatten_presentations), along with some information about that bucket and a listing of all of the objects in that bucket. We’ll be talking more about the XML and how to parse it in the S3 API Recipes coming up shortly.

Over-riding the

We don’t want to have our tests call on S3 all of the time. It really slows things down and means we can’t test when we’re away from an internet connection. To fix it, we can just over-ride the open method in the S3lib::AuthenticatedRequest library. I did it like this:

1 module S3Lib 2 class AuthenticatedRequest 3 4 # Over-ride RestOpenURI#open 5 def open(uri, headers) 6 {:uri => uri, :headers => headers} 7 end 8 9 end 10 end

As you can see, it just returns a hash showing the parameters passed in, and never makes a call to the internet.

Error Handling

We now have a working authentication library, and we are almost ready to actually start talking to S3. There’s one more thing we should do, however, that will make our lives much easier and save us tons of time while we’re building the rest of the S3 library. We need to add in some error handling. To illustrate why, let’s try making a new object in a bucket using the current library. Chapter 4. Authenticating S3 Requests 102

1 $> irb 2 >> require 'code/s3_code/library/s3_authenticator' 3 => true 4 >> S3Lib.request(:put, "spatten_sample_bucket") 5 => # 6 >> S3Lib.request(:put, "spatten_sample_bucket/sample_object", :body => "this is \ 7 a test") 8 OpenURI::HTTPError: 403 Forbidden 9 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 10 320:in `open_http' 11 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 12 659:in `buffer_open' 13 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 14 194:in `open_loop' 15 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 16 192:in `catch' 17 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 18 192:in `open_loop' 19 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 20 162:in `open_uri' 21 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 22 561:in `open' 23 from /opt/local/lib/ruby/gems/1.8/gems/rest-open-uri-1.0.0/lib/rest-open-uri.rb:\ 24 35:in `open' 25 from ./code/s3_code/library/s3_authenticator.rb:70:in `make_authenticated_reques\ 26 t' 27 from ./code/s3_code/library/s3_authenticator.rb:37:in `request' 28 from (irb):3 29 >> puts req.body 30 NoMethodError: undefined method `body' for nil:NilClass 31 from (irb):5

What’s going on here? I can make a PUT request to create a bucket, but I can’t make a PUT to create an object. Hmmm. There’s really no way to figure out what’s going on, either, as the request we’re making just returns nil. Luckily, Amazon returns some information about what the error was. What we need to do is trap the error as it occurs and grab the information from it. Looking at the error more closely, we see that the OpenURI Library is raising a OpenURI::HTTPError. Let’s add some code to the S3Lib::request method to trap that error and see what information we can extract from it. Chapter 4. Authenticating S3 Requests 103

1 def self.request(verb, request_path, headers = {}) 2 begin 3 s3requester = AuthenticatedRequest.new() 4 req = s3requester.make_authenticated_request(verb, request_path, headers) 5 rescue OpenURI::HTTPError=> e 6 puts "Status: #{e.io.status.join(",")}" 7 puts "Error From Amazon:\n#{e.io.read}" 8 puts "canonical string you signed:\n#{s3requester.canonical_string}" 9 end 10 end

Trying to make the object again gives us a bit more diagnostic feedback (I reformatted the Amazon error response a bit to make it more readable)

1 t$ irb -r 'code/s3_code/library/s3_authenticator' 2 >> S3Lib.request(:put, "spatten_sample_bucket/sample_object", 3 :body => "this is a test") 4 Status: 403,Forbidden 5 Error From Amazon: 6 7 8 SignatureDoesNotMatch 9 The request signature we calculated does not match the signature 10 you provided. Check your key and signing method. 11 7BD4FADF07973DEA 12 redacted 13 redacted 14 195MGYF7J3AC7ZPSHVR2 15 Baq4uDiuK3jU7Xf3R35sOLYrdFZBASP/e0ncdUdvUX1BJ5HEh58ojC7/WRKXjc/c 16 17 PUT 18 19 application/x-www-form-urlencoded 20 Thu, 20 Mar 2008 18:17:40 GMT 21 /spatten_sample_bucket/sample_object 22 23 24 25 canonical string you signed: 26 PUT 27 28 Chapter 4. Authenticating S3 Requests 104

29 Thu, 20 Mar 2008 18:17:40 GMT 30 /spatten_sample_bucket/sample_object

Ah-ha! Notice that the StringToSign that Amazon is returning has a content-type header of “application/x-www-form-urlencoded”. We didn’t provide a content-type header at all, so we didn’t include it in our canonical_string. It looks like one of the Ruby libraries we’re using was a little too clever and inserted the content-type for us. Let’s try adding our own content-type header. Hopefully that will work.

Warning

Having extra headers added by a library is a pretty common occurrence. If you are having troubles getting your authentication library working, make sure you check that there aren’t any unexpected headers in the string that Amazon is expecting you to sign.

1 $> irb -r 'code/s3_code/library/s3_authenticator' 2 >> req = S3Lib.request(:put, "spatten_sample_bucket/sample_object", 3 "content-type" => "text/plain", 4 :body => "this is a test") 5 => # 6 >> puts req.status 7 200 8 OK

That looks promising. No errors raised, and a status of 200 OK. Let’s list the objects in the bucket and make sure everything is okay.

1 >> req = S3Lib.request(:get, "spatten_sample_bucket") 2 => # 3 >> puts req.read 4 5 6 spatten_sample_bucket 7 8 9 1000 10 false 11 12 sample_object 13 2008-03-20T18:26:01.000Z 14 "54b0c58c7ce9f2a8b551351102ee0938" Chapter 4. Authenticating S3 Requests 105

15 14 16 17 9d92623ba6dd9d7cc06a7b8bcc46381e7c646f72d769214012f7e91b50c0de0f 18 scottpatten 19 20 STANDARD 21 22 23 => nil 24 >> req = S3Lib.request(:get, "spatten_sample_bucket/sample_object") 25 => # 26 >> puts req.read 27 this is a test 28 => nil

That looks perfect: we have a new object in the bucket called “sample_object”, and getting that object gives us back the expected object contents. We obviously don’t want to leave the error handling as is. Catching all errors and just printing out some information is decidedly sub-optimal. Let’s fix it up by creating a S3ResponseError class and initializing it with some information that will be useful for figuring out what went wrong. We’ll also make sure to add the error type given by Amazon (which was SignatureDoesNotMatch in our example above) so that we can use that to raise a more specific error type in our library.

1 module S3Lib 2 3 def self.request(verb, request_path, headers = {}) 4 begin 5 s3requester = AuthenticatedRequest.new() 6 req = s3requester.make_authenticated_request(verb, request_path, headers) 7 rescue OpenURI::HTTPError=> e 8 raise S3Lib::S3ResponseError.new(e.message, e.io, s3requester) 9 end 10 end 11 12 class S3ResponseError < StandardError 13 attr_reader :response, :amazon_error_type, :status, :s3requester, :io 14 def initialize(message, io, s3requester) 15 @io = io 16 # Get the response and status from the IO object 17 @io.rewind 18 @response = @io.read 19 @io.rewind Chapter 4. Authenticating S3 Requests 106

20 @status = io.status 21 22 # The Amazon Error type will always look like AmazonErrorType. Find it with a RegExp. 24 @response =~ /(.*)<\/literal>/ 25 @amazon_error_type = $1 26 27 # Make the AuthenticatedRequest instance available as well 28 @s3requester = s3requester 29 30 # Call the standard Error initializer 31 # if you put '%s' in the message it will be 32 # replaced by the amazon_error_type 33 super(message % @amazon_error_type) 34 end 35 end 36 end

Note that the S3Lib::request method rescues any OpenURI::HTTPError errors and re-raises them as S3Lib::S3ResponseError errors, passing in the IO object and the AuthenticatedRequest instance to the error. We can use this new error class to do something like this if we just want to output some info:

1 #!/usr/bin/env ruby 2 3 require File.join(File.dirname(__FILE__),'s3_authenticator') 4 5 begin 6 req = S3Lib.request(:put, "spatten_sample_bucket/sample_object", 7 :body => "Wheee") 8 rescue S3Lib::S3ResponseError => e 9 puts "Amazon Error Type: #{e.amazon_error_type}" 10 puts "HTTP Status: #{e.status.join(',')}" 11 puts "Response from Amazon: #{e.response}" 12 if e.amazon_error_type == 'SignatureDoesNotMatch' 13 puts "canonical string: #{e.s3requester.canonical_string}" 14 end 15 end

In the recipes in the rest of this section, we will be creating new error types and raising them based on the amazon_error_type of the raised S3ResponseError.