Using Amazon S3 from Amazon EC2 with Ruby

Amazon S3 provides a good distributed storage solution for accessingthe distributed computing power of Amazon EC2. Using Ruby scripts likes3sync and s3cmd at the command line, you can move data to and from EC2instances in your computing cloud.

Introducing Amazon S3 and Amazon EC2

The distributed computing power of the Amazon Elastic Compute Cloud(Beta) (Amazon EC2™) isn't going to do you much good unless you can getdata to and from each of the Amazon EC2 instances in your computingcloud. Amazon's Simple Storage Service (Amazon S3) provides a gooddistributed storage solution for doing this. Plus, Amazon does notcharge for Amazon EC2 instances to read and write data from Amazon S3buckets.

That's all well and good, but how do we get Amazon EC2 instances toread and write files to Amazon S3? I tried several different approacheswhile writing this article, and the easiest turned out to be a set ofRuby command-line scripts called s3sync. In this article, I'll show youhow to set up an Amazon EC2 image using one of Amazon's Fedora Core 4images, then show you how to install and use the s3sync code to accessfiles from Amazon S3.

Before you begin, make sure you have Amazon EC2 command-lineutilities installed. Instructions on how to do this are available on Amazon’s EC2 site. Amazon also has a complete Getting Started Guidefor its EC2 web service, which is what I used to help me as I waswriting this article. It clearly describes how to install the tools,generate a key set, and build Amazon EC2 instances.

Next, you have to determine the operating system image you are goingto run on the Amazon EC2 instances. For the purposes of this article weare going to use one of the images provided by Amazon. Let's have alook at what those are by using the ec2-describe-images command-line utility:

% ec2-describe-images -o amazon
IMAGE   ami-20b65349    ec2-public-images/fedora-core4-base.manifest.xml     amazon  available       public
IMAGE   ami-22b6534b    ec2-public-images/fedora-core4-mysql.manifest.xml     amazon  available       public
IMAGE   ami-23b6534a    ec2-public-images/fedora-core4-apache.manifest.xml    amazon  available       public
IMAGE   ami-25b6534c    ec2-public-images/fedora-core4-apache-mysql.manifest.xml        amazon  available       public
IMAGE   ami-26b6534f    ec2-public-images/developer-image.manifest.xml  	amazon  available       public
IMAGE   ami-2bb65342    ec2-public-images/getting-started.manifest.xml  amazon  available       public
IMAGE   ami-bd9d78d4    ec2-public-images/demo-paid-AMI.manifest.xml    amazon  available       public

I'll choose the fedora-core4-apache-mysql operating system image,because that's the kind of thing I would get from a hosting company,and it's sure to be full of useful utilities. I'll run an instance ofthat image using the following commands at the command line:

% ec2-run-instances ami-25b6534c -k gsg-keypair
RESERVATION     r-e349af8a      961421114855    default
INSTANCE        i-59c02230      ami-25b6534c                    pending gsg-keypair     0

After the image has booted, the Amazon EC2 command-line utility will give me a hostname. I'll check the name by using the ec2-describe-instances command:

% ec2-describe-instances
RESERVATION     r-e349af8a      961421114855    default
INSTANCE        i-59c02230      ami-25b6534c    ec2-72-44-57-99.z-1.compute-1.amazonaws.com     domU-12-31-36-00-3D-83.z-1.compute-1.internal   running gsg-keypair     0

Now I have a machine running Fedora Core 4 with a lot of handy stuffinstalled in it. Next is to log into the Amazon EC2 instance that Ijust created using the hostname provided by ec2-describe-instances.

% ssh -i ~/.ec2/id_rsa-gsg-keypair [email protected]     
The authenticity of host 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com (72.44.57.99)' can't be established.
RSA key fingerprint is f1:4e:d1:14:87:f0:57:71:89:6e:ed:b5:1c:14:84:b5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com,72.44.57.99' (RSA) to the list of known hosts.

         __|  __|_  )  Rev: 2
         _|  (     / 
        ___|\___|___|

 Welcome to an EC2 Public Image
                       :-)

    Apache2+MySQL4


    __ c __ /etc/ec2/release-notes.txt

[root@domU-12-31-36-00-3D-83 ~]#

Now you have your sample files set up, you're logged in to your Amazon tools, and you are ready to apply the example scripts.

Installing S3Sync

S3sync is the Ruby package I will use to add, update, remove, andlist files on the Amazon S3 servers. To do that I will first need toensure that Ruby is installed, then get the s3sync package and set itup.

To check the Ruby version, I use the following command line:

[root@domU-12-31-36-00-3D-83 ~]# ruby -v
ruby 1.8.4 (2005-12-24) [i386-linux]
[root@domU-12-31-36-00-3D-83 ~]#

This tells me my computer is running a recent version ofRuby--version 1.8.4--which is a version that allows the script to run.This should do nicely.

There are two ways that I can get the s3sync code. The first is to go to the s3sync web site (http://s3sync.net/wiki) and download it to my local computer. I would then copy it the Amazon EC2 instance. To do that I would use this command:

% scp -i ~/.ec2/id_rsa-gsg-keypair s3sync.tar.gz root@ ec2-72-44-57-99.z-1.compute-1.amazonaws.com:/root

The s3sync.tar.gz file would then be located in my home directory on the Amazon EC2 machine.

I can also do this directly from the Amazon EC2 instance using the following commands:

[root@domU-12-31-36-00-3D-83 ~]# wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
--18:31:18--  http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
           => `s3sync.tar.gz'
Resolving s3.amazonaws.com... 72.21.206.171
Connecting to s3.amazonaws.com|72.21.206.171|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26,667 (26K) []

100%[============================================================================>] 26,667        --.--K/s             

18:31:19 (3.21 MB/s) - `s3sync.tar.gz' saved [26667/26667]

[root@domU-12-31-36-00-3D-83 ~]#

Either way, after s3sync has been copied to my computer, the next thing to do is unpack it.

[root@domU-12-31-36-00-3D-83 ~]# tar -xzvf s3sync.tar.gz 
s3sync/
s3sync/HTTPStreaming.rb
s3sync/README.txt
s3sync/README_s3cmd.txt
s3sync/S3.rb
s3sync/s3cmd.rb
s3sync/s3config.rb
s3sync/s3config.yml.example
s3sync/S3encoder.rb
s3sync/s3sync.rb
s3sync/s3try.rb
s3sync/S3_s3sync_mod.rb
s3sync/thread_generator.rb
[root@domU-12-31-36-00-3D-83 ~]#

Now we have all the s3sync files in a subdirectory and I can startmoving files to and from my Amazon S3 bucket. But before I do that, Ihave to set two environment variables:

[root@domU-12-31-36-00-3D-83 s3sync]# AWS_ACCESS_KEY_ID=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_ACCESS_KEY_ID
[root@domU-12-31-36-00-3D-83 s3sync]# AWS_SECRET_ACCESS_KEY=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_SECRET_ACCESS_KEY

Change the xxxx in lines 1 and 3 to your own Access Key ID and Secret Access Key (available from the Amazon Web Services site).

If everything is working properly, you should be able to use the s3cmd.rb script to list all available buckets:

[root@domU-12-31-36-00-3D-83 s3sync]# ./s3cmd.rb listbuckets
jherr_video
[root@domU-12-31-36-00-3D-83 s3sync]#

To test this I'm going to create a test bucket. If you aren'tfamiliar with Amazon S3 buckets, a bucket is similar to a disk drive.You can have as many buckets as you like, each with a unique name andeach containing its own set of directories and files.

I'll create a bucket for this article using the following command:

# ./s3cmd.rb createbucket art072407
#

Then, I check to see whether it worked by using the listbuckets command again:

# ./s3cmd.rb listbuckets           
art072407
jherr_video
#

Now I can list the contents of the bucket using the list command.

# ./s3cmd.rb list art072407
--------------------
#

The output tells me there is nothing in the bucket. So let's putsomething in it. Just to test it, I'll put the Readme.txt file thatcomes with the s3sync code into the bucket.

# ./s3cmd.rb put art072407:Readme.txt Readme.txt 
#

The put command copies the file to the Amazon S3 bucket. The first parameter after the putcommand is the bucket and the key name. The bucket name is before thecolon, and the key name comes after the colon. In Amazon S3 terms,files are "keys" because, really, Amazon S3 can store any data bit.Normally though, your key will be the same as your file name. The lastparameter is the name of the local file to copy.

I can then use the list command to see that the file is still in the bucket:

# ./s3cmd.rb list art072407
--------------------
Readme.txt
#

One great thing about Amazon S3 is that all uploaded files areavailable as URLs from a web browser (or any application that can reada URL). The format of the URL is as follows:

http://.s3.amazonaws.com/

In the case of this example, the URL is:

http://art072407.s3.amazonaws.com/Readme.txt

But if I go to the URL at this point, I’ll get a message telling methat access to that resource is denied because, by default, uploadeddata is not publicly accessible. To make it publicly accessible, wehave to add to the put command:

# ./s3cmd.rb put art072407:Readme.txt Readme.txt x-amz-acl:public-read
#

Now, if I go back to that URL in my web browser, Amazon S3 will happily show me the Readme.txt file.

To remove the file from the bucket, I run the delete command:

# ./s3cmd.rb delete art072407:Readme.txt
#

Or, to delete everything in the bucket, I run the deleteall command:

# ./s3cmd.rb deleteall art072407
#

As noted above, you can use a URL to get to the data if the Amazon S3 key (the file) is designated as public. To get public data, you can use the following command:

# wget http://art072407.s3.amazonaws.com/Readme.txt
...

But what if the data is private? To do that I use the handy get command that comes with s3cmd.rb.

# ./s3cmd.rb get art072407:Readme.txt Out.txt
#

This command takes the Readme.txt file from the Amazon S3 bucket and copies it to the local file Out.txt.

S3Sync

So far I've worked only with reading and writing a single file fromthe Amazon S3 bucket. What about entire directories of files, withnested subdirectories, and so on? Ruby code has a solution for that aswell. The s3sync.rb command synchronizes whole directory structureswith Amazon S3 buckets.

To begin I'll create a new directory called /root/data and copy the contents of the s3sync code to it, just as an example:

# mkdir /root/data
# copy /root/s3sync/* /root/data
#

Now, I'll clear out the Amazon S3 bucket and copy the directory to it using s3sync:

# ./s3cmd.rb deleteall art072407
# ./s3sync.rb -r /root/data/ art072407:/
#

When I list the article bucket now, I can see all the original files:

# ./s3cmd.rb list art072407
--------------------
HTTPStreaming.rb
Readme.txt
Readme_s3cmd.txt
S3.rb
S3_s3sync_mod.rb
S3encoder.rb
s3cmd.rb
...
#

Next, I can remove all of the files from the /root/data directoryand re-sync them using s3sync. First, to remove them, I use thefollowing code:

# rm /root/data/*
#

Now, to re-sync from the Amazon S3 bucket I run:

# ./s3sync.rb -r art072407: /root/data
# ls -la /root/data/
total 120
drwxr-xr-x  2 root root  4096 Jul 24 11:59 .
drwxr-x---  5 root root  4096 Jul 24 11:48 ..
-rwxr-xr-x  1 root root  3427 Jul 24 11:59 HTTPStreaming.rb
-rwxr-xr-x  1 root root 12775 Jul 24 11:59 Readme.txt
-rwxr-xr-x  1 root root  4525 Jul 24 11:59 Readme_s3cmd.txt
...
#

Now I can get and put whole directories of data using Amazon S3 from my Amazon EC2 instance.

To finish, I'm going to delete the contents of the bucket, and then delete the bucket itself:

# ./s3cmd.rb deleteall art072407   
# ./s3cmd.rb deletebucket art072407

To finish working with this example completely, I'm going to delete the Amazon EC2 instance that I was testing:

% ec2-terminate-instances i-59c02230
%

And there you have it: Amazon Simple Storage Service (Amazon S3)access, direct from one of the standard Fedora Core 4 Amazon imageswith just some simple Ruby scripts and a few environment variables.

Conclusion

Amazon S3 provides a powerful mechanism for moving data betweenAmazon EC2 instances, and for moving data to and from Amazon EC2instances for distributed processing. Because all of the languagessupported by the Amazon Fedora Core 4 images can access the commandline, it's easy to invoke these commands from within page code or batchprocessors to get and put data from the Amazon S3 buckets.

Jack Herrington is the author of several books, including Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written over 50 articles on technical topics, many of which use PHP. Jack is a PHP and AJAX columnist for IBM developerWorks, and the editor of the AJAX Forum on the IBM developerWorks web site.

zyf76123