Generating Thousands of PDFs on EC2 with Ruby

December 23 2009 by Sean Cribbs

The Problem

For about two months, we’ve been working on a static website that exposes the results of complicated economics model to non-economists. We decided to make the site static because of the overhead involved in computing the results and the proprietary nature of the model. We would simply pre-generate the output for all valid permutations of the inputs. The visitor could then choose her inputs from a questionnaire, click a button and immediately be shown the results.

The caveat of this decision is that in addition to the numerical outputs, three graphs and a summary (both in HTML and PDF) would need to be generated for each permutation. Since there were 3600 permutations, this would amount to 18000 files in total. Initial local runs of our generation process took about 30 seconds for each permutation, mostly due to embedding the graph images into the PDF. On a single machine, that would take 30 hours of uninterrupted processing! Clearly, this was a job for “the cloud”.

The Tools

Before we get into a discussion of the process of configuring and running the jobs, here’s overview of the tools we used to tackle the problem.

We initially considered using Amazon’s Elastic MapReduce to run the generation jobs, but it requires Java and Hadoop, we had already invested a lot of time in our Ruby tool chain. It is nigh impossible to automatically install Ruby and ImageMagick on an EMR node. Thus, we decided to use vanilla EC2 with the tools shown below.

Prawn

Prawn is the new kid in town for generating PDF in Ruby. Prawn is pretty well-written and easy to start using, and greatly improves on PDF::Writer.

Gruff

Gruff was not the most obvious choice for this project. We liked the flexibility and hackability of Scruffy, but translating its output to PDF was a nightmare and there were some strange inconsistencies in it. In the end, Gruff proved fast, reliable, and simple. The major caveat, as described above, is that embedding images in Prawn is orders of magnitude slower than simply drawing on the canvas.

Haml, Sass, Compass

Haml has been around for 3 years now. Many people cringe at the indentation-sensitive syntax, but it prevents so much frustration that it was a good fit for the project. Naturally, we also used its cousin Sass, and the new-ish CSS/Sass meta-framework Compass. The combination of the these three made it really quick to get started with the static site and make design changes as we iterated.

Chef

You may have already heard of the awesome configuration management tool, Chef. Chef allows you to ensure consistent configuration of your servers using a nice Ruby DSL and a huge library of community-developed “cookbooks” that covers many common use-cases. We were given the chance to try out an alpha of their “Chef Platform”, which is essentially a scalable, hosted, multi-tenant version of the server component of Chef and uses the pre-release version of Chef 0.8. With that, “knife”–the new CLI tool for interacting with the Chef server API–and the custom Opscode AMI, we were well-equipped to quickly deploy a bunch of EC2 nodes. We’ll talk more about the details of the Chef recipes below.

AMQP and RabbitMQ

What’s the best way to distribute a bunch of one-time jobs to a slew of independent machines? A message queue, of course! Despite the version packaged with Ubuntu 9.04 being pretty old, we chose RabbitMQ, having used it on another project. AMQP is also well supported in Ruby.

The Process

Preparing

The first step to start our processing job was to get the data up to S3. You could do this any number of ways, but we created a bucket solely for the data and uploaded all 3600 CSV files with a desktop client.

Next, we created the scripts for the workers and the job initiator. We would potentially need to run the process multiple times, so we chose Aman Gupta’s EventMachine-based AMQP client.

Here’s the worker script, which was set up as a daemon using runit:

#!/usr/bin/env ruby

$: << File.expand_path(File.join(File.dirname(__FILE__),'..','lib'))
require 'rubygems'
require 'eventmachine'
require 'mq'
require 'custom_libraries'

Signal.trap('INT') { AMQP.stop{ EM.stop } }
Signal.trap('TERM'){ AMQP.stop{ EM.stop } }

AMQP.start(:host => ARGV.shift) do
 MQ.prefetch(1)
 MQ.queue('jobs').bind(MQ.direct('jobs')).subscribe do |header, body|
   GenerationJob.new(body).generate
 end
end

Basically, it connects to the RabbitMQ host specified on the command line, subscribes to the job queue, and starts processing messages.

The job initiation script is almost as simple:

#!/usr/bin/env ruby

$: << File.expand_path(File.join(File.dirname(__FILE__),'..','lib'))
require 'rubygems'
require 'eventmachine'
require 'mq'

AWSID = (ENV['AMAZON_ACCESS_KEY_ID'] || 'XXXXXXXXXXXXXXXXXXXX')
AWSKEY = (ENV['AMAZON_SECRET_ACCESS_KEY'] || 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

Signal.trap('INT') { AMQP.stop{ EM.stop } }
Signal.trap('TERM'){ AMQP.stop{ EM.stop } }

host = ARGV.shift
input_bucket = "custom-data"
output_bucket = "custom-output"
output_prefix = Time.now.strftime("/%Y%m%d%H%M%S")
count = 0

AMQP.start(:host => host) do
 exchange = MQ.direct('jobs')

 STDIN.each_line do |file|
   count += 1
   $stdout.print "."; $stdout.flush
   payload = {
     :input => [input_bucket, file.strip],
     :output => [output_bucket, output_prefix],
     :s3id => AWSID,
     :s3key => AWSKEY
   }
   exchange.publish(Marshal.dump(payload))
 end
 AMQP.stop { EM.stop }
end
puts "#{count} data enqueued for generation."

It reads from STDIN the names of files to add to the queue, which are stored in the S3 bucket. Before running the job, we created a text file that listed each of the 3600 files, one per line, which could then be piped to this script on the command line. Then it passes along all the information each worker needs to find the data, and where to put it when completed. We scoped the output by the time the job was enqueued, making it easier to discern older runs from newer ones.

Configuring the cloud

Now that the meat of the job was ready, we dived into configuring the servers with Chef. We created a Chef repository, added the Opscode cookbooks as a submodule, and uploaded these default cookbooks to the server:

  • apt
  • build-essential
  • erlang
  • imagemagick
  • runit
  • ruby

We created some additional cookbooks to fill out the generic setup:

  • rabbitmq - Installs and configures RabbitMQ
  • gemcutter - Upgrades Rubygems, installs Gemcutter and makes gemcutter.org the default gem source

Lastly we created our custom cookbook, which sets up all the libraries we need, downloads the code, and sets up the worker process as a runit service. Let’s walk through the default recipe in that cookbook:


%w{haml gruff fastercsv activesupport prawn prawn-core prawn-format prawn-layout eventmachine amqp aws-s3}.each do |g|
 gem_package g
end

This simply installs all of gems that we need to run the job.


# Find the node that has the job queue
q = search(:node, "run_list:role*job_queue*")[0].first

Here we use Chef’s search feature to find the node that has RabbitMQ installed and running so we can pass it to the worker script.


# Create directory to put the code in
directory "/srv"

# Unzip the code if necessary
execute "Unpack code" do
 command "tar xzf generationjobs.tar.gz"
 cwd "/srv"
 action :nothing
end

# Download the code
remote_file "/srv/generationjobs.tar.gz" do
 source "generationjobs.tar.gz"
 notifies :run, resources(:execute => "Unpack code"), :immediate
end

# Create the directory where output goes
directory "/srv/generationjobs/tmp" do
 recursive true
end

In these four resources, we set up the working directory for the worker process, download the project code (stored on the Chef server as a tarball), and unpack it. The interesting thing about this sequence is that we don’t automatically unpack the tarball. Since the Chef client runs periodically in the background, we don’t want to be unpacking the code every time, but only when it has changed. We use an immediate notification from the remotefile resource to tell the unpacking to run when the tarball is a new version; remotefile won’t download the tarball unless the file checksum has changed.


# Create runit service for worker
runit_service "generationworker" do
 options({:worker_bin => "/srv/generationjobs/bin/worker", :queue_host => q})
 only_if { q }
end

The last step is a pseudo-resource defined in the “runit” cookbook that creates all the pieces of a runit daemon for you; we only had to create the configuration templates for the daemon and put them in our cookbook. The additional options passed to the runitservice tell the templates the location of the worker code and the RabbitMQ host. We also take advantage of the “onlyif” option so the service won’t be created if there’s no host with RabbitMQ on it yet.

The last step in the Chef configuration was to create two roles, one for the queue and one for the worker. Naturally, the node that has the queue can also act as a worker. Here’s what the role JSON documents look like:


// The queue role
{
 "name": "job_queue",
 "chef_type": "role",
 "json_class": "Chef::Role",
 "default_attributes": {

 },
 "description": "Provides a message queue for sending jobs out to the workers.",
 "recipes": [
   "erlang",
   "rabbitmq"
 ],
 "override_attributes": {

 }
}

// The worker role
{
 "name": "job_worker",
 "chef_type": "role",
 "json_class": "Chef::Role",
 "default_attributes": {

 },
 "description": "Processes the data from a queue into the PDF, PNG and HTML output.",
 "recipes": [
   "apt",
   "build-essential",
   "ruby",
   "gemcutter",
   "imagemagick::rmagick",
   "runit",
   "custom"
 ],
 "override_attributes": {

 }
}

Running the jobs on EC2

Now comes the fun (and easy) part! Armed with an AWS account, an EC2 certificate, and knife, we began firing up nodes to run the job. With Opscode’s preconfigured Chef AMI, you can pass a JSON node configuration in the EC2 initial data. First we generated the configuration for the job queue node:

$ knife instance_data --run-list="role[job_queue] role[job_worker]" | pbcopy

With the JSON configuration in the clipboard, we could paste it into ElasticFox (or the AWS Management console) and fire up the first EC2 node. Several minutes later, the node was ready to go. Now, we created a similar configuration, but with only the worker role:

$ knife instance_data --run-list="role[job_worker]" | pbcopy

Then we fired up nine of the nodes with that configuration and proceeded to initiate the job:

$ ssh -i ~/ec2-keys/my-ec2-cert.pem root@ec2-public-hostname
[root@ec2-public-hostname]$ cd /srv/generationworker
[root@ec2-public-hostname]$ bin/startjobs localhost > manifest.txt

After all the preparation, that’s all there was to it! A little over an hour later, we had generated PNG graphs, PDF, and HTML from all 3600 datasets.

Conclusion

It’s no mystery why “cloud computing” is so popular. The ability to quickly and cheaply access computational power, utilize it, and then dispose of it is really appealing, and tools like Chef and EC2 make it really easy to accomplish. What can you cook up?

  • As I said, I've used AMQP before, but also it's lightweight, and it doesn't incur any extra cost. If I wanted verification of the completion of each generation step, I could require 'ack' on the queue.
  • Thanks for the writeup Sean. I am curious why you chose AMQP over Amazon's SQS for the queue. I am about to embark on a similar exercise in cloud computing and any insight you could provide would be helpful.
  • alan
    what is the name of this website? would love to take a look at it
  • The site won't go live for another few weeks. The client is still fine tuning th model which means that everything will have to be run again (making our investment in this infrastructure pay off even more!) Once the client launches the site we'll try to get their permission to link it here.
  • @Chris - Thanks for the link. We needed PNGs for the HTML as well, so it seemed natural to embed them in the PDF, despite the quality reduction.

    @Martin - I responded to your query on Hacker News, but we used S3 to collect the generated files. AMQP has an 'ack' option for messages, but we didn't find it necessary - some of the output had errors, but nothing crashed.

    @Colin - I spent the greater part of a day making our code work on EC2, and probably 2-3 hours of that was configuring Chef.
  • Colin
    Great job! Thanks for sharing. How much time went into the preparation?
  • This is too cool. If you are ever looking for a sysadmin who knows how to code, give me a ring :)
  • I think you left out the most interesting parts:
    How did you collect the generated files?
    Did you take any precautions (Does AMPQ have some sort of transactional semantics?) in case a worker died while creating a pdf?
  • With regards to generating your plots, you might be interested in alternative plotting libraries that can generate PDF directly for inclusion with Prawn. Perhaps Gnuplot, or Tioga?

    I gave a talk recently which may be of interest to you. Slides and code samples are here:
    http://github.com/chrislo/data_visualisation_ruby
blog comments powered by Disqus