tl;dr: You should compose your web app with IO streams
On dump.ly, one of our most loved features is the download button, which creates a zip file with all the original images in an album.
The original solution was hacked up very quickly (ie was pretty ghetto). It simply downloaded each file from S3 into a temp folder on the server, zipped up that folder via a shell exec, and then sent the resulting archive down the pipe.
While it worked correctly, there were some problems:
Initial latency
The user has to wait while each file is downloaded to the server and zipped up. This results in bad UX, as the user expects an immediate action (the browser download popup) however they’re left waiting 10-30 seconds while the server does its work.
Spiky Server Load
The S3 downloads and zipping are both spiky CPU and IO intensive. This is the worst kind as our EC2 instances which usually can handle hundreds of thousands of requests, can very easily be brought to their knees.
Complexity
While this was supposed to be the quick and simple solution, it very quickly got messy as we needed to ensure all temporary files got deleted in case of errors. We also had to add a work queue to limit the number of concurrent requests to avoid overloading the server.
Repeated work
Each and every request would generate the whole zip file again, even if the download was aborted.
We really needed an improvement, so the the first thought was to add caching (the magical solution to everything). With a cache, there would only be a big hit on the first download, and so the work to create an archive can be amortized across multiple requests. As users can also select which images they want in a zip, we would have to use a hash of the image ids as a key for the cache. We would also have to store the cache files in S3, so that all front end servers can use them, and also work out an expiry strategy.
While this seemed like a sane idea, it reminded us of the proverbial ‘putting lipstick on a pig’. Then we thought: why can’t we just generate the zip on the fly without ever touching disk?
Streams to the rescue
Well, we can. Node has built in support for downloading data in chunks (eg files from S3), running chunks of data through deflate, and firing those chunks back at the user. All are exposed through the beautiful stream interface, and so can be composed to create a pipeline.
One immediate problem is nodejs’ zlib module only compresses raw data. To actually create a zip container we need to write out a bunch of headers, a few checksums, and an envelope for each file. Luckily github user wellawaretech had created a module zipstream, which I’ve forked, to wrap all this magic up.
Now, when a user clicks the download button, the server:
- Enumerates all the requested images in that album.
- Immediately writes to the client’s http response the http headers to say it’s a download and the file name is
.zip. - zipstream writes the header bytes of zip container.
- Creates an http request to the first image in S3.
- Pipes that into zipstream (we don’t actually need to run deflate as the images are already compressed).
- Pipes that into the client’s http response.
- Repeats for each image, with zipstream correctly writing envelopes for each file.
- zipstream writes the footer bytes for the zip container
- Ends the http response.
This is so much better than before, as:
- The download is now immediate, with only a second or two of latency.
- The pipeline ensures that the whole process only runs as fast as the slowest bottleneck, which is usually the client download speed. It’s auto throttling.
- Everything is in memory, and nothing ever touches disk. Only as much work as needed is ever done. eg Aborting a 1 GB download at 1MB, will only waste 1MB of CPU processing and IO bandwidth.
- We can run many thousands of downloads concurrently on one server, as at each point in time a download only takes minimal resources: 2 http requests and a few JS stream objects.
- Alternatively, we can run smaller cheaper servers, and get the same experience.
- The code is significantly simpler, no need to manually throttle, no need to clean up temp directories after, no work queues.
- No need for a cache, as we only stream what we need. Again less code to maintain.
So just overall better engineering. The only downside is that it’s conceptually more complicated, and requires some understanding of underlying components (zip files, http responses, streams). While IO streaming is node’s bread and butter, and this implementation was relatively trivial, this may not be true for other frameworks.
Future improvements
So what next? Well we can try to make everything streaming: our upload process waits for the image before processing it. It would be really cool if the processing code could run AS the image is coming in, though this is harder to implement as it would require support in the underlying graphics library (graphicsmagick). What about our API servers, can we write out JSON as it gets generated from the DB? Probably, but how do we expose our DB as a composable stream? Let me know in the comments if you’ve done something like this, or can suggest improvements.
CEO, dump.ly