TL;DR: Skip to the results!
The Quest
A few months back, I was developing a little Flash widget as a tangent to my main project, and I needed an asynchronous PNG encoder. I found several, but none of them were quite what I wanted. Some were pretty good, but synchronous only. Most of the asynchronous ones did the final compression step all at once at the end, with a call to ByteArray's deflate() method, which meant they weren't really asynchronous -- there'd be a noticeable pause while the compression took place.
In-Spirit's PNG encoder, which compiled zlib using Alchemy(!), came very close to what I wanted. It maintained a consistent framerate and offered good, configurable compression. There was only one problem: its asynchronous mode was slow. And I mean really slow. Actually, since I was developing on a netbook, it was intolerably slow, even for medium-sized images. The SWC, at over 100KB, was also a little hefty considering the small scope of my widget.
The Journey
So, being easily distracted, I decided to build my own PNG encoder. I was officially working on a tangent of a tangent. I had two goals: speed, and true asynchronous encoding.
I started with the haXe port of the basic as3corelib PNG encoder from Adobe. I spent a while optimizing it for speed: I removed unnecessary casts, unrolled loops, and inlined function calls. I moved as much as I could into domain memory, which is basically a raw hunk of memory, byte-addressable, that is really, really fast since the reads and writes are done using Alchemy opcodes (made possible through the awesomeness of haXe). Because only one chunk of memory at once can be selected to be the domain memory, it needs to be partitioned manually into regions for different purposes. For example, I have a region with a CRC lookup table in it, and another for the raw pixel data to be compressed, and another for doing the compression, etc. Another caveat stemming from the singularity of the domain memory is that if two different packages use it, and they both assume sole ownership, then at least one of them will end up manipulating the wrong data. To make sure this didn't happen with my PNG encoder, everywhere that I use domain memory, I first save what it was before overwriting it, then set it back when I'm done. Because Flash is single-threaded, this works beautifully to ensure that my encoder will never cause conflicts with any other code that uses domain memory, even if it assumes sole ownership.
There's two main parts to PNG encoding: First, the bitmap data has to be converted into the right RGBA format (BitmapData.getPixels() and friends unfortunately yield data in ARGB format). During this conversion, a filter can be applied that increases the compressibility of the data (e.g. by using deltas between adjacent pixel values in place of the actual values). The second phase is to compress the pixel data. There's also some bookkeeping related to the PNG format, which stores things in "chunks" with CRC-32 checksums, and there's a header and footer chunk, but those are minor details.
I managed to optimize the first phase, which transformed the raw pixel data into the right format, to the point where it was no longer the bottleneck. The compression phase was taking over 60% of the entire encoding time -- but there was nothing I could do about it, since it was abstracted away to a single call to deflate(). This also meant I couldn't make it asynchronous. So, of course, I decided to go on a tangent of a tangent of a tangent, and implement zlib and DEFLATE from scratch (as described by RFCs 1950 and 1951). This took a lot longer than I expected, but it was a success! I managed to write a compression algorithm that was competitive enough to the built-in one in terms of speed and compression ratio (on the GOOD setting) and much faster for highly redundant data (as many images are) on the FAST setting. What's more, since I had complete control over the implementation, I was able to write it in such a way that it would be easy to adapt to a chunk-at-a-time, asynchronous architecture.
Making it Asynchronous
When Flash developers use the word "asynchronous", they typically don't mean it in its usual sense of "multiple things happening at once", since Flash is single-threaded. They use it to refer to an algorithm that spreads its processing across multiple frames so that the UI doesn't appear to lock up (and, in extreme cases, cause the script to timeout). The Flash AVM2 has, at its core, what's been described as the "elastic racetrack". Basically, a loop which dispatches events and updates the display goes around and around as fast as it can in order to maintain the chosen frame rate as best as possible. If the frame rate is low, or updating the display is very fast, then it might go through several event dispatching cycles (provided there are pending events) before rendering the next frame.
I wanted to make the asynchronous mode of the encoder complete as quickly as possible, but without degrading the frame-rate intolerably. Another PNG encoder, from BIT-101, handled a fixed number of scanlines (each horizontal row of pixels is called a scanline) per frame. This method has a couple of disadvantages: Different Flash programs have different processing loads, and different platforms yield different performance (which varies depending on background load too). These both lead to either sub-optimal frame rates, or wasted cycles around the racetrack. To avoid these issues, I tried an adaptive approach that continuously monitored both the framerate, and how fast the encoder could process a single scanline of a given image. I could then use this information to make an estimate for the number of scanlines to process during the next update -- I called this the "step" size. Theoretically, all I had to do was increase the step size until the frame rate decreased (to use up any slack space of free cycles), and decrease the step size as needed to maintain that frame rate. In practice, that turned out to be tricky and error-prone, partly because of timing inaccuracies, but mostly because the framerate can vary from external factors outside of the encoder. I ended up just including a target FPS setting (it defaults to 20) and aiming for that. This was simpler, and gave much better performance. Each update, the step size is updated to match the target FPS as closely as possible; if the current framerate is more than more than 15% worse than the target, then a correction is made to attempt to bring the error delta to zero.
Discoveries
First, images are big data. I was mostly using a teeny 200x200px image for testing. That's 40000 pixels total. And each pixel is 4 bytes (with alpha), giving a total size of 160000 bytes. Anything you do 160000 times (at least, on a netbook processor ;-) ) is magnified hugely. Changes in the order of if conditions would give significant performance speedups or slowdowns. I ended up with a very tight loop iterating over every byte, and everything in there mattered, and everything outside that loop didn't. A regular-sized 1024x768px image is over three million bytes of raw pixel data!
Also, jumps are slow. Anything involving a jump was automatically suspicious; unrolling loops, working around ifs, and inlining function calls gave huge boosts to performance. Nearly every function of the library was inlined. Disclaimer: I was not using a high resolution timer to measure performance, so, of course, my measurements had a fairly large margin of error; however, I only included optimizations that brought down the total time to encode (which was coarse-grained enough not to worry about timing errors). As always, do your own benchmarks, and never trust sweeping generalizations like mine ;-)
Finally, it turns out that, despite Flash being single-threaded, it's possible for event handlers, particularly timer tick event handlers, to be re-entrant (i.e. the event handler function gets called before a previous call to the same handler has completed). How is this possible given that there's no multithreading? Well, it turns out that any call to dispatchEvent() (for whatever purpose) might prompt Flash to deal with pending events during that call to dispatchEvent(). For example, imagine a high-frequency timer with an event handler that, when fired, dispatches a progress event. During the progress dispatchEvent() call, Flash notices that another timer event is due, and dispatches the same event handler that caused the progress event in the first place! This is actually fairly easy to work around (just queue all events you want to dispatch until you're done doing everything else), but can cause all sorts of nasty, subtle bugs if you're not aware of it.
The Results
It took me a while to complete the library and write this blog post, but I've finally finished! I've imaginatively named my new encoder "PNGEncoder2". You can grab it from GitHub. See the README file for the full feature list, installation instructions, and usage examples.
Here's a benchmark (source) comparing PNGEncoder2 with other PNG encoders (both synchronous and asynchronous). Note that the first run might take a little longer than others because of one-time initializations and such. My encoder uses the Paeth filter to improve the compressibility of the pixel data; for the other PNG encoders that support filters, I've set them to also use Paeth to match (not all do support filters, however).
No doubt there's still a few bugs; if you find one, I'd appreciate a link to a sample image that exhibits the bug. At one point, there was a particularly nasty bug caused by a typo in one constant out of a column of 32, which only showed up when encoding a particular image of a unicorn. (They are magical!)