Home | Developer Blog

Kindle Topaz File Format: Explorations Part I

I’ve recently been playing with the Amazon Kindle, Amazon’s electronic book reader. Despite looking like a throwback to ’80s computing, it’s actually a surprisingly good product. It’s by far the best ebook reader on the market, and transparent wireless purchasing and downloading is a killer feature. If they can improve their product design and invest in future e-Ink improvements they will own this space as it grows. Printed books have some time yet, but the writing is on the wall. Music paved the way, video will follow as internet bandwidth increases, and books will *eventually* go digital.

Beyond reading books on the thing, I’ve been looking at how the Kindle stores its books. Amazon uses two primary file formats for Kindle ebooks; a modified form of Mobipocket and something called Topaz.

Mobipocket files purchased from Amazon have an AZW extension (which presumably stands for Amazon Whispernet - the name of the Kindle wireless download service). Mobipocket files from other sources will have a MOBI or PRC extension. Topaz files will have an AZW1 extension if downloaded directly to the Kindle, and a TPZ extension if downloaded from Your Media Library on Amazon.com.

The Mobipocket format is an extension of the PalmDoc format, which is itself an extension of the Palm Database Format. The current version of the Mobipocket format employed by Amazon is a mess. It simply has too much history as a file format. Design choices that made sense when using the files on an original Palm Pilot don’t make any sense on a device like the Kindle. For more details on the Mobipocket format see the MobileRead Wiki. For some interesting work done on understanding the Kindle’s internals and Mobipocket DRM see Igor Skochinsky’s blogs: Darkreverser’s Blog and Reversing Everything. Some analysis of and commentary on the security of Mobipocket DRM on the Kindle can be found in this article on CodeRyder’s Blog.

The Topaz file format is currently not publicly documented and there is no publicly available tool for creating Topaz files. Everything described here is based on inspection of the Topaz files that I have available to me and some guesswork.

Some preliminaries:

  • Topaz files can be identified by their first four bytes, which are always ‘TPZ0′. The ‘0′ may be a version number, so we may see that change over time.
  • Integers in Topaz files use a variable length encoding scheme known as VLQ coding. If a number is less than 127 it is stored in a single byte. If a number N is greater than 127 it is stored as (N / 128) | 0×80 in the first byte and N % 128 in the second byte, where / represents integer division. So 32 is stored simply as 0×20 while 130 is stored as 0×8102. This extends beyond two byte storage as needed in the obvious way. For details see the Wikipedia article. I’m not sure how many integers are stored in Topaz files, but this seems like a pretty small compression gain and more trouble than its worth. My only guess is that the font storage inside the Topaz files makes this worthwhile.
  • Strings in Topaz files are stored essentially as Pascal strings. A Pascal string consists of a prefix length followed by the characters themselves, with no null terminator. The twist for Topaz strings is that the prefix length is stored as a variable length integer described above instead of a single byte. So ‘foo’ is stored as 0×3 followed by ‘foo’ while a string of 130 characters is stored as 0×8102 followed by the 130 characters. Also, the strings are always encoded in UTF-8 and the prefix length is the number of bytes in the string, not the number of characters.

In the next post I describe the basic structure of a Topaz file.

Post a Comment