Kindle Topaz File Format: Explorations Part II
Posted on September 7, 2008 at 11:50 am by Late Night Coder
This is the second part of my continuing exploration into the Kindle Topaz file format. This will probably not make much sense until you read the first part.
Topaz files contain a set of headers and corresponding blocks. Each header and block has a type. It appears that there are seven types: DICT, DKEY, GLYPHS, IMG, METADATA, OTHER and PAGE. For each header there are one or more corresponding blocks, and the header provides the number, location and sizes of the blocks, and probably some other information I haven’t decoded yet. Topaz files seem to only ever contain a single block for the DICT, DKEY, METADATA and OTHER types. There are typically many blocks for the GLYPHS, IMG and PAGE types. Hardly surprising given that GLYPHS contains font data, IMG contains image data, and PAGE contains book text.
I’m going to use STRING to indicate a Topaz encoded string and VARINT to indicate a Topaz encoded variable length integer. For information on how these two data types are encoded, see the first post in this series.
Topaz files begin with:
4 bytes 'TPZ0' (file identifier) 1 byte Number of headers (always 0x07 it seems)
Following this is a set of headers. There seems to always be seven headers and they always appear in the same order. In general, each header contains information about how many blocks there are of that type, where they are located in the file, and how long each block is. Each header has the following general structure in the file:
1 byte 0x63 ('c')
STRING Header type identifier string. e.g., 0x04 + 'dict'
DATA Header data (varies based on type)
Before I describe the individual headers, let me mention block offset and block length fields. The block offsets contained in the headers are measured from the separator ‘@’ byte described below, not from the beginning of the file. The length of a block is measured from the point where the block’s data starts. This seems to vary per block type but always excludes the block type string, any constant 0×00/0xFF bytes, any block index number and any field count.
DICT Header
6 bytes 'c' 0x04 'dict' 1 byte 0x01 (probably number of DICT blocks - always seems to be one) VARINT Offset for DICT block VARINT Unknown VARINT Length of DICT block
I assume the corresponding DICT block contains information related to the DRM encryption.
DKEY Header
6 bytes 'c' 0x04 'dkey' 1 byte 0x01 (probably number of DKEY blocks - always seems to be one) VARINT Offset for DKEY block VARINT Length of DKEY block 1 byte 0x00 (Unknown)
I assume the corresponding DKEY block contains information related to the DRM encryption.
GLYPHS Header
8 bytes 'c' 0x06 'glyphs' VARINT Number of GLYPHS blocks (G) Repeat G times: VARINT Offset for GLYPHS block VARINT Unknown VARINT Length of GLYPHS block
The corresponding GLYPHS blocks contain font data.
IMG Header
5 bytes 'c' 0x03 'img' VARINT Number of IMG blocks (I) Repeat I times: VARINT Offset for IMG block VARINT Unknown VARINT Length of IMG block
The corresponding IMG blocks contain images.
METADATA Header
10 bytes 'c' 0x08 'metadata' 1 byte 0x01 (probably number of METADATA blocks - always seems to be one) VARINT Offset for METADATA block VARINT Length of METADATA block
OTHER Header
7 bytes 'c' 0x05 'other' 1 byte 0x01 (probably number of OTHER blocks - always seems to be one) VARINT Offset for OTHER block VARINT Unknown (may be the uncompressed length - it's always larger than the next field) VARINT Length of OTHER block
I have no idea what’s stored in the corresponding OTHER block.
PAGE Header
6 bytes 'c' 0x04 'page' VARINT Number of PAGE blocks (P) Repeat P times: VARINT Offset for PAGE block VARINT Unknown (may be the uncompressed length - it's always larger than the next field) VARINT Length of PAGE block 1 byte Unknown (Always 0x64)
The corresponding PAGE blocks contain the book text.
Following the headers there seems to be a separator marker. This is the point from which all of the block offsets are measured.
1 byte '@' (0x40)
I’m still deciphering most of the block types, but the metadata block is easy.
METADATA Block
9 bytes 0x08 'metadata' 1 byte 0x00 (unknown) VARINT Number of metadata key/value pairs (M) Repeat M times: STRING Key STRING Value
Here’s the list of keys I’ve identified:
| ASIN | Amazon Standard Identification Number |
| Authors | List of authors (semicolon separated) |
| CDEKey | Unknown - Seems to match ASIN |
| CDEType | Unknown - EBOK for book, EBSP for book sample |
| ClippingLimit | Unknown |
| GUID | Unknown - some globally unique identifier |
| MaxMemoryPage | Unknown |
| MaxMemoryUsed | Unknown |
| PublisherLimit | Unknown - 1 for books, 0 for book samples |
| Title | Title of the book |
| UpdateTime | Last modified date in ISO 8601 format - YYYYMMDD T HH:MM |
| createTime | Creation date in ISO 8601 format - YYYYMMDD T HH:MM |
| file_version | Unknown - version of this particular Kindle book? |
| firstTextPage | Unknown |
| fontSize | Unknown |
| glyphContourCount | Unknown |
| glyphCount | Unknown - Number of glyphs in the file? |
| glyphLoad_avg | Unknown - Average time for glyph loading? |
| glyphLoad_max | Unknown - Maximum time for glyph loading? |
| glyphLoad_p90 | Unknown - 90th percentile time for glyph loading? |
| glyphUseCount | Unknown |
| glyphVtxCount | Unknown - A vertex count? |
| oASIN | Unknown |
| startReadingPage | Unknown - The page the Kindle should open on reading? |
There’s a lot of inconsistency in key naming with camel case, leading caps and underscores all present in only twenty or so keys.
Well, you’ve reached the limit of what I’ve figured out so far. Future posts will come as I learn more.
► Post a Comment