Home | Developer Blog

Kindle Topaz File Format: Explorations Part II

This is the second part of my continuing exploration into the Kindle Topaz file format. This will probably not make much sense until you read the first part.

Topaz files contain a set of headers and corresponding blocks. Each header and block has a type. It appears that there are seven types: DICT, DKEY, GLYPHS, IMG, METADATA, OTHER and PAGE. For each header there are one or more corresponding blocks, and the header provides the number, location and sizes of the blocks, and probably some other information I haven’t decoded yet. Topaz files seem to only ever contain a single block for the DICT, DKEY, METADATA and OTHER types. There are typically many blocks for the GLYPHS, IMG and PAGE types. Hardly surprising given that GLYPHS contains font data, IMG contains image data, and PAGE contains book text.

I’m going to use STRING to indicate a Topaz encoded string and VARINT to indicate a Topaz encoded variable length integer. For information on how these two data types are encoded, see the first post in this series.

Topaz files begin with:

4 bytes     'TPZ0' (file identifier)
1 byte      Number of headers (always 0x07 it seems)

Following this is a set of headers. There seems to always be seven headers and they always appear in the same order. In general, each header contains information about how many blocks there are of that type, where they are located in the file, and how long each block is. Each header has the following general structure in the file:

1 byte      0x63 ('c')
STRING      Header type identifier string. e.g., 0x04 + 'dict'
DATA        Header data (varies based on type)

Before I describe the individual headers, let me mention block offset and block length fields. The block offsets contained in the headers are measured from the separator ‘@’ byte described below, not from the beginning of the file. The length of a block is measured from the point where the block’s data starts. This seems to vary per block type but always excludes the block type string, any constant 0×00/0xFF bytes, any block index number and any field count.

DICT Header

6 bytes     'c' 0x04 'dict'
1 byte      0x01 (probably number of DICT blocks - always seems to be one)
VARINT      Offset for DICT block
VARINT      Unknown
VARINT      Length of DICT block

I assume the corresponding DICT block contains information related to the DRM encryption.

DKEY Header

6 bytes     'c' 0x04 'dkey'
1 byte      0x01 (probably number of DKEY blocks - always seems to be one)
VARINT      Offset for DKEY block
VARINT      Length of DKEY block
1 byte      0x00 (Unknown)

I assume the corresponding DKEY block contains information related to the DRM encryption.

GLYPHS Header

8 bytes     'c' 0x06 'glyphs'
VARINT      Number of GLYPHS blocks (G)
Repeat G times:
  VARINT    Offset for GLYPHS block
  VARINT    Unknown
  VARINT    Length of GLYPHS block

The corresponding GLYPHS blocks contain font data.

IMG Header

5 bytes     'c' 0x03 'img'
VARINT      Number of IMG blocks (I)
Repeat I times:
  VARINT    Offset for IMG block
  VARINT    Unknown
  VARINT    Length of IMG block

The corresponding IMG blocks contain images.

METADATA Header

10 bytes    'c' 0x08 'metadata'
1 byte      0x01 (probably number of METADATA blocks - always seems to be one)
VARINT      Offset for METADATA block
VARINT      Length of METADATA block

OTHER Header

7 bytes     'c' 0x05 'other'
1 byte      0x01 (probably number of OTHER blocks - always seems to be one)
VARINT      Offset for OTHER block
VARINT      Unknown (may be the uncompressed length - it's always larger than the next field)
VARINT      Length of OTHER block

I have no idea what’s stored in the corresponding OTHER block.

PAGE Header

6 bytes     'c' 0x04 'page'
VARINT      Number of PAGE blocks (P)
Repeat P times:
  VARINT    Offset for PAGE block
  VARINT    Unknown (may be the uncompressed length - it's always larger than the next field)
  VARINT    Length of PAGE block
1 byte      Unknown (Always 0x64)

The corresponding PAGE blocks contain the book text.

Following the headers there seems to be a separator marker. This is the point from which all of the block offsets are measured.

1 byte      '@' (0x40)

I’m still deciphering most of the block types, but the metadata block is easy.

METADATA Block

9 bytes     0x08 'metadata'
1 byte      0x00 (unknown)
VARINT      Number of metadata key/value pairs (M)
Repeat M times:
  STRING    Key
  STRING    Value

Here’s the list of keys I’ve identified:

ASIN Amazon Standard Identification Number
Authors List of authors (semicolon separated)
CDEKey Unknown - Seems to match ASIN
CDEType Unknown - EBOK for book, EBSP for book sample
ClippingLimit Unknown
GUID Unknown - some globally unique identifier
MaxMemoryPage Unknown
MaxMemoryUsed Unknown
PublisherLimit Unknown - 1 for books, 0 for book samples
Title Title of the book
UpdateTime Last modified date in ISO 8601 format - YYYYMMDD T HH:MM
createTime Creation date in ISO 8601 format - YYYYMMDD T HH:MM
file_version Unknown - version of this particular Kindle book?
firstTextPage Unknown
fontSize Unknown
glyphContourCount Unknown
glyphCount Unknown - Number of glyphs in the file?
glyphLoad_avg Unknown - Average time for glyph loading?
glyphLoad_max Unknown - Maximum time for glyph loading?
glyphLoad_p90 Unknown - 90th percentile time for glyph loading?
glyphUseCount Unknown
glyphVtxCount Unknown - A vertex count?
oASIN Unknown
startReadingPage Unknown - The page the Kindle should open on reading?

There’s a lot of inconsistency in key naming with camel case, leading caps and underscores all present in only twenty or so keys.

Well, you’ve reached the limit of what I’ve figured out so far. Future posts will come as I learn more.

Post a Comment