taw's blog: ESF - Empire Total War Object Serialization Format

Princess Sophia (R.I.P. babygirl!) by Shandi-lee from flickr (CC-NC-ND)

ESF is the primary data format used by Total War series games starting from Empire, and then Napoleon, Shogun 2, and whatever comes next.

It was reverse engineered from partial documentation by various people including your truly, but so far there's no complete documentation available anywhere. This post is meant to be such documentation.

Obvious data types

Most of data types used in ESFs are straightforward - almost everything is little endian unless noted otherwise, data sizes are standard powers of two, negative numbers are two's-complement, character set is Unicode, floating point numbers are IEEE 754 etc.

int8, int16, int32, int64 - little-endian signed integers
uint8, uint16, uint32, uint64 - little endian unsigned integers
float32, float64 - single and double precission IEEE 754 floating point numbers
char8 - ASCII character
char16 - little-endian UTF-18 codepoint
bool8 - boolean (00 false, 01 true, other values never observed)

Strings

There are two kinds of strings used in ESF files - ASCII strings (used mostly for internal identifiers), and Unicode strings (used mostly for displayable text).

Both are prefixed by character count as uint16, for ASCII strings (ca_ascii):

uint16 count, char8[count] characters

For Unicode strings (ca_unicode):

uint16 count, char16[count] codepoints

There are no observed instances of ASCII strings having high bit set - so it's not possible to tell if that's meant as ISO-Latin-1, UTF-8, an error, or something else.

There are no observed instances of Unicode planes other than Basic Multilingual Plane being used.

There are no observed instances of 0 character/codepoint being present in either of them.

Neither string format is ever zero-terminated.

Exotic encodings for numbers

Before I get to the format itself, there are a few exotic encodings used in ESFs.

int24be and uint24be are big-endian (that is reversed) 24 bit (3 byte) signed and unsigned integer. So sequence of bytes 01 00 00 means 65536.

uintvar is a variable length encoding for unsigned integers, with following rules:

result = 0;
  while(data[i] & 0x80) {
    result = (result << 7) | (data[i] & 0x7f);
    i += 1;
  }

Or in diagrams:

0XXXXXXX → 0bXXXXXXX
1YYYYYYY 0XXXXXXX → 0bYYYYYYYXXXXXXX
1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bZZZZZZZYYYYYYYXXXXXXX
1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bWWWWWWWZZZZZZZYYYYYYYXXXXXXX
1VVVVVVV 1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bVVVVVVVWWWWWWWZZZZZZZYYYYYYYXXXXXXX

Or in examples:

0 is encoded as 00
1 is encoded as 01
127 is encoded as 7f
128 is encoded as 81 00
255 is encoded as 81 7f

This format is somewhat analogous to UTF-8. In theory you could encode arbitrarily long numbers this way, but it's never longer than 5 bytes. And for that matter encoded numbers are never longer than 32 bits (even though 5 bytes would fit 35 bits in this encoding). There are many observed cases where encoding different than the shortest possible encoding is used - as far as I know it's typically 5 bytes used when 4 would suffice, but I haven't investigated it any deeper than that. For example - 80 80 80 80 01, 80 80 80 01, 80 80 01, 80 01, and 01 are all valid encodings for number 1.

Format variants

There are 4 variants of ESF format, usually referred to by their magic number:

ABCD - used in Empire Total War and Napoleon Total War
ABCE - used in Empire Total War and Napoleon Total War
ABCF - used in Shogun 2 Total War
ABCA - used in Shogun 2 Total War

Header

ESF files start with the following header:

uint32 - magic number (0xABCD, 0xABCE, 0xABCF, 0xABCA)
uint32 - 4 bytes, always zeros - not present in ABCD format
uint32 - 4 bytes, look like Unix timestamp - not present in ABCD format
uint32 - offset where footer starts

Lack of these two fields in the header is the only difference between ABCD and ABCE formats. Following header is a single root node (usually with more nodes nested within). Following that is footer, and possibly zero padding.

Footer

ESF format consists of nodes similar to XML tags. The first thing in the footer is a lookup table for names of these tags:

uint16 number of tag types
ca_ascii name of tag 0
ca_ascii name of tag 1
ca_ascii name of tag 2
...

For example tag table containing nodes "kittens" and "pandas" would be:

02 00 - size of tag name table as uint16
07 00 - size of string "kittens" as uint16
6b 69 74 74 65 6e 73 - ASCII encoding of "kittens"
06 00 - size of string "pandas" as uint16
70 61 6e 64 61 73 - ASCII encoding of "pandas"

For formats ABCD and ABCE that's the entire footer. Formats ABCF and ABCA contain two more data structures - lookup tables for Unicode strings and ASCII strings. ABCD/ABCE formats kept all strings other than node names within main data. The main change in ABCF/ABCA formats was moving the strings to the footer, and only using indexes to footer tables in main data, resulting in much smaller files. Footer for ABCF/ABCA continues as:

uint16 size of Unicode string lookup table
ca_unicode string A
uint32 index of string A
ca_unicode string B
uint32 index of string B
...
uint16 size of ASCII string lookup table
ca_ascii string A
uint32 index of string A
ca_ascii string B
uint32 index of string B
...

The main difference between tag name lookup table and these tables is that tag name lookup table simply used positions on the list as indexes, and these tables use explicit indexes (it's like Array vs Hash). There have been no observed cases of multiple entries having the same ID, or any other such irregularities. Following footer in some ABCA format files there are some 00 bytes. These as far as I know do nothing at all and are ignored.

Data nodes - numbers

Presenting His Royal Highness King Milo the First! by Malingering from flickr (CC-NC-ND)

Between header and footer there are data nodes - or to be more precise exactly one data node, which may contain nested nodes - very much like XML has one root element. Node type can be determined by its first byte. Numbers nodes have very simple correspondence between encoding and encoded data:

01 bool8 - boolean
02 int8 - int8 (never observed in practice)
03 int16 - int16
04 int32 - int32
05 int64 - int64 (rarely observed in practice)
06 uint8 - uint8
07 uint16 - uint16
08 uint32 - uint32
09 uint64 - uint64 (rarely observed in practice)
0a float32 - float32
0b float64 - float64 (never observed in practice)

A few further node types are only a bit more complicated:

0c float32 float32 - XY coordinates
0d float32 float32 float32 - XYZ coordinates
10 uint16 - angle from 0 to almost-360 degrees

Choice of numerical node types is not data dependent - if certain object has uint32 with value 1, it will always be encoded as 08 01 00 00 00 in ABCD/ABCE/ABCF formats. Most numerical data in ESF files is uint32, int32, and float32 - so there's plenty of zeroes there. As an optimization ABCA format introduced further node types not present in ABCD/ABCE/ABCF formats:

12 - boolean true
13 - boolean false
14 - uint32 zero
15 - uint32 one
16 uint8 - uint32
17 uint16 - uint32
18 uint24be - uint32
19 - int32 zero
1a int8 - int32
1b int16 - int32
1c int24be - int32
1d - float32 zero

So uint32 node with value 5 might be< encoded as 08 05 00 00 00, 16 05, 17 05 00, or 18 00 00 05. As far as I can tell the most compact possible representation is always used (16 05 in this case), but I haven't strictly verified that.

String nodes

In formats ABCD/ABCE string nodes are encoded as follows:

0e ca_unicode - unicode string node
0f ca_ascii - ASCII string node

In formats ABCF/ABCA instead of string they only contain an index to appropriate lookup table in the footer

0e uint32 - unicode string node
0f uint32 - ASCII string node

Array nodes

There are array nodes types corresponding to basic node types, their code is 40 + code of basic data type:

41 boolean array
42 int8 array
43 int16 array
etc.

In ABCD/ABCE/ABCF formats arrays are encoded as:

uint8 node-type-code (40..5f)
uint32 offset of first byte after end of array (this is a weird way of encoding size)
element 0
element 1
...

In ABCA format it's:

uint8 node-type-code (40..5f)
uintvar number of bytes in the array
element 0
element 1
...

So for example if we have array of two uint32s - [100, 200] starting at offset 0x6000 in file, it will be encoded like this in ABCD/ABCE/ABCF:

[offset 0x6000] 48 - since 08 is node type code for uint32, 48 is node type code for array of uint32s
[offset 0x6001..0x6004] 0c 60 00 00 - offset of first byte after end of array
[offset 0x6005..0x6008] 64 00 00 00 - 100 encoded as uint32
[offset 0x6009..0x600b] c8 00 00 00 - 200 encoded as uint32

In ABCA format array encoding doesn't depend on offset, so it's:

48 - code for array of uint32s
08 - size of the array as uintvar
64 00 00 00 - 100 encoded as uint32
c8 00 00 00 - 200 encoded as uint32

However the same way numbers in ABCA format can be encoded with shorter encodings if they fit, entire arrays can as well, if every one of their elements does:

56 - since 16 is code of uint32-encoded-as-uint8
02 - size of the array as uintvar
64 - 100 encoded as uint8
c8 - 200 encoded as uint8

There's no way to mix encodings within the array. Array of [0, 1, 1000] would be encoded as:

57 - since 17 is code of uint32-encoded-as-uint16
06 - size of array as uintvar
00 00 - 0 encoded as uint16
01 00 - 1 encoded as uint16
e8 03 - 1000 encoded as uint16

Now since arrays encode their size as either byte count (ABCA) or file offset (ABCD/ABCE/ABCF) and not as number of elements, they never use 0-size encodings like 52, 53, 54, 55, 59, 5d. In ABCF/ABCA formats there are arrays of ASCII and Unicode strings, and they cause no complications since "string" there is just uint32 index to a separate lookup table. There are no such arrays in ABCD/ABCE formats, where strings are encoded directly in the data using variable number of bytes.

Record node

Prince Levi by geckoam from flickr (CC-NC-ND)

The most important node type in ESF is record type, vaguely corresponding to tags in XML, or non-basic object types in Java. The root of ESF files is always a record type, and since records can contain any number of nodes in them, they provide basic structure to ESF files. For ABCD/ABCE/ABCF format record nodes are encoded as follows:

uint8 80 - record node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number - starts with 0, updated every time object format changes
uint32 offset of first byte after end of record
node 0
node 1
...

In ABCA situation is much more complicated, since there are actually three formats of records. Root node - and root node only - is encoded as:

uint8 80
uint16 tag name
uint8 version
uintvar size of data within record in bytes
node 0
node 1
...

Other nodes can either use traditional encoding, except with different node type code:

uint8 a0
uint16 tag name
uint8 version
uintvar size of data within record in bytes
node 0
node 1
...

Or a new compact encoding. In compact encoding first two bytes are better considered as a bitfield:

100vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
uintvar size of data within record in bytes
node 0
node 1
...

So node type byte can be anything between 80 and 9f.

Array of records node

The last type is arrays of records. All records in the array are the same type and version. For ABCD/ABCE/ABCF format it's encoded as follows:

uint8 81 - array of records node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number
uint32 offset of first byte after end of array
uint32 number of elements
uint32 offset of first byte after end of record #0
contents of record 0
uint32 offset of first byte after end of record #1
contents of record 1
...

ABCA format has two encodings. Traditional encoding:

uint8 e0 - array of records node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number
uintvar size of array contents in bytes
uintvar number of elements
uintvar size of record 0 in bytes
contents of record 0
uintvar size of record 1 in bytes
contents of record 1
...

And compact encoding very similar to one used for records:

110vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
uintvar size of array contents in bytes
uintvar number of elements
uintvar size of record 0 in bytes
contents of record 0
uintvar size of record 1 in bytes
contents of record 1
...