Eressea: text vs. binary data files

I’m posting about this because I like talking about code and nobody is on IRC. If you don’t like code, you may want to skip this. Also, none of this is rocket science, but it was a nice exercise.

Traditionally, the data files in Eressea have been text files. You know the kind where you’re doing a lot of fprintf and fscanf everywhere in the code. There are usually two advantages associated with that: small integers take up less space (2-3 bytes instead of 4), and when something goes terribly wrong, you can edit the data with vi.

The latter had long been a mixed blessing, with edits occasionally doing more harm than good, and since the introduction of a shell and script access to all the game data, it’s not really been necessary. Which left three reasons not to switch: Slight space improvements, the amount of work to change over to something else, and backward compatibility. However, I had some time on my hands recently and decided to tackle all of those issues.

First thing, build a wrapper
I decided to take this in three steps: First, remove all the direct stdio.h usage, then implement binary storage and last, look at space improvements. I now have an intermediate interface between the code responsible for storing data and the actual writing. Instead of fprintf(F, "\"%s\" %d %s ", u->name, u->hp, u->race->name) I now write this:

store->w_str(store, u->name);
store->w_int(store, u->hp);
store->w_tok(store, u->race->name);

This is the analog code for reading:

u->name = store->r_str(store);
u->hp = store->r_int(store);
store->r_tok(store, buffer, sizeof(buffer));

Notice the store object that contains function pointers to the read/write functions, file version info and other state that the reading code might need. Once this was done, it was a matter of minutes to make a new binary reader/writer that just uses fread/fwrite. The resulting file was 120 MB where the original had been 100 MB, just as expected (there are a lot of integers in Eressea). Also, because I’m prefixing all strings with their size, that’s another small overhead over the 0-delimited strings (but I’m saving some on not having to quote them, which is great).

Size matters
Storing integers as 4 bytes made a big impact, so I went ahead and packed them. This can be done with almost the same approach as UTF-8 uses, but you need to pay a little attention to negative integers, and you can save a little storage, so I now use between 1 and 5 bytes to store an integer, but any number in [-63,64] takes up only one byte. Result: The file shrinks down to just around 70 MB.

Conclusion
If you’ve read this far, then you may be interested in the results. the following is for un-optimized code, because the use case I’m interested in is when I run the game in the debugger.

       | Q6600 2.4Ghz | PII/450 Mhz
-------+--------------+-----------
binary | 18.953s      | 33.56s
text   | 50.574s      | 73.6s
-------+--------------+-----------
ratio  | 37%          | 46%

Read time for the data file went down to just 46% of the previous time, and a lot of the remaining time spent is creating the data structures, not disk IO. On my desktop at home (left column), things look even better. Previously, opening the game data in the editor meant time for a coffee break.

I guess I always knew how bad fscanf was, but these results still shocked me a little bit. Where in Knuth’s name does it spend all that time? On the crappy old server that runs the game, this kind of optimization actually mattered, and if I had known how much it matters, I would have done this a lot earlier.

One final word for those of you who don’t remember how crappy a PII/450 is: gzip throughput is about 8 Mbit/s on this box. Which should explain why I don’t do compression.

And as promised above, here’s my int-packing code. Use it as you please, and let me know if you think you can improve it.

size_t
pack_int(int v, char * buffer)
{
  int sign = (v<0);


  if (sign) {
    v = ~v + 1;
    sign = 0x40;
  }
  if (v<0x40) {
    buffer[0] = (char)(v | sign);
    return 1;
  } else if (v> 6) | 0x80);
    buffer[1] = (char)((v & 0x3F) | sign);
    return 2;
  } else if (v>13) & 0x7f) | 0x80);
    buffer[1] = (char)(((v>> 6) & 0x7f) | 0x80);
    buffer[2] = (char)((v & 0x3F) | sign);
    return 3;
  } else if (v>20) & 0x7f) | 0x80);
    buffer[1] = (char)(((v>>13) & 0x7f) | 0x80);
    buffer[2] = (char)(((v>> 6) & 0x7f) | 0x80);
    buffer[3] = (char)((v & 0x3F) | sign);
    return 4;
  }
  buffer[0] = (char)(((v>>27) & 0x7f) | 0x80);
  buffer[1] = (char)(((v>>20) & 0x7f) | 0x80);
  buffer[2] = (char)(((v>>13) & 0x7f) | 0x80);
  buffer[3] = (char)(((v>> 6) & 0x7f) | 0x80);
  buffer[4] = (char)((v & 0x3F) | sign);
  return 5;
}

int
unpack_int(const char * buffer)
{
  int i = 0, v = 0;

  while (buffer[i] & 0x80) {
    v = (v << 7) | (buffer[i++] & 0x7f);
  }
  v = (v << 6) | (buffer[i] & 0x3f);

  if (buffer[i] & 0x40) {
    v = ~v + 1;
  }
  return v;
}

André Johansen

November 2, 2008 at 2:36 pm

I’m now working with a system where all the data (a relatively small amount) is preprocessed completely into files that are loaded into shared memory directly. A bit yucky when it comes to changing data formats, but you save a lot of time at program startup, where it matters.

(BTW, seems you mixed up the timing numbers!)

Enno

November 2, 2008 at 5:26 pm

Fixed, thanks. Memory mapped files are indeed good for startup times, but they would make for bigger files in my case, and with slow disk and the need to read all the data all the time, I think it would slow down total execution time for me.

Eressea: text vs. binary data files

2 thoughts on “Eressea: text vs. binary data files”

Leave a Comment Cancel Reply