Code Kata: Prefix search (Part 3: more fun with tries)

In my last post, I demonstrated that crit-bit trees are really fast. This time, I want to talk a little bit about my specific implementation of these tries, and some interesting applications you can use them for.

I also want to acknowledge some of the prior art that I looked at before deciding to roll my own implementation. Professor Bernstein himself has written an implementation himself for his portable qhasm assembler. But since it’s part of a larger software product, I didn’t want to disentangle it. Adam Langley has an implementation on github with some outstanding documentation that was very useful, but it only stores zero-terminated strings, not arbitrary data. While I didn’t read his code too closely to keep my own implementation challenging, I did steal the clever bit-mask trick from it.

More Things To Do With Tries

While they are really good at string lookup, what crit-bit trees are best at is finding strings with a common prefix. I implemented two separate functions for this:

const void * cb_find(critbit_tree * cb, const void * key, size_t keylen);

int cb_foreach(critbit_tree * cb, const void * key, size_t keylen, int (*match_cb)(const void * match, const void * key, size_t keylen, void *), void *data);

The first function is straightforward, it looks for an exact match of the first keylen bytes of key in the given tree. The second function calls the match_cb callback on every match, which is a much easier pattern to implement than trying to return all matches. Where would you store them? Dynamic memory allocation is right out, of course. I compromised a bit by writing the b_find_prefix function, which takes a buffer supplied by the caller, but beare that paging through the results by calling it repeatedly is quadratic in complexity, so you shouldn’t do that. It’s really just a hack.

Use As A Fast Key-Value Store

One really nifty thing that I have been using this code for is as a fast key-value store. You could insert a couple of strings like “config_value=42” in the tree and look for the first match with the prefix “config_value=”. There are two wrapper macros in critbit.h to simplify this, and being able to store data other than zero-terminated strings means I am not limited to storing string values. Here’s some example code that shows their use:

int i = 42;
char buffer[20];
const char * key = "herpderp";
void * match;
size_t len;

len = cb_new_kv(key, strlen(key), &i, sizeof(int), buffer);
cb_insert(&cb, buffer, len);
if (cb_find_prefix(&cb, key, strlen(key)+1, &match, 1, 0)) {
cb_get_kv(match, &i, sizeof(int));
}

What these macros essentially do is this: cb_new_kv creates a key|0|value memory block (written into buffer). Once you insert it, you can search for it by prefix-search for the key plus its zero-terminator, and use cb_get_kv to extract the value from the first match.

Testing

I approached this project with a test-first attitude, writing tests before writing the actual implementation, and fixing bugs until the tests passed. I wrote almost as many lines of tests for this as I wrote actual code, and it still wasn’t enough – one or two small issues slipped through. Lots of pointers and memory-pokery is always tricky, and I think this is one of those cases where it should be undisputed that unit tests are just incredibly useful.

Atlantis TDD (2) Unit Testing

Before I get to write any more implementation, it’s time to write some tests for my interfaces. I like to use CuTest for this, because it’s small, pure C, doesn’t try to be fancy, and there’s only a single file to link to. And now I’m writing tests for every function in the interface and its edge cases:

static void test_get_regions(CuTest * tc)
{
  struct region * results[4];
  void * cur;
  icursor * icur;
  int i, n = 0;

  svc.reset();

  for (i=0;i!=3;++i) {
    svc.regions->create(0, i);
  }
  cur = svc.get_regions(&icur);
  CuAssertPtrNotNull(tc, cur);
  CuAssertPtrNotNull(tc, icur);
  CuAssertIntEquals(tc, 3, icur->get(cur, 4, (void**)results));
  for (i=0;i!=3;++i) {
    int x, y;
    svc.regions->get_xy(results[i], &x, &y);
    CuAssertIntEquals(tc, 0, x);
    n |= (1<<y);
  }
  CuAssertIntEquals(tc, 7, n);

  CuAssertIntEquals(tc, 3, icur->advance(&cur, 4));
  CuAssertIntEquals(tc, 0, icur->advance(&cur, 4));
  if (icur->destroy) icur->destroy(cur);
}

This function tests the cursor returned by the get_regions function that I talked about in my previous post. As tests go, it’s pretty big: after first creating 3 regions, it asserts that the cursor will return those regions, will only return three even if asked for four, and that it’s able to advance until it reaches the end. Notice how, once again, the code is entirely written in terms of the new interface. I tend to write “struct region” instead of “region” for the type here, to remind myself that I don’t have access to the members of my objects, and apart from knowing it’s a struct, I really don’t know anything about it.

Now, I can’t run this test yet, because to don that, I would need an actual implementation of the interface to test. Since I haven’t broken out the Atlantis source yet, I’m going to write my own implementation of the data structures that does the bare minimum, and hook it up to the interfaces. You can see the result of that in the mock directory of my github project. This reference implementation will also allow me to test my game logic later without having to link against Atlantis or Eressea.

Atlantis TDD (1) Service Provider and Interfaces

The first thing I’m going to need is a service provider that serves the interface to the game. I’m a little bit spoiled by testing in PHP, where it’s easy to use an associative array to build an interface on the fly, but I obviously won’t get this here. Here’s what my interface roughly looks like:

typedef struct iunit {
  struct unit * (*create)(void);
  void (*destroy)(struct unit *);
  struct unit * (*get)(int);

  int (*get_uid)(const struct unit *);
  struct region * (*get_region)(const struct unit *);
  void (*set_region)(struct unit *, struct region *);
  ... /* more of the same */
} iunit;

/* similar struct for iregion, iship, ibuilding TBD */

typedef struct igame {
  struct iunit * units;
  struct iregion * regions;

  int max_directions;

  void (*reset)(void);
  void * (*get_regions)(struct icursor **);
  ... /* more of the same */
} igame;

That’s a lot of function pointers! As I said in my last post, the idea is to write code that is independent of implementation, so at no point can I use the actual unit structure that’s in Atlantis, or even access u->no directly. Some other game might have an entirely different way of storing ids, after all. Instead, I will be using svc.units->get_uid(u), which is more to type, but easy to plug into with an implementation-dependent function. You can see the rest of the interface classes on github, if you are curious.

I’m pretty pleased with the icursor interface. The global list of the game’s regions in Atlantis is a next-pointer chained linked list. In Eressea, it is an unrolled linked list, and I assume that in A5, it is a std::list. Or maybe it’s a hashtable? I will be iterating over all regions a lot in the game logic, and similar issues will arise for all units in a region, etc. So here’s the icursor interface:

typedef struct icursor {
  void (*destroy)(void * cursor);
  int (*get)(void * cursor, int n, void * results[]);
  int (*advance)(void ** cursor, int n);
} icursor;

A function like svc.get_regions returns a void * typed data pointer (the cursor), and an icursor interface that can be used to iterate the particular type of container it represents. You can either get one or more elements from a cursor, or advance it. When you’re done, release the cursor by calling destroy. Here’s an example:

  icursor *icursor;
  void *cur = svc.get_regions(&icursor);
  region *r;
  while (icur->get(cur, 1, &r)) {
    printf("region %d,%dn", r->x, r->y);
    icur->advance(&cur, 1);
  }
  icur->destroy(cur);

That’s the first game logic I’ve written since I started! If you think I’m off to a good start and will be writing that movement code next, think again: It’s time to write tests first!