Using libstree

Using libstree is hopefully really quite simple. As an example, let's walk through one of the test programs found in the test directory of the source distribution: lcshex.c, which calculates the longest common byte sequence of the programs you pass it as command line arguments [1].

Since libstree operates on arbitrary strings, it needs a string model that leaves the implementation of the common string operations up to the user. String families that use a given set of such operations is called a string class (called LST_StringClass). Strings are instances of the LST_String type, and suffix trees are instances of LST_STree type. Each string belongs to a string class. Since it will often be the case that an application uses only a single class of strings (e.g., byte strings), there is a default string class, whose properties you can specify once and then forget about it.

libstree currently uses three string-class-specific operations:

The default operations assume byte strings, so compare function uses memcmp() on the memory area that the string item occupies, uses simple character assignment to copy an item.

Our little test program works on binary files, so we can use the compare and copy function, but for output, we prefer hexadecimal strings. libstree already provides a function that does this, and we use it to override the default.

Let's look at some code. We will need a suffix tree and a few strings; also, we override the output method as describes above.

  LST_STree       *tree;
  LST_StringSet   *set, *result;
  LST_String      *string;

  /* Create a string class with a special print method, otherwise
   * we can use the defaults:
   */
  lst_stringclass_set_defaults(NULL, NULL, lst_string_print_hex);
  

Since most applications will have to handle multiple strings at a time, a container for a number of strings is convenient. libstree provides the type LST_StringSet that can hold a set of strings. Let's allocate such a set:

  /* Create a string set to conveniently hold all our strings: */
  set = lst_stringset_new();
  

Now, we check for each of the program names that the user passed in, whether the file exists and if we can read it. We then allocate a chunk of memory and read the entire program code into it:

  FILE *f;
  u_char *data;
  struct stat st;

  /* stat() the file, open file handle etc, now read content: */ 

  if (fread(data, 1, st.st_size, f) != (size_t) st.st_size)
    {
      printf("Error reading %s -- skipping.\n", argv[i]);
      free(data);
       continue;
    }
  

We're ready to create a string object from this bytesequence and add it to our string set.

  string = lst_string_new(data, 1, st.st_size);
  lst_stringset_add(set, string);
  

Now, let's build a suffix tree for all the strings we just put into our string set:

  /* Create a suffix tree for all strings in the set: */
  tree = lst_stree_new(set);
  

Now find the longest common substring of all those strings.

  /* Find longest common substring(s) */
  result = lst_alg_longest_common_substring(tree, 0, 0);
  

Note that the result is also a string set -- there may be multiple longest strings, of the same length. You can limit the results by requesting only substrings of a given minimum and maximum length, see lst_alg_longest_common_substring(). Finally, let's print each of the strings out to the console. The string set contents can be iterated using lst_stringset_foreach():

  /* Print them out: */
  lst_stringset_foreach(result, string_cb, NULL);
  

That's it. Please browse the other test programs to get a better feeling for the API.

Notes

[1]

What do you mean, "What is this good for!?" :)"