It has been, what, like four years since librsvg got fully rustified,
and now it is time to move another piece of critical infrastructure to
a memory-safe language.
I'm talking about libipuz, the GObject-based C library that
GNOME Crosswords uses underneath. This is a library that
parses the ipuz file format and is able to represent various
kinds of puzzles.
Libipuz is an interesting beast. The ipuz format is JSON with a lot
of hair: it needs to represent the actual grid of characters and their
solutions, the grid's cells' numbers, the puzzle's clues, and all the
styling information that crossword puzzles can have (it's more than
you think!).
{
"version": "http://ipuz.org/v2",
"kind": [ "http://ipuz.org/crossword#1", "https://libipuz.org/barred#1" ],
"title": "Mephisto No 3228",
"styles": {
"L": {"barred": "L" },
"T": {"barred": "T" },
"TL": {"barred": "TL" }
},
"puzzle": [ [ 1, 2, 0, 3, 4, {"cell": 5, "style": "L"}, 6, 0, 7, 8, 0, 9 ],
[ 0, {"cell": 0, "style": "L"}, {"cell": 10, "style": "TL"}, 0, 0, 0, 0, {"cell": 0, "style": "T"}, 0, 0, {"cell": 0, "style": "T"}, 0 ]
# the rest is omitted
],
"clues": {
"Across": [ {"number":1, "clue":"Having kittens means losing heart for home day", "enumeration":"5", "cells":[[0,0],[1,0],[2,0],[3,0],[4,0]] },
{"number":5, "clue":"Mostly allegorical poet on writing companion poem, say", "enumeration":"7", "cells":[[5,0],[6,0],[7,0],[8,0],[9,0],[10,0],[11,0]] },
]
# the rest is omitted
}
}
Libipuz uses json-glib, which works fine to ingest the
JSON into memory, but then it is a complete slog to distill the JSON
nodes into C data structures. You need iterate through each node in
the JSON tree and try to fit its data into yours.
Get me the next node. Is the node an array? Yes? How many elements?
Allocate my own array. Iterate the node's array. What's in this
element? Is it a number? Copy the number to my array. Or is it a
string? Do I support that, or do I throw an error? Oh, don't forget
the code to meticulously free the partially-constructed thing I was
building.
This is not pleasant code to write and test.
Ipuz also has a few mini-languages within the format, which live
inside string properties. Parsing these in C unpleasant at best.
Differences from librsvg
While librsvg has a very small GObject-based API, and a medium-sized
library underneath, libipuz has a large API composed of GObjects,
boxed types, and opaque and public structures. Using libipuz involves
doing a lot of calls to its functions, from loading a crossword to
accessing each of its properties via different functions.
I want to use this rustification as an exercise in porting a
moderately large C API to Rust. Fortunately, libipuz does have a
good test suite that is useful from the beginning of the port.
Also, I want to see what sorts of idioms appear when exposing things
from Rust that are not GObjects. Mutable, opaque structs can just
be passed as a pointer to a heap allocation, i.e. a Box<T>
. I want
to take the opportunity to make more things in libipuz immutable;
currently it has a bunch of reference-counted, mutable objects, which
are fine in single-threaded C, but decidedly not what Rust would
prefer. For librsvg it was very beneficial to be able to notice parts
of objects that remain immutable after construction, and to
distinguish those parts from the mutable ones that change when the
object goes through its lifetime.
Let's begin!
In the ipuz format, crosswords have a character set or charset: it
is the set of letters that appear in the puzzle's solution.
Internally, GNOME Crosswords uses the charset as a histogram of
letter counts for a particular puzzle. This is useful information
for crossword authors.
Crosswords uses the histogram of letter counts in various important
algorithms, for example, the one that builds a database of words
usable in the crosswords editor. That database has a clever
format which allows answering questions like the following
quickly: What words in the database match ?OR??
— WORDS
and
CORES
will match.
IPuzCharset
is one of the first pieces of code I worked on in
Crosswords, and it later got moved to libipuz. Originally it didn't
even keep a histogram of character counts; it was just an ordered set
of characters that could answer the question, "what is the index of
the character ch
within the ordered set?".
I implemented that ordered set with a GTree
, a balanced
binary tree. The keys in the key/value tree were the characters, and
the values were just unused.
Later, the ordered set was turned into an actual histogram with
character counts: keys are still characters, but each value is now a
count of the coresponding character.
Over time, Crosswords started using IPuzCharset
for different
purposes. It is still used while building and accessing the database
of words; but now it is also used to present statistics in the
crosswords editor, and as part of the engine in an acrostics
generator.
In particular, the acrostics generator has been running into some
performance problems with IPuzCharset
. I wanted to take the port to
Rust as an opportunity to change the algorithm and make it faster.
Refactoring into mutable/immutable stages
IPuzCharset
started out with these basic operations:
/* Construction; memory management */
IPuzCharset *ipuz_charset_new (void);
IPuzCharset *ipuz_charset_ref (IPuzCharset *charet);
void ipuz_charset_unref (IPuzCharset *charset);
/* Mutation */
void ipuz_charset_add_text (IPuzCharset *charset,
const char *text);
gboolean ipuz_charset_remove_text (IPuzCharset *charset,
const char *text);
/* Querying */
gint ipuz_charset_get_char_index (const IPuzCharset *charset,
gunichar c);
guint ipuz_charset_get_char_count (const IPuzCharset *charset,
gunichar c);
gsize ipuz_charset_get_n_chars (const IPuzCharset *charset);
gsize ipuz_charset_get_size (const IPuzCharset *charset);
All of those are implemented in terms of the key/value binary tree
that stores a character in each node's key, and a count in the node's
value.
I read the code in Crosswords that uses the ipuz_charset_*()
functions and noticed that in every case, the code first constructs
and populates the charset using ipuz_charset_add_text()
, and then
doesn't modify it anymore — it only does queries afterwards. The only
place that uses ipuz_charset_remove_text()
is the acrostics
generator, but that one doesn't do any queries later: it uses the
remove_text()
operation as part of another algorithm, but only that.
So, I thought of doing this:
-
Split things into a mutable IPuzCharsetBuilder
that has the
add_text
/ remove_text
operations, and also has a build()
operation that consumes the builder and produces an immutable
IPuzCharset
.
-
IPuzCharset
is immutable; it can only be queried.
-
IPuzCharsetBuilder
can work with a hash table, which turns the
"add a character" operation from O(log n) to O(1) amortized.
-
build()
is O(n) on the number of unique characters and is only
done once per charset.
-
Make IPuzCharset
work with a different hash table that also allows
for O(1) operations.
Basics of IPuzCharsetBuilder
IPuzCharsetBuilder
is mutable, and it can live on the Rust side as a
Box<T>
so it can present an opaque pointer to C.
#[derive(Default)]
pub struct CharsetBuilder {
histogram: HashMap<char, u32>,
}
// IPuzCharsetBuilder *ipuz_charset_builder_new (void); */
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_new() -> Box<CharsetBuilder> {
Box::new(CharsetBuilder::default())
}
For extern "C"
, Box<T>
marshals as a pointer. It's nominally what
one would get from malloc()
.
Then, simple functions to create the character counts:
impl CharsetBuilder {
/// Adds `text`'s character counts to the histogram.
fn add_text(&mut self, text: &str) {
for ch in text.chars() {
self.add_character(ch);
}
}
/// Adds a single character to the histogram.
fn add_character(&mut self, ch: char) {
self.histogram
.entry(ch)
.and_modify(|e| *e += 1)
.or_insert(1);
}
}
The C API wrappers:
use std::ffi::CStr;
// void ipuz_charset_builder_add_text (IPuzCharsetBuilder *builder, const char *text);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_text(
builder: &mut CharsetBuilder,
text: *const c_char,
) {
let text = CStr::from_ptr(text).to_str().unwrap();
builder.add_text(text);
}
CStr
is our old friend that takes a char *
and can wrap it as a
Rust &str
after validating it for UTF-8 and finding its length.
Here, the unwrap()
will panic if the passed string is not UTF-8, but
that's what we want; it's the equivalent of an assertion that what was
passed in is indeed UTF-8.
// void ipuz_charset_builder_add_character (IPuzCharsetBuilder *builder, gunichar ch);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_character(builder: &mut CharsetBuilder, ch: u32) {
let ch = char::from_u32(ch).unwrap();
builder.add_character(ch);
}
Somehow, the glib-sys crate doesn't have gunichar
, which is just a
guint32
for a Unicode code point. So, we take in a u32
, and check
that it is in the appropriate range for Unicode code points with
char::from_u32()
. Again, a panic in the unwrap()
means that the
passed number is out of range; equivalent to an assertion.
Converting to an immutable IPuzCharset
pub struct Charset {
/// Histogram of characters and their counts plus derived values.
histogram: HashMap<char, CharsetEntry>,
/// All the characters in the histogram, but in order.
ordered: String,
/// Sum of all the counts of all the characters.
sum_of_counts: usize,
}
/// Data about a character in a `Charset`. The "value" in a key/value pair where the "key" is a character.
#[derive(PartialEq)]
struct CharsetEntry {
/// Index of the character within the `Charset`'s ordered version.
index: u32,
/// How many of this character in the histogram.
count: u32,
}
impl CharsetBuilder {
fn build(self) -> Charset {
// omitted for brevity; consumes `self` and produces a `Charset` by adding
// the counts for the `sum_of_counts` field, and figuring out the sort
// order into the `ordered` field.
}
}
Now, on the C side, IPuzCharset
is meant to also be immutable and
reference-counted. We'll use Arc<T>
for such structures. One
cannot return an Arc<T>
to C code; it must first be converted to a
pointer with Arc::into_raw()
:
// IPuzCharset *ipuz_charset_builder_build (IPuzCharsetBuilder *builder);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_build(
builder: *mut CharsetBuilder,
) -> *const Charset {
let builder = Box::from_raw(builder); // get back the Box from a pointer
let charset = builder.build(); // consume the builder and free it
Arc::into_raw(Arc::new(charset)) // Wrap the charset in Arc and get a pointer
}
Then, implement ref()
and unref()
:
// IPuzCharset *ipuz_charset_ref (IPuzCharset *charet);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_ref(charset: *const Charset) -> *const Charset {
Arc::increment_strong_count(charset);
charset
}
// void ipuz_charset_unref (IPuzCharset *charset);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_unref(charset: *const Charset) {
Arc::decrement_strong_count(charset);
}
The query functions need to take a pointer to what really is the
Arc<Charset>
on the Rust side. They reconstruct the Arc
with
Arc::from_raw()
and wrap it in ManuallyDrop
so that the Arc
doesn't lose a reference count when the function exits:
// gsize ipuz_charset_get_n_chars (const IPuzCharset *charset);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_get_n_chars(charset: *const Charset) -> usize {
let charset = ManuallyDrop::new(Arc::from_raw(charset));
charset.get_n_chars()
}
Tests
The C tests remain intact; these let us test all the #[no_mangle]
wrappers.
The Rust tests can just be for the internals, simliar to this:
#[test]
fn supports_histogram() {
let mut builder = CharsetBuilder::default();
let the_string = "ABBCCCDDDDEEEEEFFFFFFGGGGGGG";
builder.add_text(the_string);
let charset = builder.build();
assert_eq!(charset.get_size(), the_string.len());
assert_eq!(charset.get_char_count('A').unwrap(), 1);
assert_eq!(charset.get_char_count('B').unwrap(), 2);
assert_eq!(charset.get_char_count('C').unwrap(), 3);
assert_eq!(charset.get_char_count('D').unwrap(), 4);
assert_eq!(charset.get_char_count('E').unwrap(), 5);
assert_eq!(charset.get_char_count('F').unwrap(), 6);
assert_eq!(charset.get_char_count('G').unwrap(), 7);
assert!(charset.get_char_count('H').is_none());
}
Integration with the build system
Libipuz uses meson, which is not particularly fond of
cargo
. Still, cargo
can be used from meson with a wrapper script
and a bit of easy hacks. See the merge request for details.
Further work
I've left the original C header file ipuz-charset.h
intact, but
ideally I'd like to automatically generate the headers from Rust with
cbindgen
. Doing it that way lets me check that my
assumptions of the extern "C"
ABI are correct ("does foo: &mut Foo
appear as Foo *foo
on the C side?"), and it's one fewer C-ism to
write by hand. I need to see what to do about inline documentation;
gi-docgen
can consume C header files just fine, but I'm
not yet sure about how to make it work with generated headers from
cbindgen
.
I still need to modify the CI's code coverage scripts to work with the
mixed C/Rust codebase. Fortunately I can copy those incantations from
librsvg.
Is it faster?
Maybe! I haven't benchmarked the acrostics generator yet. Stay tuned!