December 04, 2020

Advent of Rust 4: It’s Hard to Return an Iterator

Welcome again to this stream-of-consciousness log about learning the Rust programming language by doing the programming puzzles in Advent of Code 2020, or as I like to call it, On the Code by Jack Kerouac.1 Let’s get on with it!

Day 4, Part 1

I start with cargo new puzzle4 and copying over the code from yesterday, but this time I’d like to refactor it a bit. I would like to write a function is_part2() that tells you whether the solution for Part 2 of the puzzle was requested or not, and I would like to change read_lines() so that it already does the expect() calls that I am copy-pasting every day, rather than returning a Result of an iterator of Results of strings.

Changing read_lines() really does not work for me. I can’t figure out how to return an iterator from the function! Well, clearly the function is already returning an iterator, but I can’t figure out how to express the iterator that I want to return, as a return type.

What I want is this:

fn read_lines<P>(filename: P) -> ???
where
    P: AsRef<path::Path>,
{
    let file = fs::File::open(filename).expect("Bad file");
    io::BufReader::new(file)
        .lines()
        .map(|s| s.expect("Bad line in file"))
}

where ??? ought to be something like Iterator<String>. But I cannot figure out how to write an iterator type. The iterator types that are returned from standard library functions all seem to be type aliases of some sort. For example, io::Lines is the iterator that lines() returns, and I know it is an iterator of Results of strings, because the documentation says so, but I don’t know how to build my own iterator type.

I google “rust return iterator” and the first result is encouragingly titled “Returning Rust Iterators”. This article suggests asking the compiler by returning the empty type () and letting it suggest what to return instead, so I do that.

Unfortunately, I get a type that I don’t think I can put into my program! The compiler says “expected unit type (), found struct Map<std::io::Lines<BufReader<File>>, [closure@src/main.rs:31:14: 31:46]>” I am guessing that referring to a type by the file in which it’s found and the lines and columns that it spans, is not legal Rust syntax!

So I try poking around with no success. By trying things and following the compiler’s suggestions, I end up with the awkward return type of iter::Map<io::Lines<io::BufReader<fs::File>>, dyn FnMut(io::Result<&String>) -> &String>, but this is still producing errors about lifetimes that I don’t understand. So I give up on this idea.

However, maybe I can write read_lines() so that it at least calls expect() on the io::Lines iterator that it returns:

fn read_lines<P>(filename: P) -> io::Lines<io::BufReader<fs::File>>
where
    P: AsRef<path::Path>,
{
    let file = fs::File::open(filename).expect("Bad file");
    io::BufReader::new(file).lines()
}

To be clear, I don’t really understand the P: AsRef<path::Path> part either. I guess I will try to refactor this again in a few days when I have learned a bit more.

Writing is_part2(), on the other hand, is quite straightforward:

fn is_part2() -> bool {
    let args: Vec<String> = env::args().collect();
    let default = String::from("1");
    let arg = args.get(1).unwrap_or(&default);
    arg == "2"
}

This works, but I actually think that I might be able to make this even nicer by using a match statement. I haven’t encountered the match statement directly yet, but I’ve seen it in several google results over the past few days. I google “rust match syntax”, land here, and come up with this:

fn is_part2() -> bool {
    match env::args().nth(1) {
        Some("2") => true,
        _ => false,
    }
}

This almost works, but I have to replace "2" with String::from("2"), and then the compiler tells me that I have to take it out of the match statement, because function calls are not allowed in pattern matching. So in the end it looks like this:

fn is_part2() -> bool {
    let string2 = String::from("2");
    match env::args().nth(1) {
        Some(string2) => true,
        _ => false,
    }
}

The funny thing is, though, that the compiler warns that string2 is an unused variable in both of the places that it is used! It seems like I haven’t understood this well enough. I google “rust match string” and land on a Stack Overflow question titled “How to match a String against string literals in Rust?”, but that is actually only good for when you know that you already have a String! But I have an Option<String>, so I google “rust match option string”, and find another Stack Overflow post with the topic “How can I pattern match against an Option<String>?”. The helpful answer says this is a known limitation of pattern-matching, but suggests two things to do instead. The second solution looks good to me, so I implement it:

fn is_part2() -> bool {
    match env::args().nth(1) {
        Some(s) if s == "2" => true,
        _ => false,
    }
}

This works, and looks like what I had in mind. Time to start on the puzzle!

Today’s puzzle is processing records that represent “passports”. These records consist of whitespace-separated key:value pairs, and records are separated by blank lines. There are eight possible keys, each consisting of three letters. The cid key is optional, and all the others are required. A record is valid if it contains all the required keys. The answer to the puzzle is the number of valid passports in the input file.

I start out by downloading the input file and put it in the project directory, as usual.

So first I think a bit about how I will process the data into records. My usual approach so far has been to split the file into lines and process each line using a chain of iterators, but maybe it’s better to let go of that approach for today. I could read the whole file into a string and split it on the blank lines in order to get an array where each element is a record, and then process each record. Or I could still process the file line-by-line, and use a for loop while keeping the notion of a “current” record, which I push into a vector when it is completed. I think I like that idea best so far.

The records can be hash sets, where I store the keys. That’s a bit wasteful since really all I need is one bit for each of the 8 possible keys, to tell whether the record has it or not! (I can ignore the values for now, since I only need to look at the keys.) But I decide to be wasteful nonetheless, for two reasons: first, I suspect Part 2 of the puzzle will require me to do something with the values, but if Rust is like other languages, a hash set will be easy enough to refactor into a hash map. Second, stuffing the string keys into a hash set or hash map is simple, and I won’t have to write code that translates key strings into record fields.

To make a first attempt, I google “rust hash set” and read the Rust by Example page. It’s interesting that you don’t have to specify the type of a HashSet, the compiler can figure it out by what you insert into it!

I also have to look in the API documentation for how to split a string on whitespace, but it seems there is conveniently a split_whitespace() method.

Here’s my attempt:

use std::collections::HashSet;
use std::env;
use std::fs;
use std::io::{self, BufRead};
use std::path;

fn main() {
    let mut passports = vec![];
    let mut current = HashSet::new();
    for line in read_lines("input").map(|s| s.expect("Bad line in file")) {
        if line == "" {
            passports.push(current);
            current = HashSet::new();
        }
        current = current.union(get_keys_from_line(&line));
    }

    if is_part2() {
        println!("part 2");
    } else {
        println!("the first passport is {:?}", passports[0]);
    }
}

fn get_keys_from_line(line: &str) -> HashSet<String> {
    let mut new_keys = HashSet::new();
    for pair in line.split_whitespace() {
        new_keys.insert(String::from(&pair[..3]));
    }
    new_keys
}

It looks like I got several & operators right this time, even though I still got one wrong:

error[E0308]: mismatched types
  --> src/main.rs:15:33
   |
15 |         current = current.union(get_keys_from_line(&line));
   |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
   |                                 |
   |                                 expected `&HashSet<_>`, found struct `HashSet`
   |                                 help: consider borrowing here: `&get_keys_from_line(&line)`
   |
   = note: expected reference `&HashSet<_>`
                 found struct `HashSet<String>`

error[E0308]: mismatched types
  --> src/main.rs:15:19
   |
15 |         current = current.union(get_keys_from_line(&line));
   |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `HashSet`, found struct `std::collections::hash_set::Union`
   |
   = note: expected struct `HashSet<_>`
              found struct `std::collections::hash_set::Union<'_, _, RandomState>`

I need a & on the return value of get_keys_from_line(), and it appears I didn’t read the documentation of union() well enough because it returns an iterator, not another hash set, so I need to add collect() after it. I’m momentarily confused since I thought collect() created a vector from an iterator, not a hash set! But I see that the examples in the documentation for union() are doing it like that, so I assume it must be OK. I’m also wondering if there isn’t a more idiomatic way to do what I’m trying to do (I’m thinking of Python’s dict.update()) but I decide to leave it for now.

Now I get this error:

error[E0277]: a value of type `HashSet<String>` cannot be built from an iterator over elements of type `&String`
  --> src/main.rs:15:61
   |
15 |         current = current.union(&get_keys_from_line(&line)).collect();
   |                                                             ^^^^^^^ value of type `HashSet<String>` cannot be built from `std::iter::Iterator<Item=&String>`
   |
   = help: the trait `FromIterator<&String>` is not implemented for `HashSet<String>`

At least I think I know enough to understand this error message now; the union() iterator is giving me the &String values that are owned by the original two hash sets (current and the temporary object returned from get_keys_from_line()). I don’t own them, so I can’t insert them into a hash set of values that I’m supposed to own.

Hmm, this is not what I wanted at all. What I actually wanted was an update() method that would move the elements from the second set into the first set. I start to think this might have been easier after all if I’d used 8 bits to store whether the keys were present… ðŸ˜�

I google “rust hashset update” and land on this encouragingly titled Stack Overflow post: “How can I insert all values of one HashSet into another HashSet?” It comes as a surprise to me that there is an extend() method for this! I guess as the commenter says under that post,

Plus it teaches me that I should not only look at the ‘methods’ section of the doc to find methods, but also into the traits a struct implements.

I guess I’ve learned that too now! Wow, that really is confusing, that a note down at the bottom of the API documentation page saying only impl<T, S> Extend<T> for HashSet<T, S> is actually telling you that there is an extend() method.

Regardless, I change the line to read current.extend(&get_keys_from_line(&line)); and I don’t quite get what I want either:

error[E0716]: temporary value dropped while borrowed
  --> src/main.rs:15:25
   |
12 |             passports.push(current);
   |                            ------- borrow later used here
...
15 |         current.extend(&get_keys_from_line(&line));
   |                         ^^^^^^^^^^^^^^^^^^^^^^^^^ - temporary value is freed at the end of this statement
   |                         |
   |                         creates a temporary which is freed while still in use
   |
   = note: consider using a `let` binding to create a longer lived value

This is another head-scratcher. I don’t see why it should matter that the temporary value is freed, when I thought I was moving all the elements out of it and into current! From reading about ownership yesterday, I thought I understood that ownership of values is always moved into function calls, unless they are borrowed with the & operator.

But while I was browsing the HashSet documentation I do remember coming across the drain() method and wondering if that would be useful. Maybe the problem is that the ownership of the values isn’t getting transferred, and I need to drain() them out of the temporary object so that current owns them. I change the line to current.extend(get_keys_from_line(&line).drain()); and it works!

So, I file away the knowledge that the operation I think of as a.update(b) is written as a.extend(b.drain()) in Rust. I wonder if that is actually the idiomatic way to do it?

Now that I have the data in the form that I wanted, I can write the code to get the answer:

let count = passports.iter().filter(passport_is_valid).count();
println!("{}", count);

// [...]

fn passport_is_valid(passport: &HashSet<String>) -> bool {
    let n_keys = passport.len();
    n_keys == 8 || (n_keys == 7 && !passport.contains("cid"))
}

But this doesn’t work:

error[E0631]: type mismatch in function arguments
  --> src/main.rs:21:45
   |
21 |         let count = passports.iter().filter(passport_is_valid).count();
   |                                             ^^^^^^^^^^^^^^^^^ expected signature of `for<'r> fn(&'r &HashSet<String>) -> _`
...
26 | fn passport_is_valid(passport: &HashSet<String>) -> bool {
   | -------------------------------------------------------- found signature of `for<'r> fn(&'r HashSet<String>) -> _`

error[E0599]: no method named `count` found for struct `Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>` in the current scope
    --> src/main.rs:21:64
     |
21   |           let count = passports.iter().filter(passport_is_valid).count();
     |                                                                  ^^^^^ method not found in `Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>`
     |
     = note: the method `count` exists but the following trait bounds were not satisfied:
             `<for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid} as FnOnce<(&&HashSet<String>,)>>::Output = bool`
             which is required by `Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>: Iterator`
             `for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}: FnMut<(&&HashSet<String>,)>`
             which is required by `Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>: Iterator`
             `Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>: Iterator`
             which is required by `&mut Filter<std::slice::Iter<'_, HashSet<String>>, for<'r> fn(&'r HashSet<String>) -> bool {passport_is_valid}>: Iterator`

Well, I saw an error just like this yesterday, and it was a missing & operator on the type of the function parameter. But I already have a & there! I try putting a second one (since the error message also has two of them) and sure enough, it works. I get an answer, but it’s the wrong answer.

According to the Advent of Code website, my answer is too low. I look over my code again and I see that I forgot to add the last passport to my vector of passports! Maybe that’s the problem? I add passports.push(current); below the loop, run the program again, and yes, I get an answer that’s one more than the previous answer. This time, it’s correct according to the website.

Here’s the full code (I’ll start leaving out the definitions of is_part2() and read_lines() each time unless they change, though):

use std::collections::HashSet;

fn main() {
    let mut passports = vec![];
    let mut current = HashSet::new();
    for line in read_lines("input").map(|s| s.expect("Bad line in file")) {
        if line == "" {
            passports.push(current);
            current = HashSet::new();
        }
        current.extend(get_keys_from_line(&line).drain());
    }
    passports.push(current);

    if is_part2() {
        println!("part 2");
    } else {
        let count = passports.iter().filter(passport_is_valid).count();
        println!("{}", count);
    }
}

fn passport_is_valid(passport: &&HashSet<String>) -> bool {
    let n_keys = passport.len();
    n_keys == 8 || (n_keys == 7 && !passport.contains("cid"))
}

fn get_keys_from_line(line: &str) -> HashSet<String> {
    let mut new_keys = HashSet::new();
    for pair in line.split_whitespace() {
        new_keys.insert(String::from(&pair[..3]));
    }
    new_keys
}

Day 4, Part 2

Well, as I suspected, the second part of the puzzle does require looking at the values. So I will start by refactoring the hash set into a hash map, that saves the values as well as the keys.

This refactor is pretty straightforward, I look up the documentation for HashMap to check if the methods are different (contains() has to change to contains_key()) but other than that, I just have to change HashSet to HashMap throughout. (I also originally overlooked that the correct type is HashMap<String, String>, not HashMap<String>, but the compiler helpfully reminded me.)

use std::collections::HashMap;

fn main() {
    let mut passports = vec![];
    let mut current = HashMap::new();
    for line in read_lines("input").map(|s| s.expect("Bad line in file")) {
        if line == "" {
            passports.push(current);
            current = HashMap::new();
        }
        current.extend(get_pairs_from_line(&line).drain());
    }
    passports.push(current);

    if is_part2() {
        println!("part 2");
    } else {
        let count = passports.iter().filter(passport_is_valid).count();
        println!("{}", count);
    }
}

fn passport_is_valid(passport: &&HashMap<String, String>) -> bool {
    let n_keys = passport.len();
    n_keys == 8 || (n_keys == 7 && !passport.contains_key("cid"))
}

fn get_pairs_from_line(line: &str) -> HashMap<String, String> {
    let mut new_pairs = HashMap::new();
    for pair in line.split_whitespace() {
        new_pairs.insert(String::from(&pair[..3]), String::from(&pair[4..]));
    }
    new_pairs
}

Now I can start changing passport_is_valid() to implement the rules for each field. I’ll make it look something like this:

    (n_keys == 8 || (n_keys == 7 && !passport.contains_key("cid")))
        && valid_birth_year(&passport["byr"])
        && valid_issue_year(&passport["iyr"])
        && valid_expiry_year(&passport["eyr"])
        && valid_height(&passport["hgt"])
        && valid_hair_color(&passport["hcl"])
        && valid_eye_color(&passport["ecl"])
        && valid_passport_id(&passport["pid"])

Then I can write a function for each validation criterion. To start with I write all these functions but make them only contain true, so I can write them one by one but still run the program. (This went pretty smoothly although the compiler did have to remind me to add the & operators before passing each passport data field into each function.)

I wonder for a bit whether I should use scan_fmt! like I did on Day 2, or regular expressions, or parse(). I decide that maybe it’s time for me to use some regular expressions in Rust. It’s new for me, would be useful to learn, and these regular expressions will be simple. But for the fields where I need to check that the numbers are between certain ranges, I’ll also need to use parse().

Let’s start at the top. Birth year must be a 4-digit number between 1920 and 2002 inclusive. (What happens to the 17-year-old or 101-year-old travelers?) I google “rust regex” and find out that you need to install a package for regular expressions, so I add regex = "1" to Cargo.toml and use regex::Regex; to my program. I start reading the documentation for the regex package.

I already know regular expression syntax, which will be very helpful for completing this task. The one thing that I should take note of is that patterns match anywhere in the string, so I need to use ^ and $ to anchor them if I only want them to match the whole string. (That is, matching is like re.search() and not re.match() in Python.)

I come up with this as a first attempt:

fn valid_birth_year(byr: &str) -> bool {
    let re = Regex::new(r"^[12]\d{3}$").unwrap();
    if !re.is_match(byr) {
        return false;
    }
    let year = byr.parse::<u16>();
    year >= 1920 && year <= 2002
}

It seems the only thing I’ve forgotten is to unwrap() the result of parse(), which I had learned on Day 1 but forgot about. So I add that. Then, since issue year and expiry year are validated in a very similar way, just with different maximum and minimum years, I extract that into a valid_year() function:

fn valid_birth_year(byr: &str) -> bool {
    valid_year(byr, 1920, 2002)
}

fn valid_issue_year(iyr: &str) -> bool {
    valid_year(iyr, 2010, 2020)
}

fn valid_expiry_year(eyr: &str) -> bool {
    valid_year(eyr, 2020, 2030)
}

fn valid_year(y: &str, min: u16, max: u16) -> bool {
    let re = Regex::new(r"^[12]\d{3}$").unwrap();
    if !re.is_match(y) {
        return false;
    }
    let year = y.parse::<u16>().unwrap();
    year >= min && year <= max
}

(I initially forgot the -> bool on the new function. I’m so used to either not having a return type on functions (Python, JavaScript) or having the return type come before the name (C, C++) that I forget this a lot!)

Next I write a first attempt at valid_height(), using the match statement that I learned earlier today, and browsing the regex documentation to find that I have to use captures() to get the values of the regular expression’s capture groups:

fn valid_height(hgt: &str) -> bool {
    let re = Regex::new(r"^(\d{2,3})(cm|in)$").unwrap();
    let groups = re.captures(hgt);
    let height = groups[1].parse::<u8>().unwrap();
    match &groups[2] {
        "cm" => height >= 150 && height <= 193,
        "in" => height >= 59 && height <= 76,
    }
}

The compiler tells me that I forgot to unwrap() the result from captures(), and after I do that, it tells me that my match pattern is “non-exhaustive” — I guess this means that I don’t provide any instructions for what to do if the unit is neither cm nor in. I remember from earlier that a default case is written as _ =>, so I add _ => false to the match statement.

Now it runs! But it panics at unwrapping the return value of captures(). I look at the captures() documentation again, and it says that it will return None if the regex doesn’t match. So I decide to do some quick’n’dirty println!("{}", hgt) debugging to see what the string is that doesn’t match. It’s 97, without a unit after it, so indeed it doesn’t match. A string that doesn’t match should make the function return false, not panic.

I’m sure there’s a nice idiom for “return if None, unwrap otherwise” that I’m not yet aware of. I google “rust option return if none” and I learn a bit more about the ? operator, but since I’m not returning a Result from this function that doesn’t help me. However, another thing I learn is that the body of a match statement doesn’t actually have to be the same type as the expression! You can put return false in there and it will return from the function. How nice! So, the captures() call becomes:

let groups = match re.captures(hgt) {
    Some(groups) => groups,
    None => return false,
};

Next comes hair color. This is very similar to the year validation that I’ve done already, only without having to parse the string:

fn valid_hair_color(hcl: &str) -> bool {
    let re = Regex::new(r"^#[0-9a-f]{6}$").unwrap();
    re.is_match(hcl)
}

Eye color doesn’t actually need to use a regular expression:

fn valid_eye_color(ecl: &str) -> bool {
    ecl == "amb"
        || ecl == "blu"
        || ecl == "brn"
        || ecl == "gry"
        || ecl == "grn"
        || ecl == "hzl"
        || ecl == "oth"
}

This works, but I wonder if I could take this opportunity to apply something that I think I remember, from reading about match earlier:

fn valid_eye_color(ecl: &str) -> bool {
    match ecl {
        "amb" | "blu" | "brn" | "gry" | "grn" | "hzl" | "oth" => true,
        _ => false,
    }
}

This also works, and looks much nicer!

Finally, there is the passport ID, which is a nine-digit number:

fn valid_passport_id(pid: &str) -> bool {
    let re = Regex::new(r"^\d{9}$").unwrap();
    re.is_match(pid)
}

I should now be able to get my answer! Before I check whether it’s correct on the Advent of Code website, I would like to make one improvement that was mentioned in the regex documentation. Running the program takes an almost noticeably long time, and it’s probably because I’m rebuilding the same Regex objects once for each passport in the file, several hundred times. The documentation mentions you can use the lazy_static package to build them only once, and so I add that to Cargo.toml and follow the example in the regex documentation.

According to the example, I need to add this to the top of the file:

#[macro_use]
extern crate lazy_static;

And change this:

let re = Regex::new(r"^[12]\d{3}$").unwrap();

to this:

lazy_static! {
    static ref YEAR_REGEX: Regex = Regex::new(r"^[12]\d{3}$").unwrap();
}

I initially forget to add the type : Regex and the compiler’s error message isn’t so helpful:

error: no rules expected the token `=`
  --> src/main.rs:57:23
   |
57 |         static ref re = Regex::new(r"^[12]\d{3}$").unwrap();
   |                       ^ no rules expected this token in macro call

I still don’t quite know how macros work in Rust, or why lazy_static! is a macro and not a function, but my guess is that it’s harder to generate good error messages for macros.

I also try to keep the name re for the regular expressions, but the compiler helpfully tells me that it’s bad style for static variables to have lower case names! So I rename them.

My program seems to run faster now, and still gives me the same answer. So I put that answer into the Advent of Code website, and it’s correct! Hooray!

Here’s the program, quite long this time because of all the validation rules:

use regex::Regex;
use std::collections::HashMap;

#[macro_use]
extern crate lazy_static;

fn main() {
    let mut passports = vec![];
    let mut current = HashMap::new();
    for line in read_lines("input").map(|s| s.expect("Bad line in file")) {
        if line == "" {
            passports.push(current);
            current = HashMap::new();
        }
        current.extend(get_pairs_from_line(&line).drain());
    }
    passports.push(current);

    let count = passports.iter().filter(passport_is_valid).count();
    println!("{}", count);
}

fn passport_is_valid(passport: &&HashMap<String, String>) -> bool {
    let n_keys = passport.len();
    (n_keys == 8 || (n_keys == 7 && !passport.contains_key("cid")))
        && (!is_part2()
            || valid_birth_year(&passport["byr"])
                && valid_issue_year(&passport["iyr"])
                && valid_expiry_year(&passport["eyr"])
                && valid_height(&passport["hgt"])
                && valid_hair_color(&passport["hcl"])
                && valid_eye_color(&passport["ecl"])
                && valid_passport_id(&passport["pid"]))
}

fn valid_birth_year(byr: &str) -> bool {
    valid_year(byr, 1920, 2002)
}

fn valid_issue_year(iyr: &str) -> bool {
    valid_year(iyr, 2010, 2020)
}

fn valid_expiry_year(eyr: &str) -> bool {
    valid_year(eyr, 2020, 2030)
}

fn valid_year(y: &str, min: u16, max: u16) -> bool {
    lazy_static! {
        static ref YEAR_REGEX: Regex = Regex::new(r"^[12]\d{3}$").unwrap();
    }
    if !YEAR_REGEX.is_match(y) {
        return false;
    }
    let year = y.parse::<u16>().unwrap();
    year >= min && year <= max
}

fn valid_height(hgt: &str) -> bool {
    lazy_static! {
        static ref HEIGHT_REGEX: Regex = Regex::new(r"^(\d{2,3})(cm|in)$").unwrap();
    }
    let groups = match HEIGHT_REGEX.captures(hgt) {
        Some(groups) => groups,
        None => return false,
    };
    let height = groups[1].parse::<u8>().unwrap();
    match &groups[2] {
        "cm" => height >= 150 && height <= 193,
        "in" => height >= 59 && height <= 76,
        _ => false,
    }
}

fn valid_hair_color(hcl: &str) -> bool {
    lazy_static! {
        static ref HAIR_REGEX: Regex = Regex::new(r"^#[0-9a-f]{6}$").unwrap();
    }
    HAIR_REGEX.is_match(hcl)
}

fn valid_eye_color(ecl: &str) -> bool {
    match ecl {
        "amb" | "blu" | "brn" | "gry" | "grn" | "hzl" | "oth" => true,
        _ => false,
    }
}

fn valid_passport_id(pid: &str) -> bool {
    lazy_static! {
        static ref ID_REGEX: Regex = Regex::new(r"^\d{9}$").unwrap();
    }
    ID_REGEX.is_match(pid)
}

fn get_pairs_from_line(line: &str) -> HashMap<String, String> {
    let mut new_pairs = HashMap::new();
    for pair in line.split_whitespace() {
        new_pairs.insert(String::from(&pair[..3]), String::from(&pair[4..]));
    }
    new_pairs
}

Afterword

I don’t have that much to reflect on, this time. I did still forget the & operators all over the place, but less often than I did on the previous days, and I had a better understanding of the errors when they occurred.

I’m still a bit mystified about why it’s so difficult to express the type of an iterator in Rust. It seems like it shouldn’t be that hard, so maybe I’m missing something or approaching it the wrong way?

Finally, it seems like I’m reading the API documentation more often, and googling less. I think this is a sign that I have a better idea of where to look and what I’m looking for in the API documentation, but googling is still useful when I need to figure out how to do something that’s not covered by the examples in the API documentation, like a.extend(b.drain()).


[1] I don’t, in fact, like to call it that ↩

Beginning Rust

I have the privilege of some free time this December and I unexpectedly was inspired to do the first few days of the Advent of Code challenge, by a number of inspiring people including Philip Chimento, Daniel Silverstone and Ed Cragg.

The challenge can be completed in any language, but it’s a great excuse to learn something new. I have read a lot about Rust and never used until a few days ago.

Most of my recent experience is with Python and C, and Rust feels like it has many of the best bits of both languages. I didn’t get on well with Haskell, but the things I liked about that language are also there in Rust. It’s done very well at taking the good parts of these languages and leaving out the bad parts. There’s no camelCaseBullshit, in particular.

As a C programmer, it’s a pleasure to see all of C’s invisible traps made explicit and visible in the code. Even integer overflow is a compile time error. As a Python programmer, I’m used to writing long chains of operations on iterables, and Rust allows me to do pretty much the same thing. Easy!

Rust does invent some new, unique bad parts. I wanted to use Ed’s cool Advent of Code helper crate, but somehow installing this tiny library using Cargo took up almost 900MB of disk space. This appears to be a known problem. It makes me sad that I can’t simply use Meson to build my code. I understand that Cargo’s design brings some cool features, but these are big tradeoffs to make. Still, for now I can simply avoid using 3rd party crates which is anyway a good motivation to learn to work with Rust’s standard library.

I also spend a lot of time figuring out compiler errors. Rust’s compiler errors are very good. If you compare them to C++ compiler errors, then there’s really no comparison at all. In fact, they’re so good that my expectations have increased, and paradoxically this makes me more critical! (Sometimes you have to measure success by how many complaints you get). When the compiler tells me ‘you forgot this semicolon’, part of me thinks “Well you know what I meant — you add the semicolon!”. And while some errors clearly tell you what to fix, others are still pretty cryptic. Here’s an example:

error[E0599]: no method named `product` found for struct `Vec<i64>` in the current scope
   --> day3.rs:82:34
    |
82  |       let count: i64 = tree_counts.product();
    |                                    ^^^^^^^ method not found in `Vec<i64>`
    |
    = note: the method `product` exists but the following trait bounds were not satisfied:
            `Vec<i64>: Iterator`
            which is required by `&mut Vec<i64>: Iterator`
            `[i64]: Iterator`
            which is required by `&mut [i64]: Iterator`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.

What’s the problem here? If you know Rust, maybe it’s obvious that my tree_counts variable is a Vec (list), and I need to call .iter() to produce an iterator. If you’re a beginner, this isn’t a huge help. You might be tempted to call rustc --explain E0599, which will tell you that you might, for example, need to implement the .chocolate() method on your Mouth struct. This doesn’t get you any closer to knowing why you can’t iterate across a list, which is something that you’d expect to be iterable.

Like I said, Rust is lightyears ahead of other compilers in terms of helpful error messages. However, if it’s a goal that “Rust is for students”, then there is still lots of work to do to improve them.

I know enough about software development to know that the existence of Rust is nothing short of a miracle. The Rust community are clearly amazing and deserve ongoing congratulations. I’m also impressed with Advent of Code. December is a busy time, which is why I’ve never got involved before, but if you are looking for something to do then I can recommend it!

You can see my Advent of Code repo here. It may, or may not proceed beyond day 4. It’s useful to check your completed code against some kind of ‘model’, and I’ve been using Daniel’s repo for that. Who else has some code to show?

It's templates all the way down - part 3

In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, I've shown you how to truly build nested images. In this part, I'll talk about the ci-fairy tool that is part of the same repository of ci-templates.

When you're building a CI pipeline, there are some tasks that most projects need in some way or another. The ci-fairy tool is a grab-bag of solutions for these. Some of those solutions are for a pipeline itself, others are for running locally. So let's go through the various commands available.

Using ci-fairy in a pipeline

It's as simple as including the template in your .gitlab-ci.yml file.


include:
- 'https://gitlab.freedesktop.org/freedesktop/ci-templates/-/raw/master/templates/ci-fairy.yml'
Of course, if you want to track a specific sha instead of following master, just sub that sha there. freedesktop.org projects can include ci-fairy like this:

include:
- project: 'freedesktop/ci-templates'
ref: master
file: '/templates/ci-fairy.yml'
Once that's done, you have access to a .fdo.ci-fairy job template that you can extends from. This will download an image from quay.io that is capable of git, python, bash and obviously ci-fairy. This image is a fixed one and referenced by a unique sha so even if where we keep working on ci-fairy upstream you should never see regression, updating requires you to explicitly update the sha of the included ci-fairy template. Obviously, if you're using master like above you'll always get the latest.

Due to how the ci-templates work, it's good to set the FDO_UPSTREAM_REPO variable with the upstream project name. This means ci-fairy will be able to find the equivalent origin/master branch, where that's not available in the merge request. Note, this is not your personal fork but the upstream one, e.g. "freedesktop/ci-templates" if you are working on the ci-templates itself.

Checking commit messages

ci-fairy has a command to check commits for a few basic expectations in commit messages. This currently includes things like enforcing a 80 char subject line length, that there is an empty line after the subject line, that no fixup or squash commits are in the history, etc. If you have complex requirements you need to write your own but for most projects this job ensures that there are no obvious errors in the git commit log:


check-commit:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-commits --signed-off-by
except:
- master@upstream/project
Since you don't ever want this to fail on an already merged commit, exclude this job the master branch of the upstream project - the MRs should've caught this already anyway.

Checking merge requests

To rebase a contributors merge request, the contributor must tick the checkbox to Allow commits from members who can merge to the target branch. The default value is off which is frustrating (gitlab is working on it though) and causes unnecessary delays in processing merge requests. ci-fairy has command to check for this value on an MR and fail - contributors ideally pay attention to the pipeline and fix this accordingly.


check-merge-request:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-merge-request --require-allow-collaboration
allow_failure: true
As a tip: run this job towards the end of the pipeline to give collaborators a chance to file an MR before this job fails.

Using ci-fairy locally

The two examples above are the most useful ones for CI pipelines, but ci-fairy also has some useful local commands. For that you'll have to install it, but that's as simple as


$ pip3 install git+http://gitlab.freedesktop.org/freedesktop/ci-templates
A big focus on ci-fairy for local commands is that it should, usually, be able to work without any specific configuration if you run it in the repository itself.

Linting

Just hacked on the CI config?


$ ci-fairy lint
and done, you get the same error back that the online linter for your project would return.

Pipeline checks

Just pushed to the repo?


$ ci-fairy wait-for-pipeline
Pipeline https://gitlab.freedesktop.org/username/project/-/pipelines/238586
status: success | 7/7 | created: 0 | pending: 0 | running: 0 | failed: 0 | success: 7 ....
The command is self-explanatory, I think.

Summary

There are a few other parts to ci-fairy including templating and even minio handling. I recommend looking at e.g. the libinput CI pipeline which uses much of ci-fairy's functionality. And the online documentation for ci-fairy, who knows, there may be something useful in there for you.

The useful contribution of ci-fairy is primarily that it tries to detect the settings for each project automatically, regardless of whether it's run inside a MR pipeline or just as part of a normal pipeline. So the same commands will work without custom configuration on a per-project basis. And for many things it works without API tokens, so the setup costs are just the pip install.

If you have recurring jobs, let us know, we're always looking to add more useful functionality to this little tool.

December 03, 2020

Advent of Rust 3: Once in ‘a Lifetime

Well, this is another long stream-of-consciousness chronicle of my attempt to learn how to program in Rust by doing the daily programming puzzles on Advent of Code 2020. It’s long, but on the plus side, this is the first time ever that I’ve published two blog posts in two days, let alone three in three days. And you know what they say, if I’d had more time, I would’ve written you a shorter letter.

I thought a bit about why it should even be interesting or valuable to write about my mistakes and thought process. Or put more bluntly, isn’t this just a waste of time? Who is this useful for?

Well, for one thing, at my job I’ve been working on Temporal, a standards-track proposal to add a modern API for dates and times to the JavaScript language. Earlier this year I conducted a survey of people who had tried out the proposed Temporal API, and one of the purposes was to try to figure out what was easy to understand and what was hard, for someone coming to Temporal with no prior knowledge. Even though I had been in exactly that position myself only a few months before, I had become so accustomed to using Temporal that I could literally remember nothing of my own experience.

It’s sometimes called the curse of knowledge. I’m sure I will look back in a year, or two years, when I have written lots of Rust code, and not remember any of this either, and I’ll be glad that I wrote it down. But maybe it’ll be valuable in the meantime to someone else!

Day 3, Part 1

Today we get a roguelike puzzle! We are sliding on a toboggan down a horizontally repeating grid of open spaces (.) and trees (#) that is defined in our input file. We slide at a certain angle (3 cells across for every 1 cell down), and the answer to Part 1 is how many trees we crash into by the time we get to the bottom.

By this time I’m familiar with how to start — cargo new puzzle3, download the input file and put it in the project directory, and copy the relevant code from yesterday. This time I’m not only copying the read_lines() function, but also the code that determines from the command line argument whether I want to run the code for Part 1 or Part 2.

Here’s what I start out with:

use std::env;
use std::fs;
use std::io::{self, BufRead};
use std::path;

fn main() {
    let args: Vec<String> = env::args().collect();
    let mut part2 = false;
    let default = String::from("1");
    let arg = args.get(1).unwrap_or(&default);
    if arg == "2" {
        part2 = true;
    }
    let lines: Vec<String> = read_lines("input")
        .expect("Bad file")
        .map(|s| s.expect("Bad line in file"))
        .collect();
    println!("grid is {} × {}", lines[0].len(), lines.len());
}

fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<fs::File>>>
where
    P: AsRef<path::Path>,
{
    let file = fs::File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

I still made a few minor mistakes when adapting this code from yesterday’s code: I forgot to add the Vec<String> type to lines, I forgot to handle errors with .map(|s| s.expect("Bad line in file")), and I guessed wrong that the length of a vector was length() and not len(). Happily, I am getting accustomed enough to this, that I’m able to correct those mistakes fairly quickly.

I will once again use my approach from the last two days, where I first try to get the data into the right data structure. Now that I’m no longer struggling with knowing what data structures even exist in Rust, and how do I even write Rust code, I can afford to think about this before I start writing any code.

It seems like the appropriate data structure is a two-dimensional array of booleans – true if there is a tree in the cell, false if the cell is an open space. I don’t know if Rust has two-dimensional arrays or vectors, so I first start by googling “rust two dimensional array”. I find several results (1, 2, 3, and the array2d package) but none of them really stand out as saying “this is the way to do it.”

What I do understand so far is that I’d use an array if the size was known at compile time, and a Vec of Vecs if I wanted to have variable-length rows or columns. The array2d package from my search results does look interesting because it seems like it gives you a rectangular array, where the length of each row and column is the same for the whole array, but doesn’t necessarily have to be known at compile time. In theory I do know the width and length at compile time, because I can look in the file and count the number of lines and the length of each line! I have a feeling that that would make my code too tied-up with this particular data file, though, and might mean that I would have to refactor the whole thing for Part 2, so I’d prefer not to take that approach.

(I think briefly about using a one-dimensional array and checking through it with a stride of width + 3, but because the grid repeats in the horizontal direction, I think that wouldn’t work.)

Taking the two-dimensional array approach would mean that unlike yesterday, I would not use iterators very much. Hmm, I feel like I’ve just gotten comfortable with iterators in Rust, I wonder what a solution would look like if it were based on iterators?

Maybe that would actually be easier! Thinking about it some more, I don’t actually need to store the entire grid in a 2-D array. I just need to read in each row, figure out my horizontal position on that row, and store a boolean indicating whether I have hit a tree on that row. I believe it can be done just by operating on the string representing each row, with a map(), filter(), and count().

Let’s see if my guess about how to do it is correct. Here’s the first version that I come up with:

fn main() {
    // [...args...]
    let trees_hit = read_lines("input")
        .expect("Bad file")
        .enumerate()
        .filter(hits_tree)
        .count();
    println!("{}", trees_hit);
}

fn hits_tree(row_index: usize, row: String) -> bool {
    let col_index = row_index * 3 % row.len();
    row.as_bytes()[col_index] == '#'
}

Here, I’m assuming that it’s OK to index the string as bytes, because the only characters allowed are # and .. (I learned from yesterday’s exercise what the difference was between indexing a string by byte and iterating through it by character.) I’m taking a guess that '#' (with single quotes instead of double quotes) will give me a single byte rather than a string, like it does in C.

The compiler gives me quite a lot of errors, but I think I know what most of them mean:

error[E0593]: function is expected to take 1 argument, but it takes 2 arguments
  --> src/main.rs:17:17
   |
17 |         .filter(hits_tree)
   |                 ^^^^^^^^^ expected function that takes 1 argument
...
22 | fn hits_tree(row_index: usize, row: String) -> bool {
   | --------------------------------------------------- takes 2 arguments

error[E0599]: no method named `count` found for struct `Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>` in the current scope
    --> src/main.rs:18:10
     |
18   |           .count();
     |            ^^^^^ method not found in `Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>`
     |
     = note: the method `count` exists but the following trait bounds were not satisfied:
             `<fn(usize, String) -> bool {hits_tree} as FnOnce<(&(usize, std::result::Result<String, std::io::Error>),)>>::Output = bool`
             which is required by `Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>: Iterator`
             `fn(usize, String) -> bool {hits_tree}: FnMut<(&(usize, std::result::Result<String, std::io::Error>),)>`
             which is required by `Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>: Iterator`
             `Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>: Iterator`
             which is required by `&mut Filter<Enumerate<std::io::Lines<BufReader<File>>>, fn(usize, String) -> bool {hits_tree}>: Iterator`

error[E0308]: mismatched types
  --> src/main.rs:24:34
   |
24 |     row.as_bytes()[col_index] == '#'
   |                                  ^^^ expected `u8`, found `char`

The first error is probably because I need to make hits_tree() take a tuple of (row_index, row) as its one argument, instead of two arguments. The second error (reminiscent of those long template errors that you get from C++ compilers) I don’t understand, but it seems like it might be a consequence of the first error, and might go away if I solve that one? The third error shows that my guess that '#' is a byte, is wrong. I was close, it is a char, but a byte is type u8.

The third error seems like the easiest to solve, so first I google “rust byte literal” and land here, which makes me think that b'#' might be correct — and it is!

Next I need to make hits_tree() take a tuple as an argument. I wonder if I can destructure this tuple as I would be able to in JavaScript: function hits_tree([row_index, row]) { ... } I google “rust tuple function argument”, and I get this Stack Overflow answer which makes me think that I should be able to do (row_index, row): (usize, String).

But before I run that, I would like to try to get the & operator right this time. After yesterday, when every time I compiled my code, the compiler told me to add a &, I did spend some time reading the “Understanding Ownership” page. Now I feel like I should be better equipped to avoid these mistakes, or at least understand them when I do make them.

So for the first time in this series, I actually go to the API documentation for a function intentionally, instead of googling it and landing there! First I check enumerate() to check that usize is really the correct type for the index (it is,) then I check lines() to check that String is really the correct type for the row (it isn’t.) The type is actually io::Result<String> which reminds me that I need to expect() the value, so I add .map(|s| s.expect("Bad line in file")) before the call to enumerate().

Then there’s the question of whether the row parameter should be a reference or not. I remember reading this:

A more experienced Rustacean would write [s: &str] instead [of s: &String] because it allows us to use the same function on both &String values and &str values.

But on the other hand, I’m not sure that this parameter is actually a reference! It seems to me that since we get a io::Result<String> from read_lines(), the ownership of the string is passed from function to function in the iterator chain, so I would guess (but with only a little more than 50% confidence)1 that String is correct and &str is not. So I make the change of (row_index, row): (usize, String) and run it.

Unfortunately, sitting down and thinking about it is rewarded with an error that I really don’t understand:

error[E0631]: type mismatch in function arguments
  --> src/main.rs:18:17
   |
18 |         .filter(hits_tree)
   |                 ^^^^^^^^^ expected signature of `for<'r> fn(&'r (usize, String)) -> _`
...
23 | fn hits_tree((row_index, row): (usize, String)) -> bool {
   | ------------------------------------------------------- found signature of `fn((usize, String)) -> _`

I know from the reading I’ve done that 'r is a “lifetime”, but the page that I read told me that lifetimes were an advanced feature that would be explained later in the Rust book. Oops, I guess I should have read that chapter too!

I wonder if I can just fake it ’til I make it, by adding for<'r> and &'r to the function signature as the compiler is suggesting. I try that, but it’s a syntax error, “expected item, found keyword for“. I try removing the for<'r>, and I get another error about “unexpected lifetime” which makes me think that I should put the &'r on (usize, String) instead of (row_index, row). (That makes sense, somewhat, because it’s part of the type, not part of the destructured function arguments.) So I try that, and I get another error, but this one tells me exactly what to do:

error[E0261]: use of undeclared lifetime name `'r`
  --> src/main.rs:23:33
   |
23 | fn hits_tree((row_index, row): &'r (usize, String)) -> bool {
   |             -                   ^^ undeclared lifetime
   |             |
   |             help: consider introducing lifetime `'r` here: `<'r>`

Sure enough, when I add <'r>, it works (with some warnings about not using the part2 variable, which I’ll use soon enough) and I get a number out! I put the number into the Advent of Code website and it’s correct!

The lifetime thing is fairly unsatisfying: I have no idea what it does, why it’s necessary, or why it’s even called r!

I’ll try one more thing. Maybe I was wrong about the row needing to be a String. For example, I look back at yesterday’s code and see fn parse_line(line: &str), so maybe in today’s code I also need &str. So I try changing the function signature to fn hits_tree((row_index, row): (usize, &str)) -> bool, without any lifetimes. But I get the old errors back. I do notice one interesting thing in the error message this time:

error[E0631]: type mismatch in function arguments
  --> src/main.rs:18:17
   |
18 |         .filter(hits_tree)
   |                 ^^^^^^^^^ expected signature of `for<'r> fn(&'r (usize, String)) -> _`
...
23 | fn hits_tree((row_index, row): (usize, &str)) -> bool {
   | ----------------------------------------------------- found signature of `for<'r> fn((usize, &'r str)) -> _`

Now, for some reason, it thinks that my function signature already has a lifetime in it, even though I didn’t put one in! I wonder if the lifetime is implicitly when you include a reference. But it definitely wants a String and not a str. So I try changing the type of the argument to (usize, &String) without the lifetime. That doesn’t work either, but interestingly it does continue to think that the type of my function includes a lifetime. It’s telling me that the reference needs to be on the tuple, not on the string, so I try &(usize, String), once more without a lifetime, and that works.

So the message about having to add a lifetime was strange and confusing, but it turns out that I didn’t need to have it after all. Unfortunately I had to discover this by putting & operators in random places until my code compiled, which I had intended to stop doing today ☹

On the plus side, I got the correct answer, so that’s the end of Part 1! Here’s the full code:

use std::env;
use std::fs;
use std::io::{self, BufRead};
use std::path;

fn main() {
    let args: Vec<String> = env::args().collect();
    let mut part2 = false;
    let default = String::from("1");
    let arg = args.get(1).unwrap_or(&default);
    if arg == "2" {
        part2 = true;
    }
    let trees_hit = read_lines("input")
        .expect("Bad file")
        .map(|s| s.expect("Bad line in file"))
        .enumerate()
        .filter(hits_tree)
        .count();
    println!("{}", trees_hit);
}

fn hits_tree((row_index, row): &(usize, String)) -> bool {
    let col_index = row_index * 3 % row.len();
    row.as_bytes()[col_index] == b'#'
}

fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<fs::File>>>
where
    P: AsRef<path::Path>,
{
    let file = fs::File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

Day 3, Part 2

My guess was that Part 2 of the puzzle might be to do the same thing but with the toboggan at a sharper angle such that you would have to skip rows entirely. (For example, 1 right for every 3 down.) I was both right and wrong. It’s actually that you have to check a list of five angles including a sharper one that makes you skip rows, and the answer to the puzzle is the results from those angles all multiplied together. So, my approach of using iterators for the whole thing should actually be refactored, because I don’t want to read the file five times!

I briefly think about using array2d to store the map of the trees after all, but I decide that Vec<String> would be more convenient since I can get it from my existing iterator pipeline by capping it after the expect() call with collect(), and I expect my existing code will continue to work on that with few modifications.

So before I start on Part 2 I decide to refactor it this way:

let landscape: Vec<String> = read_lines("input")
    .expect("Bad file")
    .map(|s| s.expect("Bad line in file"))
    .collect();

let trees_hit = landscape.iter().enumerate().filter(hits_tree).count();

But oh no…

error[E0631]: type mismatch in function arguments
  --> src/main.rs:19:57
   |
19 |     let trees_hit = landscape.iter().enumerate().filter(hits_tree).count();
   |                                                         ^^^^^^^^^ expected signature of `for<'r> fn(&'r (usize, &String)) -> _`
...
23 | fn hits_tree((row_index, row): &(usize, String)) -> bool {
   | -------------------------------------------------------- found signature of `for<'r> fn(&'r (usize, String)) -> _`

OK, I’m back to adding & operators in random places! At least as far as I can tell from the error message, I will need to keep the one on the tuple, and add a second one to &String. And indeed that works.

Next I want to refactor the second part of that into a function count_trees_hit(landscape, right, down) so I can call it several times on the list of angles expressed as (right, down) pairs. I’ll start by omitting the down parameter and just making sure that the existing code works:

fn main() {
    // ...
    let trees_hit = count_trees_hit(&landscape, 3);
    println!("{}", trees_hit);
}

fn count_trees_hit(landscape: &Vec<String>, right: usize) -> usize {
    landscape
        .iter()
        .enumerate()
        .filter(|(row_index, row): &(usize, &String)| hits_tree(row_index, row, right))
        .count()
}

fn hits_tree(row_index: usize, row: &String, right: usize) -> bool {
    let col_index = row_index * right % row.len();
    row.as_bytes()[col_index] == b'#'
}

It looks like this almost works, but I get one error:

error[E0308]: mismatched types
  --> src/main.rs:27:65
   |
27 |         .filter(|(row_index, row): &(usize, &String)| hits_tree(row_index, row, right))
   |                                                                 ^^^^^^^^^
   |                                                                 |
   |                                                                 expected `usize`, found `&usize`
   |                                                                 help: consider dereferencing the borrow: `*row_index`

Since usize is an integer type I’m not sure why it has a reference, but I try changing the type of the row_index parameter to &usize and it works. It also works if I do what the compiler says and add *, but that’s an operator that I haven’t learned about (the page I read mentioned that it existed, but I don’t have a good idea of how it works.)

Next I want to add a down parameter. I think if we move multiple rows downwards for every column that we move to the right, then we can simply skip those rows. So it seems to me that I should be able to add another filter() step that skips the row if its index modulo down is not zero.

This new step doesn’t even need access to the row data so I wonder if I can simplify my code by ignoring the string parameter. I google “rust ignore parameter” and land on “Pattern Syntax” which I skim, and it looks like I should be able to use either .. or _ to ignore that parameter. I’ll try both and see what the compiler says. After a couple of tries, what works is: .filter(|(row_index, ..): &(usize, _)| *row_index % down == 0).

Now I have this code:

    let trees_hit = count_trees_hit(&landscape, 1, 2);
    println!("{}", trees_hit);
}

fn count_trees_hit(landscape: &Vec<String>, right: usize, down: usize) -> usize {
    landscape
        .iter()
        .enumerate()
        .filter(|(row_index, ..): &(usize, _)| *row_index % down == 0)
        .filter(|(row_index, row): &(usize, &String)| hits_tree(*row_index, row, right))
        .count()
}

fn hits_tree(row_index: usize, row: &String, right: usize) -> bool {
    let col_index = row_index * right % row.len();
    println!("{}, {}", col_index, row_index);
    row.as_bytes()[col_index] == b'#'
}

I am a bit suspicious that I might be getting the wrong row_index in the second filter() step, because I am filtering out the already-enumerated rows. So for testing, I try it with right 1 and down 2, and add the println!() statement in hits_tree() to print out the row index and column index.

It turns out my suspicion was correct. The indices that are printed out, are 0, 0, 2, 2, 4, 4, 6, 6 etc., while they should be 0, 0, 1, 2, 2, 4, 3, 6. I think what I need to do to fix this, is un-enumerate the items after the first filter() step and then re-enumerate them. I could write the un-enumerate step with map, but I have a feeling that there might be a built-in iterator method to take only the second element of a tuple, so I look at the API documentation for iterators again. I don’t see anything likely there. I look at the API documentation for Itertools as well, but don’t see anything there either. I try googling a few things (such as “rust iterator drop tuple element”) but that doesn’t give me any likely-looking results, so I decide to just write it myself. I insert .map(unenumerate).enumerate() into the iterator pipeline, after the first filter(), and define my unenumerate() function as follows:

fn unenumerate((.., row): &(_, &String)) -> &String {
    row
}

I get two errors that I haven’t seen before:

error[E0106]: missing lifetime specifier
  --> src/main.rs:34:45
   |
34 | fn unenumerate((.., row): &(_, &String)) -> &String {
   |                           -------------     ^ expected named lifetime parameter
   |
   = help: this function's return type contains a borrowed value, but the signature does not say which one of argument 1's 2 lifetimes it is borrowed from
help: consider introducing a named lifetime parameter
   |
34 | fn unenumerate<'a>((.., row): &'a (_, &String)) -> &'a String {
   |               ^^^^            ^^^^^^^^^^^^^^^^     ^^^

error[E0121]: the type placeholder `_` is not allowed within types on item signatures
  --> src/main.rs:34:29
   |
34 | fn unenumerate((.., row): &(_, &String)) -> &String {
   |                             ^ not allowed in type signatures
   |
help: use type parameters instead
   |
34 | fn unenumerate<T>((.., row): &(T, &String)) -> &String {
   |               ^^^              ^

From what I can understand from these errors, first it’s asking me to add a lifetime parameter, and second it seems like the .. trick is not allowed in named functions, only in inline functions?

The suggestion to use type parameters makes me think of how this might work in C++, so I try the following instead:

fn unenumerate<First, Second>((.., second): &(First, Second)) -> Second {
    second
}

Here, confusingly, the compiler does not want the & on the tuple type, so after a few tries of adding and removing & operators I do get this to compile and produce the debug output that I was expecting.

A bit too late, I realize that I could have done this much quicker with only one filter() step, skipping the rows by returning false in hits_tree(). Although what I have already works, this seems like it’s potentially a big enough improvement that I should try it:

fn count_trees_hit(landscape: &Vec<String>, right: usize, down: usize) -> usize {
    landscape
        .iter()
        .enumerate()
        .filter(|(row_index, row): &(usize, &String)| hits_tree(*row_index, row, right, down))
        .count()
}

fn hits_tree(row_index: usize, row: &String, right: usize, down: usize) -> bool {
    if row_index % down != 0 {
        return false;
    }
    let col_index = row_index / down * right % row.len();
    println!("{}, {}", col_index, row_index);
    row.as_bytes()[col_index] == b'#'
}

Apart from forgetting the braces on the if statement (apparently, you cannot omit them in Rust for a single-statement block as you can in C-like languages) I get the same result as my previous code that used unenumerate(), so I feel good about this.

It’s time to remove the debug println!() and actually write the Part 2 code, that tries several different angles and multiplies the answers together:

if part2 {
    let total = count_trees_hit(&landscape, 1, 1)
        * count_trees_hit(&landscape, 3, 1)
        * count_trees_hit(&landscape, 5, 1)
        * count_trees_hit(&landscape, 7, 1)
        * count_trees_hit(&landscape, 1, 2);
    println!("{}", total);
} else {
    let trees_hit = count_trees_hit(&landscape, 3, 1);
    println!("{}", trees_hit);
}

With this, cargo run 2 gives me an answer, which I put into the Advent of Code website, and it’s correct. And as a nice note to end on, this last change marks the first time in this series that I actually write some Rust code that compiles right away!

The full code, also pushed to GitHub:

use std::env;
use std::fs;
use std::io::{self, BufRead};
use std::path;

fn main() {
    let args: Vec<String> = env::args().collect();
    let mut part2 = false;
    let default = String::from("1");
    let arg = args.get(1).unwrap_or(&default);
    if arg == "2" {
        part2 = true;
    }
    let landscape: Vec<String> = read_lines("input")
        .expect("Bad file")
        .map(|s| s.expect("Bad line in file"))
        .collect();

    if part2 {
        let total = count_trees_hit(&landscape, 1, 1)
            * count_trees_hit(&landscape, 3, 1)
            * count_trees_hit(&landscape, 5, 1)
            * count_trees_hit(&landscape, 7, 1)
            * count_trees_hit(&landscape, 1, 2);
        println!("{}", total);
    } else {
        let trees_hit = count_trees_hit(&landscape, 3, 1);
        println!("{}", trees_hit);
    }
}

fn count_trees_hit(landscape: &Vec<String>, right: usize, down: usize) -> usize {
    landscape
        .iter()
        .enumerate()
        .filter(|(row_index, row): &(usize, &String)| hits_tree(*row_index, row, right, down))
        .count()
}

fn hits_tree(row_index: usize, row: &String, right: usize, down: usize) -> bool {
    if row_index % down != 0 {
        return false;
    }
    let col_index = row_index / down * right % row.len();
    row.as_bytes()[col_index] == b'#'
}

fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<fs::File>>>
where
    P: AsRef<path::Path>,
{
    let file = fs::File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

Afterword

I can tell that I’m starting to get more comfortable in Rust, which feels good. But I’m a bit disappointed that even though I did some extra studying about ownership and the & operator, I was still confused, still couldn’t understand some errors, and basically had to either do exactly what the compiler said without understanding why, or else add or delete & operators arbitrarily until the code compiled.

Maybe what I should do next is read the page in the Rust book on lifetimes, and read about the * operator. Even though the page on ownership said those were advanced topics, both of those came up today in error messages that I didn’t understand.


[1] It occurs to me that — at least for this series — I’m glad I haven’t set up my editor to show Rust compiler errors inline, because otherwise I wouldn’t consciously think about these things ↩

Subcluster allocation for qcow2 images

In previous blog posts I talked about QEMU’s qcow2 file format and how to make it faster. This post gives an overview of how the data is structured inside the image and how that affects performance, and this presentation at KVM Forum 2017 goes further into the topic.

This time I will talk about a new extension to the qcow2 format that seeks to improve its performance and reduce its memory requirements.

Let’s start by describing the problem.

Limitations of qcow2

One of the most important parameters when creating a new qcow2 image is the cluster size. Much like a filesystem’s block size, the qcow2 cluster size indicates the minimum unit of allocation. One difference however is that while filesystems tend to use small blocks (4 KB is a common size in ext4, ntfs or hfs+) the standard qcow2 cluster size is 64 KB. This adds some overhead because QEMU always needs to write complete clusters so it often ends up doing copy-on-write and writing to the qcow2 image more data than what the virtual machine requested. This gets worse if the image has a backing file because then QEMU needs to copy data from there, so a write request not only becomes larger but it also involves additional read requests from the backing file(s).

Because of that qcow2 images with larger cluster sizes tend to:

  • grow faster, wasting more disk space and duplicating data.
  • increase the amount of necessary I/O during cluster allocation,
    reducing the allocation performance.

Unfortunately, reducing the cluster size is in general not an option because it also has an impact on the amount of metadata used internally by qcow2 (reference counts, guest-to-host cluster mapping). Decreasing the cluster size increases the number of clusters and the amount of necessary metadata. This has direct negative impact on I/O performance, which can be mitigated by caching it in RAM, therefore increasing the memory requirements (the aforementioned post covers this in more detail).

Subcluster allocation

The problems described in the previous section are well-known consequences of the design of the qcow2 format and they have been discussed over the years.

I have been working on a way to improve the situation and the work is now finished and available in QEMU 5.2 as a new extension to the qcow2 format called extended L2 entries.

The so-called L2 tables are used to map guest addresses to data clusters. With extended L2 entries we can store more information about the status of each data cluster, and this allows us to have allocation at the subcluster level.

The basic idea is that data clusters are now divided into 32 subclusters of the same size, and each one of them can be allocated separately. This allows combining the benefits of larger cluster sizes (less metadata and RAM requirements) with the benefits of smaller units of allocation (less copy-on-write, smaller images). If the subcluster size matches the block size of the filesystem used inside the virtual machine then we can eliminate the need for copy-on-write entirely.

So with subcluster allocation we get:

  • Sixteen times less metadata per unit of allocation, greatly reducing the amount of necessary L2 cache.
  • Much faster I/O during allocating when the image has a backing file, up to 10-15 times more I/O operations per second for the same cluster size in my tests (see chart below).
  • Smaller images and less duplication of data.

This figure shows the average number of I/O operations per second that I get with 4KB random write requests to an empty 40GB image with a fully populated backing file.

I/O performance comparison between traditional and extended qcow2 images

Things to take into account:

  • The performance improvements described earlier happen during allocation. Writing to already allocated (sub)clusters won’t be any faster.
  • If the image does not have a backing file chances are that the allocation performance is equally fast, with or without extended L2 entries. This depends on the filesystem, so it should be tested before enabling this feature (but note that the other benefits mentioned above still apply).
  • Images with extended L2 entries are sparse, that is, they have holes and because of that their apparent size will be larger than the actual disk usage.
  • It is not recommended to enable this feature in compressed images, as compressed clusters cannot take advantage of any of the benefits.
  • Images with extended L2 entries cannot be read with older versions of QEMU.

How to use this?

Extended L2 entries are available starting from QEMU 5.2. Due to the nature of the changes it is unlikely that this feature will be backported to an earlier version of QEMU.

In order to test this you simply need to create an image with extended_l2=on, and you also probably want to use a larger cluster size (the default is 64 KB, remember that every cluster has 32 subclusters). Here is an example:

$ qemu-img create -f qcow2 -o extended_l2=on,cluster_size=128k img.qcow2 1T

And that’s all you need to do. Once the image is created all allocations will happen at the subcluster level.

More information

This work was presented at the 2020 edition of the KVM Forum. Here is the video recording of the presentation, where I cover all this in more detail:

You can also find the slides here.

Acknowledgments

This work has been possible thanks to Outscale, who have been sponsoring Igalia and my work in QEMU.

Igalia and Outscale

And thanks of course to the rest of the QEMU development team for their feedback and help with this!

What’s new in GnuTLS 3.7.0

On behalf of the GnuTLS team, I am pleased to present GnuTLS 3.7.0, the first cut of the 3.7 series. This is the result of several months of planning and work by 25 contributors and includes feature enhancements and behavior changes, such as removal of deprecated functions and tightening of system requirements. In this entry, I will try to detail some notable features in the release.

API for on-demand CA certificates retrieval

During the TLS authentication phase, the server typically presents a chain of X.509 certificates, from the end-entity certificate to the trusted CA certificate. The AIA extension allows the server to omit certain portion of the certificate chain, by pointing to the location where the client can download the missing certificates. Although GnuTLS provides a means to override the certificate verification logic completely through callbacks, this task is error-prone and thus desired to be supported natively. Sahana Prasad introduced the new set of API that allow applications to safely complement the certificate chain. The API is already being used in glib-networking.

API to support QUIC

QUIC is a new, UDP based transport protocol used as the basis of HTTP/3. Although the protocol internally relies on TLS for security, the networking protocol requires direct access to the TLS handshake state machine, which is normally hidden behind the TLS library interface.

After examining the design decisions in other libraries, we took the approach to provide a small set of API that exposes the necessary part of the state machine: capturing and injecting Handshake/Alert messages, and notifying key installation.

The API can be used as a crypto backend of the ngtcp2 library, and in turn by curl’s experimental HTTP/3 support. Here is a quick screenshot how it works:


$ ./src/curl -V
curl 7.74.0-DEV (x86_64-pc-linux-gnu) libcurl/7.74.0-DEV GnuTLS/3.7.0 zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) ngtcp2/0.1.0-DEV nghttp3/0.1.0-DEV
Release-Date: [unreleased]
Protocols: dict file ftp ftps gopher http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli Debug HTTP3 HTTPS-proxy IDN IPv6 Largefile libz NTLM NTLM_WB PSL SSL TLS-SRP TrackMemory UnixSockets

$ ./src/curl --verbose -s --alt-svc altsvc.cache https://quic.aiortc.org/
* STATE: INIT => CONNECT handle 0x7c8a68; line 1796 (connection #-5000)
* Alt-svc connecting from [h1]quic.aiortc.org:443 to [h3-29]quic.aiortc.org:443
* Added connection 0. The cache now contains 1 members
* STATE: CONNECT => WAITRESOLVE handle 0x7c8a68; line 1842 (connection #0)
[...]
* QUIC handshake is completed
* ngtcp2 established connection!
* Connected to quic.aiortc.org () port 443 (#0)
* Marked for [keep alive]: HTTP/3 default
* STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x7c8a68; line 1987 (connection #0)
* STATE: SENDPROTOCONNECT => DO handle 0x7c8a68; line 2011 (connection #0)
* Using HTTP/3 Stream ID: 0 (easy handle 0x7c8a68)
> GET / HTTP/3
> Host: quic.aiortc.org
> user-agent: curl/7.74.0-DEV
> accept: */*
> alt-used: quic.aiortc.org:443
>
[...]

Hard dependency on Nettle 3.6

GnuTLS relies on the Nettle library for cryptographic algorithm implementations. While we closely work with the Nettle upstream, we have kept the copy of the library in our tree for compatibility reasons. By requiring Nettle 3.6, we were able to remove those compatibility code from the distribution, which makes the maintenance easier.

GOST MAGMA/KUZNYECHIK CTR-ACPKM and CMAC

Thanks to Dmitry Baryshkov, Nettle 3.6 has seen adoption of the GOST hash and digital signature (GOSTDSA) algorithms. Aiming at the inclusion of the future Nettle releases, the GOST block cipher algorithms, Magma and Kuznyechik, have also been made available through the abstract crypto API.

Resurrection of padlock instruction set and support for Zhaoxin CPU

GnuTLS has supported the VIA padlock instruction set for many years now. However, the support has become difficult to test as the VIA CPUs are hard to obtain. Jonas Zhou fixed the current padlock detection code, and added support for the compatible instruction set available in the Zhaoxin CPU family.

3.8.0 and 3.7.1 milestones are now open!

Now that 3.7.0 was released, a couple of milestones have been created for the new development.

I would expect even more new features coming in the those future releases, such as the Linux kernel TLS (KTLS) support, DTLS 1.3, algorithm acceleration based on AF_ALG, and TPM 2.0 support, to name a few. Stay tuned for updates!

Still on Github

Over 4 years ago now, I wrote about moving ostree to Github, and I wanted to add an update here. I still think it was the right move.

Free Software is important to me – but I think Github overall provides a lot more benefit to FOSS than harm from its mostly proprietary nature. Providing a zero-cost mostly reliable featureful platform (also with various zero-cost CI available) is a huge accelerant to all the FOSS projects that use it. And whenever I have to try to contribute a patch via email that has no CI checking I sometimes just want to throw up my hands and move on.

But for the people who don’t agree with me and think Free Software needs free tools – I say awesome. I am very glad you exist, and really there’s about 20% of me that also agrees. That part of me is happy when I come across projects hosted in e.g. Gitlab.com at least. It’s obviously good for there to be some diversity and competition, beyond the fact that Gitlab is at least at the core FOSS. I also hope at some point somehow pagure’s model of storing issues and PR comments in Git takes off too. Or maybe it’ll be something like Radicle.

Anyways, that’s really all there is to say – I continue to use Github for those reasons but I’m happy to see new tooling that might also win in the future. Or just cool developments in existing tools. My goal here is just to have these current thoughts written down so I can link to it in various places.

December 01, 2020

bolt 0.9.1 with fixes for integrated thunderbolt controller

A new release of bolt is out: 0.9.1 - Unstable icy waters. This is a bug-fix release that addresses some issue on integrated Thunderbolt controller.

Intel's Ice Lake is the first architecture where the Thunderbolt controller is part of the CPU die. This is quite a big difference. There is a good article on wikichip called "A Look At The Ice Lake Thunderbolt 3 Integration" for those that are curious about the technical details. What matters for bolt is that there is no DROM, which means that the udev device representing the host switch does not have the usual name and id attributes for the device and vendor. Additionally, the unique_id attribute has a different UUID every boot. This breaks one of the fundamental assumptions for boltd, which used the unique_id of the host to uniquely identify the corresponding Thunderbolt domain. This is important because we store host devices and domains in the store. Now, with the uuid changing this means that 1) we can not match the previously stored domains and hosts to the ones after a reboot and thus will accumulate "stale" domains in the store. Ironically, the fact that the host device also does not have any name and id information means that boltd would refuse to create the BoltDevice for those which meant we did at least not accumulate the stale host devices in the store. It did break the detection of the generation, i.e. if it is Thunderbolt 3 or USB 4.

To address all these issues, a couple of changes were made to boltd: 1) The PCI id of the native host interface (NHI) is used to detect if we are dealing with an integrated Thunderbolt controller and if so, neither the domain nor the host is stored. 2) If the host udev device does not have name or id information, the SMBIOS/DMI info that is exposed via sysfs is used transparently. 3) The bolt store is now versioned and thus can be upgraded. If no version info is detected, it means the store has version 0 and will be upgraded to 1. On that update, the store is scanned and all stored domains that are currently not online are removed, thus removing the stale domains.

The release already is in Fedora rawhide (aka F34, bodhi) and should also hit the stable releases (F33, F32) soon. It is updated on Arch already, thanks Jaro! Testing would be very welcome. I also encourage all other distributions to upgrade to it.

Committed to the integrity of your root filesystem

Quite a while ago I came across the SQLite testing page and was impressed (and since then it’s gotten even better). They’ve clearly invested a lot in it, and I think SQLite’s ubiquity is well deserved.

When I started the ostree project I had this in mind but…testing is hard. We have decent "unit test style" coverage since the start but that’s not very "real world". We’ve gone through a few test frameworks over the years. But to the point of this blog post: I finally had a chance to write some new testing code and I’m happy with how it turned out!

TL;DR: There’s a new "transactionality" test run on every PR that uses a mix of e.g. kill -9 ostree and reboot -ff while updates are running, and verifies that you either have the old or new system safely. (PRs: ostree#2048 and ostree#2127).

But along the way there were some interesting twists.

Test frameworks and rebooting

I mentioned we’d been through a few test frameworks. An important thing to me is that ostree is a distribution-independent project; it’s used by a variety of systems today. Ideally, our tests can be run in multiple frameworks used by different distributions. That works easily for our "unit tests" of course, same as it does for many other projects (make check style tests that are nondestructive and run as non-root).

But our OSTree tests want a "real" system (usually a VM), and further the most interesting tests need to be destructive. More than that, we need to support rebooting the system under test.

I’d known about the Debian autopkgtest specification for a while, and when I was looking at testing I re-evaluated it. There are some things that are very Debian-specific (how tests are defined in the metadata), but in particular I really liked how it supports reboots.

There’s a big tension in test systems like this – is the test logic primarily run on the "system under test", or is it on some external system which manages the target via e.g. ssh? We had lots of problems in our prior test frameworks was dealing with reboots with the latter style. Plus the latter style tends to strongly tie the test code to the test harness.

In the Fedora CoreOS group we use a system called "kola" which came from the original CoreOS project. It knows how to boot systems using Ignition in various clouds along with qemu. I added partial support for the Debian Autopkgtest specification to it (cosa#1528).

Avoiding shell script

A lot of the original ostree tests are in shell script. I keep finding myself writing shell even though I also keep being badly burned by it from time to time.

So another tangent along the way here: For writing new tests I’d resolved to use "not shell script". Python would be an obvious choice but…another large wrinkle here is that in CoreOS we don’t want interpreters in the base OS – they should run as containers (yes, a shell is obviously an interpreter too but…). So going the interpreted test route would drive us towards having our test framework run as a privileged container. I decided not to do this for a few reasons; the biggest is that makes it much harder to test the system as other processes see it.

My preferred language nowadays is Rust, and it generates static-except-libc binaries that we can just copy to the host. Further, fortuitously someone else created Rust bindings to ostree and I’d been wanting an excuse to use that for a while too! However…some things are just too verbose via API, and plus we want to test the CLI too. Invoking subprocesses via Rust std::process::Command is also very verbose. So I ended up creating a sh-inline crate for Rust that makes it ergonomic to include snippets of strict mode bash in the code. This snippet is a good example. I’d like to make this even more ergonomic too, but my proc-macro-fu isn’t there yet.

Actually writing the test

OK so all those prerequisites out of the way, the first thing I did was write the code to do the "try upgrading and while that’s running, kill -9 it". That went reasonably quickly and worked well, so I moved on to the more interesting case of adding reboot -ff (simulating immediate power loss) as another "interrupt strategy". This excercises the whole stack through the kernel, particularly interactions with the filesystem.

However, this required completely rewriting the control flow because here the "test harness" is also being forcibly killed. We don’t want to rely on persisting our state to the disk on the system. I ended up serializing the process state into AUTOPKGTEST_REBOOT_MARK, which gets stored in the harness and passed back when the process starts again. Effectively then the test code becomes a sort of coroutine with the harness.

Found problems

Depending on how you look at it, fortunately or unfortunately: none so far. One motivation for writing this test was to try to reproduce a bug a user filed that showed an error message from the boot loader configuration handling code. I haven’t managed to reproduce that yet. I did manually inject some faults in the code and verify that the test failed of course. And in the past I’ve of course done some manual testing to verify that ostree does what it says on the box for implementing transactional upgrades. But there’s clearly more to explore here.

Next steps

One thing I plan to explore next here is fault injection, probably with strace fault injection. This may also combine well with adding support for the harness to request explicit sleep() calls to widen the window on possible races. Plus so far while I’ve mentioned support for other distributions, this is only testing Fedora CoreOS in its default mode; e.g. we’re only validating xfs and not other Linux filesystems, etc.

Are we testing like SQLite yet?

Definitely not, but I’m happy that I made some progress closer to that goal! It was an interesting project and I’m looking forward to building more of it per above. Outside of OSTree, the goal of this blog was write down some of the "lessons learned" for others working in this space. For example, I hope some people working in the Linux-based OS testing space look at the Debian autopkgtest; it can be hard to come to consensus on test frameworks and standards, but there are at least some good ideas there. Also I think the mix of "Rust with some inline shell script" worked pretty well for these types of tests; particularly if the CLI outputs JSON, deserializing with Serde is great. Though taking the Rust compile time hit for tests is a downside.

But in the end, I can at least now say that every pull request to OSTree runs through a test suite that ensures it survives being forcibly terminated while an update is running. The integrity of your root filesystem is very important to me – it should be robust and image-like, but still a Linux system in the end. If this sounds good to you, I hope you check out one of the distributions that use it!

November 29, 2020

Accurate Conclusions from Bogus Data: Methodological Issues in “Collaboration in the open-source arena: The WebKit case”

Nearly five years ago, when I was in grad school, I stumbled across the paper Collaboration in the open-source arena: The WebKit case when trying to figure out what I would do for a course project in network theory (i.e. graph theory, not computer networking; I’ll use the words “graph” and “network” interchangeably). The paper evaluates collaboration networks, which are graphs where collaborators are represented by nodes and relationships between collaborators are represented by edges. Our professor had used collaboration networks as examples during lecture, so it seemed at least mildly relevant to our class, and I wound up writing a critique on this paper for the class project. In this paper, the authors construct collaboration networks for WebKit by examining the project’s changelog files to define relationships between developers. They perform “community detection” to visually group developers who work closely together into separate clusters in the graphs. Then, the authors use those graphs to arrive at various conclusions about WebKit (e.g. “[e]ven if Samsung and Apple are involved in expensive patent wars in the courts and stopped collaborating on hardware components, their contributions remained strong and central within the WebKit open source project,” regarding the period from 2008 to 2013).

At the time, I contacted the authors to let them know about some serious problems I found with their work. Then I left the paper sitting in a short-term to-do pile on my desk, where it has been sitting since Obama was president, waiting for me to finally write this blog post. Unfortunately, nearly five years later, the authors’ email addresses no longer work, which is not very surprising after so long — since I’m no longer a student, the email I originally used to contact them doesn’t work anymore either — so I was unable to contact them again to let them know that I was finally going to publish this blog post. Anyway, suffice to say that the conclusions of the paper were all correct; however, the networks used to arrive at those conclusions suffered from three different mistakes, each of which was, on its own, serious enough to invalidate the entire work.

So if the analysis of the networks was bogus, how did the authors arrive at correct conclusions anyway? The answer is confirmation bias. The study was performed by visually looking at networks and then coming to non-rigorous conclusions about the networks, and by researching the WebKit community to learn what is going on with the major companies involved in the project. The authors arrived at correct conclusions because they did a good job at the later, then saw what they wanted to see in the graphs.

I don’t want to be too harsh on the authors of this paper, though, because they decided to publish their raw data and methodology on the internet. They even published the python scripts they used to convert WebKit changelogs into collaboration graphs. Had they not done so, there is no way I would have noticed the third (and most important) mistake that I’ll discuss below, and I wouldn’t have been able to confirm my suspicions about the second mistake. You would not be reading this right now, and likely nobody would ever have realized the problems with the paper. The authors of most scientific papers are not nearly so transparent: many researchers today consider their source code and raw data to be either proprietary secrets to be guarded, or simply not important enough to merit publication. The authors of this paper deserve to be commended, not penalized, for their openness. Mistakes are normal in research papers, and open data is by far the best way for us to be able to detect mistakes when they happen.

Collaboration network
A collaboration network from the paper. The paper reports that this network represents collaboration between September 2008 (when Google began contributing to WebKit) and February 2011 (the departure of Nokia from the project). Because the authors posted their data online, I noticed that this was a mistake in the paper: the graph actually represents the period between February 2011 and July 2012. The paper’s analysis of this graph is therefore questionable, but note this was only a minor mistake compared to the three major mistakes that impact this network. Note the suspiciously-high number of unaffiliated (“Other”) contributors in a corporate-dominated project.

The rest of this blog post is a simplified version of my original school paper from 2016. I’ve removed maybe half the original content, including some flowery academic language and unnecessary references to class material (e.g. “community detection was performed using fast modularity maximization to generate an alternate visualization of the network,” good for high scores on class papers, not so good for blog posts). But rewriting everything to be informal takes a long time, and I want to finish this today so it’s not still on my desk five more years from now, so the rest of this blog post is still going to be much more formal than normal. Oh well. Tone shift now!

We (“we” means “I”) examine various methodological issues discovered by analyzing the paper. The first section discusses the effects on the collaboration network of choosing a poor definition of collaboration. The second section discusses a major source of error in detecting the company affiliation of many contributors. The third section describes a serious mistake in the data collection process. Each of these issues is quite severe, and any one alone calls into question the validity of the entire study. It must be noted that such issues are not necessarily unique to this paper, and must be kept in mind for all future studies that utilize collaboration networks.

Mistake #1: Poorly-defined Collaboration

The precise definition used to model collaboration has tremendous impact on the usefulness of the resultant collaboration network. Many collaboration networks are built using definitions of collaboration that are self-evidently useful, where there is little doubt that edges in the network represent real-world collaboration. The paper adopts an approach to building collaboration networks where developers are represented by nodes, and an edge exists between two nodes if the corresponding developers modified the same source code file during the time period under consideration for the construction of the network. However, it is not clear that this definition of collaboration is actually useful. Consider that it is a regular occurrence for developers who do not know each other and may have never communicated to modify the same files. Consider also that modifying a common file does not necessarily reflect any shared interest in a particular portion of the software project. For instance, a file might be modified when making an interface change in another file, or when fixing a build error occurring on a particular platform. Such occurrences are, in fact, extremely common in the WebKit project. Additionally, consider that there exist particular source code files that are unusually central to the project, and must be modified more frequently than other files. It is highly likely that almost all developers will at one point or another make some change in such a file, and therefore be connected via a collaboration edge to all other developers who have ever modified that file. (My original critique shows a screenshot of the revision history of WebPageProxy.cpp, to demonstrate that the developers modifying this file were working on unrelated projects.)

It is true, as assumed by the paper, that particular developers work on different portions of the WebKit source code, and collaborate more with particular other developers. For instance, developers who work for the same company typically, though not always, collaborate most with other developers from that same company. However, the paper’s naive definition of collaboration should ensure that most developers will be considered to have collaborated equally with most other developers, regardless of the actual degree of collaboration. For instance, consider developers A and B who regularly collaborate on a particular source file. Now, developer C, who works on a platform that does not use this file and would not ordinarily need to modify it, makes a change to some cross-platform interface in another file that requires updating this file. Developer C is now considered to have collaborated with developers A and B on this file! Clearly, this is not a desirable result, as developers A and B have collaborated far more on the development of the file. Moreover, consider that an edge exists between two developers in the collaboration network if they have ever both modified any file anywhere in WebKit during the time period under review; then we can expect to form a network that is almost complete (a “full” graph where edges exists between most nodes). It is evident that some method of weighting collaboration between different contributors would be desirable, as the unweighted collaboration network does not seem useful.

One might argue that the networks presented in the paper clearly show developers exist in subcommunities on the peripheries of the network, that the network is clearly not complete, and that therefore this definition of collaboration sufficed, at least to some extent. However, this is only due to another methodological error in the study. Mistake #3, discussed later, explains how the study managed to produce collaboration networks with noticeable subcommunities despite these issues.

We note that the authors chose this same definition of collaboration in their more recent work on OpenStack, so there exist multiple studies using this same flawed definition of collaboration. We speculate that this definition of collaboration is unlikely to be more suitable for OpenStack or for other software projects than it is for WebKit. The software engineering research community must explore alternative models of collaboration when undertaking future studies of software development collaboration networks in order to more accurately reflect collaboration.

Mistake #2: Misdetected Contributor Affiliation

One difficulty when building collaboration networks is the need to correctly match each contributor with the correct company affiliation. Although many free software projects are dominated by unaffiliated contributors, others, like WebKit, are primarily developed by paid contributors. Looking at the number of times a particular email domain appears in WebKit changelog entries made during 2015, most contributors commit using corporate emails, but many developers commit to WebKit using personal email accounts, such as Gmail accounts; additionally, many developers use generic webkit.org email aliases, which were previously available to active WebKit contributors. These developers may or may not be affiliated with companies that contribute to the project. Use of personal email addresses is a source of inaccuracy when constructing collaboration networks, as it results in an undercount of corporate contributions.  We can expect this issue has led to serious inaccuracies in the reported collaboration networks.

This substantial source of error is neither mentioned nor accounted for; all contributors using such email accounts were therefore miscategorized as unaffiliated. However, the authors clearly recognized this issue, as it has been accounted for in their more recent work covering OpenStack by cross-referencing email addresses from git revision history with a database containing corporate affiliations maintained by the OpenStack Foundation. Unfortunately, no such effort was made for the WebKit data set.

The WebKit project was previously dominated by contributors with chromium.org email domains. This domain is equivalent to webkit.org in that it can be used by contributors to the Chromium project regardless of corporate affiliation; however, most contributors with Chromium emails are actually Google employees. The high use of Chromium emails by Google employees appears to have led to a dramatic — by roughly an entire order of magnitude — undercount of Google’s contributors to the WebKit project, as only contributors with google.com emails were considered to be Google employees. The vast majority of Google employees used chromium.org emails, and so were counted as unaffiliated developers. This explains the extraordinarily high number of unaffiliated developers in the networks presented by the paper, despite the fact that WebKit development is, in reality, dominated by corporate contributors.

Mistake #3: Missing Most Changelog Data

The paper incorrectly claims to have gathered its data from both WebKit’s Subversion revision history and from its changelog files. We must draw a distinction between changelog entries and Subversion revision history. Changelog entries are inserted into changelog files that are committed into the Subversion repository; they are completely separate from the Subversion history. Each subproject within the WebKit project has its own set of changelog files used to record changes under the corresponding directory.

In fact, the paper processed only the changelog files. This was actually a good choice, as WebKit’s changelog files are much more accurate than the Subversion history, for two reasons. Firstly, it is easy for a contributor to change the email address entered into a changelog file, e.g. after a change in company affiliation. However, it is difficult to change the email address used to commit to Subversion, as this requires requesting a new Subversion account from the Subversion administrator; accordingly, contributors are more likely to use older email addresses, lacking accurate company affiliation, in Subversion revisions than in changelog files.  Secondly, many Subversion revisions are not directly committed by contributors, but rather are actually committed by the commit queue bot, which runs various tests before committing the revision. Subversion revisions are also, more rarely, committed by a completely different contributor than the patch author. In both cases, the proper contributor’s name will appear in only the changelog file, and not the Subversion data. Some developers are dramatically more likely to use the commit queue than others. Various other reviews of WebKit contribution history that examine data from Subversion history rather than from changelog files are flawed for this reason. Fortunately, by relying on changelog files rather than Subversion metadata, the authors avoid this problem.

Unfortunately, a serious error was made in processing the changelog data. WebKit has many different sets of changelog files, stored in various project subdirectories (JavaScriptCore, WebCore, WebKit, etc.), as well as toplevel changelogs stored in the root directory of the project. Regrettably, the authors were unaware of the changelogs in subdirectories, and based their analysis only on the toplevel changelogs, which contain only changes that occurred in subdirectories that lack their own changelog files. In practice, this inadvertently restricted the scope of the analysis to a very small minority of changes, primarily to build system files, manual tests, and the WebKit website. That is, the reported collaboration networks do not reflect collaboration on any actual source code files. All source code files are contained in subdirectories with their own changelog files, and therefore no source code files were actually considered in the analysis of collaboration on source code changes.

We speculate that the analysis’s focus on build system files likely exaggerates the effects of clustering in the network, as different companies used different build systems and thus were less likely to edit the build systems used by other companies, and that an analysis based on the correct data would display less of a clustering effect. Certainly, there would be dramatically more edges in the already-dense networks, because an edge exists between two developers if there exists any one file in WebKit that both developers have modified. Omitting all of the source code files from the analysis therefore dramatically reduces the likelihood of edges existing between nodes in the network.

Conclusion

We found that the original study was impacted by an unsuitable definition of collaboration used to build the collaboration networks, severe errors in counting contributor affiliation (including the classification of most Google employees as unaffiliated developers), and the omission of almost all the required data from the analysis, including all data on modifications to source code files. The authors constructed and studied essentially meaningless networks. Nevertheless, the authors were able to derive many accurate conclusions about the WebKit project from their inaccurate collaboration networks. Such conclusions illustrate the dangers of seeking to find particular meanings or explanations through visual inspection of collaboration networks. Researchers must work forwards from the collaboration networks to arrive at their conclusions, rather than backwards by attempting to match the networks to conclusions gained from prior knowledge.

Original Report

Wow, OK, you actually read this far? Since this blog post criticizes an academic paper, and since this blog post does not include various tables and examples that support my arguments, I’ve attached my original analysis in full. It is a boring, student-quality grad school project written with the objective of scoring the highest-possible grade in a class rather than for clarity, and you probably don’t want to look at it unless you are investigating the paper in detail. (If you download that, note that I no longer work for Igalia, and the paper was not authorized by Igalia either; I used my company email to disclose my affiliation and maybe impress my professor a little.) Phew, now I can finally remove this from my desk!

Catching up on WebKit GStreamer WebAudio backends maintenance

Over the past few months the WebKit development team has been working on modernizing support for the WebAudio specification. This post highlights some of the changes that were recently merged, focusing on the GStreamer ports.

My fellow WebKit colleague, Chris Dumez, has been very active lately, updating the WebAudio implementation for the mac ports in order to comply with the latest changes of the specification. His contributions have been documented in the Safari Technology Preview release notes for version 113, version 114, version 115 and version 116. This is great for the WebKit project! Since the initial implementation landed around 2011, there wasn’t much activity and over the years our implementation started lagging behind other web engines in terms of features and spec compliance. So, many thanks Chris, I think you’re making a lot of WebAudio web developers very happy these days :)

The flip side of the coin is that some of these changes broke the GStreamer backends, as Chris is focusing mostly on the Apple ports, a few bugs slipped in, noticed by the CI test bots and dutifully gardened by our bots sheriffs. Those backends were upstreamed in 2012 and since then I didn’t devote much time to their maintenance, aside from casual bug-fixing.

One of the WebAudio features recently supported by WebKit is the Audio Worklet interface which allows applications to perform audio processing in a dedicated thread, thus relieving some pressure off the main thread and ensuring a glitch-free WebAudio rendering. I added support for this feature in r268579. Folks eager to test this can try the GTK nightly MiniBrowser with the demos:

$ wget https://trac.webkit.org/export/270226/webkit/trunk/Tools/Scripts/webkit-flatpak-run-nightly
$ chmod +x webkit-flatpak-run-nightly
$ python3 webkit-flatpak-run-nightly --gtk MiniBrowser https://googlechromelabs.github.io/web-audio-samples/audio-worklet/

For many years our AudioFileReader implementation was limited to mono and stereo audio layouts. This limitation was lifted off in r269104 allowing for processing of up to 5.1 surround audio files in the AudioBufferSourceNode.

Our AudioDestination, used for audio playback, was only able to render stereo. It is now able to probe the GStreamer platform audio sink for the maximum number of channels it can handle, since r268727. Support for AudioContext getOutputTimestamp was hooked up in the GStreamer backend in r266109.

The WebAudio spec has a MediaStreamAudioDestinationNode for MediaStreams, allowing to feed audio samples coming from the WebAudio pipeline to outgoing WebRTC streams. Since r269827 the GStreamer ports now support this feature as well! Similarly, incoming WebRTC streams or capture devices can stream their audio samples to a WebAudio pipeline, this has been supported for a couple years already, contributed by my colleague Thibault Saunier.

Our GStreamer FFTFrame implementation was broken for a few weeks, while Chris was landing various improvements for the platform-agnostic and mac-specific implementations. I finally fixed it in r267471.

This is only the tip of the iceberg. A few more patches were merged, including some security-related bug-fixes. As the Web Platform keeps growing, supporting more and more multimedia-related use-cases, we, at the Igalia Multimedia team, are committed to maintain our position as GStreamer experts in the WebKit community.

November 27, 2020

Do not use librsvg 2.40.x

Please do not use librsvg 2.40.x; it cannot render recent Adwaita icon themes correctly.

The librsvg 2.40.x series is the last "C only" version of the library; it was deprecated in 2017.

During the port to Rust, I rewrote the path parser to be spec-compliant, and fixed a few cases that the C version did not handle. One of this cases is for compact Arc data.

The SVG path grammar allows one to remove whitespace between numbers if the next number starts with a sign. For example, 23-45 gets parsed as two numbers 23 -45.

In addition, the arguments of the Arc commands have two flags in the middle of a bunch of numbers. The flags can be 0 or 1, and there may be no whitespace between the flags and the next number. For example, A1.98 1.98 0 0015 13.96 gets parsed as A1.98 1.98 0 0 0 15 13.96 — note the two 0 0 flags before the 15.

Librsvg 2.40.x does not parse this correctly. Adwaita-icon-theme-3.36, and possibly earlier versions, uses minimized SVG files with compressed whitespace, and will not render correctly with the C-only version of librsvg.

This is help-contents-symbolic.svg rendered with librsvg 2.40.21:

icon rendered incorrectly

And this is help-contents-symbolic.svg rendered with librsvg 2.50.2:

icon rendered correctly

This is not the only icon with compact Arc commands; there are many others that will also be mis-rendered in 2.40.x.

I don't know when Adwaita started using SVGs with compressed whitespace; probably it didn't when librsvg 2.40.x was the latest version, or everyone would have noticed mis-rendered icons.

Background: Someone recently filed a bug about memory unsafety in librsvg 2.40.x's path parser, which mysteriously enough only manifests itself in big-endian platforms. I wouldn't be surprised if this had latent bugs on little-endian as well.

Please use at least librsvg 2.48.x; any earlier versions are not supported. Generally I keep an eye on the last two stable release sets (2.48.x and 2.50.x as of this writing), but only commit fixes to the latest stable series (2.50.x currently).

How Apple might completely take over end users' computers

Many people are concerned about Apple's ongoing attempts to take more and more control of end user machines from their users. Some go so far as to say that Apple won't be happy until they have absolute and total control over all programs running on end user devices, presumably so that they can enforce their 30% tax on every piece of software. Whether this is true or not we don't really know.

What we can do instead is a thought experiment. If that was their end goal, how would they achieve it? What steps would they take to obtain this absolute control? Let's speculate.

Web apps

A common plan against tightening app store requirements is to provide a web app instead. You can do a lot of cool things with WebAssembly and its state is continuously improving. Thus it must be blocked. This is trivial: require that web browsers may only run WASM programs that are notarized by Apple. This is an easy change to sell, all it needs is a single tear jerking please-think-of-the-children presentation about the horrible dangers of online predators, Bitcoin miners and the like. Notarization adds security and who wouldn't want to have more of that?

There is stilll the problem that you can run an alternative browser like Chrome or Firefox. This can be solved simply by adding a requirement that third party browsers can only get notarized if they block all non-notarized web apps. On iOS this is of course already handled by mandating that all browsers must use the system's browser engine. At some point this might get brought over to macOS as well. For security.

Javascript

Blocking WASM still leaves the problem of Javascript. There is a lot of it and even Apple can not completely block non-notarized JS from running. Here you have to run the long game. An important step is, surprisingly, to drive the adoption of WebAssembly. There are many ways of doing this, the simplest is to stop adding any new JS functionality and APIs that can instead be done in WebAssembly. This forces app developers to either drop Apple support or switch to WASM. This transition can be boosted by stopping all development and maintenance on the existing JS engine and letting it bitrot. Old web pages will get worse and worse over time and since Apple won't fix their browser, site operators will be forced to upgrade to technologies like WASM that come with mandatory notarization. For security.

Scripting languages

Scripting languages such as Perl and Python can be used to run arbitrary programs so they must go. First they are removed from the core install so people have to download and install them separately. That is only an inconvenience, though. To achieve total control notarization requirements must again be updated. Any program that loads "external code" must add a check that the code it is running is notarized by Apple. At first you will of course allow locally written script files to be run, as long as you first hunt through system security settings to add run permissions to the script file. This must be done with physical human interaction like a mouse or touchpad. It must not be automatable to prevent exploits. The obtained permissions are of course revoked every time the file is edited. For security.

Compilers

There is still a major hole in this scheme: native compilers. It might be tedious, but it is possible to compile even something as big as Firefox from scratch and run the result. Therefore this must be blocked, and notarization is again the key. This can be done by requiring all binaries, even self-built ones, to be notarized. This is again easy to sell, because it blocks a certain class malware injection attacks. Following iOS's lead you have to get a developer certificate from Apple to sign your own code to run on your own machine.

Once the basic scheme is in place you have to tighten security and block the signing from any compiler except the Apple provided system one. This has to be done for security, because existing third party compilers may have bugs and features that could be used to circumvent notarization requirements somehow. Only Apple can get this right as the system provider. There must be one, and only one, way of going from source code to executable binaries and that path must be fully controlled by Apple. For security.

Total separation of development and use

Even with all this you can still compile and run your own code, meaning people will find ways of getting around these requirements and doing what they want to do rather than what Apple permits them to do. This means that even tighter reins are required. The logical end result is to split the macOS platform into two separate entities. The first one is the "end user" system that can only run Apple-notarized apps and nothing else. The second is the "dev platform" that runs only XCode, Safari (in some sort of a restricted mode) and one other program that has to have been fully compiled on the current physical machine. Remote compilation servers are forbidden as they are a security risk. This is roughly how iOS development and things like game console dev kits already work. The precedent is there, waiting for the final logical step to be taken.

This has the side effect that every developer who wants to support Apple platforms now has to buy two different Apple laptops, one for development and one for testing. But let us be absolutely clear about one thing. This is not done to increase sales and thus profits. No. Not under any circumstances! It is for a higher purpose: for security.

November 26, 2020

2020-11-26 Thursday

  • Mail chew, sync with Thais, COOL community meeting. Called out to help an older man with Altzheimers who had fallen and broken his hip with J. poor chap - not all interruptions to the flow are bad.
  • Plugged away at a proposal; call with Bob about Christmas, exchanged contracts on house.

November 25, 2020

Friends of GNOME Update – November 2020

Welcome to the November 2020 Friends of GNOME Update

A photo of a group of people in matching shirts, sitting on a table making spring rolls.
“GNOME Asia 2009 with FOSSASIA, Saigon Ho Chi Minh City” by FOSSASIA is licensed under CC BY 2.0

GNOME on the Road

The Seattle GNU/Linux Conference took place online this year and we were there. Executive Director Neil McGovern gave a presentation titled “Patently Obvious” about our legal case with a patent assertion entity and how the settlement impacts all of FOSS.

Strategic Initiatives Manager M. de Blanc gave a surprise talk that had nothing to do with GNOME, but discussed the Foundation nonetheless.

We also had talks at Linux Application Summit and GNOME.Asia, which you can read more about below.

We (co-) Hosted Great Events!

Linux App Summit (LAS) took place on November 12 – 14. Co-organized with KDE, LAS brought together attendees from over 80 countries. Videos are already online if you would like to catch up or share your favorite sessions with friends.

We had three GNOME Foundation staff speaking at LAS:

While writing this, GNOME.Asia Summit is taking place. Organizationally based in Malaysia, GNOME.Asia is happening until November 26. It features an amazing list of speakers, including talks from Bartłomiej and Melissa; GNOME Foundation Board Members Felipe Borges (giving two talks), Rob McQueen, and Federico Mena Quintero (also giving two talks; and GNOME Foundation members and friends.

Accessibility GTK (and GNOME)

Emmanuele, Core GTK Developer, has spent 2020 focusing on accessibility. He recently worked with Matthias Clasen on a blog post about some of that work.

When discussing computing, “accessibility” refers to the technologies that make things like software and web sites work for people with disabilities or who otherwise need accommodations. This includes a range of permanent and temporary conditions, e.g. blind users and people who have broken an arm and are computing one handed while it heals. Accessibility matters to us at GNOME because we believe everyone should trust and be empowered by their technology, regardless of ability.

Community Education Challenge Phase Two Winners

The Community Engagement Challenge Phase Two is wrapping up. Melissa Wu and Caroline Henriksen have been preparing for the announcement of the Phase Two winners. You can join us on December 2 at 18:00 UTC for a showcase of projects, highlighting how they’ve developed since July, and the grand announcement of who will be moving on to Phase Three.

Check Out A GNOME Working Group

We’ve started up regular social media working group hours that anyone can join. The goal of these meetings will be to discuss and plan out news and social topics for the following week, and if there is time, to work on drafting the content. You can drop in on one to check it out (or one of the other Working Group or Team meetings). Information is on events.gnome.org.

Building the Future of GNOME

We’re running a fundraiser to fund the Foundation’s activities in 2021. We appreciate how much you’ve already supported GNOME! We’re asking if you’d be willing to share our announcement of the fundraiser; one of the weekly updates we’ll be sharing on gnome.org, including this one on GTK4; or your GNOME Story on social media using #GNOMEStories.

Even if you don’t, we recommend checking out Director of Operations Rosanna Yuen’s GNOME Story.

From Planet GNOME

Here are a few posts we particularly liked from Planet GNOME:

Thank you!

Thank you for all the ways you support GNOME—the community, the Foundation, and the project. This has not been an easy year for many of us, and we appreciate that you have given your time and energy into making GNOME a place where people have found connection, fulfillment, and even joy.

2020-11-25 Wednesday

  • Damien arrived to help finsh & fibre-glass the roof with Martin (a new Father) - more insulating, located and fixed the missing silicon around the shower causing the free shower below.
  • Sales team call, poked at code with Kendy & Ash; procurement call, plugged away at some diagrams.

November 23, 2020

fwupd 1.5.2

The last few posts I did about fwupd releases were very popular, so I’ll do the same thing again: I’ve just tagged fwupd 1.5.2 – This release changes a few things:

  • Add a build time flag to indicate if packages are supported – this would be set for “traditional” package builds done by the distro, and unset by things like the Fedora COPR build, the Flatpak or Snap bundles. There are too many people expecting that the daily snap or flatpak packages represent the “official fwupd” and we wanted to make it clear to people using these snapshots that we’ve done basically no QA on the snapshots.
  • A plugin for the Pinebook Pro laptop has been added, although it needs further work from PINE64 before it will work correctly. At the moment there’s no way of getting the touchpad version, or finding out which keyboard layout is installed so we can tag the correct firmware file. It’s nearly there and is still very useful for playing with the hardware on the PB Pro.
  • Components can now set the icon from the metadata from the LVFS, if supported by the fwupd plugin. This allows us to tag “generic” ESRT devices as things like EC devices, or, ahem, batteries.
  • I’ve been asked by a few teams, including the Red Hat Edge team, the CoreOS team and also by Google to switch from libsoup to libcurl for downloading data – as this reduces the image size by over 5MB. Even NetworkManager depends on libcurl now, and this seemed like a sensible thing to do given fwupd is now being used in so many different places.
  • Fall back to FAT32 internal partitions for detecting ESP, as some users were complaining that fwupd did not properly detect their ESP that didn’t have the correct partition GUID set. Although I think fixing the GUID is the right thing to do, the system firmware also falls back, and pragmatically so should we.
  • Fix detection of ColorHug version on older firmware versions, which was slightly embarrassing as ColorHug is one of the devices in the device regression tests, but we were not testing an old enough firmware version to detect this bug.
  • Fix reading BCM57XX vendor and device ids from firmware – firmware for the Talos II machine is already on the LVFS and can replace the non-free firmware there in almost all situations now.
  • For this release we had to improve synaptics-mst reliability when writing data, which was found occasionally when installing firmware onto a common dock model. A 200ms delay is the difference between success and failure, which although not strictly required seemed pragmatic to add.
  • Fix replugging the MSP430 device which was the last device that was failing a specific ODM QA. This allows us to release a ton of dock firmware on the LVFS.
  • Fix a deadlock seen when calling libfwupd from QT programs. This was because we were calling a sync method from threads without a context, which we’ve now added.
  • In 1.5.0 we switched to the async libfwupd by default, and accidentally dropped the logic to only download the remote metadata as required. Most users only need to download the tiny .jcat file every day, and the much larger .xml.gz is only downloaded if the signature has changed in the last 24h. Of course, it’s all hitting the CDN, but it’s not nice to waste bandwidth for no reason.
  • As Snap is bundling libfwupd with gnome-software now, we had to restore recognizing GPG and PKCS7 signature types. This allows a new libfwupd to talk to an old fwupd daemon which is something we’d not expected before.
  • We’re also now setting the SMBIOS chassis type to portable if a DeviceTree battery exists, although I’d much rather see a ChassisType in the DT specification one day. This allows us to support HSI on platforms like the PineBook Pro, although the number of tests is still minimal without more buy-in from ARM.
  • We removed the HSI update and attestation suffixes; we decided they complicated the HSI specification and didn’t really fit in. Most users won’t even care and the spec is explicitly WIP so expect further changes like this in the future.
  • If you’re running 1.5.0 or 1.5.1 you probably want to update to this release now as it fixes a hard-to-debug hang we introduced in 1.5.0. If you’re running 1.4.x you might want to let the libcurl changes settle, although we’ve been using it without issue for more than a week on a ton of hardware here. Expect 1.5.3 in a few weeks time, assuming we’re all still alive by then. :)

    November 21, 2020

    Adding (very) preliminary support for C++ modules in Meson

    One of the most common questions people ask about Meson is why does it not yet have support for building C++ modules. Up until now the answer has been simple: no compiler really supports it yet. However Visual Studio has added sufficient functionality in their latest 2019 developer preview that an implementation in Meson has become feasible. The actual code can be found in this merge request for those brave enough to try it out.

    The basic problem with C++ modules is the same as with Fortran modules: you can no longer build source files in an arbitrary order. Instead you have to scan the contents of files, see what modules each source file generates and consumes and orchestrate the build so that all source files that produce modules are built before any source files that consume them. This requires dynamic dependency generation that has been added to Ninja only fairly recently.

    The original idea was that compiler toolchain vendors would provide scanner binaries because parsing C++ code is highly unreliable due to C preprocessor macro shenanigans. It turns out that a "toolchain provided" dependency scanner can not obtain all necessary data reliably, because it requires higher level knowledge about the project setup. This can only be done reliably by the build system. An alternative would be to pass all this information to the compiler/scanner via compiler flags but that turns out to be a terrible thing to define and maintain stable over changes. It also has the downside that you'd need to spawn a single process for every file, which is fairly slow on Windows. Meson's approach is to write a custom dependency scanner. Yes, it is based on regexes, so it is not 100% reliable but on the other hand you only need to spawn one process per build target (exe, shared lib, static lib) as opposed to one per source file.

    Still, the end result does work for simple projects. It does not handle things like module partitions but those can be added later. Even this simple project and test has brought about several notes and questions:

    • Where should the generated module files be put? In the target private dir? In a global dir? If the latter, what happens if two unrelated parts in the code base specify the same module?
    • Microsoft has not documented the module compiler flags and cl /? does not even list them. Because of this all module files get dumped to the build directory root.
    • Only ixx files are supported. VS does not enforce file name extensions. I would really want to enforce module file name extensions to only one. We can't change change legacy code and force everyone to use a single extension for C++ source, but we totally should do that for new ones. Having to support many file name extensions for the same thing is madness.
    Sadly I don't have any numbers on how much modules improve compilation speed. Feel free to try it out yourself, though. Bug reports and especially fixes are welcome.

    November 20, 2020

    Sugar Learning Tools – Part III

    Catching up

    The last couple of months have been hectic: more Sugar applications with Flatpak, quality of life improvements to Flatseal, tinkered with the idea of a Flathub-based OS for phones, showed case the tools to create modern native applications with JavaScript, shared my views on how Flatpak is reshaping the desktop’s future, teased the GNOME Circle at the Linux App Summit, planted the initial seeds to kickstart the GNOME community in my home country, started contributing to Flathub’s backend and other non-desktop projects (oh, yeah, and then my actual job 😂).

    For this post though, I would like to focus on the progress made to Sugar applications with Flatpak.

    Sugar Learning Tools

    During my GUADEC presentation I mentioned the things I wanted to achieve moving forward with this project. Primarily, improving the documentation and porting guide to ensure that this project can scale and, of course, port more art-oriented applications.

    For improving the documentation, I ported Sugar’s official “hello world” and re-wrote the porting guide based on this application. By using this minimal application as an example, it becomes much easier to highlight the key porting steps and concepts. I also took the opportunity to update the application itself to the latest version of the Sugar toolkit.

    As for porting new applications, I didn’t get to port as many art-oriented applications as I wished. Mostly due to the fact that most of those are still using GTK2 and Python2, so it would require more time I can afford at the moment. Nevertheless, I ported some pretty awesome ones.

    Paint is a playful painting application. It provides all the classic tools a young learner would expect, e.g. brushes, shapes, text, transformations, adding images, etc. Plus, it can enable sounds to make the experience extra funny.

    Pukllanapac is a Rubik’s Cube-like game. The objective is for the learner to exercise their problem solving skills. The application has different patterns for circles, triangles, and hexagons.

    Solar System is an interactive visualization. The goal is encourage the learner to know more about the planets and their moons. In words of the original developer of this application (@cvgarciarea) :

    It is my hope that this tool serves as a practical and interactive way to explore astronomy.

    It’s worth noticing that he was one of the early young Sugar hackers that showed up, when the Uruguayan government introduced OLPC nation wide.

    Swift Feet provides a series of entertaining physical activities. It encourages  the learner to stay active. Includes exercises and dances. This application was originally designed for OLPC Canada in collaboration with ParticipACTION Canada and Ophea organizations.

    Sugar Chess is a friendlier front-end for gnuchess. Its features go beyond and above other popular chess applications, e.g. allows the learner to customize the graphics, replay matches, use different modes, and more.

    Moving forward

    There’re still plenty of applications that need to be ported but, for the time being, I need to find a better way to monitor these twenty five applications that were ported so far.

    I have discussed the idea of a developer Dashboard with Flathub’s maintainers but, for the short term, I might end up hacking something quick for my own needs.

    Another imminent task is to update the BaseApp and all applications to the latest GNOME runtime, but this should be trivial. If anyone is interested in learning about packaging applications with Flatpak this task could be good a opportunity, so feel free to contact me!

     

    Scaling Flathub 100x

    Flatpak relies on OSTree to distribute apps. This means that flatpak repositories, such as Flathub, are really just OSTree repositories. At the core of an OSTree repository is the summary file, which describes the content of the repository.  This is similar to the metadata that “apt-get update” downloads.

    Every time you do an flatpak install it needs the information in the summary file. The file is cached between operations, but any time the repository changes the local copy needs to be updated.

    This can be pretty slow, with Flathub having around 1000 applications (times 4 architectures). In addition, the more applications there are, the more likely it is that one has been updated since the last time which means you need to update.

    This isn’t yet a major problem for Flathub, but its just a matter of time before it is, as apps keep getting added:

    This is particularly problematic if we want to add new architectures, as that multiplies the number of applications.

    So, the last month I’ve been working in OSTree and Flatpak to solve this by changing the flatpak repository format. Today I released Flatpak 1.9.2 which is the first version to support the new format, and Flathub is already serving it (and the old format for older clients).

    The new format is not only more efficient, it is also split by architecture meaning each user only downloads a subset of the total summary. Additionally there is a delta-based incremental update method for updating from a previous version.

    Here are some data for the latest Flathub summary:

    • Original summary: 6,6M  (1.8M compressed)
    • New (x86-64) summary: 2.7M (554k compressed)
    • Delta from previous summary: 20k

    So, if you’re able to use the delta, then it needs 100 times less network bandwidth compared to the original (compressed) summary and will be much faster.

    Also, this means we can finally start looking at supporting other architectures in Flathub, as doing so will not inconvenience users of the major architectures.

    To the future and beyond!

    November 19, 2020

    Glade Not Recommended

    If you are starting out with GTK development, you may have heard of a tool called Glade. Glade is a UI designer application for GTK projects, that allows you to create, modify, and preview UI files before writing code. In that sense, Glade is a very useful tool for GTK apps.

    With that said, I must implore that you do not use Glade.

    Why? Glade was built for it’s own format, before the advent of
    GtkBuilder. It does not know certain properties, does not know
    of certain features, and does not know modern GTK practices.

    Instead of ignoring what it does not know, it aggressively “corrects” it. Glade will re-write your UI files to its whims,
    including but not limited to:

    • Removing property bindings
    • Removing unknown properties
    • Adding child properties to GtkBox children everywhere
      • These are gone in GTK 4, which adds extra work to the porting process
    • Adding explicit GtkViewports that cause a double-border
    • Forcing minimum width and height on GtkListBoxRows
    • Re-arranging objects within the UI file
    • Changing the format of all properties in an edited UI file

    This makes Glade harmful in different ways. Your UI files will be bloated, due to Glade adding placeholders, child properties,
    and extra widgets where they aren’t needed. When you make a
    contribution to an app, you may be asked to re-do it by hand
    because Glade made unnecessary changes. When porting to GTK 4 you
    will need to do more work, as Glade doesn’t know to avoid
    deprecated widgets, properties, and child properties.

    All of these elements combine to make Glade more harmful than helpful. So please, do not use Glade. If you are working with UI
    files write them by hand. An editor with code folding and syntax
    highlighting can be very helpful. If you need to mock something
    up, you can try sketching on a piece of paper or using our
    mockup resources. Whatever you choose, don’t use Glade.

    Acer Aspire Switch 10 E SW3-016's and SW5-012's and S1002's horrible EFI firmware

    Recently I acquired an Acer Aspire Switch 10 E SW3-016, this device was the main reason for writing my blog post about the shim boot loop. The EFI firmware of this is bad in a number of ways:

    1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file.

    2. But it will actually boot EFI/Boot/bootx64.efi ! (wait what? yes really)

    3. It will only boot from an USB disk connected to its micro-USB connector, not from the USB-A connector on the keyboard-dock.

    4. You must first set a BIOS admin password before you can disable secure-boot (which is necessary to boot home-build kernels without doing your own signing)

    5. Last but not least it has one more nasty "feature", it detect if the OS being booted is Windows, Android or unknown and it updates the ACPI DSDT based in this!

    Some more details on the OS detection mis feature. The ACPI "Device (SDHB) node for the MMC controller connected to the SDIO wifi module contains:

            Name (WHID, "80860F14")
            Name (AHID, "INT33BB")


    Depending on what OS the BIOS thinks it is booting it renames one of these 2 to _HID. This is weird given that it will only boot if EFI/Microsoft/Boot/bootmgfw.efi exists, but it still does this. Worse it looks at the actual contents of EFI/Boot/bootx64.efi for this. It seems that that file must be signed, otherwise it goes in OS unknown mode and keeps the 2 above DSDT bits as is, so there is no _HID defined for the wifi's mmc controller and thus no wifi. I hit this issue when I replaced EFI/Boot/bootx64.efi with grubx64.efi to break the bootloop. grubx64.efi is not signed so the DSDT as Linux saw it contained the above AML code. Using the proper workaround for the bootloop from my previous blog post this bit of the DSDT morphes into:

            Name (_HID, "80860F14")
            Name (AHID, "INT33BB")


    And the wifi works.

    The Acer Aspire Switch 10 E SW3-016's firmware also triggers an actual bug / issue in Linux' ACPI implementation, causing the bluetooth to not work. This is discussed in much detail here. I have a patch series fixing this here.

    And the older Acer Aspire Switch 10 SW5-012's and S1002's firmware has some similar issues:

    1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file

    2. These models will actually always boot the EFI/Microsoft/Boot/bootmgfw.efi file, so that is somewhat more sensible.

    3. On the SW5-012 you must first set a BIOS admin password before you can disable secure-boot.

    4. The SW5-012 is missing an ACPI device node for the PWM controller used for controlling the backlight brightness. I guess that the Windows i915 gfx driver just directly pokes the registers (which are in a whole other IP block), rather then relying on a separate PWM driver as Linux does. Unfortunately there is no way to fix this, other then using a DSDT overlay. I have a DSDT overlay for the V1.20 BIOS and only for the v1.20 BIOS available for this here.

    Because of 1. and 2. you need to take the following steps to get Linux to boot on the Acer Aspire Switch 10 SW5-012 or the S1002:

    1. Rename the original bootmgfw.efi (so that you can chainload it in the multi-boot case)

    2. Replace bootmgfw.efi with shimia32.efi

    3. Copy EFI/fedora/grubia32.efi to EFI/Microsoft/Boot

    This assumes that you have the files from a 32 bit Windows install in your ESP already.

    November 18, 2020

    GSoD Weekly Summary 9 and 10

    Week 9 (9th Nov – 22nd Nov, 2020)

    This week was mostly all about finishing a task that I had taken up earlier during the last couple of weeks, which was about GNOME Calculator’s Operational modes. So, this week, I thought that rather than taking on a completely new task I should finish this one first as it was due for quite some time now.

    The idea here was to consolidate all documentation regarding the different operational modes of the calculator into a single section consisting of an overview page along with dedicated pages for each of the operational modes: Basic, Advanced, Financial, Programming and Keyboard modes. 

    I started working on this a few weeks back and spent quite some time on it however, I didn’t manage to finish it yet. There were quite a few reasons behind this including the fact that it required me to do a surprisingly huge amount of research work. Honestly speaking, some of the functionalities available within a few modes like the Financial and Programming modes involved quite a few mathematical concepts that I was quite new to and and none to very little knowledge about. 

    The first thing that I did in this regard was to create a plan of tackling this work as it involved a considerable amount of changes. The plan that I had in my mind was that I would first create a new overview page with an introduction to each of the different operational modes within it and to link it up with the existing pages. Refactor the existing pages on the  programming and financial modes. And finally create a new section on the index page for Operational Modes under which to consolidate all the documents. Once this was done I could next move onto creating new documents about the Basic, Advanced and Keyboard modes.

    Sounds simple right? What could be so difficult about it, I thought! If only I knew!!!! 

    As soon as I started working on this, it soon felt like I had opened up Pandora’s box itself! I couldn’t believe the amount of time and effort that I ended up spending behind just the initial part of the job, i.e. refactoring the existing pages, creating a new section, consolidating the pages under this section and creating an overview page. 

    It wasn’t the technical part of writing or refactoring the docs that was challenging, but rather the job of finding and deciding upon the correct vocabulary for each and every sentence in each of the documents. This became especially evident after I opened the MR for this first set of change. I immediately got a huge number of change suggestions for it. Everything from the page description to the use of screenshots came under the scanner. (psst… the MR is still under active review and I’m still implementing review suggestions! 😅). 

    I must admit that this MR has the longest list of change suggestions of all my MRs till date, and frankly, it’s turning out to be a great learning opportunity for me so far. I am getting to learn a lot about things like language conventions for page descriptions to the overall layout of a help document in general. As an example, one of the things that was quite an important learning for me here was the fact that including screenshots of an application within a help document for that application might not always be quite as productive. Since the user that would anyway be accessing the help docs for a certain mode would have at least seen the UI of that particular mode of the application when they invoke it in the first place itself.

    Overall, it’s been quite a fun week in terms of learning so far, and hopefully I would soon be able to close the loop on this particular set of work on GNOME Calculator! 🤞

    Ciao! See you next week! 😊 👋

    How to fix Linux EFI secure-boot shim bootloop issue

    How to fix the Linux EFI secure-boot shim bootloop issue seen on some systems.

    Quite a few Bay- and Cherry-Trail based systems have bad firmware which completely ignores any efibootmgr set boot options. They basically completely reset the boot order doing some sort of auto-detection at boot. Some of these even will given an error about their eMMC not being bootable unless the ESP has a EFI/Microsoft/Boot/bootmgfw.efi file!

    Many of these end up booting EFI/Boot/bootx64.efi unconditionally every boot. This will cause a boot loop since when Linux is installed EFI/Boot/bootx64.efi is now shim. When shim is started with a path of EFI/Boot/bootx64.efi, shim will add a new efibootmgr entry pointing to EFI/fedora/shimx64.efi and then reset. The goal of this is so that the firmware's F12 bootmenu can be used to easily switch between Windows and Linux (without chainloading which breaks bitlocker). But since these bad EFI implementations ignore efibootmgr stuff, EFI/Boot/bootx64.efi shim will run again after the reset and we have a loop.

    There are 2 ways to fix this loop:

    1. The right way: Stop shim from trying to add a bootentry pointing to EFI/fedora/shimx64.efi:

    rm EFI/Boot/fbx64.efi
    cp EFI/fedora/grubx64.efi EFI/Boot


    The first command will stop shim from trying to add a new efibootmgr entry (it calls fbx64.efi to do that for it) instead it will try to execute grubx64.efi from the from which it was executed, so we must put a grubx64.efi in the EFI/Boot dir, which the second command does. Do not use the livecd EFI/Boot/grubx64.efi file for this as I did at first, that searches for its config and env under EFI/Boot which is not what we want.

    Note that upgrading shim will restore EFI/Boot/fbx64.efi. To avoid this you may want to backup EFI/Boot/bootx64.efi, then do "sudo rpm -e shim-x64" and then restore the backup.

    2. The wrong way: Replace EFI/Boot/bootx64.efi with a copy of EFI/fedora/grubx64.efi

    This is how I used to do this until hitting the scenario which caused me to write this blog post. There are 2 problems with this:

    2a) This requires disabling secure-boot (which I could live with sofar)
    2b) Some firmwares change how they behave, exporting a different DSDT to the OS dependending on if EFI/Boot/bootx64.efi is signed or not (even with secure boot disabled) and their behavior is totally broken when it is not signed. I will post another rant ^W blogpost about this soon. For now lets just say that you should use workaround 1. from above since it simply is a better workaround.

    Note for better readability the above text uses bootx64, shimx64, fbx64 and grubx64 throughout. When using a 32 bit EFI (which is typical on Bay Trail systems) you should replace these with bootia32, shimia32, fbia32 and grubia32. Note 32 bit EFI Bay Trail systems should still use a 64 bit Linux distro, the firmware being 32 bit is a weird Windows related thing.

    Also note that your system may use another key then F12 to show the firmware's bootmenu.

    November 17, 2020

    Try GTK 4 Demos Now!

    You watched Matthias’s talk at Linux App Summit last week and you wish you could try the demos yourself? Excellent!

    During the Q&A section, Cassidy James asked if there was any Flatpak available, an hour after that we published a Nightly version of GTK4 Demo along with the Widget factory and the Icon Browser in GNOME’s Nightly repository.

    To try it all you need to do is the following:

    $ flatpak install https://nightly.gnome.org/repo/appstream/org.gtk.Demo4.flatpakref
    $ flatpak run org.gtk.Demo4

    The Flatpak is built by the GTK CI pipeline for on each commit so it will be updatable as time goes. You can try the most bleeding edge there is, all in its little isolated sandbox without worrying about messing with your system or having to go through the soul crashing process of figuring out how to build software.

    If you are curious how we do CI/CD with Flatpak, take a look at the Merge Request and the relevant wiki page.

    November 16, 2020

    Search Pinboard.in from GNOME Shell

    Pinboard.in is a bookmarking and archival website run by Maciej Ceglowski, also a noted public speaker, Antarctic explorer and political activist. I use Pinboard as a way to close browser tabs, by pretending to myself that I’ll one day revisit the 11,000 interesting links that I’ve bookmarked.

    Hoping to make better use of this expansive set of thought-provoking articles, hilarious videos and expired domain names, I wrote a minimal search provider for GNOME Shell.

    If you’re happy to meson install some Python scripts, then you can use it too! The search provider installs a systemd timer unit which checks for new bookmarks every hour, and downloads them into a Whoosh search index. I chose Whoosh because it’s fun to try new search engines, and my opinion is that it’s fast, powerful, and a little heavy on the disk space usage — my bookmarks take up 7MB in a JSON file but 17MB in the Whoosh index.

    A secondary purpose of this effort is to show how easy it is to make search providers. It took me about 8 hours to make this. Feel free to copy and paste for your own search providers, and use the CLI test tool in my desktop-search repo when testing them.

    What other services would you like to see integrated into GNOME Shell as search providers? Next on my list might be notes from Joplin.

    November 11, 2020

    GSoD Weekly Summary 8

    Week 8 (2nd Nov – 8th Nov, 2020)

    By the end of last week I had curated a list of the issues for GNOME contacts ready on which I need to work on. I also created an issue for that in the user-docs section on GitLab. Creating an issue and then working on it is a good practice as you can then reference all the Merge Requests you work on under the same issue. This allows not only the supervisor, but also everyone who is working on documentation for GNOME to get a clear idea about what you are working on and what is the next task you would be working on. Once you finish the task, you can mark it as “Done� and get started with the next one.

    So, when I finished working on my first issue last week I marked it “Doneâ€� and started working on the next one. I started with “Starting Contacts for the first timeâ€�  where there were few changes like, earlier the edit option was under the edit option but now it is an individual option available under “View moreâ€� – this needed to be changed. Also the icon of “View moreâ€� was missing and I thought it might be a little hard for a person who is completely new to GNOME Contacts to find the “View moreâ€� button and if I add an icon to it, it will be a lot more easier. So, I did that and some more changes as per the recent updates and sent the Merge request. Easy peasy.

    Similarly I made changes in the “Connect with your contactâ€� and  “Edit contact detailsâ€� pages and submitted the merge request. And with all this I managed to finish GNOME contacts. It was very quick to be honest. Another core application finished! 🙂

    Strike

    Personal activity statistics on gitlab.gnome.org

    ✔ GNOME Gitlab achievement unlocked

    ✘ Write something about ‘life-work balance’ here

    November 10, 2020

    Excursions: The Seven Bridges of Königsberg


    Making a little test with a new kind of blog post format where I explore some interesting fact related to geography and some history and fact around it.

    For this first one I was going to take a look at the mathematical problem involving the bridges of Königsberg (present day Kaliningrad).

    I have an old mathematical chart that I got from my father:



     
    I used to study this a lot growing up and I recently was a bit curious as to when it was actually issued. The chart was issued by a company called Original-Odhner (https://sv.wikipedia.org/wiki/Original-Odhner, there is no English article in Wikipedia for it…) and doing some more research I found a mention of this chart in an old issue of a technical magazine called “Teknisk Tidskrift” from 1959 (http://runeberg.org/tektid/1959/0026.html) where the chart is mentioned as being recently published, so that should put it around 1959, or perhaps 1958.
     
    In any case, up near the top left corner there is a description of this topological problem outlined above:
     

      So, the subject of this post is the problem with the seven bridges of Königsberg in what was back in Leonard Euler's days in the 18th century Prussia (nowadays Kaliningrad, today a Russian enclave surrounded by Lithuania and Poland). In those days the two islands of city was connected by seven bridges and a common passtime of the recidents was to try to come up with a way of performing a walk crossing all the bridges, but only cross each bridge of time (https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg)
     
    Euler managed to prove in 1736 that this is a mathematically impossible problem. Part of the solution involved the abstraction of the problem into one of a pure graph. Thus ignoring the choice of paths in areas on land and the islands, between the bridges. Being irrelevant (for the solution of this specific problem). And this is a very important observation. The same principle applies for the algorithms we use today for finding shortest or quickest routes. By treating the way segments as edges in a graph with weights (corresponding to e.g. distance, time required, or other significant parameters ranking preferred characteristics) ignoring things like the actual bends and twists between each intersection where a choice of direction needs to be taken.
     
    To conclude this, we could take a virtual excursion to present-day Kaliningrad.
     

      The central island that the town center in Euler's days, is now a park (it was devastated during WWII)


    Of the original seven bridges of the old puzzel, five remain in their position today (only two seems to be the original ones from 1736). And incidentally one can actually make an equivalent walk today, using the present day bridges.


    Hope you liked this form of post, and maybe somebody has ideas for other places and stories to explore!

    November 08, 2020

    Evolution added support for EteSync-2.0

    Welcome everyone, in the past weeks, EteSync made some major announcements one of them is releasing EteSync-2.0 which brings new features and made major improvements over EteSync-1.0, you can read more about what the new version brings and how to migrate your EteSync account to EteSync-2.0 from their blog post here.

    And now I am happy to announce that the Evolution-EteSync module has been upgraded to use the new EteSync-2.0 protocol which is built under Etebase and the module is now under GNOME repo, you can check it here.

    This guide will show you how to install the module and how to use it with Evolution, most of the things are as before except that now, you don’t need to enter an encryption password plus other major performance boost.

    Contents

    Installing the module

    Pre-built packages

    Arch:

    Fedora/CentOS (COPR):

    • dnf copr enable daftaupe/etesync-rs
    • dnf install libetebase
    • dnf install evolution-etesync

    Install from sources

    Note: You can head to the community chat if you faced any issues during the installation of the module.

    Adding EteSync account

    After installing the module, obviously you’ll need an EteSync account, if you don’t have one, you can create one from EteSync website.

    The steps are very simple.

    1. click on the arrow next to “New” button
    2. choose “Collection Account”
    3. Enter your email/username.
    4. Choose “Look up for an EteSync account”
    5. Enter your password.

    Adding data inside a collection

    In this example I am adding an appointment to a calendar of mine.

    Adding new collection

    EteSync supports three types of collections (Address-book, Calendars and Task lists) and recetly they added Etesync notes (not yet supported in Evolution).
    There are two ways to add new collections in Evolution.

    First is from “New” menu
    Second is choosing category from the bottom-left area (here it is Contacts), right clicking on the account and choose “New address-book/calendar/task list”

    Rename or change a collection color

    Just right click on the collection (address-book, calendar or task list), select properties then you can rename or change color, then click ok.

    Running your own instance (self-host)

    This part is a little advanced. Since EteSync has it’s code open-source so anyone can use the server code to easily self-host his own server but it comes with less benefit, however, if you wish to do so, please follow the instructions here.

    After you have successfully set up your own instance, and verified it works by connecting to it from the browser, you can follow the steps do add an account while hosting your own server.

    In this example I am self hosting the server with my localhost url on port 8000 (http://127.0.0.1:8000)

    Hope these instruction help 😀

    November 06, 2020

    Learning meson

    Last Friday at Red Hat the fourth Day of Learning happened. This time I picked the meson build system. More and more projects have switched to it, like systemd more than 3 years ago, or most of GNOME. Back then I was really impressed by how much faster a systemd build became with meson – but now I actually want to learn it, peek behind the curtain, be able to contribute to projects that use it, and to know if a conversion makes sense.

    November 02, 2020

    New fwupd 1.5.1 release

    Hot on the heels of 1.5.0, I’ve just tagged and uploaded fwupd 1.5.1. Most importantly, if fixes the regression we recently included for an as-yet-unnamed OEM who wants to ship dock firmware. Any day now, I promise.

    Other interesting things we fixed:

    • Delete unused EFI variables when deploying firmware — which frees up a lot of space if you’ve ever enabled the fwupdx64.efi debugging…
    • Fix duplicate probe warning for the Logitech Unifying device — which was really cosmetic, but wasting resources is never nice.
    • Include the amount of NVRAM size in use in the LVFS failure report — which will might let us explain some of the dbx updates failing.
    • Make bcm57xx hotplug more reliable — although uncommon to hotplug PCI devices, using an eGPU enclosure (like I do for the device tests!) this needs to work!
    • Recognize authorized ThunderBolt value of 2 — which we found in the wild recently.
    • Remove the duplicate parent-child data in FwupdDevice and FuDevice — although not strictly a bugfix, duplicating this data made no sense and caused confusion.
    • Use UDisks to find out if swap files and devices are encrypted — which further adds more code depending on UDisks. I’ve added a Recommends: udisks2 in the Fedora package, but see the wiki if you’re running a minimal system.

    As before, Fedora 33 and 32 updates in the usual places.

    October 31, 2020

    Spooky GTG features to try out for Halloween 2020

    Are you an irresistible creature with an insatiable love for the dead… bugs? Well, grab your bug hunter crossbow, because we need you to test some big technological changes in GTG so that we can confidently release version 0.5 sometime soon (way before the year end, ideally).

    Synchronizing your tasks across devices using CalDAV

    A new synch backend/plugin was developed by Mildred during the summer. I think that’s super cool and it will probably be useful to many of you! I will now let this thematic tweet speak for itself (so that you can retweet it too today):

    It would be great to have many adventuring early-adopters try out this feature. See the details (and report issues) in this ticket. This has not yet been merged into the main branch of the Git version of GTG, as I would like to have more people testing it before it is included by default in what will become a stable GTG release. Please indicate your success or failures by voting on this comment.

    Changes related to the local file format data storage

    In an attempt to rule out the possibility of our XML file format being the cause of GTG’s performance problems that occur when you have multiple hundreds of tasks (I have a thousand), Diego ported GTG’s XML storage backend to LXML, which is reportedly 40 times faster.

    • Unfortunately, that was not the silver bullet; the performance issues remained, but at least now we can rule out XML (and the task editor, see below) as suspects, and we can move on to a different part of the problem.
    • We suspect our performance woes have something to do with sorting in liblarch, or perhaps deeper: maybe it’s all GtkTreeView/GtkTreeModel’s fault? Who knows.
    • Do you take pride in optimizing the sh!t out of GTK applications? Do you have experience in duck-taping a jetpack onto a turtle? Join the party! Your help is more than welcome in investigating our performance problems and devising potential fixes. We will owe you beer and candy.

    In addition to the LXML port, another area where Diego brought down the might of his code refactoring skills is the new “task editor” backend and view renderer, which includes a bunch of new features and refinements such as support for subheadings (with the “#” markdown syntax), the use of GTK checkboxes to represent subtasks, tag highlight colors that match the tag’s color, and a bunch of bugfixes. It was also envisioned as a way to address one of the performance issues I theorized about. This is what it looks like:

    New taskview screenshot

    And as if that wasn’t enough, Diego set out to redesign the XML file format from a new specification devised with the help of Brent Saner (proposal episodes 1, 2 and 3). The new file format is not yet merged into master, and we consider it quite experimental. If you’re feeling courageous and are looking for a bug hunting challenge, back up your data (really) before trying it out with the “file_format_2” branch (which you can see in the current list of branches).

    As you can imagine, Diego’s changes are all major, invasive technological changes (particularly the proposed file format change), and they would benefit from extensive testing by everybody before 0.5 happens. I’ve done some pretty exhaustive testing before he merged his first two branches of course, but I’m only human, and it is possible that issues might remain. To make testing easier, a snapshot of the current git version has been packaged as a Flatpak package, available here. As for the third (file format) branch, we’d like to have more people testing it with copies of “real data” before we feel confident about merging it, so you’ll have to build the git version yourself instead of using the flatpak. So please grab GTG’s git version ASAP, with a copy of your data (see the instructions in the README, including the “Where is my user data and config stored?” section), and torture-test it to make sure everything is working properly, and report issues you may find (if any).

    Other ongoing improvements

    There have been countless bug fixes and improvements going on in recent months, here are a few that I can remember from the top of my head:

    • The main window’s previously active tab/mode is now restored when you launch the application.
    • Task windows no longer all reopen on launch when you shutdown the PC without closing GTG first.
    • There is a setting in GTG’s preference now to specify whether you are using a dark theme or not, so that GTG can adapt its colors accordingly. From what I am told, it seems GNOME still doesn’t have a mechanism for applications to know for certain if the user’s system is using a dark theme, or to connect to a signal when this changes…
    • Izidor Matusov made a surprise appearance to nuke the pyxdg dependency from orbit
    • Diego landed the new icon, along with a nighty version of it. I mourn the pixel-perfectness of the oldschool bitmap icon, but there are worse things going on in 2020 ;)
    • The about dialog now properly shows the application icon
    • Parent tasks automatically close their window when you click to open a subtask, which makes that consistent with closing the subtask’s window when you click “Open parent”
    • Neui started by working on the German translation, then quickly got addicted and started fixing a bunch of papercuts in the code! Among other things, Neui removed dbus-python in favor of gdbus, made the plugins’ “about” text translatable, documented the contribution process for translators, and various other things.
    • New Swedish and Hungarian translation updates have been contributed!
    • Francisco Lavin fixed the Hamster plugin! Productivity metric rodents rejoice! Please test this thing.
    • Recurring tasks. Holy shit.

    More great things to come.


    P.s.: I realized today that I totally forgot to mention this in the context of GTG: I have a personal news announcement mailing list where I send out notifications about my new blog posts (with a short hand-crafted summary), as courtesy to my readers. Over the past 12 months, I have sent 5 such emails (so, not exactly a deluge), and they have so far all had something to do with GTG. If you’re afraid of missing some news about GTG, or don’t have time to be following every ticket and every tweet, feel free to subscribe to my mailing list. Details and subscription form can be found here (that page sucks, I know, I need to write something simpler).

    October 29, 2020

    The Art of (Not) Painting Pixels

    Being a compositor and a compositing window manager, the most important aspect Mutter and GNOME Shell is to paint pixels to your monitors with relevant content. A large part of this content is provided by applications themselves, but many elements still need to be rendered on top of them.

    Over the past few years, Mutter’s codebase has slowly but steadily been refactored, cleaned up, reorganized, and modernized.  This includes the internal copies of Clutter and Cogl. With the beginning of the GNOME 40 development cycle, it all converged in a specially large and exciting set of changes which we’ll be talking about in this article.

    Camera

    At its core, Mutter inherits the rendering routines of Clutter and Cogl. Cogl itself is a layer above OpenGL / GLX / EGL / GLES, which is what it ultimately boils down to.

    An interesting aspect of Clutter is that, despite seeming like a 2D toolkit, it actually renders actors in a 3D space. This is what allows effects like moving, scaling, skewing, and rotating actors.

    Clutter uses a traditional perspective projection when rendering. In practice, this mimics the real world where whatever is closer appears bigger, and what’s farther appears smaller. GNOME Shell (the “stage”) itself is rendered in a well-defined position inside this 3D space: the Z-2D plane.

     

    The view cone starting from the camera; the Z-near and Z-far regions (in red); and the Z-2D plane in the middle

    The camera in Clutter is an immutable, implicit camera positioned at the center of the 3D world, and many of the traditional techniques used in rendering engines also apply here, such as the Z-near and Z-far planes (in the illustration above, the planes delimiting the blue region of the view).

    Lights

    An interesting technique that is familiar to game developers is ray casting. In our case, ray casting is used to detect which element on screen is beneath the cursor, a process known as picking.

    Since times immemorial, Clutter used painting for picking. Actors would paint themselves on a frame buffer, each actor with a different color, and Clutter would read a single pixel of the resulting picture. Cogl had several optimizations to try and not send these operations to the GPU through a journaling mechanism, but due to various constraints of how Clutter worked, these optimized code paths weren’t hit very often.

    With the 3.34 release, Clutter moved away from this model of picking, and started using geometric picking. Not touching the GPU while picking, and (most importantly) not downloading GPU contents, was a considerable improvement. However, it still involved quite a bit of vertex projection, a relatively heavy operation that should be avoided when possible.

    With Mutter 40, we are going a step further and implementing ray-based picking.

    The well established technique of raycast throws a virtual “ray of light” into the scene, and does a hit test against a list of rectangles to figure out which ones are touched by the ray.

    One fundamental aspect of ray-based picking is that it can project vertices only when necessary to perform the hit test.  In best-case scenarios, it only projects a single group of vertices. In worst-case scenarios, it projects as much as it used to do. Furthermore, by using the graphene_ray_t helper, we also benefit from vectorized operations. These facts led to a measurable reduction in how much time it takes for each pick operation to be executed, and the relative percentage of the frame times it takes.

    These refactorings of the pick code enabled another improvement: pick culling. Much like culling out actors when painting, we can now cull out actors when picking. Initial profiling sessions show that, together, these improvements massively reduce pick times.

    Action!

    While painting what you see on your screen is a fundamental aspect of GNOME Shell, figuring out what not to paint is almost as important!

    The process of “culling out” elements from the rendering pipeline is an important optimization. When you hover a button inside an application, for example, it is common that the application will only redraw that specific button, and will tell Mutter which region of it needs to be updated on screen. This allows Mutter to only redraw the region of the screen that should be updated on the next frame.

    We’ve written about this feature (regional clipping) before, in a previous article. However, this cycle, we introduced an important improvement on top of it: clip frusta.

    Regions are transformed to 3D objects called frustum

    Despite the intimidating name, the concept is familiar: much like a clipping region is a set of 2D rectangles representing which parts of the screen contents must be updated, the clip frusta is a set of 3D volumes (frustum) representing which slices of the 3D space must be updated.

    These volumes are built by projecting the 2D clip regions in the 2D plane, and extending them all the way to the camera, and also all the way against the camera. They are also delimited by the Z-near and Z-far planes.

     

    By representing clips in 3D space, we can avoid projecting 3D actors in 2D planes over and over, which is yet another significant optimization of the rendering process.

    Thanks to Graphene, we have access to graphene_frustum_t and various routines to operate on it. This allowed us to simultaneously optimize rendering, and remove code!

    Carbon-based Compositor

    You may have noticed that we’ve mentioned Graphene multiple time throughout this article. Thanks to this fancy library, a set of data structures and algorithms is readily available for us to use.

    In addition to frustums and rays, Graphene offers methods and data types for 2-, 3-, and 4-dimensional vectors, boxes, rectangles, triangles, and various onlyothers. Despite using various types from Graphene, only recently Mutter moved away from its own internal implementation of matrices (CoglMatrix) to graphene_matrix_t, which allowed for another 4-digit code cleanup.

    What’s Next

    Now that we’ve dropped many core components of Mutter, Clutter, and Cogl, in favor of their Graphene counterparts, it is possible to think about other improvements, specially in the rendering pipeline itself.

    There already is preliminary work improving the painting routines to use paint nodes more extensively, porting effects to this new API, input processing improvements, and others that directly or indirectly benefit from using Graphene. Stay tuned!

    Huge thanks to Jakub Steiner for providing the assets used in this article. Parts of this article were turned into documentation in Mutter’s Wiki.

    Thu 2020/Oct/29

    In this line of work, we all stumble at least once upon a problem that turns out to be extremely elusive and very tricky to narrow down and solve. If we're lucky, we might have everything at our disposal to diagnose the problem but sometimes that's not the case – and in embedded development it's often not the case. Add to the mix proprietary drivers, lack of debugging symbols, a bug that's very hard to reproduce under a controlled environment, and weeks in partial confinement due to a pandemic and what you have is better described as a very long lucid nightmare. Thankfully, even the worst of nightmares end when morning comes, even if sometimes morning might be several days away. And when the fix to the problem is in an inimaginable place, the story is definitely one worth telling.

    The problem

    It all started with one of Igalia's customers deploying a WPE WebKit-based browser in their embedded devices. Their CI infrastructure had detected a problem caused when the browser was tasked with creating a new webview (in layman terms, you can imagine that to be the same as opening a new tab in your browser). Occasionally, this view would never load, causing ongoing tests to fail. For some reason, the test failure had a reproducibility of ~75% in the CI environment, but during manual testing it would occur with less than a 1% of probability. For reasons that are beyond the scope of this post, the CI infrastructure was not reachable in a way that would allow to have access to running processes in order to diagnose the problem more easily. So with only logs at hand and less than a 1/100 chances of reproducing the bug myself, I set to debug this problem locally.

    Diagnosis

    The first that became evident was that, whenever this bug would occur, the WebKit feature known as web extension (an application-specific loadable module that is used to allow the program to have access to the internals of a web page, as well to enable customizable communication with the process where the page contents are loaded – the web process) wouldn't work. The browser would be forever waiting that the web extension loads, and since that wouldn't happen, the expected page wouldn't load. The first place to look into then is the web process and to try to understand what is preventing the web extension from loading. Enter here, our good friend GDB, with less than spectacular results thanks to stripped libraries.

    #0  0x7500ab9c in poll () from target:/lib/libc.so.6
    #1  0x73c08c0c in ?? () from target:/usr/lib/libEGL.so.1
    #2  0x73c08d2c in ?? () from target:/usr/lib/libEGL.so.1
    #3  0x73c08e0c in ?? () from target:/usr/lib/libEGL.so.1
    #4  0x73bold6a8 in ?? () from target:/usr/lib/libEGL.so.1
    #5  0x75f84208 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #6  0x75fa0b7e in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #7  0x7561eda2 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #8  0x755a176a in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #9  0x753cd842 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #10 0x75451660 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #11 0x75452882 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #12 0x75452fa8 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #13 0x76b1de62 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #14 0x76b5a970 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #15 0x74bee44c in g_main_context_dispatch () from target:/usr/lib/libglib-2.0.so.0
    #16 0x74bee808 in ?? () from target:/usr/lib/libglib-2.0.so.0
    #17 0x74beeba8 in g_main_loop_run () from target:/usr/lib/libglib-2.0.so.0
    #18 0x76b5b11c in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #19 0x75622338 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
    #20 0x74f59b58 in __libc_start_main () from target:/lib/libc.so.6
    #21 0x0045d8d0 in _start ()

    From all threads in the web process, after much tinkering around it slowly became clear that one of the places to look into is that poll() call. I will spare you the details related to what other threads were doing, suffice to say that whenever the browser would hit the bug, there was a similar stacktrace in one thread, going through libEGL to a call to poll() on top of the stack, that would never return. Unfortunately, a stripped EGL driver coming from a proprietary graphics vendor was a bit of a showstopper, as it was the inability to have proper debugging symbols running inside the device (did you know that a non-stripped WebKit library binary with debugging symbols can easily get GDB and your device out of memory?). The best one could do to improve that was to use the gcore feature in GDB, and extract a core from the device for post-mortem analysis. But for some reason, such a stacktrace wouldn't give anything interesting below the poll() call to understand what's being polled here. Did I say this was tricky?

    What polls?

    Because WebKit is a multiprocess web engine, having system calls that signal, read, and write in sockets communicating with other processes is an everyday thing. Not knowing what a poll() call is doing and who is it that it's trying to listen to, not very good. Because the call is happening under the EGL library, one can presume that it's graphics related, but there are still different possibilities, so trying to find out what is this polling is a good idea.

    A trick I learned while debugging this is that, in absence of debugging symbols that would give a straightforward look into variables and parameters, one can examine the CPU registers and try to figure out from them what the parameters to function calls are. Let's do that with poll(). First, its signature.

    int poll(struct pollfd *fds, nfds_t nfds, int timeout);

    Now, let's examine the registers.

    (gdb) f 0
    #0  0x7500ab9c in poll () from target:/lib/libc.so.6
    (gdb) info registers
    r0             0x7ea55e58	2124766808
    r1             0x1	1
    r2             0x64	100
    r3             0x0	0
    r4             0x0	0

    Registers r0, r1, and r2 contain poll()'s three parameters. Because r1 is 1, we know that there is only one file descriptor being polled. fds is a pointer to an array with one element then. Where is that first element? Well, right there, in the memory pointed to directly by r0. What does struct pollfd look like?

    struct pollfd {
      int   fd;         /* file descriptor */
      short events;     /* requested events */
      short revents;    /* returned events */
    };

    What we are interested in here is the contents of fd, the file descriptor that is being polled. Memory alignment is again in our side, we don't need any pointer arithmetic here. We can inspect directly the register r0 and find out what the value of fd is.

    (gdb) print *0x7ea55e58
    $3 = 8

    So we now know that the EGL library is polling the file descriptor with an identifier of 8. But where is this file descriptor coming from? What is on the other end? The /proc file system can be helpful here.

    # pidof WPEWebProcess
    1944 1196
    # ls -lh /proc/1944/fd/8
    lrwx------    1 x x      64 Oct 22 13:59 /proc/1944/fd/8 -> socket:[32166]
    

    So we have a socket. What else can we find out about it? Turns out, not much without the unix_diag kernel module, which was not available in our device. But we are slowly getting closer. Time to call another good friend.

    Where GDB fails, printf() triumphs

    Something I have learned from many years working with a project as large as WebKit, is that debugging symbols can be very difficult to work with. To begin with, it takes ages to build WebKit with them. When cross-compiling, it's even worse. And then, very often the target device doesn't even have enough memory to load the symbols when debugging. So they can be pretty useless. It's then when just using fprintf() and logging useful information can simplify things. Since we know that it's at some point during initialization of the web process that we end up stuck, and we also know that we're polling a file descriptor, let's find some early calls in the code of the web process and add some fprintf() calls with a bit of information, specially in those that might have something to do with EGL. What can we find out now?

    Oct 19 10:13:27.700335 WPEWebProcess[92]: Starting
    Oct 19 10:13:27.720575 WPEWebProcess[92]: Initializing WebProcess platform.
    Oct 19 10:13:27.727850 WPEWebProcess[92]: wpe_loader_init() done.
    Oct 19 10:13:27.729054 WPEWebProcess[92]: Initializing PlatformDisplayLibWPE (hostFD: 8).
    Oct 19 10:13:27.730166 WPEWebProcess[92]: egl backend created.
    Oct 19 10:13:27.741556 WPEWebProcess[92]: got native display.
    Oct 19 10:13:27.742565 WPEWebProcess[92]: initializeEGLDisplay() starting.

    Two interesting findings from the fprintf()-powered logging here: first, it seems that file descriptor 8 is one known to libwpe (the general-purpose library that powers the WPE WebKit port). Second, that the last EGL API call right before the web process hangs on poll() is a call to eglInitialize(). fprintf(), thanks for your service.

    Number 8

    We now know that the file descriptor 8 is coming from WPE and is not internal to the EGL library. libwpe gets this file descriptor from the UI process, as one of the many creation parameters that are passed via IPC to the nascent process in order to initialize it. Turns out that this file descriptor in particular, the so-called host client file descriptor, is the one that the freedesktop backend of libWPE, from here onwards WPEBackend-fdo, creates when a new client is set to connect to its Wayland display. In a nutshell, in presence of a new client, a Wayland display is supposed to create a pair of connected sockets, create a new client on the Display-side, give it one of the file descriptors, and pass the other one to the client process. Because this will be useful later on, let's see how is that currently implemented in WPEBackend-fdo.

        int pair[2];
        if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, pair) < 0)
            return -1;
    
        int clientFd = dup(pair[1]);
        close(pair[1]);
    
        wl_client_create(m_display, pair[0]);
    

    The file descriptor we are tracking down is the client file descriptor, clientFd. So we now know what's going on in this socket: Wayland-specific communication. Let's enable Wayland debugging next, by running all relevant process with WAYLAND_DEBUG=1. We'll get back to that code fragment later on.

    A Heisenbug is a Heisenbug is a Heisenbug

    Turns out that enabling Wayland debugging output for a few processes is enough to alter the state of the system in such a way that the bug does not happen at all when doing manual testing. Thankfully the CI's reproducibility is much higher, so after waiting overnight for the CI to continuously run until it hit the bug, we have logs. What do the logs say?

    WPEWebProcess[41]: initializeEGLDisplay() starting.
      -> wl_display@1.get_registry(new id wl_registry@2)
      -> wl_display@1.sync(new id wl_callback@3)
    

    So the EGL library is trying to fetch the Wayland registry and it's doing a wl_display_sync() call afterwards, which will block until the server responds. That's where the blocking poll() call comes from. So, it turns out, the problem is not necessarily on this end of the Wayland socket, but perhaps on the other side, that is, in the so-called UI process (the main browser process). Why is the Wayland display not replying?

    The loop

    Something that is worth mentioning before we move on is how the WPEBackend-fdo Wayland display integrates with the system. This display is a nested display, with each web view a client, while it is itself a client of the system's Wayland display. This can be a bit confusing if you're not very familiar with how Wayland works, but fortunately there is good documentation about Wayland elsewhere.

    The way that the Wayland display in the UI process of a WPEWebKit browser is integrated with the rest of the program, when it uses WPEBackend-fdo, is through the GLib main event loop. Wayland itself has an event loop implementation for servers, but for a GLib-powered application it can be useful to use GLib's and integrate Wayland's event processing with the different stages of the GLib main loop. That is precisely how WPEBackend-fdo is handling its clients' events. As discussed earlier, when a new client is created a pair of connected sockets are created and one end is given to Wayland to control communication with the client. GSourceFunc functions are used to integrate Wayland with the application main loop. In these functions, we make sure that whenever there are pending messages to be sent to clients, those are sent, and whenever any of the client sockets has pending data to be read, Wayland reads from them, and to dispatch the events that might be necessary in response to the incoming data. And here is where things start getting really strange, because after doing a bit of fprintf()-powered debugging inside the Wayland-GSourceFuncs functions, it became clear that the Wayland events from the clients were never dispatched, because the dispatch() GSourceFunc was not being called, as if there was nothing coming from any Wayland client. But how is that possible, if we already know that the web process client is actually trying to get the Wayland registry?

    To move forward, one needs to understand how the GLib main loop works, in particular, with Unix file descriptor sources. A very brief summary of this is that, during an iteration of the main loop, GLib will poll file descriptors to see if there are any interesting events to be reported back to their respective sources, in which case the sources will decide whether to trigger the dispatch() phase. A simple source might decide in its dispatch() method to directly read or write from/to the file descriptor; a Wayland display source (as in our case), will call wl_event_loop_dispatch() to do this for us. However, if the source doesn't find any interesting events, or if the source decides that it doesn't want to handle them, the dispatch() invocation will not happen. More on the GLib main event loop in its API documentation.

    So it seems that for some reason the dispatch() method is not being called. Does that mean that there are no interesting events to read from? Let's find out.

    System call tracing

    Here we resort to another helpful tool, strace. With strace we can try to figure out what is happening when the main loop polls file descriptors. The strace output is huge (because it takes easily over a hundred attempts to reproduce this), but we know already some of the calls that involve file descriptors from the code we looked at above, when the client is created. So we can use those calls as a starting point in when searching through the several MBs of logs. Fast-forward to the relevant logs.

    socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [128, 130]) = 0
    dup(130)               = 131
    close(130)             = 0
    fcntl64(128, F_DUPFD_CLOEXEC, 0) = 130
    epoll_ctl(34, EPOLL_CTL_ADD, 130, {EPOLLIN, {u32=1639599928, u64=1639599928}}) = 0

    What we see there is, first, WPEBackend-fdo creating a new socket pair (128, 130) and then, when file descriptor 130 is passed to wl_client_create() to create a new client, Wayland adds that file descriptor to its epoll() instance for monitoring clients, which is referred to by file descriptor 34. This way, whenever there are events in file descriptor 130, we will hear about them in file descriptor 34.

    So what we would expect to see next is that, after the web process is spawned, when a Wayland client is created using the passed file descriptor and the EGL driver requests the Wayland registry from the display, there should be a POLLIN event coming in file descriptor 34 and, if the dispatch() call for the source was called, a epoll_wait() call on it, as that is what wl_event_loop_dispatch() would do when called from the source's dispatch() method. But what do we have instead?

    poll([{fd=30, events=POLLIN}, {fd=34, events=POLLIN}, {fd=59, events=POLLIN}, {fd=110, events=POLLIN}, {fd=114, events=POLLIN}, {fd=132, events=POLLIN}], 6, 0) = 1 ([{fd=34, revents=POLLIN}])
    recvmsg(30, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
    

    strace can be a bit cryptic, so let's explain those two function calls. The first one is a poll in a series of file descriptors (including 30 and 34) for POLLIN events. The return value of that call tells us that there is a POLLIN event in file descriptor 34 (the Wayland display epoll() instance for clients). But unintuitively, the call right after is trying to read a message from socket 30 instead, which we know doesn't have any pending data at the moment, and consequently returns an error value with an errno of EAGAIN (Resource temporarily unavailable).

    Why is the GLib main loop triggering a read from 30 instead of 34? And who is 30?

    We can answer the latter question first. Breaking on a running UI process instance at the right time shows who is reading from the file descriptor 30:

    #1  0x70ae1394 in wl_os_recvmsg_cloexec (sockfd=30, msg=msg@entry=0x700fea54, flags=flags@entry=64)
    #2  0x70adf644 in wl_connection_read (connection=0x6f70b7e8)
    #3  0x70ade70c in read_events (display=0x6f709c90)
    #4  wl_display_read_events (display=0x6f709c90)
    #5  0x70277d98 in pwl_source_check (source=0x6f71cb80)
    #6  0x743f2140 in g_main_context_check (context=context@entry=0x2111978, max_priority=, fds=fds@entry=0x6165f718, n_fds=n_fds@entry=4)
    #7  0x743f277c in g_main_context_iterate (context=0x2111978, block=block@entry=1, dispatch=dispatch@entry=1, self=)
    #8  0x743f2ba8 in g_main_loop_run (loop=0x20ece40)
    #9  0x00537b38 in ?? ()

    So it's also Wayland, but on a different level. This is the Wayland client source (remember that the browser is also a Wayland client?), which is installed by cog (a thin browser layer on top of WPE WebKit that makes writing browsers easier to do) to process, among others, input events coming from the parent Wayland display. Looking at the cog code, we can see that the wl_display_read_events() call happens only if GLib reports that there is a G_IO_IN (POLLIN) event in its file descriptor, but we already know that this is not the case, as per the strace output. So at this point we know that there are two things here that are not right:

    1. A FD source with a G_IO_IN condition is not being dispatched.
    2. A FD source without a G_IO_IN condition is being dispatched.

    Someone here is not telling the truth, and as a result the main loop is dispatching the wrong sources.

    The loop (part II)

    It is at this point that it would be a good idea to look at what exactly the GLib main loop is doing internally in each of its stages and how it tracks the sources and file descriptors that are polled and that need to be processed. Fortunately, debugging symbols for GLib are very small, so debugging this step by step inside the device is rather easy.

    Let's look at how the main loop decides which sources to dispatch, since for some reason it's dispatching the wrong ones. Dispatching happens in the g_main_dispatch() method. This method goes over a list of pending source dispatches and after a few checks and setting the stage, the dispatch method for the source gets called. How is a source set as having a pending dispatch? This happens in g_main_context_check(), where the main loop checks the results of the polling done in this iteration and runs the check() method for sources that are not ready yet so that they can decide whether they are ready to be dispatched or not. Breaking into the Wayland display source, I know that the check() method is called. How does this method decide to be dispatched or not?

        [](GSource* base) -> gboolean
        {
            auto& source = *reinterpret_cast(base);
            return !!source.pfd.revents;
        },

    In this lambda function we're returning TRUE or FALSE, depending on whether the revents field in the GPollFD structure have been filled during the polling stage of this iteration of the loop. A return value of TRUE indicates the main loop that we want our source to be dispatched. From the strace output, we know that there is a POLLIN (or G_IO_IN) condition, but we also know that the main loop is not dispatching it. So let's look at what's in this GPollFD structure.

    For this, let's go back to g_main_context_check() and inspect the array of GPollFD structures that it received when called. What do we find?

    (gdb) print *fds
    $35 = {fd = 30, events = 1, revents = 0}
    (gdb) print *(fds+1)
    $36 = {fd = 34, events = 1, revents = 1}

    That's the result of the poll() call! So far so good. Now the method is supposed to update the polling records it keeps and it uses when calling each of the sources check() functions. What do these records hold?

    (gdb) print *pollrec->fd
    $45 = {fd = 19, events = 1, revents = 0}
    (gdb) print *(pollrec->next->fd)
    $47 = {fd = 30, events = 25, revents = 1}
    (gdb) print *(pollrec->next->next->fd)
    $49 = {fd = 34, events = 25, revents = 0}

    We're not interested in the first record quite yet, but clearly there's something odd here. The polling records are showing a different value in the revent fields for both 30 and 34. Are these records updated correctly? Let's look at the algorithm that is doing this update, because it will be relevant later on.

      pollrec = context->poll_records;
      i = 0;
      while (pollrec && i < n_fds)
        {
          while (pollrec && pollrec->fd->fd == fds[i].fd)
            {
              if (pollrec->priority <= max_priority)
                {
                  pollrec->fd->revents =
                    fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
                }
              pollrec = pollrec->next;
            }
    
          i++;
        }

    In simple words, what this algorithm is doing is to traverse simultaneously the polling records and the GPollFD array, updating the polling records revents with the results of polling. From reading how the pollrec linked list is built internally, it's possible to see that it's purposely sorted by increasing file descriptor identifier value. So the first item in the list will have the record for the lowest file descriptor identifier, and so on. The GPollFD array is also built in this way, allowing for a nice optimization: if more than one polling record – that is, more than one polling source – needs to poll the same file descriptor, this can be done at once. This is why this otherwise O(n^2) nested loop can actually be reduced to linear time.

    One thing stands out here though: the linked list is only advanced when we find a match. Does this mean that we always have a match between polling records and the file descriptors that have just been polled? To answer that question we need to check how is the array of GPollFD structures filled. This is done in g_main_context_query(), as we hinted before. I'll spare you the details, and just focus on what seems relevant here: when is a poll record not used to fill a GPollFD?

      n_poll = 0;
      lastpollrec = NULL;
      for (pollrec = context->poll_records; pollrec; pollrec = pollrec->next)
        {
          if (pollrec->priority > max_priority)
            continue;
      ...
    

    Interesting! If a polling record belongs to a source whose priority is lower than the maximum priority that the current iteration is going to process, the polling record is skipped. Why is this?

    In simple terms, this happens because each iteration of the main loop finds out the highest priority between the sources that are ready in the prepare() stage, before polling, and then only those file descriptor sources with at least such a a priority are polled. The idea behind this is to make sure that high-priority sources are processed first, and that no file descriptor sources with lower priority are polled in vain, as they shouldn't be dispatched in the current iteration.

    GDB tells me that the maximum priority in this iteration is -60. From an earlier GDB output, we also know that there's a source for a file descriptor 19 with a priority 0.

    (gdb) print *pollrec
    $44 = {fd = 0x7369c8, prev = 0x0, next = 0x6f701560, priority = 0}
    (gdb) print *pollrec->fd
    $45 = {fd = 19, events = 1, revents = 0}

    Since 19 is lower than 30 and 34, we know that this record is before theirs in the linked list (and so it happens, it's the first one in the list too). But we know that, because its priority is 0, it is too low to be added to the file descriptor array to be polled. Let's look at the loop again.

      pollrec = context->poll_records;
      i = 0;
      while (pollrec && i < n_fds)
        {
          while (pollrec && pollrec->fd->fd == fds[i].fd)
            {
              if (pollrec->priority <= max_priority)
                {
                  pollrec->fd->revents =
                    fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
                }
              pollrec = pollrec->next;
            }
    
          i++;
        }

    The first polling record was skipped during the update of the GPollFD array, so the condition pollrec && pollrec->fd->fd == fds[i].fd is never going to be satisfied, because 19 is not in the array. The innermost while() is not entered, and as such the pollrec list pointer never moves forward to the next record. So no polling record is updated here, even if we have updated revent information from the polling results.

    What happens next should be easy to see. The check() method for all polled sources are called with outdated revents. In the case of the source for file descriptor 30, we wrongly tell it there's a G_IO_IN condition, so it asks the main loop to call dispatch it triggering a a wl_connection_read() call in a socket with no incoming data. For the source with file descriptor 34, we tell it that there's no incoming data and its dispatch() method is not invoked, even when on the other side of the socket we have a client waiting for data to come and blocking in the meantime. This explains what we see in the strace output above. If the source with file descriptor 19 continues to be ready and with its priority unchanged, then this situation repeats in every further iteration of the main loop, leading to a hang in the web process that is forever waiting that the UI process reads its socket pipe.

    The bug – explained

    I have been using GLib for a very long time, and I have only fixed a couple of minor bugs in it over the years. Very few actually, which is why it was very difficult for me to come to accept that I had found a bug in one of the most reliable and complex parts of the library. Impostor syndrome is a thing and it really gets in the way.

    But in a nutshell, the bug in the GLib main loop is that the very clever linear update of registers is missing something very important: it should skip to the first polling record matching before attempting to update its revents. Without this, in the presence of a file descriptor source with the lowest file descriptor identifier and also a lower priority than the cutting priority in the current main loop iteration, revents in the polling registers are not updated and therefore the wrong sources can be dispatched. The simplest patch to avoid this, would look as follows.

       i = 0;
       while (pollrec && i < n_fds)
         {
    +      while (pollrec && pollrec->fd->fd != fds[i].fd)
    +        pollrec = pollrec->next;
    +
           while (pollrec && pollrec->fd->fd == fds[i].fd)
             {
               if (pollrec->priority <= max_priority)

    Once we find the first matching record, let's update all consecutive records that also match and need an update, then let's skip to the next record, rinse and repeat. With this two-line patch, the web process was finally unlocked, the EGL display initialized properly, the web extension and the web page were loaded, CI tests starting passing again, and this exhausted developer could finally put his mind to rest.

    A complete patch, including improvements to the code comments around this fascinating part of GLib and also a minimal test case reproducing the bug have already been reviewed by the GLib maintainers and merged to both stable and development branches. I expect that at least some GLib sources will start being called in a different (but correct) order from now on, so keep an eye on your GLib sources. :-)

    Standing on the shoulders of giants

    At this point I should acknowledge that without the support from my colleagues in the WebKit team in Igalia, getting to the bottom of this problem would have probably been much harder and perhaps my sanity would have been at stake. I want to thank Adrián and &Zcaronan for their input on Wayland, debugging techniques, and for allowing me to bounce back and forth ideas and findings as I went deeper into this rabbit hole, helping me to step out of dead-ends, reminding me to use tools out of my everyday box, and ultimately, to be brave enough to doubt GLib's correctness, something that much more often than not I take for granted.

    Thanks also to Philip and Sebastian for their feedback and prompt code review!

    October 28, 2020

    Friends of GNOME Update – October 2020

    Welcome to the October 2020 Friends of GNOME Update!

    A crescent of the Earth from space
    “Earth” by Kevin M. Gill is licensed under CC BY 2.0

    GNOME on the Road

    Executive Director Neil McGovern spoke at Open Source Summit EU. In his keynote, titled “Patently Obvious – The Year the Lawyers Came to FOSS,” Neil spoke about our patent case with Rothschild Patent Imaging.

    Neil was also interviewed in The Registrar. This wide-ranging article covers the patent case, technical development of GNOME, GNOME beyond a desktop environment, and even GNOME on a phone.

    GNOME Around the World

    We’re working with our friends at KDE on the Linux Application Summit (LAS). This event takes place November 12 – 14. It will be online this year. The event will cover all things to do with apps in a Linux environment. Registration is open! LAS is also looking for volunteers, so if you’d like to get involved, please fill out this form.

    Registration for GNOME.Asia is open! The GNOME.Asia Summit 2020 will be taking place online on November 24 – 26. While the conference is centered around the GNOME Project, there will be talks, workshops, and Birds of a Feather sessions for everyone interested in free and open source software. You can register online.

    Engagement Team: Engage!

    The GNOME Engagement Team, organized by Kristi Progri, launched two new projects: Engagement Team Reports and What’s Happening in GNOME.

    “Engagement Team Reports” covers what the Engagement Team has been up to, which includes work from contributors, volunteers, and Foundation staff. “What’s Happening in GNOME” focuses on technical developments in the GNOME ecosystem.

    If you don’t already, consider following the Engagement Blog to keep track of these updates.

    If you’re now inspired to get involved with the Engagement Team, they maintain an active Discourse, and have monthly meetings.

    GTK4

    GTK is an amazing, important, and exciting part of the GNOME ecosystem. It gets people excited in ways that few other parts of our technical development does. Emmanuele Bassi, the Foundation’s Core GTK Developer, has been working hard on getting the newest major release, GTK4, ready to go. We’re really excited about this at the Foundation and across the GNOME community.

    CEC: Community Education Challenge

    Melissa Wu, the head organizer of the Community Engagement Challenge, and Caroline Henriksen have been working hard on keeping up momentum around the Challenge. Between organizing public conversations with the judges and keeping up with the Phase One winners, we have the Phase Two deadline coming up. If you’d like to keep up with the Challenge news, sign up for the mailing list.

    Flathub Search Updates

    Flatpak is one of our favorite ways to install apps on GNOME. The best way to get the apps you want is on Flathub. Bartłomiej Piotrowski and Jan Horbowicz have recently added new search implementation to flathub.org, which will yield better results in your searches.

    LAS (mentioned above) is a great event if you want to learn more about what’s happening with Flatpak.

    Thank you!

    Thank you for being a Friend of GNOME! Caroline and I are working on some stuff for Friends of GNOME that I’m pretty excited about and can’t wait to share with you. In the meantime, we appreciate your continued support and all the ways you help GNOME.

    October 26, 2020

    Librsvg's test suite is now in Rust

    Some important changes are afoot in librsvg.

    Changes to continuous integration

    Some days ago, Dunja Lalic rewrote the continuous integration scripts to be much faster. A complete pipeline used to take about 90 minutes to run, now it takes about 15 minutes on average.

    Graph with pipeline timings, which shrink drastically

    The description of changes is interesting. The idea is to make tests fail as fast as possible, close to the beginning of the pipeline. To speed up the whole pipeline, Dunja did the following:

    • Move the cargo check stage to the beginning. This test means, "does this even have a chance of compiling?".

    • Have the code style and formatting tests, cargo clippy and cargo fmt, run in parallel with the unit tests. These lints can fail, but they are easy to fix after one is finished modifying the main code.

    • Run the unit tests and the smoke tests in debug mode, so they compile quickly.

    • Run the complete integration test suite in release mode. This takes longer to compile, but there are some slow tests that benefit a lot from faster execution.

    • Move the release tests until the end, and only run them once a week — or whenever, by hand. These take a good amount of time to run, because they do a full make distcheck and autotools is slow. Even then, now these tests take 30-40 minutes, instead of the 90 from before.

    • Between each stage of the pipeline, don't cache what doesn't help reduce compilation time. It seems that keeping around a big cache, with the whole build target, between each pipeline stage can be worse than not having one at all.

    Complete pipeline with all the stages

    Test suite in Rust

    Beteen Sven Neumann, Dunja Lalic, and myself we have finally ported the test suite to Rust: all of librsvg's tests are now in Rust, except for the C API tests. We had to do a few things:

    • Review the old tests and remove some obsolete ones.

    • Port each of the test modules to Rust. They are small, but each one has special little things — test for crashes in the XML loading code, test for crashes during rendering, test the library's security limits.

    • Fix the small tests that come as part of the documentation.

    • Untangle the reference tests and port them to Rust.

    • Move little chunks of code around so the unit tests and integration tests can share utilities to compare images, compute file paths for test fixtures, etc.

    The most complicated thing to port was the reference tests. These are the most important ones; each test loads an SVG document, renders it, and compares the result to a reference PNG image. There are some complications in the tests; they have to create a special configuration for Fontconfig and Pango, so as to have reproducible font rendering. The pango-rs bindings do not cover this part of Pango, so we had to do some things by hand.

    Anyway, the tests are now in Rust. One nice thing is that now the tests run automatically in parallel, across all CPU cores, so we save on total testing time.

    What's next: cargo-c and publish to crates.io

    We want to be able to publish librsvg in crates.io as a normal crate; this implies being able to compile, test, and publish entirely from Cargo. The compilation and testing part is done.

    Now, we have to reorganize the code so it can be published to crates.io. Librsvg comes in three parts, rsvg_internals with the implementation of the library, librsvg with the traditional C API, and librsvg_crate with the Rust API. However, to publish the Rust API to crates.io, it would be more convenient to have a single crate instead of one with the internals and one with the API.

    The next step is thus to reorganize the code:

    • Make it possible to implement the C API as a compile-time option on top of the normal Rust code. We want to use cargo-c to compile the traditional shared library librsvg.so, instead of depending on C tools for compiling and linking.

    • Combine rsvg_internals and librsvg_crate in a single crate, to publish them together. Crates.io has a 10 MB limit per crate; now that the test suite lives in a separate tests crate, this shouldn't be a problem.

    • I would like to polish the public error types before publishing the Rust API; right now they expose some implementation details that are of no interest to callers of the library.

    What remains to be ported to Rust?

    Only two things, which amount to less than 900 lines of C code:

    • rsvg-convert - the command-line program that everyone uses to convert SVG to PNG and other formats. Fortunately, Sven Neumann wrote some fantastic tests for rsvg-convert, as it is like an API that we need to keep stable: if we change the command-line options or the program's behavior, we would break everyone's scripts.

    • The gdk-pixbuf module for loading SVG. Alberto Ruiz has started porting it to Rust. The generic part of this code could later serve to wrap other Rust image codecs and plug them to gdk-pixbuf.

    October 22, 2020

    Proposal to add build graph output to GNU Make

    Background

    In 2015 I worked as a consultant at a large company in Lund. My position was with the build team and one of our responsibilities was managing and maintaining the build system for their Android based phones.

    The problem I was tasked with solving was the fact that running 'make' for a product after a successful build resulted in a lot of stuff being rebuilt unnecessarily.

    A stock Android build tree behaved nicely: a second run of 'make' only produced a line about everything being up-to-date. But the company products were taking a good 15 minutes for a rebuild even if nothing had been changed.

    The Android build system works by including all recipes to be built (programs / libraries / etc) using the GNU Make include directive, so that you end up with one giant Makefile that holds all rules for building the platform. Possibly to avoid the problems laid out in the paper Recursive make considered harmful.

    As you can imagine this results in quite a large Makefile that is near impossible to debug. To help us out GNU Make has an option that helps you figure out what is going on with every decision it makes:

    -p, --print-data-base       Print make's internal database.

    This is very powerful and lets you investigate a lot about what Make is doing. It may be a bit too powerful though. It is a lot of information to digest. You can see example output from when I run it against the project from my last blog  post, riscv-asm-hello-morse in a gist here.

    The bugs we found and the fixes we implemented were mostly about targets depending on non-existing files or depending on phony targets. The fixes were almost never the hard part, it was finding the bugs that was hard.

    Lately I have been working a bit with the Yocto build system which uses Bitbake to build recipes. And Bitbake has a -g option to generate a dependency graph that lets you figure out why something was built.

    I wondered if something similar could be made useful for GNU Make. So I attempted to implement it as:

      -g, --output-graph          Output (dot) graph of modified targets for each goal.

    Here I might need to pause to inject that I am not the first person to have this idea:


    And I am sure I am missing more submissions.

    In my defense my approach is a bit different. I only want to include targets that have been updated from Make goals that have changed, which will trim the graph quite a bit. It is not an attempt to show the information from --print-database in a new way.

    Example 1

    If I use my proposal to generate a graph from the riscv project mentioned above:

    $ make -g
    riscv64-unknown-elf-as -march=rv32imac -mabi=ilp32 -g -o0  -c -o hello-morse.o hello-morse.S
    riscv64-unknown-elf-as -march=rv32imac -mabi=ilp32 -g -o0  -c -o wait.o wait.S
    riscv64-unknown-elf-as -march=rv32imac -mabi=ilp32 -g -o0  -c -o led.o led.S
    riscv64-unknown-elf-as -march=rv32imac -mabi=ilp32 -g -o0  -c -o morse.o morse.S
    riscv64-unknown-elf-ld hello-morse.o wait.o led.o morse.o -m elf32lriscv -nostartfiles -nostdlib -Thello-morse.lds -o hello-morse.elf
    riscv64-unknown-elf-objcopy -O ihex hello-morse.elf hello-morse.hex
    make: Writing dependency graph to '/home/jonas/sandbox/riscv/riscv-asm-hello-morse/all.dot'
    The last line here (my bold) is new and added by my patch. If we now look at the graph generated by this build.
    $ cat all.dot
    strict digraph "all" {
      "hello-morse.elf" -> "hello-morse.o" 
      "hello-morse.elf" -> "wait.o" 
      "hello-morse.elf" -> "led.o" 
      "hello-morse.elf" -> "morse.o" 
      "hello-morse.hex" -> "hello-morse.elf" 
      "all" -> "hello-morse.hex" 
      "all"
    }

    And we can generate a PNG from it:

    $ dot -Tpng all.dot -o all.png


    And if I force a rebuild  and re-generate the PNG:

    $ touch morse.S
    $ make -g 



    Example 2

    In my last blog post I mentioned the freedom-e-sdk used to build software for the hifive1-revb board. I noticed that it had the same problem as the Android based build system above, it always rebuilt on a second 'make' run.

    The graph for the second run looks like:



    Which helps us see that the build is forced because Make cannot find the libmetal-pico.a file. And looking at the Makefile we find that, yes, it does not check if the bsp support picolibc before depending on libmetal-pico.a. Fixing that will make rebuilding on the second 'make' run go away.

    Example 3

    If I want to track what happens on rebuild of the Make project itself I run into an issue.
    $ touch src/file.c
    $ make -g
    The resulting all.dot file hardly contains any information about the rebuild:
    strict digraph "all" {
      "all" -> "all-recursive" [label="forced: PHONY prerequisite"]
      "all"
    }
    The problem is that the make process is using make to build sub-targets, this comes from the all-recursive target. In order to solve this we need to tell, in this case automake, that all calls to make should use the -g option.
    $ touch src/file.c
    $ AM_MAKEFLAGS="-g" make -g

    This gives us an additional all-am.dot file from the all-am make goal, which holds the missing information:

    strict digraph "all-am" {
      "make" -> "src/file.o" 
      "all-am" -> "make" 
      "all-am"
    }

    Status and feedback

    I have submitted this proposal to the GNU Make project. The maintainers will decide if this is something that is worthwhile and something they could consider maintaining.

    You can also find the implementation in my Make fork at GitHub, here.

    I would love to hear your thoughts on this. Is this something you would find useful? Please comment here or drop me a line on Twitter (@jonasdn) with your opinions!

    October 21, 2020

    Accessibility in GTK 4

    The big news in last weeks GTK 3.99.3 release is that we have a first non-trivial backend for our new accessibility implementation. Therefore, now is a good time to take a deeper look at accessibility in GTK 4.

    Overview

    Lets start with a quick review of how accessibility works on Linux. The actors in this are applications and assistive technologies (ATs) such as screen readers (for instance, Orca), magnifiers and the like.

    The purpose of ATs generally is to provide users with alternative ways to interact with the application that are tailored to their needs (say, an enlarged view, text read out aloud, or voice commands). To do this, ATs need a lot of detailed information about the applications UI, and this is where the accessibility stack comes into play—it is the connecting layer between the application (or its toolkit) and the ATs.

    Applications and ATs talk to each other on the accessibility bus, which is a separate instance of a D-Bus session bus, using the interfaces described by the AT-SPI project. The UI elements of the application are represented on as objects on the bus that implement somewhat abstracted interfaces, such as Text or Value. Applications emit signals to communicate changes in the UI, and ATs can call methods on the objects to get information or make changes (e.g. change the current value of a Value interface to move the GtkScale that it represents).

    What has changed

    In GTK 2 and 3, this was done in an awkwardly indirect way: GTK widgets have auxiliary accessible objects that are implementations of ATK interfaces (translation 1: GTK ➙ ATK). These are then turned in AT-SPI objects (translation 2: ATK ➙ AT-SPI) that are represented on the accessibility bus by adapter code in at-spi2-atk. On the other end, ATs then use pyatspi to convert the AT-SPI interfaces into Python objects (translation 3: AT-SPI ➙ Python).

    This multi-step process was inefficient, lossy, and hard to maintain; it required implementing the same functionality over at least three components, and it led to discrepancies between the documented AT-SPI methods and properties, and the ones actually sent over the accessibility bus.

    In GTK 4, we are simplifying the application side by cutting out ATK and at-spi2-atk. Widgets now implement a GtkAccessible interface that lets them set a number of roles, states, properties and relations that are more or less directly taken from the WAI-ARIA spec published by the W3C. The AT-SPI backend for GTKs accessibility API then takes these ARIA-inspired attributes (and the knowledge of the widgets themselves) and represents the widgets as objects on the accessibility bus, implementing the relevant AT-SPI interfaces for them.

    This is a much more direct approach, and matches what Qt and web browsers already do.

    Application API

    Here are the highlights of the accessibility API that you are most likely to run into when using GTK 4 in applications:

    Setting an accessible role. A role is a description of the semantics of a widget, and ATs will use it to decide what kind of behavior should be presented to their users. Setting a role is a one-time operation, which means it has to be done at widget creation time, either in class_init, or during instance initialization:

    gtk_widget_class_set_accessible_role (widget_class,  
                                          GTK_ACCESSIBLE_ROLE_BUTTON);

    Updating a widgets acccessible state or properties. This should be done whenever the widget’s accessible representation changes:

    gtk_accessible_update_property (GTK_ACCESSIBLE (widget),
                           GTK_ACCESSIBLE_PROPERTY_VALUE_MIN, minimum,
                           GTK_ACCESSIBLE_PROPERTY_VALUE_NOW, value,
                           GTK_ACCESSIBLE_PROPERTY_VALUE_MAX, maximum,
                           -1);

    The GTK reference documentation has an overview of the accessibility API, which includes guidance for application developers and widget writers.

    What’s next ?

    For GTK 4.0, we are focusing on completing the AT-SPI backend. But with the new API and the backend separation, we have a clear path towards making accessibility backends for other platforms, which is something we want to look into for subsequent GTK 4 releases.

    On Linux, we want to work with other stakeholders on modernizing the AT-SPI interfaces to finally overcome the CORBA legacy that is still visible in some places. A part of that will be moving away from an accessibility bus toward peer-to-peer connections between the application and the ATs; this would enhance the security of the accessibility stack and plug a hole in the sandboxing used by technologies such as Flatpak.

    In the future we want to introduce tools to ensure that application developers will be aware of missing accessibility annotations, such as providing a label attribute, or a labelled-by relation, to icons and images in their UIs; or ensure that every UI element is correctly represented in the accessible tree. We already have a test backend for the GtkAccessible interface which can be used to write unit tests and verify that roles and attributes are updated where necessary.

    October 19, 2020

    libsecret is accepting Outreachy interns as well

    Like other projects in GNOME, libsecret also has an open project for Outreachy internship: Create a portable library for reading/writing libsecret keyring format.

    libsecret is a library that allows applications to store/retrieve user secrets (typically passwords). While it usually works as a client against a separate D-Bus service, it can also use a local file as database. The project is about refactoring the file database so it can easily gain more advanced features like hardware-based security, etc. That might sound intimidating as it touches cryptography, but don’t worry and reach out to us if you are interested 🙂