Constant Time Rank-Select Queries On A Bit Vector

Bit vectors are commonly used to store boolean values compactly. In many succinct data structures, we need to perform two operations efficiently:

$rank_1(i)$ Returns the number of set bits in the vector up to position i.
$select_1(k)$ Returns the position of the i-th set bit in the vector.

We are going to discuss some techniques which will allow us to run this operations in constant time O(1) with minimal space overhead.

I am going to omit the subscript $_1$ from the rank and select definitions and assume $rank_1$ , $select_1$ everywhere.

Faster rank

Hierarchical Counters

A naive rank(i) implementation scans the bitvector up to position $i$ , counting set bits along the way. To support constant-time rank queries, we have to precompute partial sums with different strides.

We partition the bitvector into nested levels $L_0, L_1, \dots, L_n$ where each level refines the one above it and uses a smaller stride.

For example, we may choose block sizes

L_0 = 2^{32}, \quad L_1 = 2^{16}, \quad L_2 = 2^{10}.

At each level $L_i$ we store the number of set bits from the beginning of the parent block. The counter size can be chosen according to the maximum value it needs to represent (u32 for $L_0$ , u16 for $L_1$ , etc).

To compute rank(i), we combine the precomputed counts from each level and finish with a direct bit count at the finest granularity:

\begin{align} l_m(x)&=L_m\left[\left\lfloor\frac{x}{|L_m|}\right\rfloor\right] \\ rank(i)&=l_0(i) + l_1(i) + \cdots + popcount(\text{bit\_array}[l_{n-1}(i)..i]) \end{align}

Since the number of levels is constant and the final $popcount$ operates on a bounded number of bits, the rank query runs in constant time.

Interleaving

Interleaving places counters from different hierarchy levels next to the data, improving locality on linear scans so that a rank query often touches only a few cache lines, which usually improves speed.

The downsides are the increased implementation complexity and the need to allocate the new memory region for the array - the layout becomes harder to reason about, it adds more operations, introduces slow division if we don't choose block size carefully.

SIMD

SIMD (Single Instruction, Multiple Data) instructions can accelerate rank queries for example performing several popcount operations in parallel.

Example

Simple 2 level $rank_1$ implementation:

const std = @import("std");

pub const Rank = struct {
    data: []const u64,
    l0: []u64,
    l1: []u16,
    bit_len: usize,

	const L0_SIZE: usize = 2048;
	const L1_SIZE: usize = 512;

    pub fn init(allocator: std.mem.Allocator, data: []const u64, bit_len: usize) !Rank {
        const l0_count = (bit_len + L0_SIZE - 1) / L0_SIZE;
        const l1_count = (bit_len + L1_SIZE - 1) / L1_SIZE;

        var l0 = try allocator.alloc(u64, l0_count);
        var l1 = try allocator.alloc(u16, l1_count);
        var l0_rank: u64 = 0;
        var l1_rank: u64 = 0;
        for (data, 0..) |it, i| {
            const bit_pos = i * 64;
            if (bit_pos % L0_SIZE == 0) {
                l0[bit_pos / L0_SIZE] = l0_rank;
                l1_rank = l0_rank;
            }
            if (bit_pos % L1_SIZE == 0) {
                l1[bit_pos / L1_SIZE] = @intCast(l0_rank - l1_rank);
            }
            l0_rank += @popCount(it);
        }

        return Rank{
            .data = data,
            .l0 = l0,
            .l1 = l1,
            .bit_len = bit_len,
        };
    }

    pub fn rank(self: Rank, i: usize) !u64 {
        if (i >= self.bit_len) return error.InvalidParameters;
        const l0_offset = self.l0[i / L0_SIZE];

        // offsets in self.l1 are relative to the corresponding l0, so we need to sum them
        const l1_idx = i / L1_SIZE;
        var result = l0_offset + self.l1[l1_idx];

        // count all the set bits in the array up until i-th bit
        const start_idx = (l1_idx * L1_SIZE) / 64;
        const end_idx = i / 64;

        var j = start_idx;
        while (j < end_idx) : (j += 1) {
            result += @popCount(self.data[j]);
        }

        // count the set bits in the last u64, we need to mask it so it doesn't go beyond i
        const last_offset = i % 64;
        if (last_offset > 0) {
            result += @popCount(self.data[j] >> @intCast(64 - last_offset));
        }
        return result;
    }

    pub fn deinit(self: *Rank, allocator: std.mem.Allocator) void {
        allocator.free(self.l0);
        allocator.free(self.l1);
        self.* = undefined;
    }
};

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    const data = [_]u64{0xAAAAAAAAAAAAAAAA} ** 100;
    var r = try Rank.init(allocator, &data, 100 * 64);
    defer r.deinit(allocator);
    std.debug.print("{}\n{}\n{}\n{}\n", .{ try r.rank(0), try r.rank(2), try r.rank(3), try r.rank(5000) });
}

Optimizing Select

The select operation is the inverse of rank: finding the position of the $k$ -th set bit. This task is more challenging since the density of set/unset bits can vary though the array. A uniform lookup table that works for dense data might be inefficient for sparse data.

Binary Search

A simple $O(\log N)$ approach is to binary search over the rank structure. To do that, let's define $x=select(k)$ such that $rank(x) = k−1$ . Why subtract 1? This is just my preference. I prefer using 1-based input since we are not talking about an index ( $k$ -th set bit). It seems that most researchers writing articles on this topic also prefer this approach.

Using this definition of select(k) we need to find the smallest position $i$ such that $rank(i)≥k$ . Since the rank function is monotonic this can be achieved using binary search.

pub fn select(r: Rank, k: u64) usize {
	var l: usize = 0;
	var h = self.bit_len;
	while (l < h) {
		const q = l + (h - l) / 2;
		if (r.rank(q + 1) < k) l = q + 1
		else h = q;
	}
	return l;
}

This is not a constant time solution, but it can be useful in some cases.

Hierarchical index

As with rank we can precompute select(k) using different strides at each level. However, unlike rank the input parameter $k$ is not uniformly distributed across the bit array. For example the array may contain only a few set bits out of millions of positions.

In this case we partition the bitvector into same nested levels $L_0, L_1, \dots, L_n$ as for rank but here the strides are defined in terms of set bits and we store positions rather than prefix sums

\begin{align} l_m(k) &= L_m\left[\left\lfloor\frac{k}{|L_m|}\right\rfloor\right] \\ start(k) &= l_0(k) + l_1(k) + \cdots + l_n(k) \\ search(p) &= \min \{ d \ge 0 \mid rank(p + d) \ge k \} \\ select(k) &= start(k) + search(start(k)) \end{align}

Interleaving

Interleaving for select follows the same general idea as for rank - we place the select metadata (positions or relative offsets) next to the data.

For select, interleaving is more appealing than for rank, since the bit density can vary across the bit array, and regular cache misses when loading level arrays can significantly affect performance

Density

While rank(i) always processes a fixed position in the bitvector, select(k) depends on how set bits are distributed across the array.

In dense bitvectors (more than 50% set bits) consecutive values of k usually map to nearby positions. Most index blocks contain many set bits, the computed start(k) is close to the final answer and the remaining scan is short and cache-friendly.

Sparse bitvectors (less than 50% set bits) often have large gaps between set bits. An index entry may point to a wide range of positions, and the final scan may require skipping long runs of zeroes. In this case, stride selection matters more than almost any other optimization.

This problem can be solved by dynamically choosing steps depending on the bit density. Hierarchical indexing can also help here, as it allows to keep the index reasonably small and still use narrow integer types for offsets.

Scan

Once we have narrowed our search down to some basic block, we will have to start a linear scan. It's better to do this at a scale that matches the CPU cache line.

We iterate through the words, accumulating population counts, until we find a word containing the target bit.

fn scan(k: usize, idx: usize): u64 {
	var remaining: usize = k; 
	for (bit_array[idx]) |value| {
		const pcnt = @popCount(value);
		if (remaining <= pcnt) {
			const value_pos = nthSetBitPos(value, @intCast(remaining));
			return idx * 64 + value_pos;
		}
		remaining -= pcnt;
	}
	unreachable;
}

In the last step, we need to find the n-th set bit within a machine word.

N-th set bit

Unfortunately, there is no widely supported instruction that returns the position of the n-th set bit. On the x86-64 architecture we have the BMI2 instruction set, and using the PDEP instruction from it we can define nthSetBit(x, n) = TZCNT(PDEP(1 << n, x)), but as far as I know we don't have an alternative on ARM.

I'll write a separate article about a very fast solution described in Sebastiano Vigna's article Broadword Implementation of Rank/Select Queries, it works surprisingly well on ARM.

For simplicity, we will use the following naive implementation:

pub fn nthSetBitPos(x: u64, n: u6) !u6 {
    var bits = x;
    var remaining: u6 = n;
    var i: u6 = 0;
    while (i < 64) : (i += 1) {
        if ((bits & 1) != 0) {
            if (remaining == 0) return i;
            remaining -= 1;
        }
        bits >>= 1;
    }
    return error.NotFound;
}

Example

const std = @import("std");


pub fn nthSetBitPos(x: u64, n: u6) !u6 {
    var bits = x;
    var remaining: u6 = n;
    for (0..64) |i| {
        if ((bits & 1) != 0) {
            if (remaining == 0) return @intCast(i);
            remaining -= 1;
        }
        bits >>= 1;
    }
    return error.NotFound;
}


const Tuple = std.meta.Tuple;

pub const Select = struct {
    data: []const u64,
    table: Tuple(&.{ []u64, []u16 }),
    strides: [2]u64,
    bit_len: usize,

    pub fn init(allocator: std.mem.Allocator, data: []const u64, bit_len: usize) !Select {
        if (bit_len == 0 or data.len == 0 or bit_len > data.len * 64) return error.InvalidParameters;

        const n_set = n_set: {
            const len = (bit_len - 1) / 64 + 1;
            var sum: u64 = 0;
            for (data[0 .. len - 1]) |it| sum += @popCount(it);
            break :n_set sum + @popCount(data[len - 1] >> @intCast(len * 64 - bit_len));
        };

        const stride_l0 = 2048;
        const stride_l1 = 512;
        const strides: [2]u64 = .{ stride_l0, stride_l1 };

        const l0 = try allocator.alloc(u64, if (n_set == 0) 2 else ((n_set - 1) / stride_l0 + 1) + 2);
        const l1 = try allocator.alloc(u16, if (n_set == 0) 2 else ((n_set - 1) / stride_l1 + 1) + 2);
        const table = .{ l0, l1 };
        inline for (&table) |l| {
            l[0] = 0;
            l[l.len - 1] = 0;
        }

        var idx: usize = 0;
        while (l0[0] == 0) : (idx += 1) {
            const value = data[idx];
            if (@popCount(value) != 0) {
                l0[0] = idx * 64 + try nthSetBitPos(@bitReverse(value), 0);
                break;
            }
        }
        std.debug.assert(idx < data.len);

        var bit_pos: u64 = 0;
        var cumulative_count: u64 = 0;
        var counts: [2]u64 = .{ 0, 0 };
        for (data[idx..]) |value| {
            const count = @popCount(value);
            inline for (&strides, 0..) |stride, i| {
                if (counts[i] + count < stride) {
                    counts[i] += count;
                } else {
                    std.debug.assert(counts[i] < stride);
                    var curr_value = value;
                    var curr_acc = cumulative_count;
                    var curr_bit_pos = bit_pos;
                    var curr_count = count;

                    while (counts[i] + curr_count >= stride) {
                        const remaining = stride - counts[i];
                        std.debug.assert(remaining != 0);

                        const word_pos: u16 = try nthSetBitPos(@bitReverse(curr_value), @intCast(remaining - 1));
                        const level = (curr_acc + remaining) / stride;
                        if (comptime i == 0) {
                            table[0][level] = curr_bit_pos + word_pos;
                        } else {
                            const prev_level = (curr_acc + remaining) / strides[i - 1];
                            const relative_pos = (curr_bit_pos + word_pos) - table[i - 1][prev_level];
                            const treshold = comptime std.math.maxInt(@TypeOf(table[i][0]));
                            if (relative_pos > treshold) return error.Overflow;
                            table[i][level] = @intCast(relative_pos);
                        }

                        curr_value = @shlWithOverflow(curr_value, @as(u6, @intCast(word_pos)))[0];
                        curr_count -= @intCast(remaining);
                        curr_acc += remaining;
                        curr_bit_pos += word_pos + 1;
                        counts[i] = 0;
                    }
                    counts[i] = curr_count;
                }
            }
            cumulative_count += count;
            bit_pos += 64;
        }
        return Select{
            .data = data,
            .table = table,
            .strides = strides,
            .bit_len = bit_len,
        };
    }

    pub fn select(self: Select, k: u64) !u64 {
        var bit_pos: u64 = 0;
        inline for (&self.table, &self.strides) |l, stride| {
            bit_pos += l[k / stride];
        }
        const L = self.strides.len;
        var remaining: u64 = k % self.strides[L - 1];
        if (k >= self.strides[L - 1]) remaining += 1;

        if (remaining == 0) {
            return bit_pos;
        }

        var idx = bit_pos / 64;
        const shift = bit_pos % 64;
        var value: u64 = self.data[idx] & (~@as(u64, 0) >> @intCast(shift));
        while (true) {
            const pcnt = @popCount(value);
            if (remaining <= pcnt) {
                const value_pos = try nthSetBitPos(@bitReverse(value), @intCast(remaining - 1));
                return (idx * 64) + value_pos;
            }
            remaining -= pcnt;
            idx += 1;
            value = self.data[idx];
        }

        unreachable;
    }

    pub fn deinit(self: *Select, allocator: std.mem.Allocator) void {
        allocator.free(self.table[0]);
        allocator.free(self.table[1]);
        self.* = undefined;
    }
};

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    const data = [_]u64{0xAAAAAAAAAAAAAAAA} ** 100;
    var r = try Select.init(allocator, &data, 100 * 64);
    defer r.deinit(allocator);

    std.debug.print("{}\n{}\n{}\n{}\n", .{ try r.select(1), try r.select(3), try r.select(10), try r.select(256) });
}