libBf: libBf

libBf is a header-only C++11 library of a garden variety of of Bloom filters.This manual is organized as follows:

Bloom Filters
1. Terminology
2. Basic
3. Multisets
4. Aging
  1. Stable
  2. \(A^2\)
Architecture
Usage

Bloom Filters

Whenever you have a set or list, and space is an issue, a Bloom filter may be a useful alternative. --Mitzenmacher

A Bloom filter is a randomized synopsis data structure that allows for set membership queries. Its space-efficient representation comes at the cost of false positives, i.e., elements can erroneously be reported as members of the set. In practice, the huge space savings often outweigh the false positives if kept at a sufficiently low rate.

Bloom filters have received a great deal of attention by the research community. After introducing some terminology, we discuss the basic Bloom filter and then turn to some important extensions. Thereafter, we present the architecture of libBf and show how to integrate it with your own code.

Terminology

Universe \(U\)
\(N\) distinct items
\(k\) independent hash functions \(h_1,\dots,h_k\)
Vector \(V\) of \(m\) cells, i.e., \(m = |V|\)
Set
- \(S = \{x_1,\dots,x_n\}\) where \(x_i\in U\) and \(|S| = n\)
Multiset / Stream
- \(\mathcal{S} = \{x_1,\dots,x_n\}\) where \(x_i\in U\) and \(|\mathcal{S}| = n\)
- \(C_x = \left\{ c_{h_1(x)},\dots,c_{h_k(x)} \right\}\) counters of \(x\)
- \(f_x =\) multiplicity (frequency) of \(x\in\mathcal{S}\)
Bloom filter estimate denoted by "hat:" \(\widehat{S}, \widehat{\mathcal{S}}, \widehat{f}_x, \ldots\)
Probability of a false positive (FP): \(\phi_P = \mathbb{P}\left(x\in \widehat{S} \vert x\notin S\right)\)
Probability of a false negative (FN): \(\phi_N = \mathbb{P}\left(x\notin \widehat{S} \vert x\in S\right)\)
Capacity \(\kappa\), i.e., is the maximum number of items a Bloom filter can hold until a given \(\phi_P\) can no longer be guaranteed
A Bloom filter is full when then number of added items exceeds \(\kappa\)

Basic

Burton Bloom introduced the original Bloom filter in 1970, which we refer to as the basic Bloom filter. It uses a bit vector \(V\) with \(|V| = m = O(n)\) and \(k\) independent hash functions \(h_1, \dots, h_k\) that map items in \(U\) to the range \([m] = \{1,\ldots,m\}\). (Unlike in the implementation, we start at index 1 for the formal treatment.) All bits in \(V\) are initialized to 0. To insert an item \(x\in S\), we set the bits at positions \(h_1(x), \ldots, h_k(x)\) in \(V\) to 1. To test whether an item \(q\in U\) is a member of \(\widehat{S}\), we examine the bits at positions \(h_1(q),\dots,h_k(q)\) in \(V\). If any of these bits is 0 we report \(q\notin \widehat{S}\). Otherwise we report that \(q\in \widehat{S}\), although there remains some probability that \(q\notin S\). This type of error is a false positive (FP) and also known as Bloom error \(E_B\). It occurs because other elements in \(S\) also map to the same positions. The Figure below illustrates how the basic Bloom filter works.

The basic Bloom filter devised by Burton Bloom. To insert an item \(x\), we set the corresponding positions in the bit vector to 1.

To compute the probability of a Bloom error, we start off with an empty bit vector \(V\) and insert an item. This is the same as independently (and uniformly) choosing \(k\) bits and setting them to 1. Thereafter, the probability that a certain bit in \(V\) is still 0 is

\[ \left(1 - \frac{1}{m}\right)^k. \]

Afer \(n\) insertions, the probability that a certain bit is 1 is

\[ 1 - \left(1 - \frac{1}{m}\right)^{kn}. \]

Testing for membership involves hashing an item \(k\) times. Thus the probability of a Bloom error is

\begin{equation} \mathbb{P}(E_B) = \left(1-\left(1-\frac{1}{m}\right)^{kn}\right)^k \approx \left(1 - e^{-kn/m}\right)^k \end{equation}

For fixed parameters \(m\) and \(n\), the optimal value \(k^*\) that minimizes this probability is

\[ k^* = \arg\min_k \mathbb{P}(E_B) = \left\lfloor\frac{m}{n}\ln 2\right\rfloor \]

For \(k^*\), we have hence \(E_B = (0.619)^{m/n}\). Moreover, for a desired FP probability \(\phi_P\) we can compute the number of required bits by substituting the optimal value of \(k\):

\[ m = -\frac{n\ln p}{(\ln 2)^2}. \]

Multisets

A basic Bloom filter can only represent a set, but neither allows for querying the multiplicities of an item, nor does it support deleting entries. We use the term counting Bloom filter to refer to variants of Bloom filters that represent multisets rather than sets. In libBf, we call a counting Bloom filter a basic Bloom filter with width parameter \(w\).

Counting

In a counting Bloom filter, inserting an item corresponds to incrementing a counter. Some variants also feature a decrement operation to remove item from a set. As soon we allow for deletions we introduce false negative (FN) errors. These occur when removing an item that itself was a FP. The probability of a FN is bounded by \(O(E_B)\).

When we get the count of an item \(x\in\widehat{S}\), we compute its set of counters \(C_x\) and return the minimum value as frequency estimate \(\widehat{s}_x\). This query algorithm is also known as minimum selction (MS).

Each cell in the counting Bloom filter has a fixed bit width \(w\). To insert an item \(x\), we increment the counters \(C_x\). To remove an item \(y\), we decrement its counters \(C_y\).

There are two main issues with counting Bloom filters:

Counter overflows
Choosing \(w\)

The first problem exists when we the counter value reaches \(2^w - 1\) and cannot be incremented any more. The most natural thing to do is simply not continuing to count rather than overflowing and restarting at 0. However, this introduces undercounts, which we also refer to as FNs.

The second problem concerns the choice of the width parameter \(w\). If we choose \(w\) very large, the space gains of a Bloom filter diminish. There will also be a lot of unused space, manifesting as unused zeros. If we choose \(w\) too small, we will reach the maximum counter value to fast. Choosing the right value is a difficult trade-off that also depends on the data. It has been shown that for uniform distributions a value of \(w = 4\) works well.

Bitwise

The bitwise Bloom filter is a combination of \(l\) counting Bloom filters with bit vectors \(V_i\), each of which have \(m_i\) cells, \(k_i\) hash functions, and width \(w_i\) where \(i\in\{0,\dots,l-1\}\). It aims at solving both of the overflow and space problem of the counting Bloom filter.

To add an item \(x\), we first look at the counters in the first level \(V_0\). If there is enough room (i.e., width) available, we simply perform the increment. If the counter overflows, we insert \(x\) into \(V_1\) and remove it from \(V_0\). In this fashion, the counter value is unbounded as we can always add more levels. However, the item has to be hashed \(l\) times with a total of \(\sum_{i=0}^{l-1} k_i\) hash functions.

In order to read the counter of an item, we combine the binary representation of all the levels. Let \(c_i\) be the counter value at level \(i\). Then we compute the counter value as

\[ C = \sum_{i=0}^{l-1} c_i 2^{\sum_{j=0}^i w_i}. \]

The Figure below illustrates a bitwise Bloom filter with \(w_i = 1 \; \forall i\).

The bitwise Bloom filter consists of \(l\) counting Bloom filters, each of which represent \(w_i\) orders of magnitude of the entire counter.

Spectral

The spectral Bloom filter is an optimized version of the counting Bloom filter. It consists of two extra algorithms in addition to MS and introduces a more space-efficient data structure to represent counters.

Let us review the MS algorithm. When querying an item \(q\in U\), MS uses the minimum counter value \(m_q = \min_i C_q\) as frequency estimate, i.e., \(\widehat{f}_q = m_q\). Cohen and Matias claim that \(f_x \le m_x\) and \(\mathbb{P}(\widehat{f}_x \neq m_x) = \mathbb{P}(E_B)\) for all \(x\in S\).
The second spectral algorithm is an optimization for the add operation. When adding an item \(x\) to the Bloom filter, the minimum increase (MI) algorithm only increments the minimum counter value(s) \(\tilde{C}_x = \min_i C_x\). The rationale behind this is that \(m_x\) is always the most accurate count, thus MI results in the fewest possible increment operations. Because not all counters are incremented on inserts, the effect of deletes is significantly worse and the number of FNs becomes unbounded. Thus, the MI algorithm should not be used when removing of items from a Bloom filter. Cohen and Matias claim that \(E_B^{MI} = O(E_B)\) and that if \(x\) is drawn uniformly from \(U\), then \(E_B^{MI} = E_B/k\).
The third algorithm is recurring minimum (RM) and involves two Bloom filters, \(V_1\) and \(V_2\). The key insight behind RM is that items that experience Bloom errors are less likely to have recurring minima counter values. Cohen and Matias found empirically that this applies to approximately 20% of the items. Such items with a unique minimum are maintained in the second Bloom filter to reduce the discrepancy between \(f_x\) and \(\widehat{f}_x\). To query an item \(q\in U\) according to the RM algorithm, we look first into the first Bloom filter and check if \(q\) has a recurring minimum. If so, we return the minimum counter value. Otherwise we look the minimum counter value from the second Bloom filter, unless it is 0. If it is 0 (i.e., does not exist), we return the minimum counter from the first Bloom filter. Since all the items are inserted into the first bloom filter, the RM optimization does at least as well as the MS algorithm, yet has usually better error rates because a second filter holding fewer items is used for items which experience higher error rates.

Aging

A problem all the above Bloom filter variants is that they eventually fill up over time when dealing with a large set or stream of data. This means that at some point the Bloom filter becomes unusable due to its high error rates. There exist various scenarios where one would like to "age out" old items that have been added a long time ago. For example, we might want to estimate only recent items or we have a very limited amount of space available.

Although counting Bloom filters have a delete operation, it is often impossible to retain old items in memory. Thus we do not know their counter positions in the bit vector anymore, otherwise we would simply decrement their count. What we want is a Bloom filter that has sliding window semantics, as illustrated by the Figure below.

In a sliding window scenario, an insert operation for a new item \(x_7\) would ideally delete an old item \(x_0\).

To support a sliding window, we would like to have Bloom filter that acts like a FIFO. In the following, we discuss two different Bloom filter flavors that aim at providing this property.

Stable

The stable Bloom filter is essentially a basic Bloom filter with an underlying bit vector with a fixed cell widht \(w\). However, counters do not represent the multiplicities of the items but rather their age. Thus the interface supports only set membership queries.

To insert an item, we decrement \(d\) cells chosen uniformly at random. Thereafter, we set the counters of all \(k\) cells to their maximum value of \(2^w - 1\).

Deng and Rafiei have shown that the fraction of zeros will eventually become constant. When having reached this stable point, the probability of a Bloom error is

\[ \phi_P = \left( 1 -\left( \frac{1}{1+\frac{1}{d(1/k-1/m)}} \right) \right) \]

\(A^2\)

The \(A^2\) Bloom filter, also known as active-active buffering, also provides a FIFO abstraction. To this end, it uses two single-bit vectors \(V_1\) and \(V_2\) where \(|V_1| = |V_2| = \frac{m}{2}\). Unlike the spectral RM algorithm, one Bloom filter is not a subset of the other, so an item can be in either Bloom filter.

To query for an item, we return true if \(q\) exists in either \(V_1\) or \(V_2\). To insert an item \(x\), we simply return if it already exists in \(V_1\). Otherwise we insert it in \(V_1\) and test whether \(V_1\) has reached its capacity. If it is full, we flush \(V_2\) and swap \(V_1\) and \(V_2\). Thereafter we insert the item in \(V_1\) (the old \(V_2\)).

One advantage of the \(A^2\) Bloom filter is space-efficiency, since one bit vector is always full. Let the subscript \(a\) denote the value of the \(A^2\) Bloom filter. The probability of a Bloom error is

\[ {\phi_P}_a = 1 - \sqrt{1-\phi_P} \]

and the optimal value for \(k_a\) and \(\kappa_a\) are:

\[ k_a^* = \left\lfloor -\log_2\left(1-\sqrt{1-\phi_P}\right) \right\rfloor \qquad \kappa_a^* = \left\lfloor \frac{m}{2k_a^*} \ln2 \right\rfloor \]

Architecture

libBf implements a variety of different Bloom filters. Each Bloom filter type is based on one or more cores, which is a policy class that captures store, hash, and partition properties.

Store: The store policy abstracts an \(O(1)\) random-access array data structure to access counter values.
Hash: The hash policy computes the hash values according to one of three possible schemes. The default scheme is simply to use the hash value of the underlying hasher (which is also configurable). The two other available schemes are double hashing and extended double hashing, implement \(k\) independent hash function using only two pairwise independent hash functions of the form
\[ g_i(x) = h_1(x) + ih_2(x) + f(i) \quad \forall i \in [k] \]
The function \(f\) is an arbitrary mapping into the real number. For \(f(i) \equiv 0\) we have double hashing.
Partition: The partition policy determines how to distribute the \(k\) hash values to the bit vector. The default policy is to not use partitioning and simply compute \(h_i(x) \bmod m\) for all \(i \in [k]\). Alternatively, one could partitioning to map each hash value \(h_i(x)\) into a disjoint sub-vector of size \(m/k\), i.e.,
\[ \left(h_i(x) \bmod \frac{m}{k}\right) + \frac{(i-1)m}{k} \quad \forall i\in[k] \]

The Figure below shows the architecture of libBf.

Each Bloom filter implementation in libBf is uses one or more cores. A core is a policy class that captures the store, hash, and partition properties of the Bloom filter.

Usage

For clarity we assume in the following discussion that all the classes from libBf exist in the current namespace, e.g., via using namespace bf.

Synopsis

Define a core type:


    typedef core<
      fixed_width<uint8_t, std::allocator<uint8_t>
    , double_hashing<default_hasher, 42, 4711>
    , no_partitioning
    > my_core;

Define a Bloom filter type:


    typedef basic<my_core> my_bloom_filter;

Instantiate with a core:


    my_bloom_filter bf({ 1 << 10, 5, 4 });

Use:


    bf.add("foo")
    bf.add("foo")
    bf.add('z')
    bf.add(3.14159)
    std::cout << bf.count("foo") << std::endl;  // returns 2