← Back to all projects

binaryfuse

pythonrustdata-structuresperformance

I built binaryfuse to bring the efficiency of Binary Fuse Filters (based on Lemire et al.) to Python.

If you’re working with large, immutable datasets (like blocklists or static dictionaries), Bloom filters are often the default choice. However, Binary Fuse Filters offer a significant upgrade: they are smaller (~1.13x overhead vs ~1.44x for Bloom) and faster for lookups.

The library is implemented in Rust to ensure memory safety and performance, but I ship abi3 wheels so you can install it on any modern Python version (3.8+) without needing a Rust toolchain.

How it Works

Unlike Bloom filters which set bits in a single array, Binary Fuse Filters use a fused structure of three segments. Keys are mapped to a location in each segment, and their values are XORed to store a fingerprint. To check if a key exists, we compute the same hash locations and verifying if the XOR sum matches the fingerprint.

flowchart LR
    K[Key] --> H1[Hash 1]
    K --> H2[Hash 2]
    K --> H3[Hash 3]
    
    subgraph Segments
        direction TB
        S1[Segment 0]
        S2[Segment 1]
        S3[Segment 2]
    end
    
    H1 --> S1
    H2 --> S2
    H3 --> S3
    
    S1 & S2 & S3 -->|XOR| R[Result]
    R -->|Match Fingerprint?| D{In Set?}

Here’s the logic expressed in simplified Python:

def fingerprint(key):
    # Reduce key to an 8-bit or 16-bit value
    return hash(key) & 0xFF

def contains(key, segments):
    # 1. Map key to three segments
    h1, h2, h3 = hash_to_locations(key)
    
    # 2. Retrieve values from the fused structure
    val1 = segments[0][h1]
    val2 = segments[1][h2]
    val3 = segments[2][h3]
    
    # 3. Combine them
    computed_fingerprint = val1 ^ val2 ^ val3
    
    # 4. Check against key's fingerprint
    return computed_fingerprint == fingerprint(key)

Performance

The main trade-off is immutability: you generally construct the filter once and query it many times. In exchange, you get better space efficiency and cache-friendly lookups.

MetricBinary Fuse 8-bitBloom Filter
Bits per key~9~10-14 (depending on FPR)
False Positive Rate~0.4%~1% (at 10 bits)
Lookup Speed~20ns~50+ns

usage

import binaryfuse

# Create once from a list of unique strings
data = ["apple", "banana", "cherry"]
filter = binaryfuse.BinaryFuse8(data)

# Fast lookups
"apple" in filter  # True
"grape" in filter  # False