Skip to content

cgshep/pymsh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pymsh

PyPI PyPI - Python Version PyPI - Downloads PyPI - License Test Status

pymsh is a Python library that provides multiset hash (MSH) functions. These functions let you hash collections of items without worrying about the order of those items.

What is a Multiset Hash?

A multiset is like a set, except it can contain multiple instances of the same item. For example:

  • A set might be [apple, banana, cherry] (with no duplicates).
  • A multiset could be [apple, apple, apple, banana, banana, cherry] (duplicates matter).

Multiset hashing (MSH) produces a hash value that reflects both the types of items you have and the quantities of each item, but not their order. If we hash the following using an MSH scheme, then the same hash values will be produced: hash(apple, banana, banana, apple, apple, cherry) will equal hash(apple, apple, apple, banana, banana, cherry)

We can see that the order does not matter as long as the elements and their quantities are the same.

Why Is This Useful?

If you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.

Furthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.

Installation

pip install pymsh

Basic Usage

For most general use cases, we recommend using the additive multiset hash (accessible via the shortcut class Hasher).

  1. Prepare a multiset: You can use our helper list_to_multiset if you have a Python list containing repeated elements.
  2. Hash it: Pass the resulting dictionary (element -> count) into a hasher.

Note: pymsh expects your inputs to be passed as bytes.

Example:

from pymsh import list_to_multiset, Hasher

# Suppose you have this list with repeated elements
fruit_list = [b"apple", b"apple", b"banana", b"cherry", b"banana", b"apple"]

# 1) Convert the list to a multiset:
multiset = list_to_multiset(fruit_list)
# => {b"apple": 3, b"banana": 2, b"cherry": 1}

# 2) Hash your multiset (Hasher is an alias for MSetAddHash)
msh = Hasher().hash(multiset)
print("Multiset hash:", msh)

That’s it! You’ll get a tuple representing the multiset, independent of how you ordered "apple, banana, cherry." The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key. The key can be accessed using msh.key in the above example (along with msh[1] for the nonce).

Advanced Usage

pymsh implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to Incremental vs. One-shot Hashing if you don’t need these details right now.

MSetXORHash (Keyed, Set-collision Resistant)
  • What it does: A keyed hash using XOR operations internally.
  • Best for: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).
  • Supports incremental hashing?: Yes.
  • Uses a secret key: Yes.
  • It is NOT multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.
MSetAddHash (Keyed, Multiset-collision Resistant)
  • What it does: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.
  • Best for: Most general-purpose scenarios. This is the same as the default Hasher class.
  • Supports incremental hashing?: Yes.
  • Uses a secret key: Yes.
MSetMuHash (Keyless, Multiset-collision Resistant)
  • What it does: Uses multiplication in a finite field with a large prime modulus. Slow.
  • Best for: Keyless scenarios. Good when you want collision resistance without managing keys.
  • Supports incremental hashing?: No.
  • Uses a secret key: No.
MSetVAddHash (Keyless, Multiset-collision Resistant)
  • What it does: Uses vector addition space.
  • Best for: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.
  • Supports incremental hashing?: Yes.
  • Requires a Key: No.

Examples

import secrets

from pymsh import (
    MSetXORHash,
    MSetAddHash,
    MSetMuHash,
    MSetVAddHash
)

# Sample secret key for keyed hashes
key = secrets.token_bytes(32)

# A sample multiset: item -> count
multiset = {
    b"apple": 3,
    b"banana": 2,
    b"cherry": 1
}

# 1) XOR Hash (Keyed, set-collision resistant)
xor_hasher = MSetXORHash(key)
xor_result = xor_hasher.hash(multiset)
print("XOR Hash (one-shot):", xor_result)

# 2) Additive Hash (Keyed, multiset-collision resistant)
add_hasher = MSetAddHash(key)
one_shot = add_hasher.hash(multiset)
print("Additive Hash (one-shot):", one_shot)

# Incremental usage of Additive Hash
add_hasher.update(b"apple", 3)
add_hasher.update(b"banana", 2)
add_hasher.update(b"cherry", 1)
incremental_hash = add_hasher.digest()
print("Additive Hash (incremental):", incremental_hash)
assert one_shot == incremental_hash  # They should match

# 3) MuHash (Keyless)
mu_hasher = MSetMuHash()
print("MuHash:", mu_hasher.hash(multiset))

# 4) Vector Add Hash (Keyless)
vadd_hasher = MSetVAddHash()
print("VAdd Hash:", vadd_hasher.hash(multiset))

Incremental vs. One-shot Hashing

One‐shot: Pass a full dictionary (e.g., {item: count}) at once using .hash(multiset).

Incremental: Create an instance, then call .update(item, count) for each new element as needed, and finally call .digest() to get the final hash.

Which Should I Pick?

For most general-purpose tasks, use MSetAddHash (the default Hasher).

If you prefer keyless usage without the incremental feature, then consider MSetMuHash.

However, if you need incremental and keyless, try MSetVAddHash. Here's a comparison table:

Hash Type Security Key Required Incremental Notes
MSetXORHash Set-collision resistance Yes Yes Fast set verification
MSetAddHash Multiset-collision resistance Yes Yes General purpose
MSetMuHash Multiset-collision No No Keyless; not efficient
MSetVAddHash Multiset-collision No Yes Efficient, but longer hashes

References

  1. D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. “Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking,” ASIACRYPT 2003.

Contribute

Feel free to open an issue or pull request if you have questions or suggestions!