Module synctree

This module implements a self-validating hashtree that is used within riak_ensemble as the primary data integrity mechanism.

Description

This module implements a self-validating hashtree that is used within riak_ensemble as the primary data integrity mechanism. This tree is designed similar to the hashtree used in file systems such as ZFS, with parent nodes containing the hash of their children. Thus, during traversal a tree path can be completely verified from the root node to the endpoint.

This tree is designed to be replicated on multiple nodes and provides built-in exchange logic that efficiently determines the differences between two trees -- thus trees can exchange with peers and heal missing/corrupted data.

To perform an exchange, trees must be of the same shape regardless of the data inserted. Thus, the tree implemented by this module has a fixed structure and is therefore more akin to a hash trie. To enable this fixed structure, data inserted into the tree is uniformly mapped to one of a fixed number of segments. These segments are sorted key/value lists. Hashes are computed over each segment, with each hash being stored as a leaf in the actual hash tree. Since there are a fixed number of segments, there is a fixed number of leaf hashes. The remaining levels in the hash tree are then generated on top of these leaf hashes as normal.

This design is therefore similar to hashtree.erl from riak_core.

The main high-levels differences are as follows: 1. The synctree is built entirely on pure key/value (get/put) operations, there is no concept of iteration nor any need for a sorted backend.

2. The synctree is always up-to-date. An insert into the tree immediately updates the appropriate segment and relevant tree path. There is no concept of a delayed, bulk update as used by hashtree.erl

3. All operations on the tree are self-validating. Every traversal through the tree whether for reads, inserts, or exchanges verifies the hashes down all encountered tree paths.

4. The synctree supports pluggable backends. It was originally designed and tested against both orddict and ETS backends, and then later extended with a LevelDB backend for persistent storage as used in riak_ensemble.

Most of the differences from hashtree.erl are not strictly better, but rather are designed to address differences between Riak AAE (which uses hashtree) and riak_ensemble integrity checking (which uses synctree).

Specifically, AAE is designed to be a fast, mostly background process with limited impact on normal Riak operations. While the integrity logic requires an always up-to-date tree that is used to verify every get/put operation as they occur, ensuring 100% validity. In other terms, for AAE the hashtree is not expected to be the truth but rather a best-effort projection of the truth (the K/V backend is the truth). Whereas for the integrity logic, the synctree is the truth -- if the backend differs, it's wrong and we consider the backend data corrupted.

Data Types

action()

action() = {put, term(), term()} | {delete, term()}

bucket()

bucket() = non_neg_integer()

corrupted()

corrupted() = {corrupted, level(), bucket()}

hash()

hash() = binary()

hashes()

hashes() = [{term(), hash()}]

key()

key() = term()

level()

level() = non_neg_integer()

maybe_integer()

maybe_integer() = pos_integer() | default

options()

options() = proplists:proplist()

tree()

tree() = #tree{id = term(), width = pos_integer(), segments = pos_integer(), height = pos_integer(), shift = pos_integer(), shift_max = pos_integer(), top_hash = hash(), buffer = [action()], buffered = non_neg_integer(), mod = module(), modstate = any()}

value()

value() = binary()

Function Index

compare/3
compare/4
compare/5
corrupt/2
direct_exchange/1
exchange_get/3
get/2
height/1
insert/3
local_compare/2
m_batch/2
m_flush/1
new/0
new/1
new/3
new/4
new/5
newdb/1
newdb/2
rehash/1
rehash_upper/1
top_hash/1
verify/1
verify_upper/1

Function Details

compare/3

compare(Height, Local, Remote) -> any()

compare/4

compare(Height, Local, Remote, AccFun) -> any()

compare/5

compare(Height, Local, Remote, AccFun, Opts) -> any()

corrupt/2

corrupt(Key::key(), Tree::tree()) -> tree()

direct_exchange/1

direct_exchange(Tree) -> any()

exchange_get/3

exchange_get(Level::bucket(), Bucket::level(), Tree::tree()) -> hashes() | corrupted()

get/2

get(Key::key(), Tree::tree()) -> value() | notfound | corrupted()

height/1

height(Tree::tree()) -> pos_integer()

insert/3

insert(Key::key(), Value::value(), Tree::tree()) -> tree() | corrupted()

local_compare/2

local_compare(T1, T2) -> any()

m_batch/2

m_batch(Update, Tree) -> any()

m_flush/1

m_flush(Tree) -> any()

new/0

new() -> tree()

new/1

new(Id::term()) -> tree()

new/3

new(Id::term(), Width::maybe_integer(), Segments::maybe_integer()) -> tree()

new/4

new(Id::term(), Width::maybe_integer(), Segments::maybe_integer(), Mod::module()) -> tree()

new/5

new(Id::term(), Width::maybe_integer(), Segments::maybe_integer(), Mod::module(), Opts::options()) -> tree()

newdb/1

newdb(Id::term()) -> tree()

newdb/2

newdb(Id::term(), Opts::options()) -> tree()

rehash/1

rehash(Tree::tree()) -> tree()

rehash_upper/1

rehash_upper(Tree::tree()) -> tree()

top_hash/1

top_hash(Tree::tree()) -> hash() | undefined

verify/1

verify(Tree::tree()) -> boolean()

verify_upper/1

verify_upper(Tree::tree()) -> boolean()


Generated by EDoc