Why data partition and routing?

large dataset ⟶ scale out ⟶ data shard / partition ⟶ 1) routing for data access 2) replica for availability

How to do data partition and routing?

The routing abstract model is essentially just two maps: 1) key-partition map 2) partition-machine map

hash and mod
- (+) simple
- (-) flexibility (tight coupling two maps: adding and removing nodes (partition-machine map) disrupt existing key-partition map)
Virtual buckets: key--(hash)-->vBucket, vBucket--(table lookup)-->servers
- Usercase: Membase a.k.a Couchbase, Riak
- (+) flexibility, decoupling two maps
- (-) centralized lookup table
Consistent hashing and DHT
- [Chord] implementation
- virtual nodes: for load balance in heterogeneous data center
- Usercase: Dynamo, Cassandra
- (+) flexibility, hashing space decouples two maps. two maps use the same hash, but adding and removing nodes ==only impact succeeding nodes==.
- (-) network complexity, hard to maintain

sort by primary key, shard by range of primary key

range-server lookup table (e.g. HBase .META. table) + local tree-based index (e.g. LSM, B+)

(+) search for a range (-) log(n)

Usercase: Yahoo PNUTS, Azure, Bigtable