class: center, middle
or
Daniel Pezely
Splunk Mobile Intelligence
(formerly BugSense)
Erlang Factory San Francisco
26-27 March 2015
class: center, middle
The Far Side, by Gary Larson
"Now Gorak! This time remember
to punch holes in the lid!"
BugSense: tracking crashes and events from mobile devices
- Founded September 2011 in Athens, Greece
- Acquired September 2013 by Splunk
- Re-branded in 2014 as Splunk Mobile Intelligence (MINT)
The secret sauce for BugSense has always included LDB
LDB is the Lethe Database: (pronounced LEE-thee)
- Functions as a stream processor
- Datastore tracks patterns and counts uniques:
- Most are strictly increasing counters
- Some read-modify-write operations
- Handles small updates from millions of mobile devices daily
- i.e., traffic pattern easily mistaken for DDoS attack
Earlier LDB presentations by Jon Vlachogiannis, co-founder of BugSense:
Relevant early experience to shape a career:
Virtual Reality R&D, circa 1990
-
Designed for massive number of participants; i.e., MMOG
-
Large-scale distributed systems
-
Building for "mutual, contradictory realities"
After modeling reality and Designing systems to sustain very high throughput, Everything since then seems like a subset!
Other experience:
- Main Street, Wall Street, Silicon Valley, and places in between
- Joined Splunk in January 2013 and BugSense group in 2014
- C and Lisp since 1987, Common Lisp since 2005, Python since 1999
- Started using Erlang as of Erlang Factory SF 2014
Cluster receives many billions of requests per day
Each LDB node at Amazon Web Services (AWS)
- c4.4xlarge (16 core 30GiB Xeon Haswell)
- 30,000 customers potentially share one server (hence 30GiB RAM)
- Dedicated servers use same software, different config
10x Performance Improvement
- Was 200 requests/second
- Now 2,000+ requests/second
Our SDK resides on hundreds of millions of mobile devices, each of which sends regular updates pertaining to:
- Pings
- Crashes
- Events
- e.g., breadcrumbs dropped along visitor journey
- Advanced Events
- customer defined key/value pairs
- Network Performance
- SDK receives information from iOS, Android network calls
- tracks API Exception and HTTP Response codes
- tracks duration when successful
- Transactions
- context is customer defined, as with transactions in Splunk
- success/fail/cancel
- tracks duration when successful
Customer access is only via Dashboard UI or Splunk App:
i.e., LDB renders to JSON format
Cluster receives billions of requests per day
but use as few servers as reasonably possible
Traffic is steady 24x7 from around the world
- With this volume: queuing == death
- You'll never catch-up!
Prioritize for timely response back to each mobile device
Database must be highly available yet cheap & disposable:
- Serve from in-memory datastore on same node exclusively
- Platform must recover in milliseconds (and BEAM does!)
- Load customer data on-demand
Nature of data:
- Compound-keys mapped to strictly-increasing counters
- Probabilistic counters; e.g., HyperLogLog (HLL)
- Unordered sets, implemented as arrays
Erlang
- Robustness, time to market, "Let it crash!"
- Returns to stable state within milliseconds
- Top-level code in Erlang
- Long-lived
port_command/2
for everything else
Lisp / Scheme
- Malleable language for queries
- Ideal for early stage start-up exploring problem & solution space
C
- Erlang linked-in module
- Implemented:
- Scheme runtime
- Query language library
- Probabilistic counters
- hash tables
- Sets
One master process per customer API key:
- Spawn Erlang process on-demand as we see new API key
- Owns port to linked-in C module
- No thread priority, no thread pool
- Inbound request to Cowboy spawns per-request worker
- Each process exits when idle
Datastore:
- Hash-tables implemented in C (local modifications to
uthash
) - HyperLogLog (HLL) implemented in C
- Each customer gets trio of tables: key/value pairs, sets, HLL
- ETS table for Ops-related metrics; e.g., requests per second
Miscellaneous:
- Lethe Query Language (LQL) was Scheme R4 dialect of Lisp
- Upon expected full memory or unexpected issues: "Let it crash!"
- Recovery after crash is measured in milliseconds
- High traffic customers may be throttled (sub-sampling)
Problem:
- Erlang servers "typically" measure tens of thousands requests per second
- Older LDB was merely hundreds
Revelations:
Queries are well understood now and may be specified precisely,
so malleability of Lisp as query language (LQL) was no longer necessary.
While we love Lisp, we no longer need overhead of this C implementation.
Premise:
There were known limitations to the Scheme component:
not compiled, not cached, no garbage collection, etc.
However, experience elsewhere suggested primary culprit might have been
misuse of Compare-And-Swap (CAS) within a library...
Experimental branch for measurements:
- Eliminated parsing of query language on inbound requests
- Using only cheap counters (no service tier limits, etc.)
- Forked/rewrote linked-in driver C code for minimal code path
- Linked-in module returned control back to Erlang scheduler
within equivalent of a few "reductions"
Observed:
- Only marginal increase in performance
Initial Conclusions:
- Confirmed C implementation of query language runtime not culprit
- So-called "atomic" operations [*] implicated as primary bottleneck
- Compare-and-Swap (CAS) may be harmful on NUMA/many-cores
Course of Action:
- Extend LDB core for pure-Erlang implementation
[*] There are multiple definitions of "atomic" (more below)
CAS operations as simple replacement for locks is worse than naïve,
because naïve approaches are uninfluenced by locking methodology!
- Consider before using CAS family of operations:
- Beware:
- "Lock-free" is not necessarily wait-free {5}
- CAS is "atomic" (transaction) in database sense: all-or-nothing
- CAS is "atomic" (indivisible) as one line of Assembly:
LOCK CMPXCHG
- (Not atomic in sense of occurring instantaneously or transparently)
More Background:
- CAS previously believed to be answer for "lock-free" mechanisms {6}
- CAS is very tempting for those new to deep concurrency
- CAS overused by various languages & libraries, so confirm source/docs
- e.g., Look for
__sync_add_and_fetch()
in C source - or better yet:
LOCK CMPXCHG
within disassembly dump on x86, x86-64
- e.g., Look for
- CAS use cases are deeply intertwined with CPU cache lines, etc. {7}
Do these last:
- Each would likely bring less than 2x-5x performance improvement:
- Thread pooling at level of Erlang port/driver
driver_async()
use non-NULL value forkey
param for thread reuse- LQL compiler
- LQL caching
- LQL eval with ephemeral memory allocation
- LQL garbage-collection
- If less than 10x, it automatically becomes low priority
Instead:
- Strive for 10x, 100x, 1000x improvements
- Otherwise wastes time compared to
"Moore's Law"Adapteva Parallella - Maybe cheat or punt:
- Upgrade to next generation machine
- Upgrade to larger server(s)
- Introduce a load-balancer
- Terminate SSL via front-end component/hardware
One master Erlang process per customer API key: (same as before)
- Spawn process on-demand as we see new API key
- Owns ETS tables, manages snapshots, etc.
- Inbound RPC and/or Cowboy spawns per-request worker
- Each process exits when idle
Datastore: (complete overhaul)
- Uses ETS (not Mnesia) with
tab2file()
for snapshots - Differentiates day of retention by database table (previously by row)
- Lots of ETS tables: 3 x 30,000 customers x Day of retention modulo
- Customers migrated out-of-band with respect to Erlang
- Database table naming uses modulo based upon days of retention
Miscellaneous: (much improved!)
- RPC between other systems (no HTTP beyond client connection)
- Supervisor uses
simple_one_for_one
(previouslyone_for_one
) - OTP: uses
gen_server
now, which wasn't in original LDB - Eliminates need for Linked-in C modules
- Accepts all data (no throttling, no sub-sampling)
Went Right
- Functional Programming (i.e., no side-effects)
- When new to any language, test all assumptions before starting feature
- Reached 85% code coverage in tests within 4 months (one person)
- Topped 90% coverage after another month-- asymptotic effort writing tests
- Learned to navigate docs tree (no search), and discovered other functions
Went Wrong
- Initial rewrite used Mnesia, but mailbox killed BEAM after 12 hours
- Mnesia-to-ETS migration hit architectural issue for our use cases:
Required re-introducing bits of Mnesia; e.g.,wait_for_tables()
and synchronization for HLL's read-modify-write workflow - Code reviews missed some Erlang idiom mistakes:
Withgen_server
, create wrapper functions around all "public" messages, and export those functions as means to enforce our API
Could Be Better
- Yet to do releases in proper Erlang way, but we control our servers
- Yet to do hot code loading, but server bounce is painless
- "
Why, Joe? Why?
" and taking the name of Erlang in vain
could be heard occasionally from Daniel's deskstdlib
has inconsistent/asymmetrical function names- Some errors require
try/catch
, others are results from expressions - Returning
ok
versustrue
versus{ok,Foo}
, etc. - Docs found under
erts
,kernel
,stdlib
may seem arbitrary at first
- Mnesia and ETS use atoms for table names
- Atoms are never garbage-collected within BEAM vm
- LDB creates lots of database tables, thus potential problem
- Compiler directive macros can't be applied at granularity of an expression
- It's fine for
eunit
type of scenario - Can't have debug-only function clauses this way (use Guards instead)
- It's fine for
- Constraints of syntax to a Common Lisp guy... (enough said)
- Using
when
with comma versus semicolon seems fragile - Why not
Lisp Flavoured Erlang
(LFE) ???- Other in-house code already in Erlang (not LFE)
- Existing LDB code used Erlang R16
- LFE was still only
v0.7a
at the time; e.g.,
colon operator didn't accommodatemodule:function
yet
- Using
Classic approaches-- reducing contention:
- Eliminate system calls that would cause OS to block BEAM process
- Avoid memory allocation/construction ("non-consing")
- Optimize queue or stack for one writer, one reader
- Use CPU affinity, thread-per-core, or unthreaded co-routines
However, you would be doing things that Erlang does for you!
Consider c10k / c10m tricks such as user-land TCP/IP stack:
- FreeBSD libuinet, 2014 {8, 9} -- shares networking code with OS
- Geoff Cant's Enet, 2010 {10, 11}
- Erlang library from Javier Paris, et al, 2005 {12}
- Maybe port CMU's Foxnet from Standard ML, 1994, 1996 {13}
- http://www.erlang-factory.com/sfbay2014/jon-vlachogiannis
- http://highscalability.com/blog/2012/11/26/bigdata-using-erlang-c-and-lisp-to-fight-the-tsunami-of-mobi.html
- http://www.azulsystems.com/blog/cliff/2011-11-16-a-short-conversation-on-biased-locking
- http://danluu.com/new-cpu-features/
- http://preshing.com/20120612/an-introduction-to-lock-free-programming/
- http://people.csail.mit.edu/edya/publications/OptimisticFIFOQueue-journal.pdf
- http://people.freebsd.org/~lstewart/articles/cpumemory.pdf
- https://github.com/pkelsey/libuinet
- http://www.bsdcan.org/2014/schedule/attachments/260_libuinet_bsdcan2014.pdf
- https://github.com/archaelus/enet
- http://www.erlang-factory.com/conference/SFBay2010/speakers/geoffcant
- http://www.erlang.org/workshop/2005/tcpip_presentation.pdf
- http://www.cs.cmu.edu/~fox/foxnet.html