Slide 1: Cache & Concurrency considerations for a high performance Cassandra
SriSatish Ambati Performance, Riptano, Cassandra Azul Systems & OpenJDK Twitter: @srisatish srisatish.ambati@gmail.com
Slide 2: Trail ahead
Elements of Cache Performance Metrics, Monitors JVM goes to BigData Land! Examples Lucandra, Twissandra Cassandra Performance with JVM Commentary Runtime Views Non Blocking HashMap Locking: concurrency Garbage Collection
Slide 3: A feather in the CAP
• Eventual Consistency
– Levels – Doesn’t mean data loss (journaled)
• SEDA
– Partitioning, Cluster & Failure detection, Storage engine mod – Event driven & nonblocking io – Pure Java
Slide 4: Count what is countable, measure what is measurable, and what is not measurable, make measurable -Galileo
Slide 5: Elements of Cache Performance Metrics
• Operations:
– Ops/s: Puts/sec, Gets/sec, updates/sec – Latencies, percentiles – Indexing
• # of nodes – scale, elasticity • Replication • • • •
– Synchronous, Asynchronous (fast writes)
Tuneable Consistency Durability/Persistence Size & Number of Objects, Size of Cache # of user clients
Slide 6: Elements of Cache Performance: “Think Locality”
• Hot or Not: The 80/20 rule. – A small set of objects are very popular! – What is the most RT tweet? • Hit or Miss: Hit Ratio – How effective is your cache? – LRU, LFU, FIFO.. Expiration • Long-lived objects lead to better locality. • Spikes happen – Cascading events – Cache Thrash: full table scans
Slide 7: Real World Performance
• Facebook Inbox
– Writes:0.12ms, Reads:15ms @ 50GB data
• Twitter performance
– Twissandra (simulation)
• Cassandra for Search & Portals
– Lucandra, solandra (simulation)
• ycbs/PNUTS benchmarks
– 5ms read/writes @ 5k ops/s (50/50 Update heavy) – 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy)
• Lab environment
– ~5k writes per sec per node, <5ms latencies – ~10k reads per sec per node, <5ms latencies
• Performance has improved in newer versions
Slide 8: yahoo cloud store benchmark
50/50 – Update Heavy
Slide 9: yahoo cloud store benchmark
95/5 – read heavy
Slide 10: JVM in BigData Land!
Limits for scale • Locks : synchronized
– Can’t use all my multi-cores! – java.util.collections also hold locks – Use non-blocking collections! – Hampers object portability – Use avro, thrift!
• (de)Serialization is expensive • Object overhead
– average enterprise collection has 3 elements! – Use byte[ ], primitives where possible!
• Garbage Collection
– Can’t throw memory at the problem! – Mitigate, Monitor, Measure foot print
Slide 11: Tools
• What is the JVM doing:
– dtrace, hprof, introscope, jconsole, visualvm, yourkit, azul zvision
• Invasive JVM observation tools
– bci, jvmti, jvmdi/pi agents, jmx, logging
• What is the OS doing:
– dtrace, oprofile, vtune
• What is the network disk doing:
– Ganglia, iostat, lsof, netstat, nagios
Slide 12: furiously fast writes
client issues write n2
fi node nd ply to ap mory me
n1
• Append only writes – Sequential disk access • No locks in critical path • Key based atomicity
partitioner
commit log
Slide 13: furiously fast writes
• Use separate disks for commitlog
– Don’t forget to size them well – Isolation difficult in the cloud..
• Memtable/SSTable sizes
– Delicately balanced with GC • memtable_throughput_in_mb
Slide 14: Cassandra on EC2 cloud
*Corey Hulen, EC2
Slide 15: Cassandra on EC2 cloud
Slide 17: Compactions
K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -Sorted --K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > ---Sorted Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > ----
DELETED
MERGE SORT
Index File Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted
K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Data File
Slide 18: Compactions
• • • • • Intense disk io & mem churn Triggers GC for tombstones Minor/Major Compactions Reduce priority for better reads Other Parameters – CompactionManager.
minimumCompactionThreshold=x xxx
Slide 19: Example: compaction in realworld, cloudkick
Slide 20: reads design
Slide 21: reads performance
• BloomFilter used to identify the right file • Maintain column indices to look up columns
– Which can span different SSTables
• Less io than typical b-tree • Cold read: Two seeks
– One for Key lookup, another row lookup
• Key Cache
– Optimized in latest cassandra
• Row Cache
– Improves read performance – GC sensitive for large rows.
• Most (google) applications require single row transactions*
*Sanjay G, BigTable Design, Google.
Slide 22: Client Performance Marshal Arts: Ser/Deserialization
• Clients dominated by Thrift, Avro
– Hector, Pelops
• • • • •
Thrift: upgrade to latest: 0.5, 0.4 No news: java.io.Serializable is S.L..O.…W Use “transient” avro, thrift, proto-buf Common Patterns of Doom:
– Death by a million gets
Slide 23: •
Serialization + Deserialization uBench
http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
Slide 24: Adding Nodes
• New nodes – Add themselves to busiest node – And then Split its Range • Busy Node starts transmit to new node • Bootstrap logic initiated from any node, cli, web • Each node capable of ~40MB/s – Multiple replicas to parallelize bootstrap • UDP for control messages • TCP for request routing
Slide 25: inter-node comm
• Gossip Protocol
– It’s exponential – (epidemic algorithm)
• Failure Detector
– Accrual rate phi
• Anti-Entropy
– Bringing replicas to uptodate
Slide 26: Bloom Filter: in full bloom
• • • • “constant” time size:compact false positives Single lookup
for key in file
• Deletion • Improve – Counting BF – Bloomier filters
Slide 27: Birthdays, Collisions & Hashing functions
• Birthday Paradox For the N=21 people in this room Probability that at least 2 of them share same birthday is ~0.47 • Collisions are real! • An unbalanced HashMap behaves like a list O(n) retrieval • Chaining & Linear probing • Performance Degrades • with 80% table density
•
Slide 28: the devil’s in the details
Slide 29: CFS
• All in the family! • denormalize
Slide 30: Memtable
• In-memory • ColumnFamily specific • throughput determines size before flush • Larger memtables can improve reads
Slide 31: SSTable
• MemTable “flushes” to a SSTable • Immutable after • Read: Multiple SSTable lookups possible • Chief Execs:
– SSTableWriter – SSTableReader
Slide 32: Write: Runtime threads
Slide 33: Writes: runtime mem
Slide 34: Example: Java Overheads
Slide 35: writes: monitors
Slide 36: UUID
• java.util.UUID is slow – static use leads to contention SecureRandom • Uses /dev/urandom for seed initialization • PRNG without file is atleast 20%-40% better. • Use TimeUUIDs where possible – much faster • JUG – java.uuid.generator
• • • http://github.com/cowtowncoder/java-uuid-generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
-Djava.security.egd=file:/dev/urandom
Slide 37: synchronized
• • • • Coarse grained locks io under lock Stop signal on a highway java.util.concurrent does not mean no locks • Non Blocking, Lock free, Wait free collections
Slide 38: Scalable Lock-Free Coding Style
• Big Array to hold Data • Concurrent writes via: CAS & Finite State Machine
– No locks, no volatile – Much faster than locking under heavy load – Directly reach main data array in 1 step
• Resize as needed
– Copy Array to a larger Array on demand – Use State Machine to help copy – “ Mark” old Array words to avoid missing late updates
Slide 39: Non-Blocking HashMap
Azul Vega2 – 768 cpus
1K Table
1200 1200 1000
1M Table
NB-99
1000
800
800
M-ops/sec
600
CHM-99
M-ops/sec
600
400
400
NB-75
200 200
NB
CHM-75
0 0 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500
CHM
600 700 800
Threads
Threads
Slide 40: Cassandra uses High Scale Non-Blocking Hashmap
public class BinaryMemtable implements IFlushable { … private final Map<DecoratedKey,byte[]> columnFamilies = new NonBlockingHashMap<DecoratedKey, byte[]>(); /* Lock and Condition for notifying new clients about Memtable switches */ private final Lock lock = new ReentrantLock(); Condition condition; … } public class Table { … private static final Map<String, Table> instances = new NonBlockingHashMap<String, Table>(); … }
Slide 41: GC-sensitive elements within Cassandra
• Compaction triggers System.gc()
– Tombstones from files
• • • •
“GCInspector” Memtable Threshold, sizes SSTable sizes Low overhead collection choices
Slide 42: Garbage Collection
• Pause Times • Allocation Rate
if stop_the_word_FullGC > ttl_of_node => failed requests; failure accrual & node repair. – New object creation, insertion rate – if residency in heap > 50% – GC overheads dominate. – space, cpu cycles spent GC – Bigger is not better!
• Live Objects (residency) • Overhead
• 64-bit not addressing pause times
Slide 43: Memory Fragmentation
• Fragmentation
– Performance degrades over time – Inducing “Full GC” makes problem go away – Free memory that cannot be used – Use a compacting collector – Promote less often – Use uniform sized objects – Use latest CMS with CR:6631166 – Azul’s Zing JVM & Pauseless GC
• Reduce occurrence
• Solution – unsolved
Slide 44: CASSANDRA-1014
Slide 45: Best Practices: Garbage Collection • GC Logs are cheap even in production
-Xloggc:/var/log/cassandra/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC
• Slightly expensive ones:
-XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics
Slide 46: Sizing: Young Generation
• Should we set –Xms == -Xmx ? • Use –Xmn (fixed eden)
allocations {new Object();}
survivor ratio
eden
survivor spaces promotion Tenuring Threshold allocation by jvm
old generation
Slide 47: Tuning CMS
• Don’t promote too often!
– Frequent promotion causes fragmentation
• Size the generations
– Min GC times are a function of Live Set – Old Gen should host steady state comfortably
• Parallelize on multicores:
– -XX:ParallelCMSThreads=4 – -XX:ParallelGCThreads=4
• Avoid CMS Initiating heuristic
– -XX:+UseCMSInitiationOccupanyOnly
• Use Concurrent for System.gc()
– -XX:+ExplicitGCInvokesConcurrent
Slide 48: Summary
Design & Implementation of Cassandra takes advantages of strengths while avoiding common JVM issues. • Locks: – Avoids locks in critical path – Uses non-blocking collections, TimeUUIDs! – Still Can’t use all my multi-cores..? >> Other bottlenecks to find! • De/Serialization: – Uses avro, thrift! • Object overhead – Uses mostly byte[ ], primitives where possible! • Garbage Collection – Mitigate: Monitor, Measure foot print. – Work in progress by all jvm vendors! Cassandra starts from a great footing from a JVM standpoint and will reap the benefits of the platform!
Slide 49: References • Verner Wogels, Eventually Consistent http://www.allthingsdistributed.com/2008/12/eventually_consistent.htm • Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors" • Avinash Lakshman, http://static.last.fm/johan/nosql20090611/cassandra_nosql.pdf • Eric Brewer, CAP http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • Tony Printzeis, Charlie Hunt, Javaone Talk http://www.scribd.com/ doc/36090475/GC-Tuning-in-the-Java • http://github.com/digitalreasoning/PyStratus/wiki/Documentation • http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf • Cassandra on Cloud, http://www.coreyhulen.org/?p=326 • Cliff Click’s, Non-blocking HashMap http://sourceforge.net/projects/high-scale-lib/ • Brian F. Cooper., Yahoo Cloud Storage Benchmark, http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
Q&A
Slide 50: DataModel: Know your use patterns
Alternative Twitter DataModel: <Keyspace Name="Multiblog"> <ColumnFamily CompareWith="TimeUUIDType" Name="Blogs" /> <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"/> </Keyspace>