cfox's picture
From cfox rss RSS  subscribe Subscribe

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM) 

 

 
 
Tags:  false  cloud 
Views:  119
Published:  December 14, 2011
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
Master key-system

Master key-system

From: olafur
Views: 235 Comments: 0

 
BC5. Ezekiel part 1

BC5. Ezekiel part 1

From: anon-548435
Views: 194 Comments: 0

 
See all 
 
More from this user
Introductionto Windows Share Point Services3.0

Introductionto Windows Share Point Services3.0

From: cfox
Views: 166
Comments: 0

The voice of law: experiences of the use of podcasting in teaching law

The voice of law: experiences of the use of podcasting in teaching law

From: cfox
Views: 363
Comments: 0

SECTION A-9 ENTERPRISE RISK MANAGEMENT – INTEGRATED FRAMEWORK ...

SECTION A-9 ENTERPRISE RISK MANAGEMENT – INTEGRATED FRAMEWORK ...

From: cfox
Views: 145
Comments: 0

AMNESTY INTERNATIONAL REPORT 2006

AMNESTY INTERNATIONAL REPORT 2006

From: cfox
Views: 475
Comments: 0

Google Devfest 2009 Argentina - Google and the Social Web

Google Devfest 2009 Argentina - Google and the Social Web

From: cfox
Views: 71
Comments: 0

7_Vittana_Final

7_Vittana_Final

From: cfox
Views: 249
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Cache & Concurrency considerations for a high performance Cassandra SriSatish Ambati Performance, Riptano, Cassandra Azul Systems & OpenJDK Twitter: @srisatish srisatish.ambati@gmail.com
Slide 2: Trail ahead Elements of Cache Performance Metrics, Monitors JVM goes to BigData Land! Examples Lucandra, Twissandra Cassandra Performance with JVM Commentary Runtime Views Non Blocking HashMap Locking: concurrency Garbage Collection
Slide 3: A feather in the CAP • Eventual Consistency – Levels – Doesn’t mean data loss (journaled) • SEDA – Partitioning, Cluster & Failure detection, Storage engine mod – Event driven & nonblocking io – Pure Java
Slide 4: Count what is countable, measure what is measurable, and what is not measurable, make measurable -Galileo
Slide 5: Elements of Cache Performance Metrics • Operations: – Ops/s: Puts/sec, Gets/sec, updates/sec – Latencies, percentiles – Indexing • # of nodes – scale, elasticity • Replication • • • • – Synchronous, Asynchronous (fast writes) Tuneable Consistency Durability/Persistence Size & Number of Objects, Size of Cache # of user clients
Slide 6: Elements of Cache Performance: “Think Locality” • Hot or Not: The 80/20 rule. – A small set of objects are very popular! – What is the most RT tweet? • Hit or Miss: Hit Ratio – How effective is your cache? – LRU, LFU, FIFO.. Expiration • Long-lived objects lead to better locality. • Spikes happen – Cascading events – Cache Thrash: full table scans
Slide 7: Real World Performance • Facebook Inbox – Writes:0.12ms, Reads:15ms @ 50GB data • Twitter performance – Twissandra (simulation) • Cassandra for Search & Portals – Lucandra, solandra (simulation) • ycbs/PNUTS benchmarks – 5ms read/writes @ 5k ops/s (50/50 Update heavy) – 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy) • Lab environment – ~5k writes per sec per node, <5ms latencies – ~10k reads per sec per node, <5ms latencies • Performance has improved in newer versions
Slide 8: yahoo cloud store benchmark 50/50 – Update Heavy
Slide 9: yahoo cloud store benchmark 95/5 – read heavy
Slide 10: JVM in BigData Land! Limits for scale • Locks : synchronized – Can’t use all my multi-cores! – java.util.collections also hold locks – Use non-blocking collections! – Hampers object portability – Use avro, thrift! • (de)Serialization is expensive • Object overhead – average enterprise collection has 3 elements! – Use byte[ ], primitives where possible! • Garbage Collection – Can’t throw memory at the problem! – Mitigate, Monitor, Measure foot print
Slide 11: Tools • What is the JVM doing: – dtrace, hprof, introscope, jconsole, visualvm, yourkit, azul zvision • Invasive JVM observation tools – bci, jvmti, jvmdi/pi agents, jmx, logging • What is the OS doing: – dtrace, oprofile, vtune • What is the network disk doing: – Ganglia, iostat, lsof, netstat, nagios
Slide 12: furiously fast writes client issues write n2 fi node nd ply to ap mory me n1 • Append only writes – Sequential disk access • No locks in critical path • Key based atomicity partitioner commit log
Slide 13: furiously fast writes • Use separate disks for commitlog – Don’t forget to size them well – Isolation difficult in the cloud.. • Memtable/SSTable sizes – Delicately balanced with GC • memtable_throughput_in_mb
Slide 14: Cassandra on EC2 cloud *Corey Hulen, EC2
Slide 15: Cassandra on EC2 cloud
Slide 17: Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -Sorted --K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > ---Sorted Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > ---- DELETED MERGE SORT Index File Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Data File
Slide 18: Compactions • • • • • Intense disk io & mem churn Triggers GC for tombstones Minor/Major Compactions Reduce priority for better reads Other Parameters – CompactionManager. minimumCompactionThreshold=x xxx
Slide 19: Example: compaction in realworld, cloudkick
Slide 20: reads design
Slide 21: reads performance • BloomFilter used to identify the right file • Maintain column indices to look up columns – Which can span different SSTables • Less io than typical b-tree • Cold read: Two seeks – One for Key lookup, another row lookup • Key Cache – Optimized in latest cassandra • Row Cache – Improves read performance – GC sensitive for large rows. • Most (google) applications require single row transactions* *Sanjay G, BigTable Design, Google.
Slide 22: Client Performance Marshal Arts: Ser/Deserialization • Clients dominated by Thrift, Avro – Hector, Pelops • • • • • Thrift: upgrade to latest: 0.5, 0.4 No news: java.io.Serializable is S.L..O.…W Use “transient” avro, thrift, proto-buf Common Patterns of Doom: – Death by a million gets
Slide 23: • Serialization + Deserialization uBench http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
Slide 24: Adding Nodes • New nodes – Add themselves to busiest node – And then Split its Range • Busy Node starts transmit to new node • Bootstrap logic initiated from any node, cli, web • Each node capable of ~40MB/s – Multiple replicas to parallelize bootstrap • UDP for control messages • TCP for request routing
Slide 25: inter-node comm • Gossip Protocol – It’s exponential – (epidemic algorithm) • Failure Detector – Accrual rate phi • Anti-Entropy – Bringing replicas to uptodate
Slide 26: Bloom Filter: in full bloom • • • • “constant” time size:compact false positives Single lookup for key in file • Deletion • Improve – Counting BF – Bloomier filters
Slide 27: Birthdays, Collisions & Hashing functions • Birthday Paradox For the N=21 people in this room Probability that at least 2 of them share same birthday is ~0.47 • Collisions are real! • An unbalanced HashMap behaves like a list O(n) retrieval • Chaining & Linear probing • Performance Degrades • with 80% table density •
Slide 28: the devil’s in the details
Slide 29: CFS • All in the family! • denormalize
Slide 30: Memtable • In-memory • ColumnFamily specific • throughput determines size before flush • Larger memtables can improve reads
Slide 31: SSTable • MemTable “flushes” to a SSTable • Immutable after • Read: Multiple SSTable lookups possible • Chief Execs: – SSTableWriter – SSTableReader
Slide 32: Write: Runtime threads
Slide 33: Writes: runtime mem
Slide 34: Example: Java Overheads
Slide 35: writes: monitors
Slide 36: UUID • java.util.UUID is slow – static use leads to contention SecureRandom • Uses /dev/urandom for seed initialization • PRNG without file is atleast 20%-40% better. • Use TimeUUIDs where possible – much faster • JUG – java.uuid.generator • • • http://github.com/cowtowncoder/java-uuid-generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html -Djava.security.egd=file:/dev/urandom
Slide 37: synchronized • • • • Coarse grained locks io under lock Stop signal on a highway java.util.concurrent does not mean no locks • Non Blocking, Lock free, Wait free collections
Slide 38: Scalable Lock-Free Coding Style • Big Array to hold Data • Concurrent writes via: CAS & Finite State Machine – No locks, no volatile – Much faster than locking under heavy load – Directly reach main data array in 1 step • Resize as needed – Copy Array to a larger Array on demand – Use State Machine to help copy – “ Mark” old Array words to avoid missing late updates
Slide 39: Non-Blocking HashMap Azul Vega2 – 768 cpus 1K Table 1200 1200 1000 1M Table NB-99 1000 800 800 M-ops/sec 600 CHM-99 M-ops/sec 600 400 400 NB-75 200 200 NB CHM-75 0 0 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 CHM 600 700 800 Threads Threads
Slide 40: Cassandra uses High Scale Non-Blocking Hashmap public class BinaryMemtable implements IFlushable { … private final Map<DecoratedKey,byte[]> columnFamilies = new NonBlockingHashMap<DecoratedKey, byte[]>(); /* Lock and Condition for notifying new clients about Memtable switches */ private final Lock lock = new ReentrantLock(); Condition condition; … } public class Table { … private static final Map<String, Table> instances = new NonBlockingHashMap<String, Table>(); … }
Slide 41: GC-sensitive elements within Cassandra • Compaction triggers System.gc() – Tombstones from files • • • • “GCInspector” Memtable Threshold, sizes SSTable sizes Low overhead collection choices
Slide 42: Garbage Collection • Pause Times • Allocation Rate if stop_the_word_FullGC > ttl_of_node => failed requests; failure accrual & node repair. – New object creation, insertion rate – if residency in heap > 50% – GC overheads dominate. – space, cpu cycles spent GC – Bigger is not better! • Live Objects (residency) • Overhead • 64-bit not addressing pause times
Slide 43: Memory Fragmentation • Fragmentation – Performance degrades over time – Inducing “Full GC” makes problem go away – Free memory that cannot be used – Use a compacting collector – Promote less often – Use uniform sized objects – Use latest CMS with CR:6631166 – Azul’s Zing JVM & Pauseless GC • Reduce occurrence • Solution – unsolved
Slide 44: CASSANDRA-1014
Slide 45: Best Practices: Garbage Collection • GC Logs are cheap even in production -Xloggc:/var/log/cassandra/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC • Slightly expensive ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics
Slide 46: Sizing: Young Generation • Should we set –Xms == -Xmx ? • Use –Xmn (fixed eden) allocations {new Object();} survivor ratio eden survivor spaces promotion Tenuring Threshold allocation by jvm old generation
Slide 47: Tuning CMS • Don’t promote too often! – Frequent promotion causes fragmentation • Size the generations – Min GC times are a function of Live Set – Old Gen should host steady state comfortably • Parallelize on multicores: – -XX:ParallelCMSThreads=4 – -XX:ParallelGCThreads=4 • Avoid CMS Initiating heuristic – -XX:+UseCMSInitiationOccupanyOnly • Use Concurrent for System.gc() – -XX:+ExplicitGCInvokesConcurrent
Slide 48: Summary Design & Implementation of Cassandra takes advantages of strengths while avoiding common JVM issues. • Locks: – Avoids locks in critical path – Uses non-blocking collections, TimeUUIDs! – Still Can’t use all my multi-cores..? >> Other bottlenecks to find! • De/Serialization: – Uses avro, thrift! • Object overhead – Uses mostly byte[ ], primitives where possible! • Garbage Collection – Mitigate: Monitor, Measure foot print. – Work in progress by all jvm vendors! Cassandra starts from a great footing from a JVM standpoint and will reap the benefits of the platform!
Slide 49: References • Verner Wogels, Eventually Consistent http://www.allthingsdistributed.com/2008/12/eventually_consistent.htm • Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors" • Avinash Lakshman, http://static.last.fm/johan/nosql20090611/cassandra_nosql.pdf • Eric Brewer, CAP http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • Tony Printzeis, Charlie Hunt, Javaone Talk http://www.scribd.com/ doc/36090475/GC-Tuning-in-the-Java • http://github.com/digitalreasoning/PyStratus/wiki/Documentation • http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf • Cassandra on Cloud, http://www.coreyhulen.org/?p=326 • Cliff Click’s, Non-blocking HashMap http://sourceforge.net/projects/high-scale-lib/ • Brian F. Cooper., Yahoo Cloud Storage Benchmark, http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf Q&A
Slide 50: DataModel: Know your use patterns Alternative Twitter DataModel: <Keyspace Name="Multiblog"> <ColumnFamily CompareWith="TimeUUIDType" Name="Blogs" /> <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"/> </Keyspace>

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location