Slide 1: Inside Flume
Henry Robinson henry@cloudera.com @henryr
Tuesday, 17 August 2010
Slide 2: Who am I?
• Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring!
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 3: About Cloudera
• Software, services and support for Hadoop • Built around an open core
• All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project
• We maintain, package and support Cloudera’s Distribution for Hadoop
• Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more.
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 4: What’s the problem?
• Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path
• Usually a collection of fragile, custom scripts
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 5: What is Flume? (and how can it help?)
• Flume is:
• • • • • • A distributed data collection service Scalable Configurable Extensible Manageable Open source
• How can it help?
• One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 6: The Flume Model
• Built around the concept of flows • A single flow corresponds to a type of data source
• Like web server logs • Or machine monitoring metrics
• Different flows might have different compression, batching or reliability setups
• Flume multiplexes many flows onto one service instance
• Flows are comprised of nodes chained together
• Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 7: Flume Flows
• Three typical flows, all on the same Flume service
Flow 1: Web-clicks
Reliable Delivery, Compressed, Batched
A AT D
EV EN TS
DATA
Flow 2: Process monitoring
Best Effort Delivery
EVENTS
DA TA
Flow 3: Advert Impressions
Reliable Delivery
E
N VE
TS
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 8: Anatomy of a Flume node
• • • • Data come in through a source... ... are optionally processed by one or more decorators... ... and then are transmitted out via a sink Each of these components is (re-)configurable at runtime • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 9: Agents and Collectors
• Nodes that receive data from an application are called agents • Flume supports many sources for agents, including:
• • • • • Syslog Tailing a file Unix processes Scribe API Twitter
• Nodes that write data to permanent storage are called collectors
• Most often they write to HDFS
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 10: Flume Nodes
HTTPD
Agent Source Tail Apache HTTPD logs Sink Downstream processor node
• Each role may be played by many different nodes • Usually require substantially fewer collectors than agents
Processor Source Upstream agent node Decorator Extract browser name from log string and attach it to event Sink Downstream collector node
Collector Source Upstream processor node Sink HDFS:// namenode/ /weblogs/ %{browser}/
HDF
S
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 11: Flume Events
• All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values
• Flume can use this to inform processing • Or simply write it with the event
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 12: The Flume Configuration Language
• Node configurations are written in a simple language
• my-flume-node : src | { decorator => sink }
• For example: a configuration to read HTTP log data from a file and send it to a collector:
• web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
• On the collector, receive data and bucket it according to browser:
• web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• Two lines to set-up an entire flow
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 13: Keeping Track of Nodes
• The master service monitors all Flume nodes
• A single port-of-call for checking on the health of your Flume service
• Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 14: Flume as a Distributed System
• Fundamental principle: Keep state out of the data path where possible
• • • • Replication is costly Consistency is problematic Global knowledge is impractical Follow the end-to-end principle - put smarts at the edges
• Advantages
• Failures become much cheaper • Performance is better
• Disadvantages
• Have to weaken some delivery guarantees
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 15: Scalability and reliability in Flume
• The data path is ‘horizontally scalable’
• Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place
• The master only requires a few machines
• Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 16: Flume as Open Source
• http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides
• Packages • Standardisation • Support
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
Slide 17: Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010