bamasio's picture
From bamasio rss RSS  subscribe Subscribe

Inside Flume 

 

 
 
Tags:  apache  zookeeper  logging  fume  flumedata collectorsyslogscribe  hadoopday  flume  stream  logs  log managment  scribe  bigdata  cloud  data analysis  hadoop flume  hadoop  hdfs  cloudera 
Views:  25
Published:  February 06, 2012
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
No related plicks found
 
More from this user
Introduction to digital curation

Introduction to digital curation

From: bamasio
Views: 473
Comments: 0

Brochure for aetna

Brochure for aetna

From: bamasio
Views: 325
Comments: 0

Saxon Mortgage Guidelines

Saxon Mortgage Guidelines

From: bamasio
Views: 329
Comments: 0

Caisse3

Caisse3

From: bamasio
Views: 465
Comments: 0

SaaS SecureS in uncertain timeS

SaaS SecureS in uncertain timeS

From: bamasio
Views: 140
Comments: 0

Table of Contents

Table of Contents

From: bamasio
Views: 134
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Inside Flume Henry Robinson henry@cloudera.com @henryr Tuesday, 17 August 2010
Slide 2: Who am I? • Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring! Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 3: About Cloudera • Software, services and support for Hadoop • Built around an open core • All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project • We maintain, package and support Cloudera’s Distribution for Hadoop • Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 4: What’s the problem? • Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path • Usually a collection of fragile, custom scripts Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 5: What is Flume? (and how can it help?) • Flume is: • • • • • • A distributed data collection service Scalable Configurable Extensible Manageable Open source • How can it help? • One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 6: The Flume Model • Built around the concept of flows • A single flow corresponds to a type of data source • Like web server logs • Or machine monitoring metrics • Different flows might have different compression, batching or reliability setups • Flume multiplexes many flows onto one service instance • Flows are comprised of nodes chained together • Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 7: Flume Flows • Three typical flows, all on the same Flume service Flow 1: Web-clicks Reliable Delivery, Compressed, Batched A AT D EV EN TS DATA Flow 2: Process monitoring Best Effort Delivery EVENTS DA TA Flow 3: Advert Impressions Reliable Delivery E N VE TS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 8: Anatomy of a Flume node • • • • Data come in through a source... ... are optionally processed by one or more decorators... ... and then are transmitted out via a sink Each of these components is (re-)configurable at runtime • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 9: Agents and Collectors • Nodes that receive data from an application are called agents • Flume supports many sources for agents, including: • • • • • Syslog Tailing a file Unix processes Scribe API Twitter • Nodes that write data to permanent storage are called collectors • Most often they write to HDFS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 10: Flume Nodes HTTPD Agent Source Tail Apache HTTPD logs Sink Downstream processor node • Each role may be played by many different nodes • Usually require substantially fewer collectors than agents Processor Source Upstream agent node Decorator Extract browser name from log string and attach it to event Sink Downstream collector node Collector Source Upstream processor node Sink HDFS:// namenode/ /weblogs/ %{browser}/ HDF S Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 11: Flume Events • All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values • Flume can use this to inform processing • Or simply write it with the event Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 12: The Flume Configuration Language • Node configurations are written in a simple language • my-flume-node : src | { decorator => sink } • For example: a configuration to read HTTP log data from a file and send it to a collector: • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink • On the collector, receive data and bucket it according to browser: • web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • Two lines to set-up an entire flow Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 13: Keeping Track of Nodes • The master service monitors all Flume nodes • A single port-of-call for checking on the health of your Flume service • Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 14: Flume as a Distributed System • Fundamental principle: Keep state out of the data path where possible • • • • Replication is costly Consistency is problematic Global knowledge is impractical Follow the end-to-end principle - put smarts at the edges • Advantages • Failures become much cheaper • Performance is better • Disadvantages • Have to weaken some delivery guarantees Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 15: Scalability and reliability in Flume • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 16: Flume as Open Source • http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides • Packages • Standardisation • Support Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
Slide 17: Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location