margulis's picture
From margulis rss RSS  subscribe Subscribe

Cloud Computing: Hadoop 



Cloud Computing: Hadoop

 

 
 
Tags:  cloud computing  scalability  yahoo  hadoop  pig  mapreduce 
Views:  373
Downloads:  4
Published:  April 27, 2010
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
Cloud Computing on ISO/IEC JTC 1

Cloud Computing on ISO/IEC JTC 1

From: aliceuh9
Views: 1209 Comments: 0

 
Why Office 365?

Why Office 365?

From: cabirduk
Views: 30 Comments: 0
Cloud Hound technical focus is around Microsoft Office 365, Microsoft Intune and underlying infrastructure. Cloud Hound enjoys a close working relationship with Microsoft and is certified as a Microsoft Small Business Specialist.
 
See all 
 
More from this user
314 De Kies Actief Toolkit, Marcel Penners

314 De Kies Actief Toolkit, Marcel Penners

From: margulis
Views: 762
Comments: 0

Geocaching

Geocaching

From: margulis
Views: 576
Comments: 0

3061 Mammen

3061 Mammen

From: margulis
Views: 598
Comments: 0

Web Sphere Portal Security

Web Sphere Portal Security

From: margulis
Views: 2228
Comments: 0

International Armoured Vehicles Extended Pdfw

International Armoured Vehicles Extended Pdfw

From: margulis
Views: 542
Comments: 0

Kbd Bio(Banks)

Kbd Bio(Banks)

From: margulis
Views: 286
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ darugar@yahoo-inc.com
Slide 2: What is Hadoop Flexible infrastructure for large scale computation and data processing on a network of commodity hardware. 2
Slide 3: Why? A common infrastructure pattern extracted from building distributed systems     Scale Incremental growth Cost Flexibility 3
Slide 4: Built-in Resilience to Failure   When dealing with large numbers of commodity servers, failure is a fact of life Assume failure, build protections and recovery into your architecture - Data level redundancy - Job/Task level monitoring and automated restart and re-allocation 4
Slide 5: Current State of Hadoop Project     Top level Apache Foundation project In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … Large, active user base, mailing lists, user groups Very active development, strong development team 5
Slide 6: Widely Adopted  A valuable and reusable skill set Taught at major universities Easier to hire for Easier to train on Portable across projects, groups 6
Slide 7: Plethora of Related Projects       Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout 7
Slide 8: What is Hadoop The Linux of distributed processing. 8
Slide 9: How Does Hadoop Work?
Slide 10: Hadoop File System  A distributed file system for large data - Your data in triplicate - Built-in redundancy, resiliency to large scale failures - Intelligent distribution, striping across racks - Accommodates very large data sizes - On commodity hardware 10
Slide 11: Programming Model: Map/Reduce  Very simple programming model: - Map(anything)->key, value - Sort, partition on key - Reduce(key,value)->key, value   No parallel processing / message passing semantics Programmable in Java or any other language (streaming) 11
Slide 12: Processing Model   Create or allocate a cluster Put data onto the file system: - Data is split into blocks, stored in triplicate across your cluster  Run your job: - Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data • Move computation to data, not data to computation 12
Slide 13: Processing Model • Monitor workers, automatically restarting failed or slow tasks - Gather output of Map, sort and partition on key - Run Reduce tasks • Monitor workers, automatically restarting failed or slow tasks  Results of your job are now available on the Hadoop file system 13
Slide 14: Hadoop on the Grid      Managed Hadoop clusters Shared resources - improved utilization Standard data sets, storage Shared, standardized operations management Hosted internally or externally (eg. on EC2) 14
Slide 15: Usage Patterns
Slide 16: ETL    Put large data source (eg. Log files) onto the Hadoop File System Perform aggregations, transformations, normalizations on the data Load into RDBMS / data mart 16
Slide 17: Reporting and Analytics    Run canned and ad-hoc queries over large data Run analytics and data mining operations on large data Produce reports for end-user consumption or loading into data mart 17
Slide 18: Data Processing Pipelines    Multi-step pipelines for data processing Coordination, scheduling, data collection and publishing of feeds SLA carrying, regularly scheduled jobs 18
Slide 19: Machine Learning & Graph Algorithms   Traverse large graphs and data sets, building models and classifiers Implement machine learning algorithms over massive data sets 19
Slide 20: General Back-End Processing    Implement significant portions of back-end, batch oriented processing on the grid General computation framework Simplify back-end architecture 20
Slide 21: What Next?  Dowload Hadoop: - http://hadoop.apache.org/   Try it on your laptop Try Pig - http://hadoop.apahe.org/pig/   Deploy to multiple boxes Try it on EC2 21

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location