From:
cabirduk
Views: 30
Comments: 0
Cloud Hound technical focus is around Microsoft Office 365, Microsoft Intune and underlying infrastructure. Cloud Hound enjoys a close working relationship with Microsoft and is certified as a Microsoft Small Business Specialist.
Slide 1: Data Processing in the Cloud
Parand Tony Darugar http://parand.com/say/ darugar@yahoo-inc.com
Slide 2: What is Hadoop
Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
2
Slide 3: Why?
A common infrastructure pattern extracted from building distributed systems
Scale Incremental growth Cost Flexibility
3
Slide 4: Built-in Resilience to Failure
When dealing with large numbers of commodity servers, failure is a fact of life Assume failure, build protections and recovery into your architecture
- Data level redundancy - Job/Task level monitoring and automated restart and re-allocation
4
Slide 5: Current State of Hadoop Project
Top level Apache Foundation project In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … Large, active user base, mailing lists, user groups Very active development, strong development team
5
Slide 6: Widely Adopted
A valuable and reusable skill set
Taught at major universities Easier to hire for Easier to train on Portable across projects, groups
6
Slide 7: Plethora of Related Projects
Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout
7
Slide 8: What is Hadoop
The Linux of distributed processing.
8
Slide 9: How Does Hadoop Work?
Slide 10: Hadoop File System
A distributed file system for large data
- Your data in triplicate - Built-in redundancy, resiliency to large scale failures - Intelligent distribution, striping across racks - Accommodates very large data sizes - On commodity hardware
10
Slide 11: Programming Model: Map/Reduce
Very simple programming model:
- Map(anything)->key, value - Sort, partition on key - Reduce(key,value)->key, value
No parallel processing / message passing semantics Programmable in Java or any other language (streaming)
11
Slide 12: Processing Model
Create or allocate a cluster Put data onto the file system:
- Data is split into blocks, stored in triplicate across your cluster
Run your job:
- Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data
• Move computation to data, not data to computation
12
Slide 13: Processing Model
• Monitor workers, automatically restarting failed or slow tasks
- Gather output of Map, sort and partition on key - Run Reduce tasks
• Monitor workers, automatically restarting failed or slow tasks
Results of your job are now available on the Hadoop file system
13
Slide 14: Hadoop on the Grid
Managed Hadoop clusters Shared resources
- improved utilization
Standard data sets, storage Shared, standardized operations management Hosted internally or externally (eg. on EC2)
14
Slide 15: Usage Patterns
Slide 16: ETL
Put large data source (eg. Log files) onto the Hadoop File System Perform aggregations, transformations, normalizations on the data Load into RDBMS / data mart
16
Slide 17: Reporting and Analytics
Run canned and ad-hoc queries over large data Run analytics and data mining operations on large data Produce reports for end-user consumption or loading into data mart
17
Slide 18: Data Processing Pipelines
Multi-step pipelines for data processing Coordination, scheduling, data collection and publishing of feeds SLA carrying, regularly scheduled jobs
18
Slide 19: Machine Learning & Graph Algorithms
Traverse large graphs and data sets, building models and classifiers Implement machine learning algorithms over massive data sets
19
Slide 20: General Back-End Processing
Implement significant portions of back-end, batch oriented processing on the grid General computation framework Simplify back-end architecture
20
Slide 21: What Next?
Dowload Hadoop:
- http://hadoop.apache.org/
Try it on your laptop Try Pig
- http://hadoop.apahe.org/pig/
Deploy to multiple boxes Try it on EC2
21