ait's picture
From ait rss RSS  subscribe Subscribe

Introduction to MapReduce Data Transformations 



 

 
 
Tags:  aster  database  map  reduce  transformations  tdwi  parallel  systems 
Views:  924
Downloads:  5
Published:  January 02, 2010
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
No related plicks found
 
More from this user
Facebook marketing: Best practices to boost ROI

Facebook marketing: Best practices to boost ROI

From: ait
Views: 8
Comments: 0

Domain Names: What You Need to Know About Domain Names

Domain Names: What You Need to Know About Domain Names

From: ait
Views: 167
Comments: 0

Chanimal Building A Saa S Channel Presentation

Chanimal Building A Saa S Channel Presentation

From: ait
Views: 10
Comments: 0

index.doc

index.doc

From: ait
Views: 26
Comments: 0

Library Resources for Research 2009

Library Resources for Research 2009

From: ait
Views: 280
Comments: 0

Microsoft-HP E5000 Exchange 2010 Appliance

Microsoft-HP E5000 Exchange 2010 Appliance

From: ait
Views: 46
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems tasso@asterdata.com
Slide 2: A Brief History of MapReduce 2 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 3: What is MapReduce? It’s the simplest API you have ever seen It has just two functions 1. Map() and 2. Reduce() Plus: it’s language independent (Java, Perl, Python, …) 3 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 4: Why is MapReduce Useful? It simplifies distributed applications… …by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)… …behind two simple functions. But let’s see an example 4 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 5: The quick brown fox jumps over the lazy dog. The world only needs five computers. Hello world. To be or not to be: that is the question. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Server A Server B Server C Server D Switch 5 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 6: Goal We Want to Count the # of Times Each Word Occurs 6 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 7: 1st Approach 1st Approach No MapReduce No MapReduce 7 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 8: The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. the quick brown fox jumps over the lazy dog The world only needs five computers. the world only needs five computers Hello world. hello world To be or not to be: that is the question. to be or not to be that is the question in database mapreduce is the future MapReduce is mapreduce a very is powerful a concept. very powerful concept Server A Server B Server C Server D Switch 8 Confidential and proprietary. Copyright © 2008 Aster Data Systems the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question
Slide 9: Server 4 Final Result File the is mapreduce … 5 3 2 … 9 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 10: What Did We Do? 1. Write a script to parse the documents and output word lists 2. FTP all the word lists to server 4 3. Write another script to count each word on Server 4 Problem: (2) and (3) do not scale! 10 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 11: 2nd Approach No MapReduce Fully Distributed 11 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 12: The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. the quick brown fox jumps over the lazy dog The world only needs five computers. the world only needs five computers Hello world. hello world To be or not to be: that is the question. to be or not to be that is the question in database mapreduce is the future MapReduce is mapreduce a very is powerful a concept. very powerful concept Server A Server B Server C mapreduce mapreduce be be to jumps computers hello Server D is is is question over a that the the world the world the powerful the lazy database brown database future Confidential and proprietary. Copyright © 2008 Aster Data Systems 12 Switch
Slide 13: Server 1 Final Result File the … 5 …. Server 2 Final Result File world … 2 …. Server 3 Final Result File mapreduce … 2 …. Server 4 Final Result File is … 13 Confidential and proprietary. Copyright © 2008 Aster Data Systems 3 ….
Slide 14: 2nd Approach: No MapReduce, Distributed 14 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 15: Does it work? Yes Does it take lots of time? Yes! Is it a pain? Yes!! Would you do it? No!!! 15 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 16: Moreover… Who will manage your files? What if nodes fail? What if you want to add more nodes? What if… What if… What if… 16 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 17: Input Any file (e.g. documents) Input All <key, value> pairs with the same key grouped (e.g. all <word, count> pairs where word = “the”) up in g Output Stream of <key, value> pairs (e.g. <word, count> pairs) 17 Confidential and proprietary. Copyright © 2008 Aster Data Systems Da ta Re di st r ib ut io Map() an d Gr o Reduce() n Output Anything (e.g. sum of counts for a specific word)
Slide 18: The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. Map() <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> Map() and Redistributi on Phase Map() <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> Server A <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> 18 Server B Server C Server D <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> Switch Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 19: Grouping and Reduce() Phase (on Server 1) <the, <the, <the, <the, <the, <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> 1> 1> 1> 1> 1> Reduce() Server 1 Final Result File <database,1> <database,1> Reduce() the database future 5 2 1 <future,1> Reduce() 19 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 20: What Just Happened? By writing two small scripts with a few lines of code… … we achieved exactly the same result! Plus, our code did not have to care about: • the # of servers on the system (4 or 400?) • which server to send each word • any network communication aspects • any fault tolerance aspects •… 20 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 21: Word Count was Only an Example! Google does all web indexing on MapReduce “The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3,800 lines of C++ code to approximately 700 lines when expressed using MapReduce.” Google 2004 MapReduce paper 21 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 22: Word Count was Only an Example! Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model “We adapt Google’s MapReduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).” Stanford 2006 AI Lab paper 22 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 23: Result? MapReduce makes writing parallel programs extremely easy… …and can accommodate from trivial to very complex algorithms… …thus enabling the processing of petabytes of data with a few lines of code! 23 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 24: But… Today MapReduce is used only by hardcore coders/ programmers/hackers Changes in MapReduce queries require changes in the MapReduce code itself • Constantly keep coding Using MapReduce with database data is hard and cumbersome… …when most of the structured data in the enterprise are stored in databases! 24 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 25: Beyond SQL and MapReduce 25 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 26: SQL vs MapReduce: Two different worlds? SQL Declarative • Specifies what needs to happen MapReduce Procedural • Specifies how it needs to happen Execution plans optimized dynamically Input/output is structured Data redistribution inferred from SQL statement (in MPP Databases) Code compiled once; MapReduce plans are static Input/output is unstructured Data redistribution based on <keys> in Reduce() phase 26 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 27: Implementing MR in the Database Uses Polymorphic SQL operators to embed MapReduce functions to SQL Introduces a “PARTITION BY” clause to specify data redistribution Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions Best of both worlds • Planning is still dynamic • MapReduce functions can be used like custom SQL operators • MapReduce functions can implement any algorithm or transformation • Code Once – Use Many (through SQL) model 27 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 28: The SQL/MR Process 28 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 29: SQL/MR Function: Syntax SELECT… FROM MR_Function ( ON source_data [ PARTITION BY column ] [ ORDER BY column ] [Function Arguments] ) WHERE … GROUP BY … HAVING … ORDER BY … LIMIT …; (4) Java/Python/… MR function (1) Source table or sub-select (2) <key> for data redistribution (3) Sort before the MR function Optional MR_Function Arguments (5) Select output (eg. count) Optional conditions & filters 29 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 30: Example 1: Tokenization Demo #1: Only Map (Tokenization) in SQL/MR SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/ MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Demo #3: Why do Reduce when you have SQL? SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; 30 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 31: Example 2: Sessionization What Is Sessionize? An example Aster SQL/MR function Leverages Aster’s Java library API What Does It Do? User specified a column (eg. timestamp) and a session timeout value (in seconds) Spits out unique session identifiers (sessionid column) Usage CREATE TABLE sessionized_clicks AS SELECT ts, userid, sessionid, ... FROM Sessionize( ON clicks PARTITION BY userid ORDER BY ts TIMEOUT 60 ); 31 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 32: Example 2: Sessionization Clickstream timesta mp 10:00:0 0 00:58:2 4 10:00:2 4 02:30:3 3 10:01:2 3 10:02:4 0 userid Shawn1 PrezBus h Shawn1 PrezBus h Shawn1 Shawn1 timesta mp 10:00:0 0 10:00:2 4 10:01:2 3 10:02:4 0 userid Shawn1 Shawn1 Shawn1 Shawn1 sessioni d 0 0 0 1 Session Timeout = 60 seconds timesta mp 00:58:2 4 02:30:3 3 userid PrezBus h PrezBus h OUTPUT Slide 32 session id 0 1 INPUT 32 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 33: MR Applications in the Database ELT Text and data transformations, in-parallel, in-database Queries that become too complex for SQL E.g. Sessionize(), customer segmentation, predictive analytics, … Queries that SQL inherently cannot handle well Time series analytics Aster has a set of pre-defined SQL/MR functions for this Data structures that do not fit well the relational model Time series (again) Graphs, spatial data Any analytical or reporting application that requires more performance and data proximity! 33 Confidential and proprietary. Copyright © 2008 Aster Data Systems
Slide 34: Summary Growing challenges in scaling analytical applications and reporting MapReduce is driving a data revolution (see: Google) In-Database MapReduce will open up databases to a host of new applications tasso@asterdata.com (Questions, Comments) asterdata.com/blog (Lots of technical details) (Any other information) 34 Confidential and proprietary. Copyright © 2008 Aster Data Systems 1.888.Aster.Data

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location