Slide 1: Should ETL Become Obsolete?
Why a Business-Rules driven “E-LT” Architecture may be better
Slide 2: Introduction
In today’s information-based economy, organizations must be able to integrate vast amounts of data from disparate sources in order to support strategic IT initiatives such as:
Business Intelligence Corporate Performance Management, including Master Data Management, Data Warehousing and Data Marts.
At the same time, IT organizations are under constant pressure to get more done with fewer resources.
Slide 3: The only way to satisfy these conflicting goals is to adopt a cost-effective integration solution that enhances the productivity of the IT organization and helps to streamline a broad range of integration initiatives. Over the past several decades, many organizations have turned to commercial ETL (Extract, Transform, and Load) tools as a means to reduce the effort associated with the most common integration approach: Manual Coding. Using a centralized ETL “engine” as an integration hub, and powered by a single, platform independent language, most ETL tools do reduce the effort associated with point-topoint manual integration.
Slide 4: However, many of the users of these tools have also encountered a number of issues that are a consequence of the traditional ETL architecture. This begs the question:
Is the conventional approach to ETL obsolete?
Slide 5: The Problems with the Traditional ETL Approach
Traditional ETL tools operate by: First, Extracting the data from various sources Second, Transforming the data on a proprietary, middle-tier ETL engine Third, Loading the transformed data into the target Data Warehouse or integration server. Hence the term “ETL” represents both the names and the order of the operations performed, as shown in the next slide.
Slide 7: Most ETL tools use a graphical programming model to shield the user from the complexity of coding the transformations and make the tools easier to learn and use. While in theory ETL tools should improve Developer productivity, this is not always the case. The primary issues with the traditional ETL approach fall into the following three categories: 1. Productivity/maintainability 2. Performance 3. Cost.
Slide 8: Virtually all ETL tools use some kind of graphical programming model as an alternative to manual coding. While at first glance many of these data-flow oriented GUIs look similar, there are significant differences that impact the number of intermediate steps that must be defined, and the maintainability and reuse of the job and its Components over time. The conventional ETL approach first requires the user to describe what they want to do in plain English. That means describing the Business Rules as text and then detailing precisely how to implement these rules, step by step. This phase of development is often referred to as “designing the Data Flow”.
Productivity/Maintainability Issues
Slide 9: This requires a strong understanding of the structure of the data and the architecture of the IT system. As illustrated in the next figure, the Data Flow needs to be defined manually, and repeated for each individual process. Users must understand not only what the overall transformation is supposed to do, but also define what each incremental step is needed to perform it. This usually ends up as a long chain of multiple mapping operations tied to a number of temporary outputs. Defining all of these intermediate stages requires additional analysis and development work, and hence, has a negative impact on productivity.
Slide 10: Defining all of these intermediate stages requires additional analysis and development work, and hence, has a negative impact on productivity.
Slide 11: What’s worse, when there is a need to change a Business Rule or accommodate additional data sources or targets, significant rework may be required to the existing Data Flow since the transformations are highly fragmented, requiring edits to many sub-blocks. This can make maintenance even more challenging – especially when the resource maintaining the processes is not the same as the resource who created them.
Slide 12: Performance Issues with Traditional ETL
The data transformation step of the ETL process is by far the most computer-intensive, and is performed entirely by the proprietary ETL engine on a dedicated server. The ETL engine performs data transformations (and sometimes data quality checks) on a row-by-row basis, and hence, can easily become the bottleneck in the overall process. In addition, the data must be moved over the network twice – once between the sources and the ETL server, and again between the ETL server and the target Data Warehouse.
Slide 13: Moreover, if one wants to ensure referential integrity by comparing Data Flow references against values from the target Data Warehouse, the referenced data must be downloaded from the target to the engine. Thus further increasing network traffic, download time, and leading to additional performance issues. Let’s consider, for example, how a traditional ETL job would look up values from the target database to enrich data coming from source systems. To perform such a job, a traditional ETL tool could be used in one of the following three ways:
Slide 14: 1.
Load look-up tables into memory: • The entire look-up table is retrieved from the target server and loaded into the engine’s memory. • Matching (or joining) this look-up data with source records is done in memory before the resulting transformed data is written back to the target server. • If the look-up table is large, the operation will require a large amount of memory and a long time to download its data and re-index it in the engine.
Slide 15: 1.
Perform row-by-row look-ups “on the fly”: • For every row, the ETL engine sends a query to the look-up table located on the target server. • The query returns a single row that is matched (or joined) to the current row of the flow. If the lookup table contains, for example, 500,000 rows, the ETL engine will send 500,000 queries. • This will dramatically slow down the data integration process and add significant overhead to your target system.
Slide 16: 1.
Use manual coding within the ETL job: • Use the ETL engine only for loading source data to the target RDBMS and manually write SQL code to join this data to the target look-up table. • This raises the question: why would you buy a tool that requires manual coding on the target server, knowing that you lose all the benefits of metadata management and development productivity by doing so? • Unfortunately, this is what many users end up doing once they notice (10 x) ten-times degradation in the overall performance of the integration process (when compared to the same operations executed by manual code).
Slide 17: Cost Issues
Most ETL tool purchases are justified based on potential labor savings. Unfortunately, there are other up-front and recurring costs that must be considered in the ROI analysis. The most obvious initial cost is that of the dedicated server and proprietary ETL engine software. Because these middle tier components carry out all the compute-intensive transformation operations, a powerful server is required, and in some cases multiple servers and run-time engines are necessary to meet the throughput requirements.
Slide 18: There are also ongoing hardware and software maintenance costs associated with these assets. This can result in hundreds of thousands of dollars in additional hardware, software and maintenance expenses. In addition, as the Data Warehouse grows to accommodate higher throughput demands, the ETL hub server will need to scale up with it, necessitating additional hardware and software purchases in the future. Conventional ETL tools also have a number of hidden costs, including the consulting expenses required for setup and tuning, and the rip-up and re-write of code as integration requirements evolve over time.
Slide 19: A Better Approach: “E-LT” Architecture + Business Rules
In response to the issues described before, a new architecture has emerged, which in many ways incorporates the best aspects of both manual coding and ETL approaches in the same solution. Known as “E-LT”, this new approach changes where and how data transformation takes place, and leverages the existing Developer skills, RDBMS engines and server hardware to the greatest extent possible. In essence, E-LT moves the data transformation step to the target RDBMS, changing the order of operations to: Extract the data from the source tables, Load the tables into the destination server, and then Transform the data on the target RDBMS using native SQL operators.
Slide 20: Note, with E-LT there is no need for a middle-tier engine or server as shown in the figure below.
Slide 21: Why the ETL Market is Changing
When commercial ETL tools first appeared in the 1990’s the most widely used RDBMSs such as Oracle, DB2, Teradata and Sybase did not support a rich enough set of SQL operators to handle the complex data transformation tasks required for Data Warehouse applications. Hence the dedicated ETL engine and proprietary transformation language emerged as the best alternative to laborious manual coding at the time. However, over the past decade the RDBMS vendors have increased the functionality of the SQL provided to Developers by an order of magnitude, while improving the performance and reliability of their engines at the same time.
Slide 22: For example:
The CASE…WHEN statement (equivalent to an IF… THEN… ELSE…) can be used for complex transformation rules. Outer joins (LEFT OUTER, RIGHT OUTER, or FULL JOIN) can be used to easily join data sets in a variety of different manners. Ranking and windowing functions (MIN OVER PARTITION, MAX OVER PARTITION, LEAD, LAG, and RANK) allow for more effective handling of complex aggregations of large volumes of data.
Slide 23: For example:
Complementing the richer language support, RDBMS vendors now provide a long list of out-of-the-box features and utilities that enable impressive performance when executing ETL-type operations: Some features are dedicated to efficient loading data from sources to targets Others directly process various data formats such as XML files. These are just a few examples of what can be done with the native SQL solutions provided with RDBMS packages today.
Slide 24: E-LT Architecture Offers Better Performance with Bulk Processing
By generating efficient, native SQL code, the E-LT approach leverages the powerful bulk data transformation capabilities of the RDBMS, and the power of the server(s) that hosts it. In addition, since the data is loaded directly from source systems to the target server, only one set of network transfers are required, not two or more as with the traditional ETL approach. Only relational DBMS engines can perform set operations and bulk data transformations, enabling processes to achieve higher performance. Inserts and updates are handled as “Bulk” operations and no longer performed row-by-row.
Slide 25: Thanks to “Set processing” logic, the E-LT approach can achieve exceptional performance with data transformations up to ten to twenty times more efficiently than traditional ETL tools.
Slide 26: Business-Rules-Driven Approach Brings Better Productivity and Maintainability
With a business-rules-driven paradigm, the developer only defines what they want to do, and the data integration tool automatically generates the data flow, including whatever intermediate steps are required, based on a library of “knowledge modules”. The “what to do”, i.e. the Business Rules, are specified using expressions that would make sense to Business Analysts, and are stored in a central metadata repository where they can be easily reused.
Slide 27: The implementation details, specifying “how to do it”, are stored in a separate knowledge module library, and can be shared between multiple Business Rules within multiple ETL processes. The key advantage of this approach is that it is very easy to make incremental changes either to the rules or to the implementation details, as they are, in essence, independent. When a change needs to be applied to operations logic (e.g. creating a backup copy of every target table before loading the new records) it is simply implemented in the appropriate knowledge module.
Slide 28: That change is then automatically reflected in the hundreds of processes that reference it, without having to touch the Business Rules. With a traditional ETL approach, such a change would require opening every process to manually add new steps, increasing the risks of errors and inconsistencies. This makes a huge difference in Developer productivity, especially in long-term program maintenance.
Slide 29: Combine E-LT with Business Rules to Lower the TCO
With no middle-tier ETL engine and no dedicated ETL server hardware required, the initial hardware and software capital costs are significantly lower, as are the ongoing software and hardware maintenance expenses. E-LT software also tends to be less expensive because it does not require the development of a proprietary engine for transformations. It uses any standard RDBMS to execute the ETL jobs. This means significant savings for both the software provider and the customer.
Slide 30: These savings are on top of the Developer productivity improvement, which enables Information Technology organizations to dramatically reduce the cost of developing and maintaining comprehensive Data Warehouses.
Slide 31: Conclusion: Traditional ETL is Indeed Becoming Obsolete
Now that we’ve compared the traditional ETL approach to the newer Business Rules-driven E-LT paradigm, the answer to the original question is somewhat clearer: Conventional ETL tools should be considered obsolete and phased out of the Enterprise Architecture, and tools based on Business Rules and E-LT should begin to take their place.