ETL vs ESB data flows: 5 key points to choose the right solution
What is an ETL?
ETL, Extract Transform Load, is a middleware that transfers data from point A to point B at defined intervals. These three verbs summarize each data flow’s three phases, or processes.
The first phase consists of data extraction, which can come from any source: a database, FTP, Amazon S3, Dropbox, Google Drive, Linux folder, etc.
Depending on the source, the data can also be extracted in any format: SQL, JSON, XML, EDI, positional, Excel…
A second transformation phase occurs to apply business rules, clean the data, filter out irrelevant information, etc.
A third phase integrates the data into a target destination, as opposed to the extraction, in any format and source.
The advantage of ETL is its flexibility and ability to handle the most common cases thanks to the diversity of input/output sources and formats.
The drawback of ETL is that it can become a “black box,” making it difficult to maintain or evolve.
What is an ESB?
ESB, or Enterprise Service Bus, allows applications within the same information system, which were not designed to work together, to communicate. This issue is solved by creating a bus that listens and transmits data from application A to application B, and from application B to application A.
This way, if a client’s information system grows—due to a merger or the addition of a new tool for business teams, for example—the new application only needs to be “connected” to the ESB to be integrated into the existing system.
The advantage is avoiding the overhaul of an information system every time a new software or application is integrated.
The disadvantage is that applications must be able to exchange information in real-time, particularly in SOAP or REST format.
What’s the difference between ETL and ESB?
Data retrieval
The main difference between ETL and ESB lies in how data is retrieved.
ETL works in a “pull” manner, where data flow is scheduled and will execute on demand to fetch (=pull) the data from a defined source to perform the expected task.
The ETL flow is an active flow that fetches the data.
ESB works in a “push” manner, where the flow is “event-driven” and will execute as soon as data is received from a source application.
Upon receiving it, the bus distributes (=push) the received information to target applications via a “publisher/subscriber” system.
The ESB flow is a passive flow that transfers the data.
Data sync waiting times
La seconde différence, liée à la première, est le temps d’intégration des informations.
The second difference, linked to the first, concerns the waiting times of data synchronization.
Since ETL operates on a scheduled basis, data will be integrated when the scheduled flow completes its process.
Depending on the trigger, data will be synchronized at the end of execution.
These triggers can be set hourly, daily, or at any regular intervals.
ESB, on the other hand, is event-driven. It will receive the data in the bus as soon as it is created or modified in the source application.
Once received in the bus, the data is sent to the targets in real time after simple transformation steps and via a “publisher/subscriber” system.
Unlike ETL, integration happens in real-time.
Volume
The third difference, linked to the second, concerns the data volume.
ETL is designed to handle large volumes at scheduled times during the day, so it operates at low intensity.
Example: a flow runs once a day, reading a table with ten thousand rows in a database.
In this case, it processes ten thousand rows and then stops.
Conversely, ESB is designed to handle smaller volumes, but in real-time, so it operates at high intensity.
Example: the bus receives ten thousand independent data entries throughout the day.
In this case, it processes ten thousand single-row transactions over the day and listens after each process (the ESB flow never stops).
Complexity
Due to the volume, the ETL flow will be more complex to implement because the code must be optimized to handle the load.
Additionally, the “T” process in “ETL” requires more transformations.
An optimization layer is necessary to prevent performance degradation on the server where the flow is installed.
This involves writing data to temporary files, for example, to increase the number of “sub-processes” within the flow, allowing data processing to be broken into smaller parts.
The ESB flow, however, only retrieves data pushed from the source. Its main purpose is to transfer the data so that applications remain synchronized.
The number and complexity of transformations in an ESB flow are smaller compared to ETL.
Machine resources and parallelization
The fifth and final difference concerns the number of processes and the resources consumed simultaneously.
For ETL, jobs run one at a time and stop once the process is completed.
This operating mode requires more machine resources to launch, initialize the flow, and execute processes.
Coupled with the volume and number of potential flows, the server load can double (or more) if all flows start at the same time.
By contrast, ESB is constantly listening, meaning the flow is already instantiated, running with resources ready to be used, and data arrives according to the load from source applications (=application users).
This way, multiple requests can be processed simultaneously without straining machine resources as much as ETL does.
How to choose? And do we have to choose?
Before determining whether the need is for ETL or ESB, the following five questions must be considered:
- Is the need for data immediate?
If yes, ESB; otherwise, ETL. - Are there complex transformations?
If yes, ETL; otherwise, ESB. - Is the data volume significant?
If yes, ETL; otherwise, ESB. - Will I need to add other applications in the future?
If yes, ESB; otherwise, ETL. - Do I want to limit my budget?
If yes, ETL; otherwise, ESB.
“What if I want the benefits of both ETL and ESB?”
The distinction between ETL and ESB is becoming increasingly fine. Integrators are combining these two technologies to gain the advantages of both without the drawbacks, all within the same platform (see Talend API Cloud Services).
Thus, the question is no longer about a functional choice but purely technical, based on business needs, the data received, the processes to be executed, logistical constraints, etc.
The development team is responsible for these issues, while the enterprise/management perspective is to choose the right application provider (based on needs, scalability, maintainability, and updates) and demand a generic framework to accommodate each evolution within this ETL + ESB platform.
At no point should it be necessary to completely refactor or change the architecture every time an application is added. Technical expertise is therefore crucial when considering a long-term vision.