Execute Periodic ETL Automatically


Until last post Validate Data Model by Studio Analysis, I completed creating ETL programs in Data Integration. But this is still test product because this programs are executed manually by operator. In this post, I would like to share the final piece of Data Integration, how to run ETL programs automatically after production release.

There are several ways to achieve automatic ETL, for example in my production system I am using scheduler in Machine Learning Workbench to trigger Jupyter Notebook, then operate Data Integration via Pycelonis (Python API for Celonis EMS). I would like to create posts regarding Pycelonis later, but today I do not use it because it takes a lot to explain Pycelonis. Instead today I use build in Scheduler in Data Integration, it is easy to construct and enough to use in production too.

If you remember discussion in Tune Endpoint Parameter Relevant to Delta Load, I created Data Pool Parameter today for filtering extracting scope. This parameter should be updated to current time before extraction, so I will introduce new (global) initial job. SQL in this job is just getting current time like below.

DROP TABLE _CEL_CURRENT_TIME;
CREATE TABLE _CEL_CURRENT_TIME AS
SELECT NOW()
;

Next I would like to fetch this value for extraction. Open Data Pool Parameters and update today setting to get value from above table _CEL_CURRENT_TIME. Then go to Extraction setting and change parameter today to use value of Data Pool Parameter.

image

image

OK then I would like to collect extraction and transformations to main job, and global view creation and Data Model Reload to final job like below.

image

image

I am ready for scheduling data jobs. Click Scheduling menu and create new one. From bottom to top in below screen I setup scheduling. First I will choose data jobs which is executed periodically. I choose three jobs above and order them to run initial job, main job and final job sequentially. Second I will choose Scheduling Plan to Hourly. If I would like to execute more frequently I should use Custom Cron and setup parameter (reference). Third I will enable scheduling, and switch to Delta Load. Final setup is as below screen.

image

Then just wait for one hour and see if job is executed successfully. If you would like to monitor background run, click three dots of data job or data model and setup notification (or subscription) setting. Scheduler itself does not have notification function.

That is it. I can conclude Data Integration series now. And I am thankful to you to continue reading my posts for one year. I would like to share my ideas with you in second year too.

Kaz


I would like to share my Data Pool File to you. If you would like to copy my pool to your environment, run content-CLI in your Machine Learning Workbench like content-cli push data-pool -f planio_pool_20220430.json -p mytraining. Please see Share my Analysis by Content-CLI for detail of content-CLI.

Extactor is excluded from this copy, so please see Tune Endpoint Parameter Relevant to Delta Load to upload Extractor and replace connection to that with uploaded Extarctor.



See also