Parallel Job Execution in Pentaho with Dynamic Configuration

We had to design and build a data warehouse for multi-tenant architecture.  However, there were multiple clients with different data metrics and source databases while the data model (dimensions and facts) was common. Thus, to start with, we developed the ETL (Extract, Transform & Load) jobs for a single client which had to be scaled for this new requirement.

To load the data using the same ETL job, one can change the source and target database configurations in jdbc.properties file but this is not a scalable approach as the properties file needs to be modified every time a job has to be executed. Moreover, since Pentaho refers JNDI (Java Naming and Directory Interface) connections using the name, one cannot define two JNDI connections with the same name. Also, there is a good possibility of executing the same job in parallel for different source and target databases. In such situations, multiple parameters might override during parallel execution due to common kettle.properties file.

An inefficient way to tackle this problem would be to create a separate job for each client. Imagine a scenario where there are hundreds of ETL jobs and tens of clients!  It would lead to a huge duplication of effort and would be operationally ineffective.

We, therefore, wanted to achieve our objective with minimal rework and a good design approach which would be maintainable and operationally effective.

The Solution:

In case one has installed PDI on a server at path “/opt/data-integration” and ran a PDI job through below kitchen command then by default it searches kettle.properties and repository.xml at path “KETTLE_HOME/.kettle” and searches jdbc.properties at path “/opt/data-integration/simple-jndi”.

/opt/data-integration/kitchen.sh -rep test -job load_data_job

To overcome this problem, one can assign the path while firing the kitchen command.

For example:

KETTLE_HOME=”/data/client1/” KETTLE_JNDI_ROOT=”/data/client1/jndi/” /opt/data-integration/kitchen.sh -rep test -job load_data_job

KETTLE_HOME=”/data/client2/” KETTLE_JNDI_ROOT=”/data/client2/jndi/” /opt/data-integration/kitchen.sh -rep test -job load_data_job

In the above example, we have separated kettle.properties and jdbc_properies files for both clients at a different location.

If you run the first kitchen command, it searches kettle.properties at path “=”/data/client1/” and jdbc.properties file at path “=”/data/client1/jndi/”.

If you run the second kitchen command, it searches kettle.properties at path “=”/data/client2/” and jdbc.properties file at path “=”/data/client2/jndi/”.

Similarly, for each new client one can set up the configuration in a new location and point the kitchen command to new kettle and jdbc properties files. This gives the benefit of reusing existing ETL jobs, avoid conflict between parallel executions while it provides the flexibility to scale when required.

Tags :

Let’s create new possibilities with technology