How to use Python in PDI

TenthPlanet-Pentaho-BDA-BlogBanner-How-to-use-Python-in-PDI-Web

Introduction

Python is open source, interpreted, high level programming language. It provides a great approach for object-oriented programming and one of the widely used language by data scientist and data analytics for various applications and projects. The language provides great functionality to deal with mathematics, statistics and scientific functions, also, it provides great libraries to deal with data science application.

Pentaho Data Integration and Python

Pentaho is a Business Intelligence (BI) software that provides Data Integration as part of one of its services. Using Pentaho Data Integration(PDI) once can access, prepare and blend data faster, it also provides seamless orchestration for building data pipeline services.

It also provides capabilities to operationalize Python so that data scientists can take advantage of the strength of the versatile programming language to develop predictive solutions using existing PDI steps. Python can be integrated in Pentaho via;

CPython Script Executor (Marketplace Plugin)

CPython Script Executor can be used to load the table input from PDI as a Pandas data frame and also has the ability to execute python script from a path or integrate the python code in the step itself.

Integrating Python in Pentaho Data Integration(PDI)

Before using the CPython Script Executor, the following steps needs to be followed;

  • Install your suitable python version and set the system variable paths accordingly
  • Install core base libraries such as;
    • Pandas
    • Numpy
    • Py4j
    • Matplotlib
    • sklearn
  • Install CPython Script Executor from the Marketplace provided in your PDI
  • Restart DI Server and Spoon so that the environment changes take effect

Once the above steps are done you can start a new transformation and start working on the CPython Script Executor.

CPython Script Executor

The step uses the C implementation of the Python programming language. While there are JVM-based solutions available – such as Jython – that allow a more tightly integrated experience when executing in the JVM, these do not facilitate the use of many high-powered Python libraries for scientific computing, due to the fact that such libraries include highly optimized components that are written in C or Fortran. In order to gain access to such libraries, the PDI step launches, and communicates with, a micro-service running in the C Python environment.

The CPython Script Executor step requires a developer to map to the Python environment, map input, and/or output.

  • Step name: Specifies the unique name of the Python Executor step on the canvas. You can customize the name or leave it as the default. 

The step itself offers maximum flexibility when it comes to dealing with data. It can act as a start point/data source in PDI (thus allowing the developer the freedom to source data directly via their Python code if so desired), or it can accept data from an upstream step and push it into the Python environment. In the latter case, the user can opt to send all incoming rows to Python in one hit, send fixed sized batches of rows, or send rows one-at-a-time.

A python script can be specified via the built-in editor, or loaded from a file dynamically at runtime.