Pentaho and Data Science

TenthPlanet-Pentaho-BDA-BlogBanner-Pentaho-and-Data-Science-Web

Introduction

The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how Pentaho Data Integration(PDI) as a platform, with all its data munging power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment.

When it comes to deploying a predictive solution, Pentaho Data Integration(PDI) accelerates the process of operationalizing machine learning by working seamlessly with popular libraries and languages, such as R, Python, WEKA and Spark MLlib. This allows output from team members developing in different environments to be integrated within same framework, without dictating the use of a single predictive tool.

Orchestration Capability of Pentaho Data Integration

Most enterprises struggle to put models to work because data professionals often operate in silos and create bottlenecks in the data preparation to model updates workflow. The Pentaho platform enables collaboration and removes bottlenecks in four key areas;

Data Preparation and Feature Engineering

Pentaho makes it easy to prepare and blend traditional data sources with big data sources like sensors and social media. Pentaho also accelerates notoriously difficult and costly tasks of feature engineering, automating data onboarding, data transformation and data validation in an easy-to-use drag and drop environment.

Model Train, Test and Tuning

Data Scientists often apply trial and error methodology to strike the right balance of performance and accuracy in their models. With integrations for languages like R and Python, and for machine learning libraries like Spark MLlib and Weka, Pentaho allows data scientists to seamlessly train, tune, build and test models faster.

Operationalization and Deployment of Models

Pentaho allows data professionals to easily embed models developed by a data scientist directly in an operational workflow. They can leverage existing data and feature engineering efforts, significantly reducing time taken to deploy.

Data Visualization

Pentaho CTools are used to build dashboard & reports for the visualization

Regular Update of Models

With Pentaho, data engineers and scientists can re-train existing models with new data sets or make feature updates using custom execution steps for R, Python, Spark MLlib and Weka. Pre-built workflows can automatically update models and archive existing one.

Conclusion

Pentaho fills a gap to operationalize the data integration process for advanced and predictive analytics. Pentaho makes it easy to onboard a wide variety of data sources into your data management environment.