Pentaho+ and Data Science
Introduction
The power of Pentaho+ Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how Pentaho+ Data Integration(PDI) as a platform, with all its data munging power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment.
When it comes to deploying a predictive solution, Pentaho+ Data Integration(PDI) accelerates the process of operationalizing machine learning by working seamlessly with popular libraries and languages, such as R, Python, WEKA and Spark MLlib. This allows output from team members developing in different environments to be integrated within same framework, without dictating the use of a single predictive tool.
Orchestration Capability of Pentaho+ Data Integration
Most enterprises struggle to put models to work because data professionals often operate in silos and create bottlenecks in the data preparation to model updates workflow. The Pentaho Plus platform enables collaboration and removes bottlenecks in four key areas;
Data Preparation and Feature Engineering
Pentaho+ makes it easy to prepare and blend traditional data sources with big data sources like sensors and social media. Pentaho+ also accelerates notoriously difficult and costly tasks of feature engineering, automating data onboarding, data transformation and data validation in an easy-to-use drag and drop environment.
Model Train, Test and Tuning
Data Scientists often apply trial and error methodology to strike the right balance of performance and accuracy in their models. With integrations for languages like R and Python, and for machine learning libraries like Spark MLlib and Weka, Pentaho Plus allows data scientists to seamlessly train, tune, build and test models faster.
Operationalization and Deployment of Models
Pentaho+ allows data professionals to easily embed models developed by a data scientist directly in an operational workflow. They can leverage existing data and feature engineering efforts, significantly reducing time taken to deploy.
Data Visualization
Pentaho Plus CTools are used to build dashboard & reports for the visualization
Regular Update of Models
With Pentaho+, data engineers and scientists can re-train existing models with new data sets or make feature updates using custom execution steps for R, Python, Spark MLlib and Weka. Pre-built workflows can automatically update models and archive existing one.
Conclusion
Pentaho+ fills a gap to operationalize the data integration process for advanced and predictive analytics. Pentaho+ makes it easy to onboard a wide variety of data sources into your data management environment.