Pentaho+ Data Processing Methodology

tenthplanet blog pentaho Processing Raw Data for Machine Learning

Introduction

Data processing occurs when data is collected and translated into usable information. Usually performed by a data scientist or team of data scientists, it is important for data processing to be done correctly so as to not negatively affect the end product or output.

Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers for further utilisation.

Data is collected from various sources and loaded to data warehouse using Pentaho+ Data Integration. The Pentaho+ Data Integration Tool performs the cleansing, transformation, applying rules and stores in data warehouse.

Phases of Data Processing

There are six phases in data processing:

  • Data collection

Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses. It is important that the data sources are trustworthy and well-built so that the data collected (and later used as information) are of the highest possible quality.

  • Data preparation

Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned and organized for further processing. During preparation, raw data is diligently checked for any errors. The purpose of this step is to eliminate bad data (redundant, incomplete or incorrect data) and create high-quality data for the best results.

  • Data input

The clean data is fed to the system and translated into a language that it can understand. Data input is the stage at which raw data begins to look like usable information.

  • Processing

During this stage, the data fed to the system in the previous stage is processed for interpretation. Processing is done using machine learning algorithms, though the process itself may vary slightly depending on the source of data being processed (data lakes, social networks, connected devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected devices, determining customer needs, etc.).

  • Data interpretation

The output/interpretation stage is the stage at which data is finally rendered usable. It is translated, readable, and often in the form of graphs, videos, images, plain text, etc.

  • Data storage

The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. While some information may be put to use immediately, much of it will serve a purpose later on. Plus, properly stored data is a necessity for compliance with data protection legislations like GDPR(General Data Protection Regulation). When data is properly stored, it can be quickly and easily accessed by members of the organization when needed.

Key requirements

Data lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

Data integration

It is the process of combining data from various sources or different platforms and gives a unified view.

Data platform

A key role of a data management platform is to collect structured and unstructured data from a range of internal and external sources and integrate & store the data by harmonising it into structured, unstructured and semi-structured data.

Conclusion

Advantages of data processing include increased productivity, better decision-making, accuracy, reliability, cost reduction, ease of storage and better analysis. Data mining and data management come into play during data processing, without which optimal results cannot be obtained. Each stage, starting from data collection to presentation, has a direct effect on the output and usefulness of the processed data.