Data Cleansing for Machine Learning

- Pentaho


Data accumulated from multiple sources are usually huge and unstructured. Processing such data can be overwhelming for an individual. Disorganisation can lead to stress and produce misleading results.

Data cleaning is the process of detecting and correcting (removing) corrupt or inaccurate records from data set. It involves removing incomplete, incorrect, inaccurate or irrelevant parts of the data.

Data Cleaning Process

Different types of data require different methods of cleaning. However, the following steps can always serve as a good starting point:

  • Making the data uniform by removing all possible errors in the data.
  • Checking if all values in a column are of same data type.
  • Checking if a uniform pattern is followed for all values in columns.
  • Ensuring that values like dates or numbers converge within some boundary.
  • Checking if any duplicates are present in the data and removing them.
  • The values in each column need to be validated if there is a variable present whose value depends on the other variables in the data set.
  • Checking the values of the categorical variables and removing those that don’t belong to any category.
  • Removing the columns which do not contribute towards the analysis.
  • Checking if there are any NA (not applicable) values present in the columns.These are called missing values. One way to deal with missing values is to drop the rows containing NA values, but this results in loss of data.
  • Imputing values based on values of other variables. This may not be possible at all times.
  • Predict the values using statistical methods like mean, median or basing the distribution followed by the attribute.
  • Formatting the data to capture the insights from the data easily after uniformity is achieved.
  • Sorting the data based on the key attribute.
  • Analysing and grouping the data based on common factors, if present.
  • Combining different columns or adding data from other sources, if necessary, to interpret the data easily.


Data cleaning consumes a large amount of time. However, this is the most important step as better data produces more accurate models. In fact, if you have a properly cleaned data set, even simple algorithms can derive impressive insights from the data.