Improving Data Quality using Data Cleansing and Normalization – Part Two

tenthplanet blog pentaho Improving Data Quality using Data Cleansing and Normalization 2

I. Introduction

Hello, Welcome Back! This blog is a continuation of ‘Data Cleansing and Normalization’, where I have briefly explained about some of the methods done when cleaning the data. Data Cleansing is a part of Exploratory Data Analysis, remember, the better we clean the data, the better the outcome for any statistical model.

In this blog, I will be explaining about how to further clean the technical data into a ‘Consistent Data’ and the methodologies adopted. I will be using ‘R’ Language and it’s associated libraries to provide suitable examples wherever required.

II. Data Cleaning Process

Data is collected from various sources and loaded to data warehouse using Pentaho Data Integration. The Pentaho Data Integration Tool performs the cleansing, transformation, applying rules and stores in data warehouse.

Consistent data are technically correct data that are fit for statistical analysis. They are data in which missing values, special values, (obvious) errors and outliers are either removed, corrected or imputed. The data are consistent with constraints based on real-world knowledge about the subject that the data describe.

Consistency can be understood to include in-record consistency, meaning that no contradictory information is stored in a single record, and cross-record consistency, meaning that statistical summaries of different variables do not conflict with each other.

In this blog, I will mainly focus on methods dealing with in-record consistency, with the exception of outlier handling which can be considered a cross-record consistency issue.

The process towards consistent data always involves the following three steps.

  • Detection of an inconsistency. For example, an age variable is constrained to non-negative values.
  • Selection of the field or fields causing the inconsistency. For example the marital status of a child must be unmarried. In the case of a violation it is not immediately clear whether age, marital status or both are wrong.
  • Correction of the fields that are deemed erroneous by the selection method. This may be done through deterministic (model-based) or stochastic methods.

II.A. Missing Values

A missing value, represented by NA in R , is a placeholder for a datum of which the type is known but its value isn’t. Therefore, it is impossible to perform statistical analysis on data where one or more values in the data are missing. One may choose to either omit elements from a data-set that contain missing values or to impute a value, but missing value is something to be dealt with prior to any analysis.

The behaviour of R ‘s core functionality is completely consistent with the idea that the analyst must decide what to do with missing data. A common choice, namely `leave out records with missing data’ is supported by many base functions through the na.rm option.

age <- c(23, 16, NA)
mean(age)
## [1] NA
mean(age, na.rm = TRUE)
## [1] 19.5

Functions such as sum , prod , quantile , sd and so on all have this option. Functions implementing bivariate statistics such as cor and cov offer options to include complete or pairwise complete values.

II.B. Special Values

Calculations involving special values often result in special values, and since a statistical statement about a real-world phenomenon should never include a special value, it is desirable to handle special values prior to analysis.

For numeric variables, special values indicate values that are not an element of the mathematical set of real numbers (R). The function is.finite determines which values are `regular’ values.

is.finite(c(1, Inf, NaN, NA))
## [1] TRUE FALSE FALSE FALSE

This function accepts vectorial input. With little effort we can write a function that may be used to check every numerical column in a data frame.

II.C. Outlier Detection

Outliers are the values much distant from the range of the values allowed for that feature. Their inclusion can lead to a bad fit later while building a model. For example, have a glance at following two linear regression models.

Outlier thus may lead to a bad fit. Hence, they should be handled carefully. It is not always needed that we remove them. Whether we remove them or let them remain or reduce their dominance while creating the model is a statistical decision and comes out of the intuition of data scientist.

II.D. Obvious Inconsistencies

Consider a case where the age of a person is negative in a record. These kind of inconsistencies are quite normal with the data. Hence, it is always good to find the range of the values of a column and finding the unaccepted values. Also, there might arise a need to preprocess only certain part of data while leaving other part as such like changing units from meters to centimeters in a length feature. Like this, there can be many inconsistencies. We need to carefully handle them. While doing so, it is always better to log the details of our process carefully for future reference.

II.E. Imputations

Imputation is the process of estimating or deriving values for fields where data is missing. There is no one single best imputation method that works in all cases. The imputation model of choice depends on what auxiliary information is available and whether there are (multivariate) edit restrictions on the data to be imputed.

The three major types of Imputation Models are;

  • Imputation with Mean : In this method, we fill the missing value with mean of the column.
  • Hot-Deck Imputation : In this method, missing value is filled with the value from a similar record. Meaningful hot deck imputations wouldn’t cause any problem to our data analysis.
  • Predictive Mean Matching with Hot-Deck : In this method, we try to get a predicted value of the missing value with the help of finding nearest neighbour donor records.

III. Conclusion

Data Cleaning is a process involving more of intuition of the data scientist or statistician rather than using a set of functions. The best data cleaning method will always be the one which is found only after thoroughly scrutinising data.