Improving Data Quality using Data Cleansing and Normalization - Pentaho

I. Introduction

Hello, Welcome Back! This blog is a continuation of ‘Importance of Exploratory Data Analysis’, where I have explained about the importance of performing EDA before doing any statistical modelling.

In this blog, I will be explaining about the importance of Data Cleaning and Normalization. In practice, a ‘Data Analyst’ spends much of his time on preparing the data before doing any statistical procedures. It is very rare that the raw data one works with are in the correct format, without any errors, are complete and have all the correct labels and codes that are needed for analysis.

Data Cleaning is the process of transforming raw data into consistent data that can be analyzed. In short, data cleaning should be considered as a statistical operations, since, it can influence the result of the analysis and hence, must be performed in a reproducible manner.

I will be using ‘R’ Language and it’s associated libraries to provide suitable examples wherever required.

II. Data Cleaning Process

The first step in any data cleaning process is converting the ‘Raw Data’ into ‘Technical Data’ and then converting the ‘Technical Data’ into ‘Consistent Data’.

To make the blog simple and easy to understand, I will focus only on textual data formats, this may have its drawbacks, but there several favourable properties of textual formats over binary formats;

– It is human-readable.

– Text is very permissive in the types values that are stored, allowing for comments and annotations.

Before I proceed any further on steps related to Data Cleaning, let us first understand the different types of special values present in ‘R Language’;

 Special Values Description NA Stands for not available. NA is a placeholder for a missing value. NULL NULL is special since it has no class (its class is NULL ) and has length 0 so it does not take up any space in a vector. Inf Stands for infinity and only applies to vectors of class numeric . NAN Stands for not a number. This is generally the result of a calculation of which the result is unknown, but it is surely not a number.

II.A. Variable Conversion

Converting a variable from one type to another is called ‘Coercion’. Many of the readers must be familiar with R ‘s basic coercion functions, but as a reference, the basic coercion functions are listed below;

* as.numeric

* as.integer

* as.character

* as.logical

* as.factor

* as.ordered

Each of these functions takes an R object and tries to convert it to the class specified behind the ‘as.’. By default, values that cannot be converted to the specified type will be converted to a NA value while a warning is issued.

```as.numeric(c("7", "7*", "7.0", "7,0"))
## Warning: NAs introduced by coercion
## [1] 7 NA 7 NA```

II.B. Recoding Factors

In R , the value of categorical variables is stored in factor variables. A factor is an integer vector endowed with a table specifying what integer value corresponds to what level . The values in this translation table can be requested with the levels function.

```f <- factor(c("a", "b", "a", "a", "c"))
levels(f)
## [1] "a" "b" "c"```

The use of integers combined with a translation table is not uncommon in statistical software,so chances are that you eventually have to make such a translation by hand.

II.C. Date Conversion

The base R installation has three types of objects to store a time instance: Date , POSIXlt and POSIXct . The Date object can only be used to store dates, the other two store date and/or time. Here, we focus on converting text to POSIXct objects since this is the most portable way to store such information.

Under the hood, a POSIXct object stores the number of seconds that have passed since January 1, 1970 00:00. Such a storage format facilitates the calculation of durations by subtraction of two POSIXct objects.

When a POSIXct object is printed, R shows it in a human-readable calender format. For example, the command Sys.time returns the system time provided by the operating system in POSIXct format.

```current_time <- Sys.time()
class(current_time)
## [1] "POSIXct" "POSIXt"
current_time
## [1] "2018-09-28 11:10:37 IST"

Here, Sys.time uses the time zone that is stored in the locale settings of the machine running R.```

II.D. Character Manipulation

Because of the many ways people can write the same things down, character data can be difficult to process. For example, consider the following excerpt of a data set with a gender variable.

```## gender
## 1 M
## 2 male
## 3 Female
## 4 fem.```

If this would be treated as a factor variable without any preprocessing, obviously four, not two classes would be stored. The job at hand is therefore to automatically recognize from the above data whether each element pertains to male or female . In statistical contexts, classifying such ‘messy’ text strings into a number of fixed categories is often referred to as coding.

String Normalization:

String normalization techniques are aimed at transforming a variety of strings to a smaller set of string values which are more easily processed. By default, R comes with extensive string manipulation functionality that is based on the two basic string operations: finding a pattern in a string and replacing one pattern with another.

The stringr package offers a number of functions that make some some string manipulation tasks a lot easier than they would be with R’s base functions. For example, extra white spaces at the beginning or end of a string can be removed using str_trim .

```library(stringr)
str_trim(" hello world ")
## [1] "hello world"
str_trim(" hello world ", side = "left")
## [1] "hello world "
str_trim(" hello world ", side = "right")
## [1] " hello world"```

Conversely, strings can be padded with spaces or other characters with str_pad to a certain

width.Both str_trim and str_pad accept a side argument to indicate whether trimming or padding should occur at the beginning (left), end (right) or both sides of the string. Converting strings to complete upper or lower case can be done with R ‘s built-in toupper and tolower functions.

Approximate Text Matching:

There are two forms of string matching. The first consists of determining whether a (range of) substring(s) occurs within another string. In this case one needs to specify a range of substrings (called a pattern) to search for in another string. In the second form one defines a distance metric between strings that measures how “different” two strings are.

There are several pattern matching functions that come with base R . The most used are probably grep and grepl . Both functions take a pattern and a character vector as input. The output only differs in that grepl returns a logical index, indicating which element of the input character vector contains the pattern, while grep returns a numerical index.

In the most simple case, the pattern to look for is a simple substring. Let us consider the following example given below;

```gender <- c("M", "male ", "Female", "fem.")
grepl("m", gender)
## [1] FALSE TRUE TRUE TRUE
grep("m", gender)
## [1] 2 3 4```

Note that the result is case sensitive: the capital M in the first element of gender does not match the lower case m. There are several ways to circumvent this case sensitivity. Either by case normalization or by the optional argument ignore case.

III. Conclusion

Data Cleansing and Normalization is a huge topic, In this blog, I have aptly covered the first part of data cleansing procedure which is converting ‘Raw Data’ into ‘Technical Data’. In my next blog, I will talk about how to convert the technical data into consistent data, which can help us to perform complex statistical procedures.