Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
double sum = 0;
for(double d : tempList){
sum += d;
}
out.printf("The average temperature is %1$,.2f", sum/12);

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
tempList[0] = 0;
for(double d : tempList){
sum += d;
}
out.printf("The average temperature is %1$,.2f", sum/12);

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
String[] nameList =
{"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"};
Optional<String> tempName;
for(String name : nameList){
tempName = Optional.ofNullable(name);
useName = tempName.orElse("DEFAULT");
out.println("Name to use = " + useName);
}

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.