
Handling CSV data
A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.
We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner class to read in tokens and the String class split method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.
However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.
We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.
First, we use the Scanner class to read in data from our data file. We will temporarily store the data in an ArrayList since we will not always know how many rows our data contains.
try (Scanner csvData = new Scanner(new File("Demographics.csv"))) {
ArrayList<String> list = new ArrayList<String>();
while (csvData.hasNext()) {
list.add(csvData.nextLine());
} catch (FileNotFoundException ex) {
// Handle exceptions
}
The list is converted to an array using the toArray method. This version of the method uses a String array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.
String[] tempArray = list.toArray(new String[1]);
String[][] csvArray = new String[tempArray.length][];
The split method is used to create an array of Strings for each row. This array is assigned to a row of the csvArray.
for(int i=0; i<tempArray.length; i++) {
csvArray[i] = tempArray[i].split(",");
}
Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.
First, we need to create an instance of the CSVReader class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll method.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),',');
ArrayList<String> holdData = (ArrayList)dataReader.readAll();
We can then process the data as we did above, by splitting the data into a two-dimension array using String class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),',');
String[] nextLine;
while ((nextLine = dataReader.readNext()) != null){
for(String token : nextLine){
out.println(token);
}
}
dataReader.close();
We can now clean or otherwise process the array.