Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods---and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText));
out.println("Original clean text: " + words.toString());

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){
String stopWord = readStop.nextLine().toLowerCase();
if(words.contains(stopWord)){
foundWords.add(stopWord);
}
}
words.removeAll(foundWords);
out.println("Text without stop words: " + words.toString());

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE;
fact = new EnglishStopTokenizerFactory(fact);
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length());
for(String word : tok){
out.print(word + " ");
}

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.