Natural Language Processing Fundamentals
上QQ阅读APP看书,第一时间看更新

Feature Extraction from Texts

Let's understand feature extraction with real-life examples. Features represent the characteristics of a person or a thing. These characteristics may or may not uniquely represent a person or a thing. For instance, the general characteristics that a person possesses, such as the number of ears, hands, and legs, are generally not enough to identify that person uniquely. But characteristics such as fingerprints and DNA sequences can be used to recognize that person distinctly. Similarly, in feature extraction, we try to extract attributes from texts that represent those texts uniquely. These attributes are called features. Machine learning algorithms take only numeric features as input. So, it is of utmost importance to represent texts as numeric features. When dealing with texts, we extract both general and specific features. Sometimes, individual words constituting texts do not affect some features directly, such as the language of the text and the total number of words. These features can be referred to as general features. Specific features include bag of words, and TF-IDF representations of texts. Let's understand these in the coming sections.

Extracting General Features from Raw Text

General features refer to those that are not directly dependent on the individual tokens constituting a text corpus, such as the number of words, the number of occurrences of each part of speech, and the number of uppercase and lowercase words.

Let's consider two sentences: "The sky is blue." and "The pillar is yellow.". Here, both sentences have the same number of words (a general feature), that is, four. But the individual constituent tokens are different. Let's complete an exercise to understand this better.

Exercise 22: Extracting General Features from Raw Text

In this exercise, we will extract general features from input text. These general features include detecting the number of words, the presence of "wh-" words (words beginning with "wh"), the polarity, the subjectivity, and the language in which the text is written. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the pandas library and create a DataFrame with four sentences. Add the following code to implement this:

    import pandas as pd

    df = pd.DataFrame([['The interim budget for 2019 will be announced on 1st February.'], ['Do you know how much expectation the middle-class working population is having from this budget?'], ['February is the shortest month in a year.'], ['This financial year will end on 31st March.']])

    df.columns = ['text']

    df

    The code generates the following output:

    Figure 2.29: DataFrame consisting of four sentences

  3. Use the apply function to iterate through each row of the column text, convert them to TextBlob objects, and extract words from them. Add the following code to implement this:

    from textblob import TextBlob

    df['number_of_words'] = df['text'].apply(lambda x : len(TextBlob(str(x)).words))

    df['number_of_words']

    The code generates the following output:

    Figure 2.30: Number of words in a sentence

  4. Now, again, we make use of the apply function to iterate through each row of the column text, convert them to TextBlob objects, and extract the words from them to check whether any of the words belong to the list of wh words that has been declared:

    wh_words = set(['why', 'who', 'which', 'what', 'where', 'when', 'how'])

    df['is_wh_words_present'] = df['text'].apply(lambda x : True if \

    len(set(TextBlob(str(x)).words).intersection(wh_words))>0 else False)

    df['is_wh_words_present']

    The code generates the following output:

    Figure 2.31: Checking the presence of wh words

  5. Use the apply function to iterate through each row of the column text, convert them to TextBlob objects, and extract their sentiment scores:

    df['polarity'] = df['text'].apply(lambda x : TextBlob(str(x)).sentiment.polarity)

    df['polarity']

    The code generates the following output:

    Figure 2.32: Polarity of each sentence

  6. Use the apply function to iterate through each row of the column text, convert them to TextBlob objects, and extract their subjectivity scores:

    df['subjectivity'] = df['text'].apply(lambda x : TextBlob(str(x)).sentiment.subjectivity)

    df['subjectivity']

    The code generates the following output:

    Figure 2.33: Subjectivity of each sentence

    Note

    Sentiment scores such as subjectivity and polarity will be explained in detail in Chapter 8, Sentiment Analysis.

  7. Use the apply function to iterate through each row of the column text, convert them to TextBlob objects, and detect their languages:

    df['language'] = df['text'].apply(lambda x : TextBlob(str(x)).detect_language())

    df['language']

    The code generates the following output:

Figure 2.34: Language of each sentence

We have learned how to extract general features from a given text. In the next section, we will solve an activity to get a better understanding of this.

Activity 2: Extracting General Features from Text

In this activity, we will extract various general features from documents. The dataset that we are using consists of random statements. Our objective is to find various general features such as punctuation, uppercase and lowercase words, letter characters, digits, words, and whitespaces.

Note

The data.csv dataset used in this activity can be found at this link: https://bit.ly/2CIkCa4.

Follow these steps to implement this activity:

  1. Open a Jupyter notebook.
  2. Import pandas, nltk, and the other necessary libraries.
  3. Load data.csv using pandas' read_csv function.
  4. Find the number of occurrences of each part of speech (PoS). You can see the PoS that nltk provides by loading it from help/tagsets/upenn_tagset.pickle.
  5. Find the amount of punctuation marks.
  6. Find the amount of uppercase and lowercase words.
  7. Find the number of letters.
  8. Find the number of digits.
  9. Find the amount of words.
  10. Find the amount of whitespaces for each sentence.

    Note

    The solution for this activity can be found on page 259.

We have learned how to extract general features from a given text. In the next section, we will explore special features that can be extracted from a given text.

Bag of Words

The Bag-of-Words (BoW) model is one of the most popular methods for extracting features from raw texts. The output of this model for a set of text documents is a matrix. Each column of the matrix represents a word from the vocabulary and each row corresponds to one of these text documents. Here, "vocabulary" refers to a unique set of words present in a document. Let's understand this with an example. Suppose you have two text documents:

Document 1: I like detective Byomkesh Bakshi.

Document 2: Byomkesh Bakshi is not a detective, he is a truth seeker.

The corresponding BoW representation would be as follows:

Figure 2.35: Diagram of the BoW model

The tabular representation of the BoW model would be as follows:

Figure 2.36: Tabular representation of the BoW model

Let's see how BoW can be implemented using Python.

Exercise 23: Creating a BoW

In this exercise, we will create a BoW representation for all the terms in a document and ascertain the 10 most frequent terms. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the necessary libraries and declare a list corpus. Add the following code to implement this:

    import pandas as pd

    from sklearn.feature_extraction.text import CountVectorizer

    corpus = [

    'Data Science is an overlap between Arts and Science',

    'Generally, Arts graduates are right-brained and Science graduates are left-brained',

    'Excelling in both Arts and Science at a time becomes difficult',

    'Natural Language Processing is a part of Data Science'

    ]

  3. Now we'll make use of the CountVectorizer function to create the BoW model. Add the following code to do this:

    bag_of_words_model = CountVectorizer()

    print(bag_of_words_model.fit_transform(corpus).todense())

    bag_of_word_df = pd.DataFrame(bag_of_words_model.fit_transform(corpus).todense())

    bag_of_word_df.columns = sorted(bag_of_words_model.vocabulary_)

    bag_of_word_df.head()

    The code generates the following output:

    Figure 2.37: DataFrame of the output of the BoW model

  4. Now we create a BoW model for the 10 most frequent terms. Add the following code to implement this:

    bag_of_words_model_small = CountVectorizer(max_features=10)

    bag_of_word_df_small = pd.DataFrame(bag_of_words_model_small.fit_transform(corpus).todense())

    bag_of_word_df_small.columns = sorted(bag_of_words_model_small.vocabulary_)

    bag_of_word_df_small.head()

    The code generates the following output:

Figure 2.38: DataFrame of the output of the BoW model for the 10 most frequent terms

Zipf's Law

According to Zipf's law, "for a given corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table." In simple terms, if the words in a corpus are arranged in descending order of their frequency of occurrence, then the frequency of the word at ith rank will be proportional to 1/i. To get a better understanding of this, let's look at an exercise in the next section.

Exercise 24: Zipf's Law

In this exercise, we will plot the actual ranks and frequencies of tokens, along with the expected ranks and frequencies, with the help of Zipf's law. We will be using the 20newsgroups dataset provided by the sklearn library, which is a collection of newsgroup documents. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the necessary libraries, declare a newsgroups_data_sample variable, and fetch the dataset provided by sklearn, that is, fetch_20newsgroups. Add the following code to do this:

    from pylab import *

    import nltk

    nltk.download('stopwords')

    from sklearn.datasets import fetch_20newsgroups

    from nltk import word_tokenize

    from nltk.corpus import stopwords

    import matplotlib.pyplot as plt

    %matplotlib inline

    import re

    import string

    from collections import Counter

    newsgroups_data_sample = fetch_20newsgroups(subset='train')

  3. Now we'll add individual printable characters to get a list of stop words. Add the following code to implement this:

    stop_words = stopwords.words('english')

    stop_words = stop_words + list(string.printable)

  4. To tokenize the corpus, add the following code:

    tokenized_corpus = [word.lower() for sentence in newsgroups_data_sample['data'] \

    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', sentence)) \

    if word.lower() not in stop_words]

  5. Add the following code to calculate the frequency of each token:

    token_count_di = Counter(tokenized_corpus)

    token_count_di.most_common(50)

    The code generates the following output:

    Figure 2.39: The 50 most frequent words of the corpus

  6. Now, to plot the actual ranks and frequencies of the tokens along with the expected ranks and frequencies as per Zipf's law, we add the following code:

    frequencies = [b for (a,b) in token_count_di.most_common(10000)]

    tokens = [a for (a,b) in token_count_di.most_common(10000)]

    ranks = range(1, len(frequencies)+1)

    plt.figure(figsize=(8,8))

    plt.ylim(1,10**4)

    plt.xlim(1,10**4)

    #Actual ranks and frequencies

    obtained_line, = loglog(ranks, frequencies, marker=".", label="Line obtained from the Text Corpus")

    obtained_legend = plt.legend(handles=[obtained_line], loc=1)

    ax = plt.gca().add_artist(obtained_legend)

    #Expected ranks and frequencies as per Zipf's law

    expected_line, = plt.plot([1,frequencies[0]],[frequencies[0],1],color='r',label="Line expected as per Zipf's Law")

    plt.legend(handles=[expected_line], loc=4)

    title("Plot stating Zipf law's in log-log scale")

    xlabel("Rank of token in descending order of frequency of occurrence")

    ylabel("Frequency of ocurrence of token")

    grid(True)

    The code generates the following output:

Figure 2.40: Illustration of Zipf's law

TF-IDF

Previously, we looked at the BoW model. That model has a severe drawback. The frequency of occurrence of a token does not fully represent how much information it carries about a document. This is because a term occurring multiple times in many documents does not convey much information. Rare terms can carry much more information about the documents they are present in. TF-IDF, or Term Frequency-Inverse Document Frequency, is a method of representing text data in a matrix format (row-column/table format) using numbers that quantify how much information these terms carry in the given documents. Just like the BoW model, each row, i, represents a text document from the given set of text documents. Each column, j, corresponds to a word from the vocabulary.

The Term Frequency (TF) for a given term, j, in a document, i, is equal to the number of times term j occurs in document i. Rarely occurring terms are more informative than frequently occurring general terms. To account for this, we need to multiply another factor by T. This factor denotes how specific a term is to a given document. This is called the Inverse Document Frequency (IDF).

The IDF for a given term is given by the following formula:

Here, dfj refers to the number of documents with term j. N is the total number of documents. Thus, the TF-IDF score for term j in document i will be as follows:

Let's do an exercise in the next section and learn how TF-IDF can be implemented in Python.

Exercise 25: TF-IDF Representation

In this exercise, we will create a TF-IDF representation of the input texts for all the terms in a given corpus and identify the 10 most frequent terms. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import all the necessary libraries and create a DataFrame consisting of the sentences. Add the following code to implement this:

    import pandas as pd

    from sklearn.feature_extraction.text import TfidfVectorizer

    corpus = [

    'Data Science is an overlap between Arts and Science',

    'Generally, Arts graduates are right-brained and Science graduates are left-brained',

    'Excelling in both Arts and Science at a time becomes difficult',

    'Natural Language Processing is a part of Data Science'

    ]

  3. Now, to create a TF-IDF model, we write the following code:

    tfidf_model = TfidfVectorizer()

    print(tfidf_model.fit_transform(corpus).todense())

    The code generates the following output:

    Figure 2.41: TF-IDF representation of the corpus in matrix form

  4. Now, to create a DataFrame from the generated tf-idf matrix, we write the following code:

    tfidf_df = pd.DataFrame(tfidf_model.fit_transform(corpus).todense())

    tfidf_df.columns = sorted(tfidf_model.vocabulary_)

    tfidf_df.head()

    The code generates the following output:

    Figure 2.42: TF-IDF representation of a corpus in DataFrame form

  5. Now we'll create a DataFrame from the tf-idf matrix for the 10 most frequent terms. Add the following code to implement this:

    tfidf_model_small = TfidfVectorizer(max_features=10)

    tfidf_df_small = pd.DataFrame(tfidf_model_small.fit_transform(corpus).todense())

    tfidf_df_small.columns = sorted(tfidf_model_small.vocabulary_)

    tfidf_df_small.head()

    The code generates the following output:

Figure 2.43: TF-IDF representation of the 10 most frequent terms

In the next section, we will solve an activity to extract specific features from texts.

Activity 3: Extracting Specific Features from Texts

In this activity, we will extract specific features from the texts present in a dataset. The dataset that we will be using here is fetch_20newsgroups, provided by the sklearn library. Follow these steps to implement this activity:

  1. Import the necessary packages.
  2. Fetch the dataset provided by sklearn, fetch_20newsgroup, and store the data in a DataFrame.
  3. Clean the data in the DataFrame.
  4. Create a BoW model.
  5. Create a TF-IDF model.
  6. Compare both models on the basis of the 20 most frequently occurring words.

    Note

    The solution for this activity can be found on page 263.

We have learned how to compare the BoW and TF-IDF models. In the next section, we will learn more about feature engineering.