程式扎記: [ Intro2ML ] Ch8. Working with Text Data

Stopwords
Another way that we can get rid of uninformative words is by discarding words that are too frequent to be informative. There are two main approaches: using a language-specific list of stopwords, or discarding words that appear too frequently. scikit-learn has a built-in list of English stopwords in the feature_extraction.text module:

view plaincopy to clipboardprint?
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS  
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))  
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))  

Output:

Number of stop words: 318
Every 10th stopword:
['all', 'not', 'one', 'should', 'latterly', 'cannot', 'name', 'each', 'ten', 'beyond', 'mine', 'between', 'full', 'found', 'anything', 'became', 'formerly', 'everyone', 'three', 'anyone', 'was', 'becoming', 'he', 'besides', 'something', 'herein', 'any', 'meanwhile', 'which', 'most', 'whereby', 'rather']

Clearly, removing the stopwords in the list can only decrease the number of features by the length of the list—here, 318—but it might lead to an improvement in performance. Let’s give it a try:

view plaincopy to clipboardprint?
# Specifying stop_words="english" uses the built-in list.  
# We could also augment it and pass our own.  
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)  
X_train = vect.transform(text_train)  
print("X_train with stop words:\n{}".format(repr(X_train)))  

Output:

X_train with stop words:
<25000x26966 sparse matrix of type ''
with 2149958 stored elements in Compressed Sparse Row format>

There are now 305 (27,271–26,966) fewer features in the dataset, which means that most, but not all, of the stopwords appeared. Let’s run the grid search again:

view plaincopy to clipboardprint?
from sklearn.model_selection import GridSearchCV  
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}  
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)  
grid.fit(X_train, y_train)  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  
print("Best parameters: ", grid.best_params_)  

Output:

Best cross-validation score: 0.88
('Best parameters: ', {'C': 0.1})

The grid search performance decreased slightly using the stopwords—not enough to worry about, but given that excluding 305 features out of over 27,000 is unlikely to change performance or interpretability a lot, it doesn’t seem worth using this list. Fixed lists are mostly helpful for small datasets, which might not contain enough information for the model to determine which words are stopwords from the data itself. As an exercise, you can try out the other approach, discarding frequently appearing words, by setting the max_df option of CountVectorizer and see how it influences the number of features and the performance.

Rescaling the Data with tf–idf
Instead of dropping features that are deemed unimportant, another approach is to rescale features by how informative we expect them to be. One of the most common ways to do this is using the term frequency–inverse document frequency (tf–idf) method. The intuition of this method is to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document. scikit-learn implements the tf–idf method in two classes: TfidfTransformer, which takes in the sparse matrix output produced by CountVectorizer and transforms it, and TfidfVectorizer, which takes in the text data and does both the bag-of-words feature extraction and the tf–idf transformation. There are several variants of the tf–idf rescaling scheme, which you can read about on Wikipedia. The tf–idf score for word w in document d as implemented in both the TfidfTransformer and TfidfVectorizer classes is given by:

where N is the number of documents in the training set, Nw is the number of documents in the training set that the word w appears in, and tf (the term frequency) is the number of times that the word w appears in the query document d (the document you want to transform or encode). Both classes also apply L2 normalization after computing the tf–idf representation; in other words, they rescale the representation of each document to have Euclidean norm 1. Rescaling in this way means that the length of a document (the number of words) does not change the vectorized representation.

Because tf–idf actually makes use of the statistical properties of the training data, we will use a pipeline, as described in Chapter 7, to ensure the results of our grid search are valid. This leads to the following code:
- ch8_t04.py

view plaincopy to clipboardprint?
import numpy as np  
from sklearn.datasets import load_files  
  
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS  
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))  
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))  
print  
  
reviews_train = load_files("data/aclImdb/train/")  
# load_files returns a bunch, containing training texts and training labels  
text_train, y_train = reviews_train.data, reviews_train.target  
print("type of text_train: {}".format(type(text_train)))  
print("length of text_train: {}".format(len(text_train)))  
print("text_train[1]:\n{}".format(text_train[1]))  
print("Samples per class (training): {}".format(np.bincount(y_train)))  
print  
  
reviews_test = load_files("data/aclImdb/test/")  
text_test, y_test = reviews_test.data, reviews_test.target  
print("Number of documents in test data: {}".format(len(text_test)))  
print("Samples per class (test): {}".format(np.bincount(y_test)))  
print  
  
  
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.pipeline import make_pipeline  
from sklearn.model_selection import cross_val_score  
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import GridSearchCV  
pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),  
                     LogisticRegression())  
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}  
  
grid = GridSearchCV(pipe, param_grid, cv=5)  
grid.fit(text_train, y_train)  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  

Output:

...
Best cross-validation score: 0.89

As you can see, there is some improvement when using tf–idf instead of just word counts. We can also inspect which words tf–idf found most important. Keep in mind that the tf–idf scaling is meant to find words that distinguish documents, but it is a purely unsupervised technique. So, “important” here does not necessarily relate to the “positive review” and “negative review” labels we are interested in. First, we extract the TfidfVectorizer from the pipeline:

view plaincopy to clipboardprint?
vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]  
# transform the training dataset  
X_train = vectorizer.transform(text_train)  
# find maximum value for each of the features over the dataset  
# numpy.ravel(a, order='C'): Return a contiguous flattened array.  
max_value = X_train.max(axis=0).toarray().ravel()  
sorted_by_tfidf = max_value.argsort()  
# get feature names  
feature_names = np.array(vectorizer.get_feature_names())  
  
print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))  
print("Features with highest tfidf: \n{}".format(feature_names[sorted_by_tfidf[-20:]]))  

Output:

Below demonstration on a few APIs to better understand above sample code:
- numpy.ravel(a, order='C')[source]: Return a contiguous flattened array.

>>> x = np.array([[1, 2, 3], [4, 5, 6]])
>>> print(np.ravel(x)) # Flatten the array with row first, then column
[1 2 3 4 5 6]
>>> print(np.ravel(x, order='F')) # Flatten the array with column first, then row
[1 4 2 5 3 6]

- numpy.argsort(a, axis=-1, kind='quicksort', order=None): Returns the indices that would sort an array.

>>> x = np.array([3,2,1,4])
>>> np.argsort(x) # Descending order in choosing index. e.g. The index of minimum value 1 is at 2
array([2, 1, 0, 3])
>>> x = np.array([[1,5,3],[4,2,6]])
>>> np.argsort(x)
array([[0, 2, 1],
[1, 0, 2]])
>>> np.argsort(x, axis=0)
array([[0, 1, 0],
[1, 0, 1]])

Features with low tf–idf are those that either are very commonly used across documents or are only used sparingly, and only in very long documents. Interestingly, many of the high-tf–idf features actually identify certain shows or movies. These terms only appear in reviews for this particular show or franchise, but tend to appear very often in these particular reviews. This is very clear, for example, for "pokemon", "smallville", and "doodlebops", but "scanners" here actually also refers to a movie title. These words are unlikely to help us in our sentiment classification task (unless maybe some franchises are universally reviewed positively or negatively) but certainly contain a lot of specific information about the reviews.

We can also find the words that have low inverse document frequency—that is, those that appear frequently and are therefore deemed less important. The inverse document frequency values found on the training set are stored in the idf_ attribute:

>>> vectorizer.idf_.shape # The idf of each word
(27272,)
>>> sorted_by_idf = np.argsort(vectorizer.idf_) # Sorting the idf ascendingly
>>> print("Features with lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))
Features with lowest idf:
[u'the' u'and' u'of' u'to' ... u'watch' u'think' u'acting' u'movies' u'seen' u'its']

As expected, these are mostly English stopwords like "the" and "no". But some are clearly domain-specific to the movie reviews, like "movie", "film", "time", "story", and so on. Interestingly, "good", "great", and"bad" are also among the most frequent and therefore “least relevant” words according to the tf–idf measure, even though we might expect these to be very important for our sentiment analysis task.

Investigating Model Coefficients
Finally, let’s look in a bit more detail into what our logistic regression model actually learned from the data. Because there are so many features—27,271 after removing the infrequent ones—we clearly cannot look at all of the coefficients at the same time. However, we can look at the largest coefficients, and see which words these correspond to. We will use the last model that we trained, based on the tf–idf features.

The following bar chart (Figure 8-2) shows the 25 largest and 25 smallest coefficients of the logistic regression model, with the bars showing the size of each coefficient:

view plaincopy to clipboardprint?
mglearn.tools.visualize_coefficients(  
    grid.best_estimator_.named_steps["logisticregression"].coef_,  
    feature_names, n_top_features=40)  

Figure 8-2. Largest and smallest coefficients of logistic regression trained on tf-idf features

The negative coefficients on the left belong to words that according to the model are indicative of negative reviews, while the positive coefficients on the right belong to words that according to the model indicate positive reviews. Most of the terms are quite intuitive, like "worst", "waste", "disappointment", and"laughable" indicating bad movie reviews, while "excellent", "wonderful", "enjoyable", and"refreshing" indicate positive movie reviews. Some words are slightly less clear, like "bit", "job", and "today", but these might be part of phrases like “good job” or “best today.”

Bag-of-Words with More Than One Word (n-Grams)
One of the main disadvantages of using a bag-of-words representation is that word order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and “it’s good, not bad at all” have exactly the same representation, even though the meanings are inverted. Putting “not” in front of a word is only one example (if an extreme one) of how context matters. Fortunately, there is a way of capturing context when using a bag-of-words representation, by not only considering the counts of single tokens, but also the counts of pairs or triplets of tokens that appear next to each other. Pairs of tokens are known as bigrams, triplets of tokens are known as trigrams, and more generally sequences of tokens are known as n-grams. We can change the range of tokens that are considered as features by changing the ngram_range parameter of CountVectorizer or TfidfVectorizer. The ngram_range parameter is a tuple, consisting of the minimum length and the maximum length of the sequences of tokens that are considered. Here is an example on the toy data we used earlier:

>>> bards_words =["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]
>>> from sklearn.feature_extraction.text import CountVectorizer

# The default is to create one feature per sequence of tokens that is at least one token long and at most one token long,
# or in other words exactly one token long (single tokens are also called unigrams):
>>> cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
>>> print("Vocabulary size: {}".format(len(cv.vocabulary_)))
Vocabulary size: 13
>>> print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary:
[u'be', u'but', u'doth', u'fool', u'he', u'himself', u'is', u'knows', u'man', u'the', u'think', u'to', u'wis e']

# To look only at bigrams—that is, only at sequences of two tokens following each other—we can setngram_range to (2, 2):
>>> cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
>>> print("Vocabulary size: {}".format(len(cv.vocabulary_)))
Vocabulary size: 14
>>> print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary:
[u'be fool', u'but the', u'doth think', u'fool doth', u'he is', u'himself to', u'is wise', u'knows himself', u'man knows', u'the fool', u'the wise', u'think he', u'to be', u'wise man']

Using longer sequences of tokens usually results in many more features, and in more specific features. There is no common bigram between the two phrases in bard_words:

>>> print("Transformed data (dense):\n{}".format(cv.transform(bards_words).toarray()))
Transformed data (dense):
[[0 0 1 1 1 0 1 0 0 1 0 1 0 0]
[1 1 0 0 0 1 0 1 1 0 1 0 1 1]]

For most applications, the minimum number of tokens should be one, as single words often capture a lot of meaning. Adding bigrams helps in most cases. Adding longer sequences—up to 5-grams—might help too, but this will lead to an explosion of the number of features and might lead to overfitting, as there will be many very specific features. In principle, the number of bigrams could be the number of unigrams squared and the number of trigrams could be the number of unigrams to the power of three, leading to very large feature spaces. In practice, the number of higher n-grams that actually appear in the data is much smaller, because of the structure of the (English) language, though it is still large.

Here is what using unigrams, bigrams, and trigrams on bards_words looks like:

>>> cv = CountVectorizer(ngram_range=(1, 3)).fit(bards_words)
>>> print("Vocabulary size: {}".format(len(cv.vocabulary_)))
Vocabulary size: 39
>>> print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary:
[u'be', u'be fool', u'but', u'but the', u'but the wise', u'doth', u'doth think', u'doth think he', u'fool', u'fool doth', u'fool doth think', u'he', u'he is', u'he is wise', u'himself', u'himself to', u'himself to be', u'is', u'is wise', u'knows', u'knows himself', u'knows himself to', u'man', u'man knows', u'man knows himself', u'the', u'the fool', u'the fool doth', u'the wise', u'the wise man', u'think', u'think he', u'think he is', u'to', u'to be', u'to be fool', u'wise', u'wise man', u'wise man knows']

Let’s try out the TfidfVectorizer on the IMDb movie review data and find the best setting of n-gram range using a grid search:
- ch8_t07.py

view plaincopy to clipboardprint?
...  
pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression())  
# running the grid search takes a long time because of the  
# relatively large grid and the inclusion of trigrams  
param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],  
              "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}  
  
grid = GridSearchCV(pipe, param_grid, cv=5)  
grid.fit(text_train, y_train)  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  
print("Best parameters:\n{}".format(grid.best_params_))  

Output:

As you can see from the results, we improved performance by a bit more than a percent by adding bigram and trigram features. We can visualize the cross-validation accuracy as a function of the ngram_range and C parameter as a heat map, as we did in Chapter 6 (see Figure 8-3):

Figure 8-3. Heat map visualization of mean cross-validation accuracy as a function of the parameters ngram_range and C

From the heat map we can see that using bigrams increases performance quite a bit, while adding trigrams only provides a very small benefit in terms of accuracy. To understand better how the model improved, we can visualize the important coefficient for the best model, which includes unigrams, bigrams, and trigrams (see Figure 8-4):

view plaincopy to clipboardprint?
# extract feature names and coefficients  
vect = grid.best_estimator_.named_steps['tfidfvectorizer']  
feature_names = np.array(vect.get_feature_names())  
coef = grid.best_estimator_.named_steps['logisticregression'].coef_  
mglearn.tools.visualize_coefficients(coef, feature_names, n_top_features=40)  

Figure 8-4. Most important features when using unigrams, bigrams, and trigrams with tf-idf rescaling

There are particularly interesting features containing the word “worth” that were not present in the unigram model: "not worth" is indicative of a negative review, while "definitely worth" and "well worth" are indicative of a positive review. This is a prime example of context influencing the meaning of the word “worth.”

Next, we’ll visualize only trigrams, to provide further insight into why these features are helpful. Many of the useful bigrams and trigrams consist of common words that would not be informative on their own, as in the phrases "none of the", "the only good", "on and on", "this is one", "of the most", and so on. However, the impact of these features is quite limited compared to the importance of the unigram features, as you can see in Figure 8-5:

view plaincopy to clipboardprint?
# find 3-gram features  
mask = np.array([len(feature.split(" ")) for feature in feature_names]) == 3  
# visualize only 3-gram features  
mglearn.tools.visualize_coefficients(coef.ravel()[mask],  
                                     feature_names[mask], n_top_features=40)  

Figure 8-5. Visualization of only the important trigram features of the model

Supplement
* Differences between the L1-norm and the L2-norm

程式扎記

標籤

2017年4月11日星期二

[ Intro2ML ] Ch8. Working with Text Data - Part2

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年4月11日 星期二

[ Intro2ML ] Ch8. Working with Text Data - Part2

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年4月11日星期二