Another way that we can get rid of uninformative words is by discarding words that are too frequent to be informative. There are two main approaches: using a language-specific list of stopwords, or discarding words that appear too frequently. scikit-learn has a built-in list of English stopwords in the feature_extraction.text module:
- from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
- print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
- print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))
Clearly, removing the stopwords in the list can only decrease the number of features by the length of the list—here, 318—but it might lead to an improvement in performance. Let’s give it a try:
- # Specifying stop_words="english" uses the built-in list.
- # We could also augment it and pass our own.
- vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
- X_train = vect.transform(text_train)
- print("X_train with stop words:\n{}".format(repr(X_train)))
There are now 305 (27,271–26,966) fewer features in the dataset, which means that most, but not all, of the stopwords appeared. Let’s run the grid search again:
- from sklearn.model_selection import GridSearchCV
- param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
- grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
- grid.fit(X_train, y_train)
- print("Best cross-validation score: {:.2f}".format(grid.best_score_))
- print("Best parameters: ", grid.best_params_)
The grid search performance decreased slightly using the stopwords—not enough to worry about, but given that excluding 305 features out of over 27,000 is unlikely to change performance or interpretability a lot, it doesn’t seem worth using this list. Fixed lists are mostly helpful for small datasets, which might not contain enough information for the model to determine which words are stopwords from the data itself. As an exercise, you can try out the other approach, discarding frequently appearing words, by setting the max_df option of CountVectorizer and see how it influences the number of features and the performance.
Rescaling the Data with tf–idf
Instead of dropping features that are deemed unimportant, another approach is to rescale features by how informative we expect them to be. One of the most common ways to do this is using the term frequency–inverse document frequency (tf–idf) method. The intuition of this method is to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document. scikit-learn implements the tf–idf method in two classes: TfidfTransformer, which takes in the sparse matrix output produced by CountVectorizer and transforms it, and TfidfVectorizer, which takes in the text data and does both the bag-of-words feature extraction and the tf–idf transformation. There are several variants of the tf–idf rescaling scheme, which you can read about on Wikipedia. The tf–idf score for word w in document d as implemented in both the TfidfTransformer and TfidfVectorizer classes is given by:
where N is the number of documents in the training set, Nw is the number of documents in the training set that the word w appears in, and tf (the term frequency) is the number of times that the word w appears in the query document d (the document you want to transform or encode). Both classes also apply L2 normalization after computing the tf–idf representation; in other words, they rescale the representation of each document to have Euclidean norm 1. Rescaling in this way means that the length of a document (the number of words) does not change the vectorized representation.
Because tf–idf actually makes use of the statistical properties of the training data, we will use a pipeline, as described in Chapter 7, to ensure the results of our grid search are valid. This leads to the following code:
- ch8_t04.py
- import numpy as np
- from sklearn.datasets import load_files
- from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
- print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
- print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))
- reviews_train = load_files("data/aclImdb/train/")
- # load_files returns a bunch, containing training texts and training labels
- text_train, y_train = reviews_train.data, reviews_train.target
- print("type of text_train: {}".format(type(text_train)))
- print("length of text_train: {}".format(len(text_train)))
- print("text_train[1]:\n{}".format(text_train[1]))
- print("Samples per class (training): {}".format(np.bincount(y_train)))
- reviews_test = load_files("data/aclImdb/test/")
- text_test, y_test = reviews_test.data, reviews_test.target
- print("Number of documents in test data: {}".format(len(text_test)))
- print("Samples per class (test): {}".format(np.bincount(y_test)))
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.pipeline import make_pipeline
- from sklearn.model_selection import cross_val_score
- from sklearn.linear_model import LogisticRegression
- from sklearn.model_selection import GridSearchCV
- pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),
- LogisticRegression())
- param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}
- grid = GridSearchCV(pipe, param_grid, cv=5)
- grid.fit(text_train, y_train)
- print("Best cross-validation score: {:.2f}".format(grid.best_score_))
As you can see, there is some improvement when using tf–idf instead of just word counts. We can also inspect which words tf–idf found most important. Keep in mind that the tf–idf scaling is meant to find words that distinguish documents, but it is a purely unsupervised technique. So, “important” here does not necessarily relate to the “positive review” and “negative review” labels we are interested in. First, we extract the TfidfVectorizer from the pipeline:
- vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]
- # transform the training dataset
- X_train = vectorizer.transform(text_train)
- # find maximum value for each of the features over the dataset
- # numpy.ravel(a, order='C'): Return a contiguous flattened array.
- max_value = X_train.max(axis=0).toarray().ravel()
- sorted_by_tfidf = max_value.argsort()
- # get feature names
- feature_names = np.array(vectorizer.get_feature_names())
- print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))
- print("Features with highest tfidf: \n{}".format(feature_names[sorted_by_tfidf[-20:]]))
Below demonstration on a few APIs to better understand above sample code:
- numpy.ravel(a, order='C')[source]: Return a contiguous flattened array.
- numpy.argsort(a, axis=-1, kind='quicksort', order=None): Returns the indices that would sort an array.
Features with low tf–idf are those that either are very commonly used across documents or are only used sparingly, and only in very long documents. Interestingly, many of the high-tf–idf features actually identify certain shows or movies. These terms only appear in reviews for this particular show or franchise, but tend to appear very often in these particular reviews. This is very clear, for example, for "pokemon", "smallville", and "doodlebops", but "scanners" here actually also refers to a movie title. These words are unlikely to help us in our sentiment classification task (unless maybe some franchises are universally reviewed positively or negatively) but certainly contain a lot of specific information about the reviews.
We can also find the words that have low inverse document frequency—that is, those that appear frequently and are therefore deemed less important. The inverse document frequency values found on the training set are stored in the idf_ attribute:
As expected, these are mostly English stopwords like "the" and "no". But some are clearly domain-specific to the movie reviews, like "movie", "film", "time", "story", and so on. Interestingly, "good", "great", and"bad" are also among the most frequent and therefore “least relevant” words according to the tf–idf measure, even though we might expect these to be very important for our sentiment analysis task.
Investigating Model Coefficients
Finally, let’s look in a bit more detail into what our logistic regression model actually learned from the data. Because there are so many features—27,271 after removing the infrequent ones—we clearly cannot look at all of the coefficients at the same time. However, we can look at the largest coefficients, and see which words these correspond to. We will use the last model that we trained, based on the tf–idf features.
The following bar chart (Figure 8-2) shows the 25 largest and 25 smallest coefficients of the logistic regression model, with the bars showing the size of each coefficient:
- mglearn.tools.visualize_coefficients(
- grid.best_estimator_.named_steps["logisticregression"].coef_,
- feature_names, n_top_features=40)
The negative coefficients on the left belong to words that according to the model are indicative of negative reviews, while the positive coefficients on the right belong to words that according to the model indicate positive reviews. Most of the terms are quite intuitive, like "worst", "waste", "disappointment", and"laughable" indicating bad movie reviews, while "excellent", "wonderful", "enjoyable", and"refreshing" indicate positive movie reviews. Some words are slightly less clear, like "bit", "job", and "today", but these might be part of phrases like “good job” or “best today.”
Bag-of-Words with More Than One Word (n-Grams)
One of the main disadvantages of using a bag-of-words representation is that word order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and “it’s good, not bad at all” have exactly the same representation, even though the meanings are inverted. Putting “not” in front of a word is only one example (if an extreme one) of how context matters. Fortunately, there is a way of capturing context when using a bag-of-words representation, by not only considering the counts of single tokens, but also the counts of pairs or triplets of tokens that appear next to each other. Pairs of tokens are known as bigrams, triplets of tokens are known as trigrams, and more generally sequences of tokens are known as n-grams. We can change the range of tokens that are considered as features by changing the ngram_range parameter of CountVectorizer or TfidfVectorizer. The ngram_range parameter is a tuple, consisting of the minimum length and the maximum length of the sequences of tokens that are considered. Here is an example on the toy data we used earlier:
Using longer sequences of tokens usually results in many more features, and in more specific features. There is no common bigram between the two phrases in bard_words:
For most applications, the minimum number of tokens should be one, as single words often capture a lot of meaning. Adding bigrams helps in most cases. Adding longer sequences—up to 5-grams—might help too, but this will lead to an explosion of the number of features and might lead to overfitting, as there will be many very specific features. In principle, the number of bigrams could be the number of unigrams squared and the number of trigrams could be the number of unigrams to the power of three, leading to very large feature spaces. In practice, the number of higher n-grams that actually appear in the data is much smaller, because of the structure of the (English) language, though it is still large.
Here is what using unigrams, bigrams, and trigrams on bards_words looks like:
Let’s try out the TfidfVectorizer on the IMDb movie review data and find the best setting of n-gram range using a grid search:
- ch8_t07.py
- ...
- pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression())
- # running the grid search takes a long time because of the
- # relatively large grid and the inclusion of trigrams
- param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
- "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}
- grid = GridSearchCV(pipe, param_grid, cv=5)
- grid.fit(text_train, y_train)
- print("Best cross-validation score: {:.2f}".format(grid.best_score_))
- print("Best parameters:\n{}".format(grid.best_params_))
As you can see from the results, we improved performance by a bit more than a percent by adding bigram and trigram features. We can visualize the cross-validation accuracy as a function of the ngram_range and C parameter as a heat map, as we did in Chapter 6 (see Figure 8-3):
Figure 8-3. Heat map visualization of mean cross-validation accuracy as a function of the parameters ngram_range and C
From the heat map we can see that using bigrams increases performance quite a bit, while adding trigrams only provides a very small benefit in terms of accuracy. To understand better how the model improved, we can visualize the important coefficient for the best model, which includes unigrams, bigrams, and trigrams (see Figure 8-4):
- # extract feature names and coefficients
- vect = grid.best_estimator_.named_steps['tfidfvectorizer']
- feature_names = np.array(vect.get_feature_names())
- coef = grid.best_estimator_.named_steps['logisticregression'].coef_
- mglearn.tools.visualize_coefficients(coef, feature_names, n_top_features=40)
There are particularly interesting features containing the word “worth” that were not present in the unigram model: "not worth" is indicative of a negative review, while "definitely worth" and "well worth" are indicative of a positive review. This is a prime example of context influencing the meaning of the word “worth.”
Next, we’ll visualize only trigrams, to provide further insight into why these features are helpful. Many of the useful bigrams and trigrams consist of common words that would not be informative on their own, as in the phrases "none of the", "the only good", "on and on", "this is one", "of the most", and so on. However, the impact of these features is quite limited compared to the importance of the unigram features, as you can see in Figure 8-5:
- # find 3-gram features
- mask = np.array([len(feature.split(" ")) for feature in feature_names]) == 3
- # visualize only 3-gram features
- mglearn.tools.visualize_coefficients(coef.ravel()[mask],
- feature_names[mask], n_top_features=40)
Supplement
* Differences between the L1-norm and the L2-norm
沒有留言:
張貼留言