程式扎記

Introduction
In Chapter 5, we talked about two kinds of features that can represent properties of the data: continuous features that describe a quantity, and categorical features that are items from a fixed list. There is a third kind of feature that can be found in many applications, which is text. For example, if we want to classify an email message as either a legitimate email or spam, the content of the email will certainly contain important information for this classification task. Or maybe we want to learn about the opinion of a politician on the topic of immigration. Here, that individual’s speeches or tweets might provide useful information. In customer service, we often want to find out if a message is a complaint or an inquiry. We can use the subject line and content of a message to automatically determine the customer’s intent, which allows us to send the message to the appropriate department, or even send a fully automatic reply.

Text data is usually represented as strings, made up of characters. In any of the examples just given, the length of the text data will vary. This feature is clearly very different from the numeric features that we’ve discussed so far, and we will need to process the data before we can apply our machine learning algorithms to it.

Types of Data Represented as Strings
Before we dive into the processing steps that go into representing text data for machine learning, we want to briefly discuss different kinds of text data that you might encounter. Text is usually just a string in your dataset, but not all string features should be treated as text. A string feature can sometimes represent categorical variables, as we discussed in Chapter 6. There is no way to know how to treat a string feature before looking at the data.

There are four kinds of string data you might see:

* Categorical data
* Free strings that can be semantically mapped to categories
* Structured string data
* Text data

Categorical data is data that comes from a fixed list. Say you collect data via a survey where you ask people their favorite color, with a drop-down menu that allows them to select from “red,” “green,” “blue,” “yellow,” “black,” “white,” “purple,” and “pink.” This will result in a dataset with exactly eight different possible values, which clearly encode a categorical variable. You can check whether this is the case for your data by eyeballing it (if you see very many different strings it is unlikely that this is a categorical variable) and confirm it by computing the unique values over the dataset, and possibly a histogram over how often each appears. You also might want to check whether each variable actually corresponds to a category that makes sense for your application. Maybe halfway through the existence of your survey, someone found that “black” was misspelled as “blak” and subsequently fixed the survey. As a result, your dataset contains both “blak” and “black,” which correspond to the same semantic meaning and should be consolidated.

Now imagine instead of providing a drop-down menu, you provide a text field for the users to provide their own favorite colors. Many people might respond with a color name like “black” or “blue.” Others might make typographical errors, use different spellings like “gray” and “grey,” or use more evocative and specific names like “midnight blue.” You will also have some very strange entries. Some good examples come from the xkcd Color Survey, where people had to name colors and came up with names like “velociraptor cloaka” and “my dentist’s office orange. I still remember his dandruff slowly wafting into my gaping yaw,” which are hard to map to colors automatically (or at all). The responses you can obtain from a text field belong to the second category in the list, free strings that can be semantically mapped to categories. It will probably be best to encode this data as a categorical variable, where you can select the categories either by using the most common entries, or by defining categories that will capture responses in a way that makes sense for your application. You might then have some categories for standard colors, maybe a category “multicolored” for people that gave answers like “green and red stripes,” and an “other” category for things that cannot be encoded otherwise. This kind of preprocessing of strings can take a lot of manual effort and is not easily automated. If you are in a position where you can influence data collection, we highly recommend avoiding manually entered values for concepts that are better captured using categorical variables.

Often, manually entered values do not correspond to fixed categories, but still have some underlying structure, like addresses, names of places or people, dates, telephone numbers, or other identifiers. These kinds of strings are often very hard to parse, and their treatment is highly dependent on context and domain. A systematic treatment of these cases is beyond the scope of this book.

The final category of string data is free form text data that consists of phrases or sentences. Examples include tweets, chat logs, and hotel reviews, as well as the collected works of Shakespeare, the content of Wikipedia, or the Project Gutenberg collection of 50,000 ebooks. All of these collections contain information mostly as sentences composed of words. For simplicity’s sake, let’s assume all our documents are in one language, English. In the context of text analysis, the dataset is often called the corpus, and each data point, represented as a single text, is called a document. These terms come from the information retrieval (IR) and natural language processing (NLP) community, which both deal mostly in text data.

Example Application: Sentiment Analysis of Movie Reviews
As a running example in this chapter, we will use a dataset of movie reviews from the IMDb (Internet Movie Database) website collected by Stanford researcher Andrew Maas. This dataset contains the text of the reviews, together with a label that indicates whether a review is “positive” or “negative.” The IMDb website itself contains ratings from 1 to 10. To simplify the modeling, this annotation is summarized as a two-class classification dataset where reviews with a score of 6 or higher are labeled as positive, and the rest as negative. We will leave the question of whether this is a good representation of the data open, and simply use the data as provided by Andrew Maas.

After unpacking the data, the dataset is provided as text files in two separate folders, one for the training data and one for the test data. Each of these in turn has two subfolders, one called pos and one called neg:

# wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# tar -xvf aclImdb_v1.tar.gz
# tree -dL 2 aclImdb

# rm -rf aclImdb/train/unsup

The pos folder contains all the positive reviews, each as a separate text file, and similarly for the neg folder. The unsup folder contains unlabeled data, which we won’t use, and therefore remove. There is a helper function in scikit-learn to load files stored in such a folder structure, where each subfolder corresponds to a label, called load_files. We apply the load_files function first to the training data:
- ch8_t01.py

view plaincopy to clipboardprint?
import numpy as np  
from sklearn.datasets import load_files  
  
reviews_train = load_files("data/aclImdb/train/")  
# load_files returns a bunch, containing training texts and training labels  
text_train, y_train = reviews_train.data, reviews_train.target  
print("type of text_train: {}".format(type(text_train)))  
print("length of text_train: {}".format(len(text_train)))  
print("text_train[1]:\n{}".format(text_train[1]))  

Output:

type of text_train:
length of text_train: 25000
text_train[1]:
Words can't describe how bad this movie is. ...

You can see that text_train is a list of length 25,000, where each entry is a string containing a review. We printed the review with index. The type of the entries of text_train will depend on your Python version. In Python 3, they will be of type bytes which represents a binary encoding of the string data. In Python 2, text_train contains strings. We won’t go into the details of the different string types in Python here, but we recommend that you read the Python 2 and/or Python 3 documentation regarding strings and Unicode.

The dataset was collected such that the positive class and the negative class balanced, so that there are as many positive as negative strings:

>>> import numpy as np

# numpy.bincount(x, weights=None, minlength=None): Count number of occurrences of each value in array of non-negative ints.
>>> print("Samples per class (training): {}".format(np.bincount(y_train)))
Samples per class (training): [12500 12500]

We load the test dataset in the same manner:

view plaincopy to clipboardprint?
...  
reviews_test = load_files("data/aclImdb/test/")  
text_test, y_test = reviews_test.data, reviews_test.target  
print("Number of documents in test data: {}".format(len(text_test)))  
print("Samples per class (test): {}".format(np.bincount(y_test)))  

Output:

Number of documents in test data: 25000
Samples per class (test): [12500 12500]

Representing Text Data as a Bag of Words
One of the most simple but effective and commonly used ways to represent text for machine learning is using the bag-of-words representation. When using this representation, we discard most of the structure of the input text, like chapters, paragraphs, sentences, and formatting, and only count how often each word appears in each text in the corpus. Discarding the structure and counting only word occurrences leads to the mental image of representing text as a “bag.”

Computing the bag-of-words representation for a corpus of documents consists of the following three steps:

1. Tokenization. Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation.
2. Vocabulary building. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order).
3. Encoding. For each document, count how often each of the words in the vocabulary appear in this document.

There are some subtleties involved in step 1 and step 2, which we will discuss in more detail later in this chapter. For now, let’s look at how we can apply the bag-of-words processing using scikit-learn. Figure 8-1 illustrates the process on the string "This is how you get ants.".

Figure 8-1. Bag-of-words processing

The output is one vector of word counts for each document. For each word in the vocabulary, we have a count of how often it appears in each document. That means our numeric representation has one feature for each unique word in the whole dataset. Note how the order of the words in the original string is completely irrelevant to the bag-of-words feature representation.

Applying Bag-of-Words to a Toy Dataset
The bag-of-words representation is implemented in CountVectorizer, which is a transformer. Let’s first apply it to a toy dataset, consisting of two samples, to see it working:

view plaincopy to clipboardprint?
bards_words =["The fool doth think he is wise,",  
              "but the wise man knows himself to be a fool"]  

We import and instantiate the CountVectorizer and fit it to our toy data as follows:

view plaincopy to clipboardprint?
from sklearn.feature_extraction.text import CountVectorizer  
vect = CountVectorizer()  
vect.fit(bards_words)  

Fitting the CountVectorizer consists of the tokenization of the training data and building of the vocabulary, which we can access as the vocabulary_ attribute:

view plaincopy to clipboardprint?
print("Vocabulary size: {}".format(len(vect.vocabulary_)))  
print("Vocabulary content:")  
for key in sorted(vect.vocabulary_.iterkeys()):  
    print "%s: %s" % (key, vect.vocabulary_[key])  

Output:

Vocabulary size: 13
Vocabulary content:
be: 0
but: 1
doth: 2
fool: 3
he: 4
himself: 5
is: 6
knows: 7
man: 8
the: 9
think: 10
to: 11
wise: 12

The vocabulary consists of 13 words, from "be" to "wise". To create the bag-of-words representation for the training data, we call the transform method:

view plaincopy to clipboardprint?
bag_of_words = vect.transform(bards_words)  
print("bag_of_words: {}".format(repr(bag_of_words)))  

Output:

bag_of_words: <2x13 matrix="" numpy.int64="" of="" sparse="" type="">'
with 16 stored elements in Compressed Sparse Row format>

The bag-of-words representation is stored in a SciPy sparse matrix that only stores the entries that are nonzero. The matrix is of shape 2×13, with one row for each of the two data points and one feature for each of the words in the vocabulary. A sparse matrix is used as most documents only contain a small subset of the words in the vocabulary, meaning most entries in the feature array are 0. Think about how many different words might appear in a movie review compared to all the words in the English language (which is what the vocabulary models). Storing all those zeros would be prohibitive, and a waste of memory. To look at the actual content of the sparse matrix, we can convert it to a “dense” NumPy array (that also stores all the 0 entries) using the toarray method:

view plaincopy to clipboardprint?
print("Dense representation of bag_of_words:\n{}".format(bag_of_words.toarray()))  

Output:

Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
[1 1 0 1 0 1 0 1 1 1 0 1 1]]

We can see that the word counts for each word are either 0 or 1; neither of the two strings in bards_words contains a word twice. Let’s take a look at how to read these feature vectors. The first string ("The fool doth think he is wise,") is represented as the first row in, and it contains the first word in the vocabulary,"be", zero times. It also contains the second word in the vocabulary, "but", zero times. It contains the third word, "doth", once, and so on. Looking at both rows, we can see that the fourth word, "fool", the tenth word,"the", and the thirteenth word, "wise", appear in both strings.

Bag-of-Words for Movie Reviews
Now that we’ve gone through the bag-of-words process in detail, let’s apply it to our task of sentiment analysis for movie reviews. Earlier, we loaded our training and test data from the IMDb reviews into lists of strings (text_train and text_test), which we will now process:

view plaincopy to clipboardprint?
from sklearn.feature_extraction.text import CountVectorizer  
vect = CountVectorizer().fit(text_train)  
X_train = vect.transform(text_train)  
print("X_train:\n{}".format(repr(X_train)))  

Output:

<25000x74849 sparse matrix of type ''
with 3445861 stored elements in Compressed Sparse Row format>

The shape of X_train, the bag-of-words representation of the training data, is 25,000×74,849, indicating that the vocabulary contains 74,849 entries. Again, the data is stored as a SciPy sparse matrix. Let’s look at the vocabulary in a bit more detail. Another way to access the vocabulary is using the get_feature_name method of the vectorizer, which returns a convenient list where each entry corresponds to one feature:

view plaincopy to clipboardprint?
feature_names = vect.get_feature_names()  
print("Number of features: {}".format(len(feature_names)))  
print("First 20 features:\n{}".format(feature_names[:20]))  
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))  
print("Every 2000th feature:\n{}".format(feature_names[::2000]))  

Output:

Number of features: 74849
First 20 features:
[u'00', u'000', u'0000000000001', u'00001', u'00015', u'000s', u'001', u'003830', u'006', u'007', u'0079', u'0080', u'0083', u'0093638', u'00am', u'00pm', u'00s', u'01', u'01pm', u'02']
Features 20010 to 20030:
[u'dratted', u'draub', u'draught', u'draughts', u'draughtswoman', u'draw', u'drawback', u'drawbacks', u'drawer', u'drawers', u'drawing', u'drawings', u'drawl', u'drawled', u'drawling', u'drawn', u'draws', u'draza', u'dre', u'drea']
Every 2000th feature:
[u'00', u'aesir', u'aquarian', u'barking', u'blustering', u'b\xeate', u'chicanery', u'condensing', u'cunning', u'detox', u'draper', u'enshrined', u'favorit', u'freezer', u'goldman', u'hasan', u'huitieme', u'intelligible', u'kantrowitz', u'lawful', u'maars', u'megalunged', u'mostey', u'norrland', u'padilla', u'pincher', u'promisingly', u'receptionist', u'rivals', u'schnaas', u'shunning', u'sparse', u'subset', u'temptations', u'treatises', u'unproven', u'walkman', u'xylophonist']

As you can see, possibly a bit surprisingly, the first 10 entries in the vocabulary are all numbers. All these numbers appear somewhere in the reviews, and are therefore extracted as words. Most of these numbers don’t have any immediate semantic meaning—apart from "007", which in the particular context of movies is likely to refer to the James Bond character. Weeding out the meaningful from the non meaningful “words” is sometimes tricky. Looking further along in the vocabulary, we find a collection of English words starting with “dra”. You might notice that for "draught", "drawback", and "drawer" both the singular and plural forms are contained in the vocabulary as distinct words. These words have very closely related semantic meanings, and counting them as different words, corresponding to different features, might not be ideal.

Before we try to improve our feature extraction, let’s obtain a quantitative measure of performance by actually building a classifier. We have the training labels stored in y_train and the bag-of-words representation of the training data in X_train, so we can train a classifier on this data. For high-dimensional, sparse data like this, linear models like LogisticRegression often work best.

Let’s start by evaluating LogisticRegression using cross-validation:

view plaincopy to clipboardprint?
from sklearn.model_selection import cross_val_score  
from sklearn.linear_model import LogisticRegression  
scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)  
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))  

Output:

Mean cross-validation accuracy: 0.88

We obtain a mean cross-validation score of 88%, which indicates reasonable performance for a balanced binary classification task. We know that LogisticRegression has a regularization parameter, C, which we can tune via cross-validation:
from sklearn.model_selection import GridSearchCV

view plaincopy to clipboardprint?
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}  
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)  
grid.fit(X_train, y_train)  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  
print("Best parameters: ", grid.best_params_)  

Output:

Best cross-validation score: 0.89
Best parameters: {'C': 0.1}

We obtain a cross-validation score of 89% using C=0.1. We can now assess the generalization performance of this parameter setting on the test set:

view plaincopy to clipboardprint?
X_test = vect.transform(text_test)  
print("{:.2f}".format(grid.score(X_test, y_test)))  

Output:

0.88

Now, let’s see if we can improve the extraction of words. The CountVectorizer extracts tokens using a regular expression. By default, the regular expression that is used is "\b\w\w+\b". If you are not familiar with regular expressions, this means it finds all sequences of characters that consist of at least two letters or numbers (\w) and that are separated by word boundaries (\b). It does not find single-letter words, and it splits up contractions like “doesn’t” or “bit.ly”, but it matches “h8ter” as a single word. The CountVectorizer then converts all words to lowercase characters, so that “soon”, “Soon”, and “sOon” all correspond to the same token (and therefore feature). This simple mechanism works quite well in practice, but as we saw earlier, we get many uninformative features (like the numbers). One way to cut back on these is to only use tokens that appear in at least two documents (or at least five documents, and so on). A token that appears only in a single document is unlikely to appear in the test set and is therefore not helpful. We can set the minimum number of documents a token needs to appear in with the min_df parameter:
- ch8_t03.py

view plaincopy to clipboardprint?
...  
vect = CountVectorizer(min_df=5).fit(text_train)  
X_train = vect.transform(text_train)  
print("X_train with min_df: {}".format(repr(X_train)))  

Output:

X_train with min_df:
<25000x27272 sparse matrix of type ''
with 3368680 stored elements in Compressed Sparse Row format>

By requiring at least five appearances of each token, we can bring down the number of features to 27,271, as seen in the preceding output—only about a third of the original features. Let’s look at some tokens again:

view plaincopy to clipboardprint?
feature_names = vect.get_feature_names()  
print("First 50 features:\n{}".format(feature_names[:50]))  
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))  
print("Every 700th feature:\n{}".format(feature_names[::700]))  

Output:

There are clearly many fewer numbers, and some of the more obscure words or misspellings seem to have vanished. Let’s see how well our model performs by doing a grid search again:

view plaincopy to clipboardprint?
from sklearn.model_selection import GridSearchCV  
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}  
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)  
grid.fit(X_train, y_train)  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  

Output:

Best cross-validation score: 0.89

The best validation accuracy of the grid search is still 89%, unchanged from before. We didn’t improve our model, but having fewer features to deal with speeds up processing and throwing away useless features might make the model more interpretable.
Note.

If the transform method of CountVectorizer is called on a document that contains words that were not contained in the training data, these words will be ignored as they are not part of the dictionary. This is not really an issue for classification, as it’s not possible to learn anything about words that are not in the training data. For some applications, like spam detection, it might be helpful to manually add a feature that encodes how many so-called “out of vocabulary” words there are in a particular document, though. For this to work, you need to set min_df; otherwise, this feature will never be active during training.

程式扎記

標籤

2017年4月10日星期一

[ Intro2ML ] Ch8. Working with Text Data - Part1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年4月10日 星期一

[ Intro2ML ] Ch8. Working with Text Data - Part1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年4月10日星期一