2020年4月13日 星期一

[ ML 文章收集 ] Evaluate Topic Models: Latent Dirichlet Allocation (LDA)

Source From Here
Preface
In the previous article, I introduced the concept of topic modeling and walked through the code for developing your first topic model using Latent Dirichlet Allocation (LDA) method in the python using sklearn implementation.

Pursuing on that understanding, in this article, we’ll go a few steps deeper by outlining the framework to quantitatively evaluate topic models through the measure of topic coherence and share the code template in python using Gensim implementation to allow for end-to-end model development.

Why evaluate topic models?
We know probabilistic topic models, such as LDA, are popular tools for text analysis, providing both a predictive and latent topic representation of the corpus. However, there is a longstanding assumption that the latent space discovered by these models is generally meaningful and useful, and that evaluating such assumptions is challenging due to its unsupervised training process. Besides, there is a no-gold standard list of topics to compare against every corpus.

Nevertheless, it is equally important to identify if a trained model is objectively good or bad, as well have an ability to compare different models/methods. To do so, one would require an objective measure for the quality. Traditionally, and still for many practical applications, to evaluate if “the correct thing” has been learned about the corpus, an implicit knowledge and “eyeballing” approaches are used. Ideally, we’d like to capture this information in a single metric that can be maximized, and compared.

Let’s take a look at roughly what approaches are commonly used for the evaluation:

Eye Balling Models
* Top N words
* Topics / Documents

Intrinsic Evaluation Metrics
* Capturing model semantics
* Topics interpretability

Human Judgements
* What is a topic

Extrinsic Evaluation Metrics/Evaluation at task
* Is model good at performing predefined tasks, such as classification

Natural language is messy, ambiguous and full of subjective interpretation, and sometimes trying to cleanse ambiguity reduces the language to an unnatural form. In this article, we’ll explore more about topic coherence, an intrinsic evaluation metric, and how you can use it to quantitatively justify the model selection.

What is Topic Coherence?
The concept of topic coherence combines a number of measures into a framework to evaluate the coherence between topics inferred by a model. But before that…

What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. But …

What is coherence?
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

Coherence Measures
Let’s take quick look at different coherence measures, and how they are calculated:

1. C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity

2. C_p is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of Fitelson’s coherence

3. C_uci measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words

4. C_umass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure

5. C_npmi is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)

6. C_a is baseed on a context window, a pairwise comparison of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity

There is, of course, a lot more to the concept of topic model evaluation, and the coherence measure. However, keeping in mind the length, and purpose of this article, let’s apply these concepts into developing a model that is at least better than with the default parameters. Also, we’ll be re-purposing already available online pieces of code to support this exercise instead of re-inventing the wheel.

Model Implementation
The complete code is available as a Jupyter Notebook on GitHub (my version):
1. Loading data
2. Data Cleaning
3. Phrase Modeling: Bi-grams and Tri-grams
4. Data transformation: Corpus and Dictionary
5. Base Model Performance
6. Hyperparameter Tuning
7. Final Model
8. Visualize Results

Loading Data
For this tutorial, we’ll use the dataset of papers published in NIPS conference. The NIPS conference (Neural Information Processing Systems) is one of the most prestigious yearly events in the machine learning community. The CSV data file contains information on the different NIPS papers that were published from 1987 until 2016 (29 years!).

Let’s start by looking at the content of the file papers.csv (Download NIPS Papers.zip and unzip it)
  1. # Importing modules  
  2. import pandas as pd  
  3. import os  
  4. import sys  
  5.   
  6. # Show Python version  
  7. print(sys.version)  
  8.   
  9. # Read data into papers  
  10. papers = pd.read_csv('data/papers.csv')  
  11. # Print head  
  12. papers.head()  


Data Cleaning
Since the goal of this analysis is to perform topic modeling, we will solely focus on the text data from each paper, and drop other metadata columns
  1. # Remove the columns  
  2. papers = papers.drop(columns=['id''title''abstract''event_type''pdf_name''year'], axis=1)  
  3.   
  4. # sample only 10 papers - for demonstration purposes  
  5. papers = papers.sample(10)  
  6.   
  7. # Print out the first rows of papers  
  8. papers.head()  


Remove punctuation/lower casing
Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, number and then lowercase the text:
  1. # Load the regular expression library  
  2. import re  
  3. from nltk.stem import PorterStemmer  
  4.   
  5. # Remove punctuation  
  6. papers['paper_text_processed'] = papers['paper_text'].map(lambda x: re.sub('[,\.!?]''', x))  
  7. # Remove none alphabic character(s)  
  8. papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: re.sub('[~\'(){}:;+-=*"&]'' ', x))  
  9. # Remove number  
  10. papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: re.sub('[0-9]+''', x))  
  11. # Convert the titles to lowercase  
  12. papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: x.lower().strip())  
  13. # Print out the first rows of papers  
  14. papers['paper_text_processed'].head()  


Tokenize words and further clean-up text
Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether:
  1. %%time  
  2. import gensim  
  3. from gensim.utils import simple_preprocess  
  4.   
  5. def sent_to_words(sentences):  
  6.     for sentence in sentences:  
  7.         yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations  
  8.           
  9. data = papers.paper_text_processed.values.tolist()  
  10. data_words = list(sent_to_words(data))  
  11. print(data_words[:1][0][:100]) # The front 100 words of first article  


Phrase Modeling: Bi-grams and Tri-grams
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring. Some examples in our example are: ‘back_bumper’, ‘oil_leakage’, ‘maryland_college_park’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. For the parameters below, the higher the values of these param, the harder it is for words to be combined.
  1. # Build the bigram and trigram models  
  2. bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.  
  3. trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
  4.   
  5. # Faster way to get a sentence clubbed as a trigram/bigram  
  6. bigram_mod = gensim.models.phrases.Phraser(bigram)  
  7. trigram_mod = gensim.models.phrases.Phraser(trigram)  
Remove Stopwords, Make Bigrams and Lemmatize
The phrase models are ready. Let’s define the functions to remove the stopwords, make trigrams and lemmatization and call them sequentially.
  1. import nltk  
  2.   
  3. nltk.download('stopwords')  
  4. # Optional  
  5. !pip install spacy  
  6. !python -m spacy download en_core_web_sm  
Above code is to download the resources from NLTK for stopwords and prepare the resource from Spacy. Then are the helper functions:
  1. # NLTK Stop words  
  2. from nltk.corpus import stopwords  
  3. stop_words = stopwords.words('english')  
  4. stop_words.extend(['from''subject''re''edu''use'])  
  5. # Define functions for stopwords, bigrams, trigrams and lemmatization  
  6. def remove_stopwords(texts):  
  7.     return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]  
  8.   
  9. def make_bigrams(texts):  
  10.     return [bigram_mod[doc] for doc in texts]  
  11.   
  12. def make_trigrams(texts):  
  13.     return [trigram_mod[bigram_mod[doc]] for doc in texts]  
  14.   
  15. def lemmatization(texts, allowed_postags=['NOUN''ADJ''VERB''ADV']):  
  16.     """https://spacy.io/api/annotation"""  
  17.     texts_out = []  
  18.     for sent in texts:  
  19.         doc = nlp(" ".join(sent))   
  20.         texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])  
  21.     return texts_out   
Let’s call the functions in order:
  1. import spacy  
  2. # Remove Stop Words  
  3. data_words_nostops = remove_stopwords(data_words)  
  4. # Form Bigrams  
  5. data_words_bigrams = make_bigrams(data_words_nostops)  
  6. # Initialize spacy 'en' model, keeping only tagger component (for efficiency)  
  7. nlp = spacy.load("en_core_web_sm", disable=['parser''ner'])  
  8. # Do lemmatization keeping only noun, adj, vb, adv  
  9. data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN''ADJ''VERB''ADV'])  
  10. print(data_lemmatized[:1][0][:10])  
Output:
['mori', 'speech', 'dept', 'speech', 'science', 'build', 'speech_recognition', 'system', 'knowledge', 'utilize']

Data Transformation: Corpus and Dictionary
The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus. Let’s create them:
  1. import gensim.corpora as corpora  
  2. # Create Dictionary  
  3. id2word = corpora.Dictionary(data_lemmatized)  
  4. # Create Corpus  
  5. texts = data_lemmatized  
  6. # Term Document Frequency  
  7. corpus = [id2word.doc2bow(text) for text in texts]  
  8. # View  
  9. print(corpus[:1][0][:10])  
Output:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 10), (6, 1), (7, 2), (8, 1), (9, 3)]

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency). For example, (1, 1) above implies, word id 0 occurs one times in the first document. Likewise, word id 2 occurs thrice and so on.

Base Model
We have everything required to train the base LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Apart from that, alpha and beta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior (we’ll use default for the base model).
chunksize controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.

passes controls how often we train the model on the entire corpus (set to 10). Another word for passes might be “epochs”. iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of “passes” and “iterations” high enough.

  1. # Build LDA model  
  2. lda_model = gensim.models.LdaMulticore(corpus=corpus,  
  3.                                        id2word=id2word,  
  4.                                        num_topics=10,   
  5.                                        random_state=100,  
  6.                                        chunksize=100,  
  7.                                        passes=10,  
  8.                                        per_word_topics=True)  
View the topics in LDA model
The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic. You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics()
  1. from pprint import pprint  
  2. # Print the Keyword in the 10 topics  
  3. pprint(lda_model.print_topics())  
  4.  


Compute Model Perplexity and Coherence Score
Let’s calculate the baseline coherence score:
  1. from gensim.models import CoherenceModel  
  2. # Compute Coherence Score  
  3. coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')  
  4. coherence_lda = coherence_model_lda.get_coherence()  
  5. print('\nCoherence Score: ', coherence_lda)  
Output:
Coherence Score: 0.41369518132594835

Hyperparameter Tuning
First, let’s differentiate between model hyperparameters and model parameters :
Model hyperparameters can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K

Model parameters can be thought of as what the model learns during training, such as the weights for each word in a given topic

Now that we have the baseline coherence score for the default LDA model, let’s perform a series of sensitivity tests to help determine the following model hyperparameters:
* Number of Topics (K)
* Dirichlet hyperparameter alpha: Document-Topic Density
* Dirichlet hyperparameter beta: Word-Topic Density

We’ll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two different validation corpus sets. We’ll use C_v as our choice of metric for performance comparison. Below is the helper function to calculate the score:
  1. # supporting function  
  2. def compute_coherence_values(corpus, dictionary, k, a, b):      
  3.     lda_model = gensim.models.LdaMulticore(corpus=corpus,  
  4.                                            id2word=id2word,  
  5.                                            num_topics=k,   
  6.                                            random_state=100,  
  7.                                            chunksize=100,  
  8.                                            passes=10,  
  9.                                            alpha=a,  
  10.                                            eta=b,  
  11.                                            per_word_topics=True)  
  12.       
  13.     coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')  
  14.       
  15.     return coherence_model_lda.get_coherence()  
Let’s call the function, and iterate it over the range of topics, alpha, and beta parameter values (took up 40 mins from my laptop):
  1. %%time  
  2. import numpy as np  
  3. import tqdm  
  4. grid = {}  
  5. grid['Validation_Set'] = {}  
  6. # Topics range  
  7. min_topics = 2  
  8. max_topics = 11  
  9. step_size = 1  
  10. topics_range = range(min_topics, max_topics, step_size)  
  11. # Alpha parameter  
  12. alpha = list(np.arange(0.0110.3))  
  13. alpha.append('symmetric')  
  14. alpha.append('asymmetric')  
  15. # Beta parameter  
  16. beta = list(np.arange(0.0110.3))  
  17. beta.append('symmetric')  
  18. # Validation sets  
  19. num_of_docs = len(corpus)  
  20. corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25),   
  21.                # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5),   
  22.                # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.75),   
  23.                corpus]  
  24. corpus_title = ['75% Corpus''100% Corpus']  
  25. model_results = {'Validation_Set': [],  
  26.                  'Topics': [],  
  27.                  'Alpha': [],  
  28.                  'Beta': [],  
  29.                  'Coherence': []  
  30.                 }  
  31. # Can take a long time to run  
  32. if 1 == 1:  
  33.     loop_num = len(corpus_sets) * len(topics_range) * len(alpha) * len(beta)  
  34.     pbar = tqdm.tqdm(total=loop_num)  
  35.       
  36.     # iterate through validation corpuses  
  37.     iter_i = 0  
  38.     for i in range(len(corpus_sets)):  
  39.         # iterate through number of topics  
  40.         for k in topics_range:  
  41.             # iterate through alpha values  
  42.             for a in alpha:  
  43.                 # iterare through beta values  
  44.                 for b in beta:  
  45.                     # get the coherence score for the given parameters  
  46.                     cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, k=k, a=a, b=b)  
  47.                     # Save the model results  
  48.                     model_results['Validation_Set'].append(corpus_title[i])  
  49.                     model_results['Topics'].append(k)  
  50.                     model_results['Alpha'].append(a)  
  51.                     model_results['Beta'].append(b)  
  52.                     model_results['Coherence'].append(cv)  
  53.                                           
  54.                     pbar.update(1)                      
  55.     pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)  
  56.     pbar.close()  
Investigate Results
Let’s start by determining the optimal number of topics. The chart below outlines the coherence score, C_v, for the number of topics across two validation sets, and a fixed alpha = 0.01 and beta = 0.01:
  1. # Prepare the data for drawing chart  
  2. target_alpha = 0.01  
  3. target_beta = 0.01  
  4. topic_nums = list(range(211))  
  5. target_co_pos_set = set()  
  6. for i, t in enumerate(zip(model_results['Alpha'], model_results['Beta'])):  
  7.     if t[0] == target_alpha and t[1] == target_beta:  
  8.         target_co_pos_set.add(i)  
  9.           
  10. coherences = []  
  11. for i, co in enumerate(model_results['Coherence']):  
  12.     if i in target_co_pos_set:  
  13.         coherences.append(co)  
  14.           
  15. for topic_num, coherence in zip(topic_nums, coherences):  
  16.     print("Topic number={} with coherence value={:.02f}".format(topic_num, coherence))  
Output:
Topic number=2 with coherence value=0.27
Topic number=3 with coherence value=0.34
Topic number=4 with coherence value=0.42
Topic number=5 with coherence value=0.44
Topic number=6 with coherence value=0.45
Topic number=7 with coherence value=0.38
Topic number=8 with coherence value=0.39
Topic number=9 with coherence value=0.46
Topic number=10 with coherence value=0.39
Let's draw a chart based on above collection:
  1. import matplotlib.pyplot as plt  
  2.   
  3. plt.figure(figsize=(106), dpi=100, linewidth = 2)  
  4. plt.plot(topic_nums, coherences, 's-', color = 'r', label="Coherence score (alpha=beta=0.01)")  
  5. plt.title("Topic Coherence: Determining optimal topic number", x=0.5, y=1.03)  
  6. plt.xticks(fontsize=7)  
  7. plt.yticks(fontsize=7)  
  8. plt.xlabel("Number of topic", fontsize=10, labelpad = 15)  
  9. plt.ylabel("Coherence score", fontsize=10, labelpad = 20)  
  10.   
  11. plt.legend(loc = "best", fontsize=10)  
  12. plt.show()  


With the coherence score seems to keep increasing with the number of topics, it may make better sense to pick the model that gave the highest CV before flattening out or a major drop. In this case, we picked K=9. Next, we want to select the optimal alpha and beta parameters. While there are other sophisticated approaches to tackle the selection process, for this tutorial, we choose the values that yielded maximum C_v score for K=9:
  1. target_topic_num = 9  
  2. target_collection = []  
  3. for k, a, b, c in zip(model_results['Topics'], model_results['Alpha'], model_results['Beta'], model_results['Coherence']):  
  4.     if k == target_topic_num:  
  5.         target_collection.append((a, b, c))  
  6.           
  7. target_collection = sorted(target_collection, key=lambda t: t[2], reverse=True)  
  8. target_collection[:10]  
Output:
[(0.01, 0.61, 0.4668364721990676),
(0.31, 0.61, 0.4668364721990676),
('symmetric', 0.61, 0.4668364721990676),
(0.01, 0.31, 0.4659944830252864),
(0.01, 'symmetric', 0.4659944830252864),
(0.31, 0.31, 0.4659944830252864),
(0.31, 'symmetric', 0.4659944830252864),
('symmetric', 0.31, 0.4659944830252864),
('symmetric', 'symmetric', 0.4659944830252864),
('asymmetric', 0.01, 0.4659944830252864)]

Output the result in a table:
  1. import plotly.graph_objects as go  
  2.   
  3. a_values = []  
  4. b_values = []  
  5. c_values = []  
  6. for a, b, c in target_collection[:10]:  
  7.     a_values.append(a)  
  8.     b_values.append(b)  
  9.     c_values.append(c)  
  10.       
  11. fig = go.Figure(data=[  
  12.                         go.Table(header=dict(values=['Alpha''Belta''Coherence']),  
  13.                         cells=dict(values=[a_values, b_values, c_values]))  
  14.                      ])  
  15. fig.show()  


That yields approx. 13% improvement over the baseline score with setting Alpha=0.01, Belta=0.61, K=9:
  1. best_co = target_collection[0][2]  
  2. improve_pert = (best_co - coherence_lda) * 100 / coherence_lda  
  3. print("Coherence score is improved by {:.01f}%".format(improve_pert))  
Output:
Coherence score is improved by 12.8%

Closing Notes
We started with understanding why evaluating the topic model is essential. Next, we reviewed existing methods and scratched the surface of topic coherence, along with the available coherence measures. Then we built a default LDA model using Gensim implementation to establish the baseline coherence score and reviewed practical ways to optimize the LDA hyperparameters.

Hopefully, this article has managed to shed light on the underlying topic evaluation strategies, and intuitions behind it.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...