Question
Execution below test script:
- test.py
- #!/usr/bin/env python
- import spacy
- import nltk
- # Load spacy's English-language models
- en_nlp = spacy.load('en')
- # Instantiate nltk's Porter stemmer
- stemmer = nltk.stem.PorterStemmer()
- from nltk.stem import WordNetLemmatizer
- wordnet_lemmatizer = WordNetLemmatizer()
- # Define function to compare lemmatization in spacy with stemming in nltk
- def compare_normalization(doc):
- # tokenize document in spacy
- doc_spacy = en_nlp(doc)
- # print lemmas found by spacy
- print("Lemmatization:")
- print([wordnet_lemmatizer.lemmatize(token.text) for token in doc_spacy])
- # print tokens found by Porter stemmer
- print("Stemming:")
- print([stemmer.stem(token.norm_.lower()) for token in doc_spacy])
- compare_normalization(u"Our meeting today was worse than yesterday, "
- "I'm scared of meeting the clients tomorrow.")
How-To
What ended up working for me is creating an 'nltk_data' directory in the application's folder itself, downloading the corpus to that directory and adding a line to my code that lets the nltk know to look in that directory.
Step 1: Enter nltp downloader
Step 2: Download Corpus
Or in one step from Python code:
- nltk.download("wordnet", "whatever_the_absolute_path_to_myapp_is/nltk_data/")
ntlk looks for data,resources,etc. in the locations specified in the nltk.data.path variable. All you need to do is add nltk.data.path.append('whatever_the_absolute_path_to_myapp_is/nltk_data/') to the python file actually using nltk, and it will look for corpora, tokenizers, and such in there in addition to the default paths.
沒有留言:
張貼留言