8. NLTK

Natural Language Toolkit or NLTK is an API to process text data. A lot of Natural Language Processing NLP tasks may be accomplished with NLTK.

[1]:
text = [
    'Data Science from Scratch: First Principles with Python',
    'Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking',
    'Practical Statistics for Data Scientists',
    'Build a Career in Data Science',
    'Python Data Science Handbook',
    'Storytelling with Data: A Data Visualization Guide for Business Professionals',
    'R for Data Science: Import, Tidy, Transform, Visualize, and Model Data',
    'Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control',
    'A Hands-On Introduction to Data Science',
    'Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud',
    'How Finance Works: The HBR Guide to Thinking Smart About the Numbers',
    'The Intelligent Investor: The Definitive Book on Value Investing. A Book of Practical Counsel',
    'Introduction to Finance: Markets, Investments, and Financial Management',
    'Python for Finance: Mastering Data-Driven Finance',
    'The Infographic Guide to Personal Finance: A Visual Reference for Everything You Need to Know',
    'Personal Finance For Dummies',
    'Corporate Finance For Dummies',
    'Lords of Finance: The Bankers Who Broke the World',
    'Real Estate Finance & Investments',
    'Real Estate Finance and Investments Risks and Opportunities'
]

8.1. Stop words

It is common to remove stop words from a corpus of documents. NLTK has some default and available stop words for different languages. Below, we show how to use the English stop words from NLTK with scikit’s CountVectorizer.

[2]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

stop_words = set(stopwords.words('english'))

vectorizer = CountVectorizer(binary=True, stop_words=stop_words)
X = vectorizer.fit_transform(text).todense()

bool_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

print(bool_df.shape)
bool_df.columns
(20, 74)
[2]:
Index(['ai', 'analytic', 'bankers', 'big', 'book', 'broke', 'build',
       'business', 'career', 'cloud', 'computer', 'control', 'corporate',
       'counsel', 'data', 'definitive', 'driven', 'dummies', 'dynamical',
       'engineering', 'estate', 'everything', 'finance', 'financial', 'first',
       'guide', 'handbook', 'hands', 'hbr', 'import', 'infographic',
       'intelligent', 'intro', 'introduction', 'investing', 'investments',
       'investor', 'know', 'learning', 'lords', 'machine', 'management',
       'markets', 'mastering', 'mining', 'model', 'need', 'numbers',
       'opportunities', 'personal', 'practical', 'principles', 'professionals',
       'program', 'python', 'real', 'reference', 'risks', 'science',
       'scientists', 'scratch', 'smart', 'statistics', 'storytelling',
       'systems', 'thinking', 'tidy', 'transform', 'value', 'visual',
       'visualization', 'visualize', 'works', 'world'],
      dtype='object')

8.2. Tokenization

NLTK can also tokenize words. Notice how characters such as : and & are separated into their own token?

[3]:
import nltk

for title in text:
    print(nltk.word_tokenize(title))
['Data', 'Science', 'from', 'Scratch', ':', 'First', 'Principles', 'with', 'Python']
['Data', 'Science', 'for', 'Business', ':', 'What', 'You', 'Need', 'to', 'Know', 'about', 'Data', 'Mining', 'and', 'Data-Analytic', 'Thinking']
['Practical', 'Statistics', 'for', 'Data', 'Scientists']
['Build', 'a', 'Career', 'in', 'Data', 'Science']
['Python', 'Data', 'Science', 'Handbook']
['Storytelling', 'with', 'Data', ':', 'A', 'Data', 'Visualization', 'Guide', 'for', 'Business', 'Professionals']
['R', 'for', 'Data', 'Science', ':', 'Import', ',', 'Tidy', ',', 'Transform', ',', 'Visualize', ',', 'and', 'Model', 'Data']
['Data-Driven', 'Science', 'and', 'Engineering', ':', 'Machine', 'Learning', ',', 'Dynamical', 'Systems', ',', 'and', 'Control']
['A', 'Hands-On', 'Introduction', 'to', 'Data', 'Science']
['Intro', 'to', 'Python', 'for', 'Computer', 'Science', 'and', 'Data', 'Science', ':', 'Learning', 'to', 'Program', 'with', 'AI', ',', 'Big', 'Data', 'and', 'The', 'Cloud']
['How', 'Finance', 'Works', ':', 'The', 'HBR', 'Guide', 'to', 'Thinking', 'Smart', 'About', 'the', 'Numbers']
['The', 'Intelligent', 'Investor', ':', 'The', 'Definitive', 'Book', 'on', 'Value', 'Investing', '.', 'A', 'Book', 'of', 'Practical', 'Counsel']
['Introduction', 'to', 'Finance', ':', 'Markets', ',', 'Investments', ',', 'and', 'Financial', 'Management']
['Python', 'for', 'Finance', ':', 'Mastering', 'Data-Driven', 'Finance']
['The', 'Infographic', 'Guide', 'to', 'Personal', 'Finance', ':', 'A', 'Visual', 'Reference', 'for', 'Everything', 'You', 'Need', 'to', 'Know']
['Personal', 'Finance', 'For', 'Dummies']
['Corporate', 'Finance', 'For', 'Dummies']
['Lords', 'of', 'Finance', ':', 'The', 'Bankers', 'Who', 'Broke', 'the', 'World']
['Real', 'Estate', 'Finance', '&', 'Investments']
['Real', 'Estate', 'Finance', 'and', 'Investments', 'Risks', 'and', 'Opportunities']

8.3. Stemming

Stemming is reducing a word to its base form. We typically stem words after tokenization. Notice how the stemmer leads to a lower casing of the words?

[4]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

for title in text:
    print([stemmer.stem(token) for token in nltk.word_tokenize(title)])
['data', 'scienc', 'from', 'scratch', ':', 'first', 'principl', 'with', 'python']
['data', 'scienc', 'for', 'busi', ':', 'what', 'you', 'need', 'to', 'know', 'about', 'data', 'mine', 'and', 'data-analyt', 'think']
['practic', 'statist', 'for', 'data', 'scientist']
['build', 'a', 'career', 'in', 'data', 'scienc']
['python', 'data', 'scienc', 'handbook']
['storytel', 'with', 'data', ':', 'A', 'data', 'visual', 'guid', 'for', 'busi', 'profession']
['R', 'for', 'data', 'scienc', ':', 'import', ',', 'tidi', ',', 'transform', ',', 'visual', ',', 'and', 'model', 'data']
['data-driven', 'scienc', 'and', 'engin', ':', 'machin', 'learn', ',', 'dynam', 'system', ',', 'and', 'control']
['A', 'hands-on', 'introduct', 'to', 'data', 'scienc']
['intro', 'to', 'python', 'for', 'comput', 'scienc', 'and', 'data', 'scienc', ':', 'learn', 'to', 'program', 'with', 'AI', ',', 'big', 'data', 'and', 'the', 'cloud']
['how', 'financ', 'work', ':', 'the', 'hbr', 'guid', 'to', 'think', 'smart', 'about', 'the', 'number']
['the', 'intellig', 'investor', ':', 'the', 'definit', 'book', 'on', 'valu', 'invest', '.', 'A', 'book', 'of', 'practic', 'counsel']
['introduct', 'to', 'financ', ':', 'market', ',', 'invest', ',', 'and', 'financi', 'manag']
['python', 'for', 'financ', ':', 'master', 'data-driven', 'financ']
['the', 'infograph', 'guid', 'to', 'person', 'financ', ':', 'A', 'visual', 'refer', 'for', 'everyth', 'you', 'need', 'to', 'know']
['person', 'financ', 'for', 'dummi']
['corpor', 'financ', 'for', 'dummi']
['lord', 'of', 'financ', ':', 'the', 'banker', 'who', 'broke', 'the', 'world']
['real', 'estat', 'financ', '&', 'invest']
['real', 'estat', 'financ', 'and', 'invest', 'risk', 'and', 'opportun']

Tokenization, stemming and removing stop words all go hand-in-hand, but it’s tricky to use NLTK’s out-of-the-box with scikit. Below is a demonstration of bringing together all three steps. Notice how we have 2 stop words?

  • en_stop_words is to remove stop words during tokenization and before stemming

  • stop_words is used to normalized the stop words themselves

[5]:
import string

stemmer = PorterStemmer()
en_stop_words = set(stopwords.words('english') + list(string.punctuation))
stop_words = nltk.word_tokenize(' '.join(nltk.corpus.stopwords.words('english')))

def tokenize(s):
    return [stemmer.stem(t) for t in nltk.word_tokenize(s) if t not in en_stop_words]

vectorizer = CountVectorizer(binary=True, tokenizer=tokenize, stop_words=stop_words)
vectorizer.fit_transform(text)
[5]:
<20x71 sparse matrix of type '<class 'numpy.int64'>'
        with 115 stored elements in Compressed Sparse Row format>

8.4. Lemmatization

Lemmatization is also another way to bring a word into its base form. The key difference between stemming and lemmatizing a word is that the former might produce a something that is not an actual word, while the latter does. Lemmatization is often preferred over stemming because of its output.

[6]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

for title in text:
    print([lemmatizer.lemmatize(token) for token in nltk.word_tokenize(title)])
['Data', 'Science', 'from', 'Scratch', ':', 'First', 'Principles', 'with', 'Python']
['Data', 'Science', 'for', 'Business', ':', 'What', 'You', 'Need', 'to', 'Know', 'about', 'Data', 'Mining', 'and', 'Data-Analytic', 'Thinking']
['Practical', 'Statistics', 'for', 'Data', 'Scientists']
['Build', 'a', 'Career', 'in', 'Data', 'Science']
['Python', 'Data', 'Science', 'Handbook']
['Storytelling', 'with', 'Data', ':', 'A', 'Data', 'Visualization', 'Guide', 'for', 'Business', 'Professionals']
['R', 'for', 'Data', 'Science', ':', 'Import', ',', 'Tidy', ',', 'Transform', ',', 'Visualize', ',', 'and', 'Model', 'Data']
['Data-Driven', 'Science', 'and', 'Engineering', ':', 'Machine', 'Learning', ',', 'Dynamical', 'Systems', ',', 'and', 'Control']
['A', 'Hands-On', 'Introduction', 'to', 'Data', 'Science']
['Intro', 'to', 'Python', 'for', 'Computer', 'Science', 'and', 'Data', 'Science', ':', 'Learning', 'to', 'Program', 'with', 'AI', ',', 'Big', 'Data', 'and', 'The', 'Cloud']
['How', 'Finance', 'Works', ':', 'The', 'HBR', 'Guide', 'to', 'Thinking', 'Smart', 'About', 'the', 'Numbers']
['The', 'Intelligent', 'Investor', ':', 'The', 'Definitive', 'Book', 'on', 'Value', 'Investing', '.', 'A', 'Book', 'of', 'Practical', 'Counsel']
['Introduction', 'to', 'Finance', ':', 'Markets', ',', 'Investments', ',', 'and', 'Financial', 'Management']
['Python', 'for', 'Finance', ':', 'Mastering', 'Data-Driven', 'Finance']
['The', 'Infographic', 'Guide', 'to', 'Personal', 'Finance', ':', 'A', 'Visual', 'Reference', 'for', 'Everything', 'You', 'Need', 'to', 'Know']
['Personal', 'Finance', 'For', 'Dummies']
['Corporate', 'Finance', 'For', 'Dummies']
['Lords', 'of', 'Finance', ':', 'The', 'Bankers', 'Who', 'Broke', 'the', 'World']
['Real', 'Estate', 'Finance', '&', 'Investments']
['Real', 'Estate', 'Finance', 'and', 'Investments', 'Risks', 'and', 'Opportunities']

8.5. Part-of-speech tagging

NLTK can also be used for for part-of-speech POS tagging.

[7]:
tokens = nltk.word_tokenize(text[0])
tags = nltk.pos_tag(tokens)
print(tags)
[('Data', 'NNS'), ('Science', 'NN'), ('from', 'IN'), ('Scratch', 'NN'), (':', ':'), ('First', 'JJ'), ('Principles', 'NNS'), ('with', 'IN'), ('Python', 'NNP')]

If you are wondering what these tags mean, the following borrowed code below can help decipher these tags. These tags are based off the Penn Treebank tags.

[8]:
from nltk.data import load as nltk_load

tag_dict = nltk_load('help/tagsets/upenn_tagset.pickle')
for k, v in tag_dict.items():
    print(f'{k} : {v[0]}')
LS : list item marker
TO : "to" as preposition or infinitive marker
VBN : verb, past participle
'' : closing quotation mark
WP : WH-pronoun
UH : interjection
VBG : verb, present participle or gerund
JJ : adjective or numeral, ordinal
VBZ : verb, present tense, 3rd person singular
-- : dash
VBP : verb, present tense, not 3rd person singular
NN : noun, common, singular or mass
DT : determiner
PRP : pronoun, personal
: : colon or ellipsis
WP$ : WH-pronoun, possessive
NNPS : noun, proper, plural
PRP$ : pronoun, possessive
WDT : WH-determiner
( : opening parenthesis
) : closing parenthesis
. : sentence terminator
, : comma
`` : opening quotation mark
$ : dollar
RB : adverb
RBR : adverb, comparative
RBS : adverb, superlative
VBD : verb, past tense
IN : preposition or conjunction, subordinating
FW : foreign word
RP : particle
JJR : adjective, comparative
JJS : adjective, superlative
PDT : pre-determiner
MD : modal auxiliary
VB : verb, base form
WRB : Wh-adverb
NNP : noun, proper, singular
EX : existential there
NNS : noun, common, plural
SYM : symbol
CC : conjunction, coordinating
CD : numeral, cardinal
POS : genitive marker

8.6. n-grams

We can also find n-grams using NLTK. To find n-grams, we have to first merge all the text data together (or merge all the titles together as if they were one long title).

[9]:
from itertools import chain

stop_words = set(stopwords.words('english') + list(string.punctuation))

is_valid = lambda t: t not in stop_words
tokenize = lambda t: nltk.word_tokenize(t.lower())
lemmatize = lambda t: lemmatizer.lemmatize(t)

tokens = list(chain(*[[lemmatize(t) for t in tokenize(t) if is_valid(t)] for t in text]))

The BigramCollocationFinder.from_words() can then be used to find the co-located words, and any of the BigramAssocMeasures can then be used to judge the relative strength of the co-location.

[10]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import pandas as pd

finder = BigramCollocationFinder.from_words(tokens)

measures = [
    BigramAssocMeasures.chi_sq,
    BigramAssocMeasures.dice,
    BigramAssocMeasures.fisher,
    BigramAssocMeasures.jaccard,
    BigramAssocMeasures.likelihood_ratio,
    BigramAssocMeasures.mi_like,
    BigramAssocMeasures.phi_sq,
    BigramAssocMeasures.poisson_stirling,
    BigramAssocMeasures.raw_freq,
    BigramAssocMeasures.student_t
]

results = {}
for measure in measures:
    name = measure.__name__
    result = [{'measure': name, 'w1': w1, 'w2': w2, 'score': score}
               for (w1, w2), score in finder.score_ngrams(measure)]
    result = pd.DataFrame(result).sort_values(['score'], ascending=[False]).reset_index(drop=True)
    results[name] = result

These are the top 2-gram words found by each association measure.

[11]:
pd.DataFrame([df.iloc[0] for _, df in results.items()])
[11]:
measure w1 w2 score
0 chi_sq ai big 124.000000
0 dice ai big 1.000000
0 fisher ai big 1.000000
0 jaccard ai big 1.000000
0 likelihood_ratio data science 26.571291
0 mi_like data science 2.931624
0 phi_sq ai big 1.000000
0 poisson_stirling data science 13.238306
0 raw_freq data science 0.056452
0 student_t data science 2.289124

You can use the nbest() method to retrieve the best top ngrams.

[12]:
max_ngrams = 1

[f'{w1} {w2}' for w1, w2 in [finder.nbest(m, max_ngrams)[0] for m in measures]]
[12]:
['ai big',
 'ai big',
 'ai big',
 'ai big',
 'data science',
 'data science',
 'ai big',
 'data science',
 'data science',
 'data science']

8.7. Frequency distribution

You can use the FreqDist() class to count all the word frequencies.

[13]:
stop_words = set(stopwords.words('english') + list(string.punctuation))

is_valid = lambda t: t not in stop_words
tokenize = lambda t: nltk.word_tokenize(t.lower())
lemmatize = lambda t: lemmatizer.lemmatize(t)

tokens = list(chain(*[[lemmatize(t) for t in tokenize(t) if is_valid(t)] for t in text]))

frequencies = nltk.FreqDist(tokens)
s = pd.Series([v for _, v in frequencies.items()], frequencies.keys()).sort_values(ascending=False)
_ = s.plot(kind='bar', figsize=(15, 4))
_images/nltk_25_0.png