Introduction

About Me

Outline of Talk

  • Introduce NLTK
  • Basic Tools of NLTK
  • Integrating NLTK with SciKit-Learn

Housekeeping


Downloads & Requirements

  • Code for examples is available on Github
    • www.github.com/notthatbreezy/dataphilly-talk
  • Slides are available on my website
    • www.cmbrown.org/static/media/dataphilly.html
  • Using the Anaconda Python distribution from Continuum Analytics

What is Natural Language Processing?

  • Combination of computer science, linguistics, and artificial intelligence
  • Have computers perform useful tasks involving human language
    • Spam vs. Ham
    • Question and Answering
    • Finding relevant content (clustering/topic discovery)
    • Classifying documents to an existing topic structure
    • Document summarization
    • Analyzing sentiment in social media
  • Division between probabilistic/statistical and rule-based approach (see: Norvig & Chomsky)

Why Natural Language Toolkit?

  • Python package designed to enable quick and easy prototyping by providing key functionality for NLP
  • Includes modules that accomplish the following:
    • Dealing with large corpora
    • Processing strings
    • Part-of-Speech Tagging
    • Parsing and Chunking
  • Designed with modularity in mind

Getting Started with NLTK (1/2)

Install NLTK data (corpora, stemmers, taggers)
  • Over 70 corpora available for training, prototyping, testing, etc.
  • Command Line: python -m nltk.downloader all
  • From Python: import nltk; nltk.download()
NLTK Download GUI

Getting Started with NLTK (2/2)

modals = ['can', 'could', 'may', 'might', 'must', 'will']
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
cfd = nltk.ConditionalFreqDist(
      (genre, word)
      for genre in brown.categories()
      for word in brown.words(categories=genre))
cfd.tabulate(conditions=genres, samples=modals)

#                 can could  may might must will
#           news   93   86   66   38   50  389
#       religion   82   59   78   12   54   71
#        hobbies  268   58  131   22   83  264
#science_fiction   16   49    4   12    8   16
#        romance   74  193   11   51   45   43
#          humor   16   30    8    8    9   13

Tokenizers and Stemmers (1/2)

NLTK has built-in utilities for tokenizing, removing stopwords, and stemming words

from nltk import ngrams, wordpunct_tokenize
s = "The Democrat admitted he's eying the mayor's race in a 
... attempts to stay out of the limelight 
yea over the past two years-until now."

tokens = wordpunct_tokenize(s)
print tokens
['The', 'Democrat', 'admitted', 'he', ... 'Hillary', 
'Clinton', ',', 'and', 'their', 'attempts', 'to', 'stay', 
'out', 'of', 'the', 'limelight', 'over', 'the', 'past', 
'two', 'years', '-', 'until', 'now', '.']
# ngrams
print ngrams(tokens, 2)
[('The', 'Democrat'), ('Democrat', 'admitted'), ('admitted', 'he'), 
... ('the', 'past'), ('past', 'two'), ('two', 'years'), ('years', '-'), 
('-', 'until'), ('until', 'now'), ('now', '.')]

Tokenizers and Stemmers (2/2)

Stemmers reduce words to their base - NLTK has implementations of a few different ones

from nltk.stem.porter import PorterStemmer
... # tokens from before
tokens = wordpunct_tokenize(s)
st = PorterStemmer()
stemmed = [st.stem(w) for w in words]
print stemmed
['The', 'Democrat', ... , 'Secretari', 'of', 'State', 'Hillari', 
'Clinton', ',', 'and', 'their', 'attempt', 'to', 'stay', 'out', 'of', 
'the', 'limelight', 'over', 'the', 'past', 'two', 'year', '-', 'until', 
'now', '.']

POS Tagging

Also possible to do part-of-speech tagging

from nltk.tag import pos_tag
print pos_tag(tokens)
[('The', 'DT'), ('Democrat', 'NNP'), ('admitted', 'VBD'), ('he', 'PRP'),
... ('Hillary', 'NNP'), ('Clinton', 'NNP'), (',', ','), ('and', 'CC'), 
('their', 'PRP$'), ('attempts', 'NNS'), ('to', 'TO'), ('stay', 'VB'), 
('out', 'RP'), ('of', 'IN'), ('the', 'DT'), ('limelight', 'NN'), ('over', 'IN'), 
('the', 'DT'), ('past', 'JJ'), ('two', 'CD'), ('years', 'NNS'), 
('-', ':'), ('until', 'IN'), ('now', 'RB'), ('.', '.')]
List of tag meanings available here: http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html#tab-simplified-tagset

Chunking and Diagrams

Additionally - you can go a step further and diagram 'chunks' of sentences after POS tagging

"PhillyPUG and DataPhilly are the best meetups in Philadelphia!"

from nltk.chunk import ne_chunk
...
tree = ne_chunk(pos)
Tree('S', [Tree('ORGANIZATION', [('PhillyPUG', 'NNP')]), 
('and', 'CC'), Tree('ORGANIZATION', [('DataPhilly', 'NNP')]), 
('are', 'VBP'), ('the', 'DT'), ('best', 'JJS'), ('meetups', 'NNS'),
 'in', 'IN'), Tree('GSP', [('Philadelphia', 'NNP')]), ('!', '.')])
tree.draw()
NLTK Example Tree

Clustering Political News - Memeorandum

  • Memeorandum - a political news site that aggregates articles at least (partially) automatically.
  • Will see if we can replicate their topics with NLTK and SciKit-Learn

Initial Data

  • Downloaded all articles Monday morning (code available in repo)
  • 104 articles, spread across 12 categories (probably not ideal)
  • Goal here is to show how to integrate NLTK with SciKit-Learn

Cleaning up Data

import nltk
from readability.readability import Document
...
def clean_html_directory(source_directory, target_directory):
    for f in html_files:
        ...
        relevant = Document(html_text).summary()
        cleaned = nltk.clean_html(relevant)
        ...
        output.write(cleaned.encode('utf-8', 'ignore'))
        output.close()
      

Pre-Processing Pages

For pre-processing I will experiment with removing stopwords, using ngrams of 1, 2, & 3 tokens long, and stemming words

def _remove_stopwords_lowercase(words):
    return [w.lower() for w in words if not w.lower()
        in stopwords.words('english') and w.isalpha()]

def _stemmer(words):
    st = PorterStemmer()
    return [st.stem(w) for w in words]

def make_ngram_dict(string, n_min=1, n_max=1):
    tokens = wordpunct_tokenize(string)
    lowercase = _remove_stopwords_lowercase(tokens)
    stemmed = _stemmer(lowercase)
    feature_dict = defaultdict(int)
    while (n_min <= n_max):
        for ngram in ingrams(stemmed, n_min):
            feature_dict[' '.join(ngram)] += 1
        n_min += 1
    return feature_dict

Using DictVectorizer

Transforms list of feature-value mappings to vectors

from sklearn.feature_extraction import DictVectorizer

feature_dicts = [{u'feel': 1, u'heard': 1, u'asham': 2, u'pang': 1,...},
                 {u'farhan': 1, u'global': 1, u'month': 2, u'four': 1,...},
                 ...]
vec = DictVectorizer()
arr = vec.fit_transform(feature_dicts).toarray()
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 3.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Converting to Tf-idf Scores

Scikit also has the ability to then take this array and transform it to tf-idf weighted scores

Tf-idf scores assign more weight to words that have a high term frequency and low document frequency

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()
unitfidf = tfidf_trans(arr)
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.07727961,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])
      

Pick your favorite Clustering Algorithm

Now the data is in a format you can use for clustering

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 12)
results = kmeans.fit_predict(unitfidf)

Is there a better way though?

Extracting Named Entities (1/2)

You can also use the chunking and tree diagramming to extract named entities - which can sometimes be a better way to identify topics

def chunk_document(document):
    sentences = sent_tokenize(document)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]
    return batch_ne_chunk(tagged_sentences, binary=True)

def extract_entities(doc_tree):
    entities = []
    for sent_trees in doc_tree:
        for t in sent_trees:
            if hasattr(t, 'node') and t.node == 'NE':
                entities.append(t[0][0])
    return entities
      

Extracting Named Entities (2/2)

Then I can use the NLTK's FreqDist object to pull out the most used entites

from nltk import FreqDist
entities = [[list from extract_entities function]]
fdist = FreqDist(entities)
    print "5 most common entities ({0})".format(catname)
    for i in fdist.keys()[:10]:
        print i

Example results: Obama, Guantnamo, Yemen, Congress, Guantanamo, Yemeni, Gitmo, New, US, American

Conclusion

  • Scratched the surface of what NLTK can do
  • Further Reading: www.nltk.org/book
  • Might make sense to use NLTK for data processing, scikit-learn for machine learning

Thank You!