In [1]:
from helper import *
In [4]:
# ! pip install pandas nltk gensim pyldavis
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Python libraries needed:

  • pandas
  • nltk
    • corpus to be download using nltk.download()
      • stopwords
      • wordnet
  • gensim
  • pyldavis
In [3]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[3]:
True

What does the result of a topic model look like?

  • assume you have 5 articles, all five articles are written in only 100 words
    • you assume there will be 3 topic among the five articles

A topic model will give you

  • each article belongs to each topic by percentage of similarity
In [4]:
article_topic
Out[4]:
topic1 topic2 topic3
article1 70.0% 65.0% 61.0%
article2 49.0% 63.0% 34.0%
article3 76.0% 61.0% 66.0%
article4 86.0% 46.0% 17.0%
article5 22.0% 63.0% 63.0%

each topic contains which words

  • if using tf-idf with NMF to create a simple topic model ### Note here a topic model like LDA will generate topics that
    • using sampling techniques and iternations #### so rather than a topic "contains" which word, a list of words "belongs" to which topic is more appropriate
In [5]:
topic_term
Out[5]:
word1 word2 word3 word4 word5 word6 word7 word8 word9 word10 ... word91 word92 word93 word94 word95 word96 word97 word98 word99 word100
topic1 Yes Yes No Yes No Yes Yes No No Yes ... Yes No No Yes No Yes Yes Yes No No
topic2 Yes Yes No No No No No No Yes No ... No Yes No Yes Yes Yes Yes No Yes No
topic3 Yes No No No Yes Yes No No No No ... Yes No No No No Yes Yes Yes Yes Yes

3 rows × 100 columns

Parameters of LDA

  • num_topics
    • specify how many topics you would like to extract from the documents
  • alpha

    • document-topic density
      • the greater, the article will be assigned to more topics, vice versa
  • eta

    • topic-word density
      • the greater, each topic will contain more words, vice versa

data preprocessing

  • stopwords: remove general words like I, is, to
  • punctuation: remove punctuations
  • lemmatize: reduce related forms of a word to a common base
    • e.g.
      • am, are, is -> be
      • car, cars, car's, cars' -> car
In [6]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stopwords = set(stopwords.words('english'))
punctuation = set(string.punctuation) 
lemmatize = WordNetLemmatizer()

def cleaning(article):
    one = " ".join([i for i in article.lower().split() if i not in stopwords])
    two = "".join(i for i in one if i not in punctuation)
    three = " ".join(lemmatize.lemmatize(i) for i in two.split())
    return three
  • Data preparation can't be simpler, you only need a list of documents.

    • The shorter for each document, the less time will take to complete a topic model.

      • You can start with tweets, reviews of e-Commerce, reviews of movies
In [7]:
df = pd.read_table('plot.tok.gt9.5000', names=['text'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 1 columns):
text    5000 non-null object
dtypes: object(1)
memory usage: 39.1+ KB
In [8]:
df.head(3)
Out[8]:
text
0 the movie begins in the past where a young boy...
1 emerging from the human psyche and showing cha...
2 spurning her mother's insistence that she get ...
In [9]:
text = df.applymap(cleaning)['text']
text_list = [i.split() for i in text]
len(text_list)
Out[9]:
5000
In [10]:
text_list[0]
Out[10]:
[u'movie',
 u'begin',
 u'past',
 u'young',
 u'boy',
 u'named',
 u'sam',
 u'attempt',
 u'save',
 u'celebi',
 u'hunter']

add log for recording the model fitting data while training

In [11]:
from time import time
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO,
                   filename='running.log',filemode='w')

All the text documents combined is known as the corpus.

  • To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation.
    • LDA model looks for repeating term patterns in the entire DT matrix.
      • Python provides many great libraries for text mining practices,
        • “gensim” is one such clean and beautiful library to handle text data.
          • It is scalable, robust and efficient.
            • Following code shows how to convert a corpus into a document-term matrix.

build dictonary

  • and save for future use
In [12]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(text_list)
dictionary.save('dictionary.dict')
print dictionary
Dictionary(13417 unique tokens: [u'halebopp', u'yellow', u'narcotic', u'four', u'billing']...)
C:\Program Files\Anaconda2\lib\site-packages\gensim\utils.py:843: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

build corpus

  • and save for future use
In [13]:
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text_list]
corpora.MmCorpus.serialize('corpus.mm', doc_term_matrix)

print len(doc_term_matrix)
print doc_term_matrix[100]
5000
[(173, 1), (282, 1), (638, 1), (906, 1), (957, 1), (958, 1), (959, 1), (960, 1), (961, 1), (962, 1), (963, 1), (964, 1), (965, 1)]

Running LDA Model

  • Next step is to create an object for LDA model and train it on Document-Term matrix.
    • The training also requires few parameters as input which are explained in the above section.
      • The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

check out the running.log in the home directory while running the model

  • track process
In [14]:
start = time()
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50)
print 'used: {:.2f}s'.format(time()-start)
used: 173.02s
In [15]:
print(ldamodel.print_topics(num_topics=2, num_words=4))
[(0, u'0.009*"one" + 0.006*"get" + 0.005*"film" + 0.005*"turn"'), (9, u'0.008*"love" + 0.006*"make" + 0.006*"friend" + 0.006*"father"')]
In [16]:
for i in ldamodel.print_topics(): 
    for j in i: print j
0
0.009*"one" + 0.006*"get" + 0.005*"film" + 0.005*"turn" + 0.005*"story" + 0.005*"two" + 0.005*"way" + 0.004*"girl" + 0.004*"friend" + 0.004*"life"
1
0.012*"life" + 0.009*"year" + 0.008*"find" + 0.007*"one" + 0.007*"love" + 0.007*"father" + 0.007*"man" + 0.005*"son" + 0.005*"time" + 0.005*"go"
2
0.011*"love" + 0.011*"story" + 0.008*"young" + 0.006*"new" + 0.005*"man" + 0.005*"fall" + 0.005*"come" + 0.004*"find" + 0.004*"girl" + 0.004*"first"
3
0.008*"young" + 0.007*"year" + 0.006*"film" + 0.006*"time" + 0.006*"world" + 0.005*"new" + 0.004*"city" + 0.004*"york" + 0.003*"tell" + 0.003*"life"
4
0.007*"new" + 0.006*"two" + 0.005*"get" + 0.004*"work" + 0.004*"school" + 0.004*"case" + 0.003*"hand" + 0.003*"kill" + 0.003*"around" + 0.003*"upon"
5
0.009*"family" + 0.006*"life" + 0.006*"one" + 0.005*"experience" + 0.005*"set" + 0.005*"kill" + 0.005*"control" + 0.005*"need" + 0.005*"two" + 0.005*"year"
6
0.007*"life" + 0.006*"find" + 0.006*"young" + 0.005*"friend" + 0.005*"way" + 0.005*"meet" + 0.005*"take" + 0.005*"man" + 0.005*"dead" + 0.004*"two"
7
0.011*"life" + 0.008*"film" + 0.007*"world" + 0.005*"one" + 0.005*"come" + 0.005*"story" + 0.004*"new" + 0.004*"night" + 0.004*"get" + 0.004*"back"
8
0.008*"school" + 0.008*"life" + 0.007*"day" + 0.005*"story" + 0.005*"one" + 0.004*"men" + 0.004*"high" + 0.004*"take" + 0.004*"love" + 0.004*"plan"
9
0.008*"love" + 0.006*"make" + 0.006*"friend" + 0.006*"father" + 0.005*"good" + 0.005*"new" + 0.005*"god" + 0.004*"go" + 0.004*"family" + 0.004*"find"

save model for future use

In [17]:
ldamodel.save('topic.model')

load saved model

In [19]:
from gensim.models import LdaModel
loading = LdaModel.load('topic.model')
In [20]:
print(loading.print_topics(num_topics=2, num_words=4))
[(5, u'0.009*"family" + 0.006*"life" + 0.006*"one" + 0.005*"experience"'), (6, u'0.007*"life" + 0.006*"find" + 0.006*"young" + 0.005*"friend"')]

predicting(classifying) new(existing) docs

helper function that parse new doc to token

In [21]:
def pre_new(doc):
    one = cleaning(doc).split()
    two = dictionary.doc2bow(one)
    return two
In [22]:
pre_new('new article that to be classified by trained model!')
Out[22]:
[(652, 1), (2868, 1), (4504, 1), (4858, 1)]

pass token to model

  • return the probablity of belonged topic
In [23]:
belong = loading[(pre_new('new article that to be classified by trained model!'))]
belong
Out[23]:
[(0, 0.24003193096917749),
 (1, 0.020001225111365213),
 (2, 0.31763528643561079),
 (3, 0.30231838613947093),
 (4, 0.020003073834703022),
 (5, 0.020002636729036846),
 (6, 0.020001856143880722),
 (7, 0.020002015284078991),
 (8, 0.020001382653171678),
 (9, 0.020002206699504397)]

sort topic by probability

In [24]:
new = pd.DataFrame(belong,columns=['id','prob']).sort_values('prob',ascending=False)
new['topic'] = new['id'].apply(loading.print_topic)
new
Out[24]:
id prob topic
2 2 0.317635 0.011*"love" + 0.011*"story" + 0.008*"young" +...
3 3 0.302318 0.008*"young" + 0.007*"year" + 0.006*"film" + ...
0 0 0.240032 0.009*"one" + 0.006*"get" + 0.005*"film" + 0.0...
4 4 0.020003 0.007*"new" + 0.006*"two" + 0.005*"get" + 0.00...
5 5 0.020003 0.009*"family" + 0.006*"life" + 0.006*"one" + ...
9 9 0.020002 0.008*"love" + 0.006*"make" + 0.006*"friend" +...
7 7 0.020002 0.011*"life" + 0.008*"film" + 0.007*"world" + ...
6 6 0.020002 0.007*"life" + 0.006*"find" + 0.006*"young" + ...
8 8 0.020001 0.008*"school" + 0.008*"life" + 0.007*"day" + ...
1 1 0.020001 0.012*"life" + 0.009*"year" + 0.008*"find" + 0...
In [25]:
new['topic']
Out[25]:
2    0.011*"love" + 0.011*"story" + 0.008*"young" +...
3    0.008*"young" + 0.007*"year" + 0.006*"film" + ...
0    0.009*"one" + 0.006*"get" + 0.005*"film" + 0.0...
4    0.007*"new" + 0.006*"two" + 0.005*"get" + 0.00...
5    0.009*"family" + 0.006*"life" + 0.006*"one" + ...
9    0.008*"love" + 0.006*"make" + 0.006*"friend" +...
7    0.011*"life" + 0.008*"film" + 0.007*"world" + ...
6    0.007*"life" + 0.006*"find" + 0.006*"young" + ...
8    0.008*"school" + 0.008*"life" + 0.007*"day" + ...
1    0.012*"life" + 0.009*"year" + 0.008*"find" + 0...
Name: topic, dtype: object

plotting

  • need
    • model
    • corpus
    • dictionary
In [5]:
import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()
In [6]:
d = gensim.corpora.Dictionary.load('dictionary.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')
In [7]:
data = pyLDAvis.gensim.prepare(lda, c, d)
data
Out[7]:
In [10]:
pyLDAvis.save_html(data,'vis.html')
In [11]:
# %%HTML
# <iframe width="100%" height="500" src="http://www.jishichao.com/vis"></iframe>
In [ ]: