Wednesday, November 26, 2014

Calling pydoc from anaconda environment

I was trying to get documentation around twitter package that I have recently installed. However, doing so from the Mac terminal did not return anything. Finally, I figured a way to take a look at the documentation from within the Anaconda environment.

From within Anaconda, launch the iPython QTConsole and import the pydoc package:

 import pydoc  

Once you have imported pydoc package, you can take a look at documentation related to any of the installed packages using simple help() command as follows (output image cropped due to lack of space):

 help ('twitter') 



Do not forget to wrap your package name in quotes ''. Else, you may not get any results.



Thursday, November 20, 2014

Lexical diversity and corpora...what are they?

In NLTK, we come across terms like lexical diversity, vocabulary, corpora. In this article, we are going to explore meaning of these terms with some related examples and we are also going to implement their computations using Python code snippets.

ü Vocabulary: unique words in a given text
ü Lexical diversity: the average number of times a given word has repeated across the text. Simply put, lexical diversity is (total # of words in text / # of unique words in text)
ü Corpora (plural of corpus): body of text that we may be interested in exploring. NLTK package has various sample corpus around different contexts. For different corpus available and their related details, check out http://www.nltk.org/book/ch02.html
ü Tokenize: linguistic processing method by which we separate words and punctuations

For example, consider following tongue twister:

Peter Piper picked a peck of pickled peppers.
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where is the peck of pickled peppers Peter Piper picked?

Applying the above definitions to our example, we have:

ü Vocabulary: 15 unique words (without taking into account case) - 'a', 'peter', 'of', 'is', 'piper', 'pickled', '.', 'picked', 'peppers', 'the', 'peck', 'where', ',', '?', 'if'

ü Lexical Diversity: After we run the code snippet, we see that lexical diversity is 2. This means that each unique word repeats twice (on average) within our text.

ü Corpora: In our case, we can say that this tongue twister is a corpus as it is the body of text that we are interested in exploring further by computing some stats around it.

ü Tokens: each word in the text above (except the punctuation marks such as ‘.’, ‘,’ , ‘?’ and white spaces) is a token.


Following is a code snippet and its corresponding output using Python.

Code: (you can copy paste this code into your environment to see the output)


 import nltk  
 from nltk import word_tokenize  
 tTwister = 'Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where is the peck of pickled peppers Peter Piper picked?'  
 print len (tTwister)  
 print   
 print 'Converting to text without tokenizing'  
 tTwister_text = nltk.Text(tTwister)  
 print len(tTwister_text)  
 print tTwister_text[:50]  
 print   
 print 'Tokenizing'  
 tTwister_tokens = word_tokenize(tTwister)  
 print len(tTwister_tokens)  
 print tTwister_tokens  
 print  
 print 'Converting to text after tokenizing'  
 tTwister_text = nltk.Text(tTwister_tokens)  
 print len(tTwister_text)  
 print tTwister_text[:50]  
 print  
 print 'Find vocab in text'  
 def vocab (sampleText):  
   vocab = set([words.lower() for words in sampleText])  
 #  print len (vocab)  
   return len (vocab)  
 def lexicalDiversity (sampleText):  
   return len (sampleText) / vocab (sampleText)  
 print 'Lexical Diversity =', lexicalDiversity (tTwister_text)  
Output:


These are just simple starting steps to dive into NLP. More to follow!

Friday, November 14, 2014

Converting dual level dictionary to pandas dataframe

Pandas dataframe is more like a crosstab table with rows and columns and data available at the intersection of these rows and columns. 

I am currently working on calculating Euclidean distance in order to establish collaborative model to recommend stuff based on other users ratings. More on that later. 

In this post, I am simply going to show how to convert dictionary of dictionary to a dataframe. 

Following is the code snippet to do this with a simple example:


Let us say we have dictionary of userRatings as follows:




Following is a quick code to convert this into a dataframe for easier data analysis.