Thursday, November 20, 2014

Lexical diversity and corpora...what are they?

In NLTK, we come across terms like lexical diversity, vocabulary, corpora. In this article, we are going to explore meaning of these terms with some related examples and we are also going to implement their computations using Python code snippets.

ü Vocabulary: unique words in a given text
ü Lexical diversity: the average number of times a given word has repeated across the text. Simply put, lexical diversity is (total # of words in text / # of unique words in text)
ü Corpora (plural of corpus): body of text that we may be interested in exploring. NLTK package has various sample corpus around different contexts. For different corpus available and their related details, check out http://www.nltk.org/book/ch02.html
ü Tokenize: linguistic processing method by which we separate words and punctuations

For example, consider following tongue twister:

Peter Piper picked a peck of pickled peppers.
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where is the peck of pickled peppers Peter Piper picked?

Applying the above definitions to our example, we have:

ü Vocabulary: 15 unique words (without taking into account case) - 'a', 'peter', 'of', 'is', 'piper', 'pickled', '.', 'picked', 'peppers', 'the', 'peck', 'where', ',', '?', 'if'

ü Lexical Diversity: After we run the code snippet, we see that lexical diversity is 2. This means that each unique word repeats twice (on average) within our text.

ü Corpora: In our case, we can say that this tongue twister is a corpus as it is the body of text that we are interested in exploring further by computing some stats around it.

ü Tokens: each word in the text above (except the punctuation marks such as ‘.’, ‘,’ , ‘?’ and white spaces) is a token.


Following is a code snippet and its corresponding output using Python.

Code: (you can copy paste this code into your environment to see the output)


 import nltk  
 from nltk import word_tokenize  
 tTwister = 'Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where is the peck of pickled peppers Peter Piper picked?'  
 print len (tTwister)  
 print   
 print 'Converting to text without tokenizing'  
 tTwister_text = nltk.Text(tTwister)  
 print len(tTwister_text)  
 print tTwister_text[:50]  
 print   
 print 'Tokenizing'  
 tTwister_tokens = word_tokenize(tTwister)  
 print len(tTwister_tokens)  
 print tTwister_tokens  
 print  
 print 'Converting to text after tokenizing'  
 tTwister_text = nltk.Text(tTwister_tokens)  
 print len(tTwister_text)  
 print tTwister_text[:50]  
 print  
 print 'Find vocab in text'  
 def vocab (sampleText):  
   vocab = set([words.lower() for words in sampleText])  
 #  print len (vocab)  
   return len (vocab)  
 def lexicalDiversity (sampleText):  
   return len (sampleText) / vocab (sampleText)  
 print 'Lexical Diversity =', lexicalDiversity (tTwister_text)  
Output:


These are just simple starting steps to dive into NLP. More to follow!

No comments:

Post a Comment