In NLTK, we come across terms
like lexical diversity, vocabulary, corpora. In this article, we are going to
explore meaning of these terms with some related examples and we are also going
to implement their computations using Python code snippets.
ü Vocabulary:
unique words in a given text
ü Lexical
diversity: the average number of times a given word has repeated across the
text. Simply put, lexical diversity is (total # of words in text / # of unique
words in text)
ü Corpora
(plural of corpus): body of text that we may be interested in exploring. NLTK
package has various sample corpus around different contexts. For different
corpus available and their related details, check out http://www.nltk.org/book/ch02.html
ü Tokenize:
linguistic processing method by which we separate words and punctuations
For example, consider
following tongue twister:
Peter Piper
picked a peck of pickled peppers.
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where is the peck of pickled peppers Peter Piper picked?
A peck of pickled peppers Peter Piper picked.
If Peter Piper picked a peck of pickled peppers,
Where is the peck of pickled peppers Peter Piper picked?
Applying the
above definitions to our example, we have:
ü Vocabulary: 15 unique words (without taking into account
case) - 'a', 'peter', 'of', 'is', 'piper', 'pickled', '.', 'picked', 'peppers',
'the', 'peck', 'where', ',', '?', 'if'
ü Lexical Diversity: After we run the code snippet, we see that
lexical diversity is 2. This means that each unique word repeats twice (on
average) within our text.
ü Corpora: In our case, we can say that this tongue twister is
a corpus as it is the body of text that we are interested in exploring further
by computing some stats around it.
ü Tokens: each word in the text above (except the punctuation
marks such as ‘.’, ‘,’ , ‘?’ and white spaces) is a token.
Following is
a code snippet and its corresponding output using Python.
Code: (you can copy paste this code into your environment to see the output)
import nltk
from nltk import word_tokenize
tTwister = 'Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where is the peck of pickled peppers Peter Piper picked?'
print len (tTwister)
print
print 'Converting to text without tokenizing'
tTwister_text = nltk.Text(tTwister)
print len(tTwister_text)
print tTwister_text[:50]
print
print 'Tokenizing'
tTwister_tokens = word_tokenize(tTwister)
print len(tTwister_tokens)
print tTwister_tokens
print
print 'Converting to text after tokenizing'
tTwister_text = nltk.Text(tTwister_tokens)
print len(tTwister_text)
print tTwister_text[:50]
print
print 'Find vocab in text'
def vocab (sampleText):
vocab = set([words.lower() for words in sampleText])
# print len (vocab)
return len (vocab)
def lexicalDiversity (sampleText):
return len (sampleText) / vocab (sampleText)
print 'Lexical Diversity =', lexicalDiversity (tTwister_text)
Output:
These are
just simple starting steps to dive into NLP. More to follow!
No comments:
Post a Comment