This is my next article about NLTK (The natural language processing toolkit that can be used with Python). In this blog post I will highlight some of the key features of NLTK that can be useful for any developers having to treat and understand text programmatically.
Tokenization : the transformation of text into understandable chunks
In Natural Language processing a token is a small piece of text. There are different tokenization techniques, but you have to keep in mind that the process consists in cutting an amount of words into smaller bags of word. This is usually the very first task of a complex NLP algorithm.
Sentence tokenization
If you want to analyze all the sentences of a given text you can use the punkt tokenizer.
Let’s say that you have stored into the data variable the text you want to tokenize:
1 | data= "This is my text. I want to tokenize it. But what is exactly tokenization ?" |
The first line will import the necessary classes we will need to create a new tokenizer. On the second line we create a new variable that loads the English Punkt tokenize
. Note that NLTK provides tokenizer for different languages!
The output of the following code chunk will be :
1 | ['This is my text.','I want to tokenize it.','But what is exactly tokenization ?'] |
Note that the punctuation is included into the list.
Word tokenization
The sentence scope might not be precise enough if you want to pursue a text analysis. You will probably need to go down to the word level:
1 | from nltk.tokenize import TreebankWordTokenizer |
Again we import the TreebankWordTokenizer
class from nltk.tokenize
. You have the to instantiate the class : TreebankWordTokenizer()
. tokenizer
will be an instance of that class. You can then call the method tokenize and provide it with your string of data.
The output will be :
1 | ['This','is','my','text'] |
Stop words
Stop words can be really interesting. In a text you have many of them, those stop words do not give vital information in the understanding of a text. Hence they can be removed in order to perform a better analysis of a corpus.
NLTK provides a list of usual stop words that you can use to filter a text.
1 | from nltk.corpus import stopwords |
This list is exposed inside nltk.corpus
. The first line will import the stopwords class. We will then access to the list by invoking the words
method and providing 'english'
as parameter to load only the English corpus.
How to filter a text from it’s stopwords.
1 | words_list=["baltimore","you","love","hate","can"] |
You first have a list of tokenized words. Then create another one with the following method :
- We take each element of the first list
- If the word is in the
stop_words
list we do not include it in the newly created list - Else we include it.
Synset : how to get the definition of a word token
Wordnet
is an English dictionary that gives you the ability to lookup for definition and synonyms of a word.
With a Synsets
instance you can ask for the definition of the word :
1 | from nltk.corpus import wordnet |
The code is pretty simple !
In my next blog post I will talk about hypernyms and how to use it. I will also try to find a way to use NLTK with npm and provide a tutorial for installing it with npm. Thanks for reading!