M@XCode

Personal blog dedicated to computer science

How to use tokenization, stopwords and synsets with NLTK (python)

This is my next article about NLTK (The natural language processing toolkit that can be used with Python). In this blog post I will highlight some of the key features of NLTK that can be useful for any developers having to treat and understand text programmatically.

Tokenization : the transformation of text into understandable chunks

In Natural Language processing a token is a small piece of text. There are different tokenization techniques, but you have to keep in mind that the process consists in cutting an amount of words into smaller bags of word. This is usually the very first task of a complex NLP algorithm.

Sentence tokenization

If you want to analyze all the sentences of a given text you can use the punkt tokenizer.

Let’s say that you have stored into the data variable the text you want to tokenize:

1
2
3
4
data= "This is my text. I want to tokenize it. But what is exactly tokenization ?"
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(data)

The first line will import the necessary classes we will need to create a new tokenizer. On the second line we create a new variable that loads the English Punkt tokenize. Note that NLTK provides tokenizer for different languages!

The output of the following code chunk will be :

1
['This is my text.','I want to tokenize it.','But what is exactly tokenization ?']

Note that the punctuation is included into the list.

Word tokenization

The sentence scope might not be precise enough if you want to pursue a text analysis. You will probably need to go down to the word level:

1
2
3
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('This is my text')

Again we import the TreebankWordTokenizer class from nltk.tokenize. You have the to instantiate the class : TreebankWordTokenizer(). tokenizer will be an instance of that class. You can then call the method tokenize and provide it with your string of data.

The output will be :

1
['This','is','my','text']

Stop words

Stop words can be really interesting. In a text you have many of them, those stop words do not give vital information in the understanding of a text. Hence they can be removed in order to perform a better analysis of a corpus.

NLTK provides a list of usual stop words that you can use to filter a text.

1
2
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

This list is exposed inside nltk.corpus. The first line will import the stopwords class. We will then access to the list by invoking the wordsmethod and providing 'english' as parameter to load only the English corpus.

How to filter a text from it’s stopwords.

1
2
words_list=["baltimore","you","love","hate","can"]
[w for w in words_list if w not in stop_words]

You first have a list of tokenized words. Then create another one with the following method :

  1. We take each element of the first list
  2. If the word is in the stop_words list we do not include it in the newly created list
  3. Else we include it.

Synset : how to get the definition of a word token

Wordnet is an English dictionary that gives you the ability to lookup for definition and synonyms of a word.
With a Synsets instance you can ask for the definition of the word :

1
2
3
from nltk.corpus import wordnet
s = wordnet.synsets('computer')[0]
s.definition()

The code is pretty simple !

In my next blog post I will talk about hypernyms and how to use it. I will also try to find a way to use NLTK with npm and provide a tutorial for installing it with npm. Thanks for reading!