M@XCode

Personal blog dedicated to computer science

How to install NLTK and compute basic statistics on a text

NLTK is a free library for NLP

NLTK (Natural Language Toolkit) is a free python library that is really helpful to execute NLP (Natural Language processing) tasks. The main challenge of NLP is to give the ability to an anlgorithm to understand the meaning of a text written by the human brain. The first application og NLP began during the 1950’s with the Georgetow Experiment (that was organised by IBM). The objective of that experiment was to show to the government the capabilities of machines in the translation field.

How to install it

For Windows Users :

  • Go to https://pypi.python.org/pypi/nltk and download the nltk-3.2.1.win32.exe file
  • After the downloading is complete double click on it and follow the instalation process
  • In order to check your installation, launch python. (Start menu > Python)
    and type on the black window :
1
>>> import nltk

If you have no errors you are good to go !

For MAC Users

The installation procedure is easier. Open your terminal and just type :

1
$ sudo pip install -U nltk

Import your own text from the filesystem

We will use the filesystem capability of python. The function open(filename,mode) will help us to import from the filesystem our file, in order to use NLTK against it.

1
2
f= open("my_file.txt","r")
text=f.read()

The first parameter is the name of the file (placed at the root of your project). The second argument is the treatment option of the file. For this application we only need to read the file. Thus we will use the r option.

The second line will store the raw input in the text variable.

Compute basic statistics

Words that appear in the same contexts

With this method you can find other words that appear in a similar context. The words are arranged in a way that the most similar words will appear first in the output list.

1
text.similar('banana')

Counting words

NLTK provides an easy method to count how many time a word appears in a text

1
text.count('alexa')

It will output an integer.

Get the number of words and punctuation

You can use the len function that will output an integer. Be careful, this function returns the number of words but also the number of punctuation signs in a corpus

1
len(text)

In the next blog article I will try to focus on text classifiers. (Bayesian text classifiers). Nltk gives you the power of building one really easily.