NLTK is a free library for NLP
NLTK (Natural Language Toolkit) is a free python library that is really helpful to execute NLP (Natural Language processing) tasks. The main challenge of NLP is to give the ability to an anlgorithm to understand the meaning of a text written by the human brain. The first application og NLP began during the 1950’s with the Georgetow Experiment (that was organised by IBM). The objective of that experiment was to show to the government the capabilities of machines in the translation field.
How to install it
For Windows Users :
- Go to https://pypi.python.org/pypi/nltk and download the
nltk-3.2.1.win32.exe
file - After the downloading is complete double click on it and follow the instalation process
- In order to check your installation, launch python. (Start menu > Python)
and type on the black window :
1 | >>> import nltk |
If you have no errors you are good to go !
For MAC Users
The installation procedure is easier. Open your terminal and just type :
1 | $ sudo pip install -U nltk |
Import your own text from the filesystem
We will use the filesystem capability of python. The function open(filename,mode)
will help us to import from the filesystem our file, in order to use NLTK against it.
1 | f= open("my_file.txt","r") |
The first parameter is the name of the file (placed at the root of your project). The second argument is the treatment option of the file. For this application we only need to read the file. Thus we will use the r
option.
The second line will store the raw input in the text
variable.
Compute basic statistics
Words that appear in the same contexts
With this method you can find other words that appear in a similar context. The words are arranged in a way that the most similar words will appear first in the output list.1
text.similar('banana')
Counting words
NLTK provides an easy method to count how many time a word appears in a text1
text.count('alexa')
It will output an integer.
Get the number of words and punctuation
You can use the len
function that will output an integer. Be careful, this function returns the number of words but also the number of punctuation signs in a corpus
1 | len(text) |
In the next blog article I will try to focus on text classifiers. (Bayesian text classifiers). Nltk gives you the power of building one really easily.