NOTE: This post was originally written in 2012 so it may be dated. I’m resurrecting it due to relative popularity. This post has been copied between several blogging systems (some of which were home-brewed) and some formatting has been lost along the way.
Over the last few years I’ve developed a bit of an interest in natural-language processing. It’s never been the focus of my work, but when you’re exposed to as many enterprise-class data storage/search systems as I have you have no choice but to absorb some details. Several hobby projects, sometimes involving home-brewed full-text searching, have also popped up requiring at least a cursory understanding of stemming and phonetic algorithms. Another recurring theme in my hobby projects has been classification for custom spam filtering and analyzing twitter sentiment.
In general, accomplishing these goals simply required the use of someone else’s hard work, whether it be having Solr / Lucene to stem my corpora at the office, using the Ruby classifier gem to analyze tweets about stocks or using the Python Natural Language Toolkit for… Well, pretty much anything.
Recent months have brought a new platform into my hobby work, node.js, which, while stable, still has maturing to do. Like so many things I work with anymore the need for natural-language facilities arose and I found the pickings pretty slim. I have to be honest. That’s *exactly* what I was hoping for; an opportunity to sink my teeth into the algorithms themselves.
Thus I began work on natural, a module of natural languages algorithms for node.js. The idea is loosely based on the Python NLTK in that all algorithms are in the same package, however it will likely never be anywhere near as complete. I’d be lucky for “natural” to ever do 1/2 of what NLTK does without plenty of help. As of version 0.0.17 it has two stemmers (Porter and Lancaster), one classifier (Naive Bayes), two phonetic algorithms (Metaphone and SoundEx) and an inflector.
The strategy was to cast a wide enough net to see how the algorithms might fit together in terms of interface and dependencies first. Making them performant and perfectly accurate is step two, which admittedly will still require some work. At the time of writing “natural” is in version 0.0.17 and everything seems to work (not in an official beta of any kind) but until the version ticks 0.1.0 it’s subject to significant internal change. Hopefully the interfaces will stay the same.
With the exception of the Naive Bayes classifier (to which you can supply tokens of your own stemming) all of these algorithms have no real applicability outside of English. This is a problem I’d like to rectify after solidifying a 0.1.0 release and would love to get some more people involved to accomplish it.
Installing
In order to use “natural” you have to install it… naturally. Like most node modules “natural” is packaged up in an NPM and can be install from the command line as such:
npm install natural
If you want to install from source (which can be found here on github), pull it and install the npm from the source directory.
git clone git://github.com/NaturalNode/natural.git
cd natural
npm install .
Stemming
The first class of algorithms I’d like to outline is stemming. As stated above the Lancaster and Porter algorithms are supported as of 0.0.17. Here’s a basic example of stemming a word with a Porter Stemmer.
var natural = require('natural'),
stemmer = natural.PorterStemmer;
var stem = stemmer.stem('stems');
console.log(stem);
stem = stemmer.stem('stemming');
console.log(stem);
stem = stemmer.stem('stemmed');
console.log(stem);
stem = stemmer.stem('stem');
console.log(stem);
Above I simply required-up the main “natural” module and grabbed the
PorterStemmer sub-module from within. Calling the “stem” function takes an arbitrary string and returns the stem. The above code returns the following output:
stem
stem
stem
stem
For convenience stemmers can patch String with methods to simplify the process by calling the attach method. String objects will then have a stem method.
stemmer.attach();
stem = 'stemming'.stem();
console.log(stem);
Generally you’d be interested in stemming an entire corpus. The attach method provides a tokenizeAndStem method to accomplish this. It breaks the owning string up into an array of strings, one for each word, and stems them all. For example:
var stems = 'stems returned'.tokenizeAndStem();
console.log(stems);
produces the output:
[ 'stem', 'return' ]
Note that the tokenizeAndStem method will omit certain words by default that are considered irrelevant (stop words) from the return array. To instruct the stemmer to not omit stop words pass a true in to tokenizeAndStem for the keepStops parameter.
Consider:
console.log('i stemmed words.'.tokenizeAndStem());
console.log('i stemmed words.'.tokenizeAndStem(true));
outputting:
[ 'stem', 'word' ]
[ 'i', 'stem', 'word' ]
All of the code above would also work with a Lancaster stemmer by requiring the LancasterStemmer module instead, like:
var natural = require('natural'),
stemmer = natural.LancasterStemmer;
Of course the actual stems produced could be different depending on the algorithm chosen.
Phonetics
Phonetic algorithms are also provided to determine what words sound like and compare them accordingly. The old (and I mean old… like 1918 old) SoundEx and the more modern Metaphone algorithm are supported as of 0.0.17.
The following example compares the string “phonetics” and the intentional misspelling “fonetix” and determines they sound alike according to the Metaphone algorithm.
var natural = require('natural'),
phonetic = natural.Metaphone;
var wordA = 'phonetics';
var wordB = 'fonetix';
if(phonetic.compare(wordA, wordB))
console.log('they sound alike!');
The raw code the phonetic algorithm produces can be retrieved with the process method:
var phoneticCode = phonetic.process('phonetics');
console.log(phoneticCode);
resulting in:
FNTKS
Like the stemming implementations the phonetic modules have an attach method that patches String with shortcut methods, most notably soundsLike for comparison:
phonetic.attach();
if(wordA.soundsLike(wordB))
console.log('they sound alike!');
attach also patches in a phonetics and tokenizeAndPhoneticize methods to retrieve the phonetic code for a single word and an entire corpus respectively.
console.log('phonetics'.phonetics());
console.log('phonetics rock'.tokenizeAndPhoneticize());
which outputs:
FNTKS
[ 'FNTKS', 'RK' ]
The above could could also use SoundEx by substituting the following in for the require.
var natural = require('natural'),
phonetic = natural.SoundEx;
Inflector
Basic inflectors are in place to convert nouns between plural and singular forms and to turn integers into string counters (i.e. ‘1st’, ‘2nd’, ‘3rd’, ‘4th ‘etc.).
The following example converts the word “radius” into its plural form “radii”.
var natural = require('natural'),
nounInflector = new natural.NounInflector();
var plural = nounInflector.pluralize('radius');
console.log(plural);
Singularization follows the same pattern as is illustrated in the following example wich converts the word “beers” to its singular form, “beer”.
var singular = nounInflector.singularize('beers');
console.log(singular);
Just like the stemming and phonetic modules an attach method is provided to patch String with shortcut methods.
nounInflector.attach();
console.log('radius'.pluralizeNoun());
console.log('beers'.singularizeNoun());
A NounInflector instance can do custom conversion if you provide expressions via the addPlural and addSingular methods. Because these conversion aren’t always symmetric (sometimes more patterns may be required to singularize forms than pluralize) there needn’t be a one-to-one relationship between addPlural and addSingular calls.
nounInflector.addPlural(/(code|ware)/i, '$1z');
nounInflector.addSingular(/(code|ware)z/i, '$1');
console.log('code'.pluralizeNoun());
console.log('ware'.pluralizeNoun());
console.log('codez'.singularizeNoun());
console.log('warez'.singularizeNoun());
which would result in:
codez
warez
code
ware
Here’s an example of using the CountInflector module to produce string counter for integers.
val = require('natural'),
countInflector = natural.CountInflector;
console.log(countInflector.nth(1));
console.log(countInflector.nth(2));
console.log(countInflector.nth(3));
console.log(countInflector.nth(4));
console.log(countInflector.nth(10));
console.log(countInflector.nth(11));
console.log(countInflector.nth(12));
console.log(countInflector.nth(13));
console.log(countInflector.nth(100));
console.log(countInflector.nth(101));
console.log(countInflector.nth(102));
console.log(countInflector.nth(103));
console.log(countInflector.nth(110));
console.log(countInflector.nth(111));
console.log(countInflector.nth(112));
console.log(countInflector.nth(113));
producing:
1st
2nd
3rd<
4th
10th
11th
12th
13th
100th
101st
102nd
103rd
110th
111th
112th
113th
Classification
At the moment classification is supported only by the Naive Bayes algorithm. There are two basic steps involved in using the classifier: training and classification.
The following example requires-up the classifier and trains it with data. The train method accepts an array of objects containing the name of the classification and the sample corpus.
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument("my unit-tests failed.", 'software');
classifier.addDocument("tried the program, but it was buggy.", 'software');
classifier.addDocument("the drive has a 2TB capacity.", 'hardware');
classifier.addDocument("i need a new power supply.", 'hardware');
classifier.train();
By default the classifier will tokenize the corpus and stem it with a LancasterStemmer. You can use a PorterStemmer by passing it in to the BayesClassifier constructor as such:
var natural = require('natural'),
stemmer = natural.PorterStemmer,
classifier = new natural.BayesClassifier(stemmer);
With the classifier trained it can now classify documents via the classify method:
console.log(classifier.classify('did the tests pass?'));
console.log(classifier.classify('did you buy a new drive?'));
resulting in the output:
software
hardware
Similarly, the classifier can be trained on arrays rather than strings, bypassing tokenization and stemming. This allows the consumer to perform custom tokenization and stemming if any at all. This is especially useful in a non-natural language scenario.
classifier.addDocument( ['unit', 'test'], 'software');
classifier.addDocument( ['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');
classifier.train();
It's possible to persist and recall the results of a training via the save method:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument( ['unit', 'test'], 'software');
classifier.addDocument( ['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');
classifier.train();
classifier.save('classifier.json', function(err, classifier) {
// the classifier is saved to the classifier.json file!
});
The training could then be recalled later with the load method:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
console.log(classifier.classify('did the tests pass?'));
});
Conclusion
This concludes the current state of "natural". Like I said in the introduction, there are certainly potential improvements in both terms of accuracy and performance. Now that 0.0.17 has been released features are frozen while I focus on improving both for 0.1.0.
Post-0.1.0 I intend to make "natural" more complete; slowly staring to match the NLTK with additional algorithms of all classifications and hopefully for additional languages. For that I humbly ask assistance:)