porter stemming

In 1980, Martin Porter published a stemming algorithm for English words; stemming is essentially the reduction of a word to its stem by removing its suffixes. Thus, for example, the word “stemming” can be reduced to its stem “stem” by use of Porter’s algorithm. This is immediately useful, since, when searching for the word “stemming”, we can also search for the word “stem”. Normally “stemming” will not match with “stem”, but, when reduced to its stem by suffix removal, it will.

The Porter algorithm removes about 150 common suffixes by way of an algorithm that occupies about 400 lines of Pascal or about 100 lines of Perl, courtesy of the latter’s regular expression library. For those looking for implementations of the algorithm in a variety of languages, the best starting point is Martin Porter’s own website,
tartarus.org/~martin/PorterStemmer/.

The algorithm is useful for searching (Google makes use of it) and a variety of language processing tasks. One interesting use is to find alternate bases for synonym searching. In my “weblex” text editor, when a user seeks synonyms for the word “interesting” the following happens:

  1. synonyms for “interesting” are sought in the synonym dictionary and presented to the user
  2. the stem “interest” is determined using the Porter algorithm
  3. synonyms for the stem are sought
  4. about 150 suffixes are added to the stem and, for each Frankenstinean suffixed form, synonyms are sought and presented to the user together with the suffixed form

One possible optimisation would be to check whether each suffixed form is an English word before seeking synonyms, but as the process descibed above completes in milliseconds this
does not seem to be necessary.

Tags: ,

Leave a Reply

You must be logged in to post a comment.



Copyright © 2008 Modulus Pty. Ltd.
Were you looking for another 'modulus' site?

tags: Modulus,website,design, tools, applications, consultancies, designers, kpis, balanced scorecard, business process management, internal controls, analog log analysis, computer-based training, cbt, guestbook, link checking, locations, postcodes, distance calculation, find nearest postcode, modular process module, parcel post, calculate postal charges, risk management, as/nzs 4360:2004, sitemap, seo, site search, spell-check in forms

light source
Valid XHTML 1.0 Transitional home | about modulus | bpm e-books | modules | services | links | the blog | contact us Valid CSS!