Posts Tagged ‘algorithms’

porter stemming

Monday, November 24th, 2008

In 1980, Martin Porter published a stemming algorithm for English words; stemming is essentially the reduction of a word to its stem by removing its suffixes. Thus, for example, the word “stemming” can be reduced to its stem “stem” by use of Porter’s algorithm. This is immediately useful, since, when searching for the word “stemming”, we can also search for the word “stem”. Normally “stemming” will not match with “stem”, but, when reduced to its stem by suffix removal, it will.

The Porter algorithm removes about 150 common suffixes by way of an algorithm that occupies about 400 lines of Pascal or about 100 lines of Perl, courtesy of the latter’s regular expression library. For those looking for implementations of the algorithm in a variety of languages, the best starting point is Martin Porter’s own website,
tartarus.org/~martin/PorterStemmer/.

The algorithm is useful for searching (Google makes use of it) and a variety of language processing tasks. One interesting use is to find alternate bases for synonym searching. In my “weblex” text editor, when a user seeks synonyms for the word “interesting” the following happens:

  1. synonyms for “interesting” are sought in the synonym dictionary and presented to the user
  2. the stem “interest” is determined using the Porter algorithm
  3. synonyms for the stem are sought
  4. about 150 suffixes are added to the stem and, for each Frankenstinean suffixed form, synonyms are sought and presented to the user together with the suffixed form

One possible optimisation would be to check whether each suffixed form is an English word before seeking synonyms, but as the process descibed above completes in milliseconds this
does not seem to be necessary.

plurality

Wednesday, November 12th, 2008

We become accustomed to seeing system messages such as “3 file(s) deleted.” However, this smacks of laziness. One useful function that can be written in your programming language of predilection is plural(). The plural() function is fed parameters of the number and word to be inflected and returns the word approriately inflected. For example, if we call plural(3,'file') it should return ‘files’. When we call plural(1,'file') it should return ‘file’ in order that we can say “1 file deleted” and, oddly enough, plural(0,'file') should return ‘files’, in order that we can say “0 files deleted.”

Of course, English formation of plurals is not regular, so a good implementation of the plural() function would have to deal with the many exceptions in pluralisation, such as class=>classes, potato=>potatoes, data=>data, calf=calves etc. Even then, when the system message is “3 files were deleted”, we also have to deal with the declension of the verb “to be” so as not to end up spouting nonsense like “1 file(s) were deleted”; to avoid the endless complexities of declining irregular verbs, it seems best to avoid verbs in locations in system messages where number is important, i.e. “3 files deleted” rather than “3 files were deleted”.

So, how do we deal with the apparently overwhelming complexities of English grammar, to put out a simple but gramatically-correct system message when we delete a number of files? The resolution is surprisingly easy, achieved by modifying the form of the function plural() so that it takes three arguments: the number, the singular form of the word and the plural form of the word. Thus plural(3,'platypus','platypuses') will return ‘platypuses’, whereas plural(3,'radius','radii') will return ‘radii’, without the need for a long and complex function. Thus "1 "+plural(1,'file,'files')+plural(1,'was,'were')+" deleted" correctly issues
“1 file was deleted” whereas "3 "+plural(3,'file,'files')+plural(3,'was,'were')+" deleted" correctly issues “3 files were deleted”.

Here is the whole of my Perl plural() function:

#**
#* Returns a string with correct plural treatment.
#*
#* @author Modulus Pty. Ltd. - prh
#* @version 2008 1.0
#* @param $nr the number governing the result
#* @param $sstr the singular string e.g. 'match'
#* @param $pstr the plural string e.g. 'matches'
#* @return $sstr or $pstr
sub lib_plural{
   shift == 1 ? shift: $_[1];
}

merging html files

Wednesday, November 12th, 2008

This morning I took on the task of merging 200 articles which have been published over four years at Kevin Dwyer’s Change Factory website. The aim of the exercise is to create a compendium of change management and business process management advice from the articles; they will be sorted into appropriate chapters, Kevin will write some brief “glue” text and an introduction, but we want to do it without wasted effort.

Fortunately the articles are quite consistent in format, as they are written using a template. I wrote a Perl program to work through the articles directory, retrieve each article, remove its headers and footers, discard some unnecessary content, and add the article to the (large) merged HTML file. After testing its basic functionality and fine tuning for a couple of minor exceptions, the HTML file was created, weighing in at 1.1 MB, and then read in to Microsoft Word and saved as a Word document, ready for Kevin to add the glue text and introduction. The Perl program is, in itself, quite unremarkable; any programmer could write the same in a language of their choice in half an hour.

It is convenient to have each article start a new page, but as HTML does not have a paging paradigm, where to start? This turns out to be dead easy; the aforementioned Perl program creates a level 2 HTML header block for each article, e.g. <h2>This is an Article Title</h2> with the article’s title. When read into Microsoft Word, these were automatically transformed into Words “Heading 2″ elements. All that remained was the page break, and this was readily achieved by creating a style for the merged HTML document using the print-oriented part of the CSS 2 specification, viz:

h2 {
   page-break-before: always;
}

Word (which does have a paging paradigm) recognised the instruction and promptly inserted page breaks before each “Heading 2″ element, just what was wanted. It sometimes surprises me when things turn out to be so simple to achieve. Something a bit more sophisticated could be achieved using the CSS 2 “widows and orphans” controls, but as this Word document is a means to an end rather than a finished product, I heeded Voltaire’s advice that “The perfect is the enemy of the good.”

readability & markup ratios

Wednesday, November 12th, 2008

Following on from counting syllables, I have implemented a range of readability indices in my web editor (Flesch Reading Ease, Flesch Grade count, Gunning Fog Index, Coleman-Liau Index, SMOG Index and Automated Readability Index). In addition, for SGML marked-up pages, I have included a “Markup Ratio”, which is simply the percentage of a document’s characters which are markup characters. Some initial measurements show that this percentage varies from as low as 5% (a very simple HTML help page) to as high as 28% (Microsoft’s home page).

A low ratio (i.e. a high proportion of real content to markup) is said to be a good thing for certain Internet search engines, which favour simply marked-up pages in their page rankings. At this stage it looks to me that, for a real-world, commercial web page that a target of less than 20% is achievable but would require some thought and discipline, e.g. it is not going to be met by the post-modern replacement of <b>wow</b> with <span class=”bold”>wow</span>.

counting syllables

Wednesday, November 12th, 2008

Measures such as “Flesch Reading Ease” and “Gunning’s Fog Index” are measures of the ease of reading a document, based on the average sentence length (words per sentence) and the average syllables per word in the document. Counting sentences and words is relatively easy (ignoring, for the moment, implementation isuues such as underlines, numbers, quoted sentences, etc.) but at first glance counting the syllables in a word looks more challenging. In practice, there is a simple and easily implemented algorithm for counting syllables (in English).

Here’s the algorithm, demonstrated with the word “counted”.

  1. words of 3 or fewer letters return 1
  2. discard trailing “es” and “ed”, thus “counted” -> “count”
  3. discard trailing “e”, except where the ending is “le”
  4. remove all consecutive vowels, thus “count” -> “cont”
  5. count the remaining vowels
  6. add one to the count if the word starts with “mc”

Thus for “counted”, the result is 1. This looks rather surprising since the usual pronunciation of “counted” is as 2 syllables, but discarding “ed” works well for most words, e.g. “raced”, “ripped” etc. are generally pronounced as a monosyllabic sound. It is tempting to make an exception for words ending in “ted”, but English pronunciation is so replete with exceptions that simplicity and consistency with other implementations is probably more valuable than an imaginary exactitude.

The first paragraph of this blog entry has a Gunning Fog Index of 17.0, which is rather foggy; these readability measures reward short sentences with short words, which can be over-simplistic. Still, they are a useful way to attach a concrete measure to a tranche of text, e.g. the mission statement on a website.



Copyright © 2008 Modulus Pty. Ltd.
Were you looking for another 'modulus' site?

tags: Modulus,website,design, tools, applications, consultancies, designers, kpis, balanced scorecard, business process management, internal controls, analog log analysis, computer-based training, cbt, guestbook, link checking, locations, postcodes, distance calculation, find nearest postcode, modular process module, parcel post, calculate postal charges, risk management, as/nzs 4360:2004, sitemap, seo, site search, spell-check in forms

light source
Valid XHTML 1.0 Transitional home | about modulus | bpm e-books | modules | services | links | the blog | contact us Valid CSS!