Posts Tagged ‘software development’

a slimmer thesaurus

Tuesday, December 23rd, 2008

As I often do, when planning development of my text processor I consider first what capability Microsoft Word has. I have Word 2000; its “Thesaurus” function is really a list of synonyms (with the occasional antonym) rather than a proper thesaurus. When you look up “updated”, the Thesaurus form looks like this:
Word 2000 thesaurus
It’s not very pretty. There are four input boxes, four buttons, three labels and a control alignment which appears to be largely random. The substantive content is a flat list of synonyms. It occured to me that there might be better designs available.

After some trial and error, I have ended up with the following design in my text processor, Weblex:

Weblex thesaurus
The synoym list uses a Windows Tree-view; when you select a synonym by clicking on it, synonyms for that synonym are presented, as is the case for “modernism” displayed above.

There are two input boxes and three buttons, one label. One of the buttons is additional to the Word fuctionality (“Find on WWW”), so the net reduction in clutter is quite significant.

The moral of this story is that, despite the owerwhelming number of designers, software engineers, ergonomists, hair-dressers and spin doctors in the larger software development firms, it is still not only possible but indeed feasible realistic practical workable viable to compete with them on good design.

porter stemming

Monday, November 24th, 2008

In 1980, Martin Porter published a stemming algorithm for English words; stemming is essentially the reduction of a word to its stem by removing its suffixes. Thus, for example, the word “stemming” can be reduced to its stem “stem” by use of Porter’s algorithm. This is immediately useful, since, when searching for the word “stemming”, we can also search for the word “stem”. Normally “stemming” will not match with “stem”, but, when reduced to its stem by suffix removal, it will.

The Porter algorithm removes about 150 common suffixes by way of an algorithm that occupies about 400 lines of Pascal or about 100 lines of Perl, courtesy of the latter’s regular expression library. For those looking for implementations of the algorithm in a variety of languages, the best starting point is Martin Porter’s own website,
tartarus.org/~martin/PorterStemmer/.

The algorithm is useful for searching (Google makes use of it) and a variety of language processing tasks. One interesting use is to find alternate bases for synonym searching. In my “weblex” text editor, when a user seeks synonyms for the word “interesting” the following happens:

  1. synonyms for “interesting” are sought in the synonym dictionary and presented to the user
  2. the stem “interest” is determined using the Porter algorithm
  3. synonyms for the stem are sought
  4. about 150 suffixes are added to the stem and, for each Frankenstinean suffixed form, synonyms are sought and presented to the user together with the suffixed form

One possible optimisation would be to check whether each suffixed form is an English word before seeking synonyms, but as the process descibed above completes in milliseconds this
does not seem to be necessary.

functional documentation

Sunday, November 16th, 2008

Software documentation and comments are highly variable in quality and usefulness. At the useless end of the spectrum are the detailed source comments which tell you what the next line of code is going to do. As the next line of code tells us exactly the same thing, this is at best pointless; at worst, when the comment does not accurately describe what the next line of code is going to do, it can be seriously misleading; here’s an example of what I’m describing:

#increase each object's value by 10
for (sort keys %objects){
$_ += 10;
}

Perl programmers will notice that the code does not increase the object values, but rather the keys of the %objects hash, so the comment is not only redundant but also plain wrong.

On the other hand, very high-level, abstract documentation which tells us what a program or function library does conceptually is only useful in a very high-level, abstract sort of way.

One type of documentation which is much more successful is the “Javadocs” style of functional documentation, which describes each function (or subroutine or procedure) principally in terms of its inputs and outputs. “Javadocs” is a documentation methodology maintained by Sun Microsystems, originally intended for documenting Java functions, but easily adapted to almost any other language. Here’s an example of the use of Javadocs on a PHP function:

/** * Returns the capability for a given capability name.
 *
 * @author Modulus Pty. Ltd. - prh
 * @version 2008 1.0
 * @param $id unique string id of the device
 * @param $name string name of the capability
 * @param $fallback boolean for considering fallback

 * @param $fallbackChain array of strings, where known, provide this to avoid unnecessary repetitive lookups
 * @return string capability

 */

function lib_getCapability($id, $name, $fallback, $fallbackChain) {
   ...
}

This documentation is immediately useful in the source code for developers maintaining or altering the source code. Furthermore, the effort required to create and maintain the documentation is quite limited in relation to the benefit derived. However, you may wish to publish an API to your functional library without publishing the source code. One way to do this is to use our javadoc.module,
which, for a modest $19.95, creates elegant, valid and conformant XHTML documentation from Perl, PHP, Javascript or Java source code. Here’s an example of the output generated:

lib_getCapability

function lib_getCapability($id, $name, $fallback, $fallbackChain)

Returns the capability for a given capability name.

author
Modulus Pty. Ltd. – prh
version
2008 1.0
param
$id unique string id of the device
$name string name of the capability
$fallback boolean for considering fallback
$fallbackChain array of strings, where known, provide this to avoid unnecessary repetitive lookups
return
string capability

readability & markup ratios

Wednesday, November 12th, 2008

Following on from counting syllables, I have implemented a range of readability indices in my web editor (Flesch Reading Ease, Flesch Grade count, Gunning Fog Index, Coleman-Liau Index, SMOG Index and Automated Readability Index). In addition, for SGML marked-up pages, I have included a “Markup Ratio”, which is simply the percentage of a document’s characters which are markup characters. Some initial measurements show that this percentage varies from as low as 5% (a very simple HTML help page) to as high as 28% (Microsoft’s home page).

A low ratio (i.e. a high proportion of real content to markup) is said to be a good thing for certain Internet search engines, which favour simply marked-up pages in their page rankings. At this stage it looks to me that, for a real-world, commercial web page that a target of less than 20% is achievable but would require some thought and discipline, e.g. it is not going to be met by the post-modern replacement of <b>wow</b> with <span class=”bold”>wow</span>.

counting syllables

Wednesday, November 12th, 2008

Measures such as “Flesch Reading Ease” and “Gunning’s Fog Index” are measures of the ease of reading a document, based on the average sentence length (words per sentence) and the average syllables per word in the document. Counting sentences and words is relatively easy (ignoring, for the moment, implementation isuues such as underlines, numbers, quoted sentences, etc.) but at first glance counting the syllables in a word looks more challenging. In practice, there is a simple and easily implemented algorithm for counting syllables (in English).

Here’s the algorithm, demonstrated with the word “counted”.

  1. words of 3 or fewer letters return 1
  2. discard trailing “es” and “ed”, thus “counted” -> “count”
  3. discard trailing “e”, except where the ending is “le”
  4. remove all consecutive vowels, thus “count” -> “cont”
  5. count the remaining vowels
  6. add one to the count if the word starts with “mc”

Thus for “counted”, the result is 1. This looks rather surprising since the usual pronunciation of “counted” is as 2 syllables, but discarding “ed” works well for most words, e.g. “raced”, “ripped” etc. are generally pronounced as a monosyllabic sound. It is tempting to make an exception for words ending in “ted”, but English pronunciation is so replete with exceptions that simplicity and consistency with other implementations is probably more valuable than an imaginary exactitude.

The first paragraph of this blog entry has a Gunning Fog Index of 17.0, which is rather foggy; these readability measures reward short sentences with short words, which can be over-simplistic. Still, they are a useful way to attach a concrete measure to a tranche of text, e.g. the mission statement on a website.



Copyright © 2008 Modulus Pty. Ltd.
Were you looking for another 'modulus' site?

tags: Modulus,website,design, tools, applications, consultancies, designers, kpis, balanced scorecard, business process management, internal controls, analog log analysis, computer-based training, cbt, guestbook, link checking, locations, postcodes, distance calculation, find nearest postcode, modular process module, parcel post, calculate postal charges, risk management, as/nzs 4360:2004, sitemap, seo, site search, spell-check in forms

light source
Valid XHTML 1.0 Transitional home | about modulus | bpm e-books | modules | services | links | the blog | contact us Valid CSS!