porter stemming

November 24th, 2008

In 1980, Martin Porter published a stemming algorithm for English words; stemming is essentially the reduction of a word to its stem by removing its suffixes. Thus, for example, the word “stemming” can be reduced to its stem “stem” by use of Porter’s algorithm. This is immediately useful, since, when searching for the word “stemming”, we can also search for the word “stem”. Normally “stemming” will not match with “stem”, but, when reduced to its stem by suffix removal, it will.

The Porter algorithm removes about 150 common suffixes by way of an algorithm that occupies about 400 lines of Pascal or about 100 lines of Perl, courtesy of the latter’s regular expression library. For those looking for implementations of the algorithm in a variety of languages, the best starting point is Martin Porter’s own website,
tartarus.org/~martin/PorterStemmer/.

The algorithm is useful for searching (Google makes use of it) and a variety of language processing tasks. One interesting use is to find alternate bases for synonym searching. In my “weblex” text editor, when a user seeks synonyms for the word “interesting” the following happens:

  1. synonyms for “interesting” are sought in the synonym dictionary and presented to the user
  2. the stem “interest” is determined using the Porter algorithm
  3. synonyms for the stem are sought
  4. about 150 suffixes are added to the stem and, for each Frankenstinean suffixed form, synonyms are sought and presented to the user together with the suffixed form

One possible optimisation would be to check whether each suffixed form is an English word before seeking synonyms, but as the process descibed above completes in milliseconds this
does not seem to be necessary.

functional documentation

November 16th, 2008

Software documentation and comments are highly variable in quality and usefulness. At the useless end of the spectrum are the detailed source comments which tell you what the next line of code is going to do. As the next line of code tells us exactly the same thing, this is at best pointless; at worst, when the comment does not accurately describe what the next line of code is going to do, it can be seriously misleading; here’s an example of what I’m describing:

#increase each object's value by 10
for (sort keys %objects){
$_ += 10;
}

Perl programmers will notice that the code does not increase the object values, but rather the keys of the %objects hash, so the comment is not only redundant but also plain wrong.

On the other hand, very high-level, abstract documentation which tells us what a program or function library does conceptually is only useful in a very high-level, abstract sort of way.

One type of documentation which is much more successful is the “Javadocs” style of functional documentation, which describes each function (or subroutine or procedure) principally in terms of its inputs and outputs. “Javadocs” is a documentation methodology maintained by Sun Microsystems, originally intended for documenting Java functions, but easily adapted to almost any other language. Here’s an example of the use of Javadocs on a PHP function:

/** * Returns the capability for a given capability name.
 *
 * @author Modulus Pty. Ltd. - prh
 * @version 2008 1.0
 * @param $id unique string id of the device
 * @param $name string name of the capability
 * @param $fallback boolean for considering fallback

 * @param $fallbackChain array of strings, where known, provide this to avoid unnecessary repetitive lookups
 * @return string capability

 */

function lib_getCapability($id, $name, $fallback, $fallbackChain) {
   ...
}

This documentation is immediately useful in the source code for developers maintaining or altering the source code. Furthermore, the effort required to create and maintain the documentation is quite limited in relation to the benefit derived. However, you may wish to publish an API to your functional library without publishing the source code. One way to do this is to use our javadoc.module,
which, for a modest $19.95, creates elegant, valid and conformant XHTML documentation from Perl, PHP, Javascript or Java source code. Here’s an example of the output generated:

lib_getCapability

function lib_getCapability($id, $name, $fallback, $fallbackChain)

Returns the capability for a given capability name.

author
Modulus Pty. Ltd. – prh
version
2008 1.0
param
$id unique string id of the device
$name string name of the capability
$fallback boolean for considering fallback
$fallbackChain array of strings, where known, provide this to avoid unnecessary repetitive lookups
return
string capability

variations on RACI

November 15th, 2008

RACI aide memoire

Before discussing RACI variants, let’s recall the basic RACI model; RACI is a technique used within BPM to clarify responsibilities
and accountabilities for processes. The term RACI is an acronym for “Responsible”, “Accountable”, “Consulted” and “Informed”, and the
technique of RACI is to determine, for each and every process, who is Responsible for the execution of the process, who is Accountable for the outcomes of the
process, who is Consulted in the execution of the process and who is Informed by the process. These four characteristics of people involved with managing business
processes describe each of the key player’s degree of involvement in and liability for the process.

There are a number of models which vary or extend the RACI model. These variants exist partly due to a slight initial awkwardness in distinguishing between the “Accountable” and “Responsible” roles
in the traditional RACI model.

Our purpose in discussing the variants here is to enable you to identify and understand them; we do not suggest that you adopt any of these variants as the benefits of the variants do not, in our view, outweigh the downside
of using a non-standard model. We believe that standard RACI has the necessary and sufficient role identifications to support process-based management.

RACI-VS

RACI-VS adds two roles, viz:

  • Verifies” – the party that checks that a product or service matches its established standards and
  • Signs-off” – the party who approves the verification step and authorises the release of the product or service

This variant adds minor roles and potentially introduces confusion between the two roles it introduces. Additionally, it is quite common for verification processes to be separate processes from the main processes which generate
the product or service to be verified and thus these roles may only exist in the verification process, again inviting some confusion between the verification role and the verification process.

CAIRO

CAIRO adds the “Omitted” role (or, as our American friends might prefer, “Out of the Loop”). This extension is potentially useful in identifying specific cases where an organisational role, which might be thought to be involved in a
process, is specifically not Accountable, Responsible, Consulted or Informed. On the negative side, it should be noted that most organisational roles are not involved in most processes, so this extension runs the risk
that numerous positions need to be identified as “Omitted” for each process.

RASCI

The RASCI extension adds the “Support” role to the basic model, allowing for people who are co-opted to support the “Responsible” role in performing the task. There is a subtle distinction
between the “Consulted” and “Support” roles, as the latter may be tasked with performing part of the work of the “Responsible” party.
Again, the value of the extension seems limited relative to the cost of non-standard terminology.

RACI (Alternate mapping)

The one variant of RACI which addresses the most common RACI difficulty is the alternate mapping where:

  • Responsible” is the person responsible for the performance of the task; this is roughly equivalent to the standard meaning for “Accountable” and
  • Assists” is the party who assists the “Responsible” party in performance of the work and may perform the bulk of the work

This variant appears, at first sight, to resolve the difficulty in standard RACI of understanding the relative accountabilities and responsibilities of the “Accountable” and “Responsible” roles but, in our view, only weakens the
proper understanding of the roles. The notion of someone whose role is purely to “Assist” weakens accountability and, consequentially, makes process-based management more difficult to drive through.

This article is an extract from the “Teal Book” (Identifying, Documenting & Analysing Business Processes) available for on-line purchase at e-books.

plurality

November 12th, 2008

We become accustomed to seeing system messages such as “3 file(s) deleted.” However, this smacks of laziness. One useful function that can be written in your programming language of predilection is plural(). The plural() function is fed parameters of the number and word to be inflected and returns the word approriately inflected. For example, if we call plural(3,'file') it should return ‘files’. When we call plural(1,'file') it should return ‘file’ in order that we can say “1 file deleted” and, oddly enough, plural(0,'file') should return ‘files’, in order that we can say “0 files deleted.”

Of course, English formation of plurals is not regular, so a good implementation of the plural() function would have to deal with the many exceptions in pluralisation, such as class=>classes, potato=>potatoes, data=>data, calf=calves etc. Even then, when the system message is “3 files were deleted”, we also have to deal with the declension of the verb “to be” so as not to end up spouting nonsense like “1 file(s) were deleted”; to avoid the endless complexities of declining irregular verbs, it seems best to avoid verbs in locations in system messages where number is important, i.e. “3 files deleted” rather than “3 files were deleted”.

So, how do we deal with the apparently overwhelming complexities of English grammar, to put out a simple but gramatically-correct system message when we delete a number of files? The resolution is surprisingly easy, achieved by modifying the form of the function plural() so that it takes three arguments: the number, the singular form of the word and the plural form of the word. Thus plural(3,'platypus','platypuses') will return ‘platypuses’, whereas plural(3,'radius','radii') will return ‘radii’, without the need for a long and complex function. Thus "1 "+plural(1,'file,'files')+plural(1,'was,'were')+" deleted" correctly issues
“1 file was deleted” whereas "3 "+plural(3,'file,'files')+plural(3,'was,'were')+" deleted" correctly issues “3 files were deleted”.

Here is the whole of my Perl plural() function:

#**
#* Returns a string with correct plural treatment.
#*
#* @author Modulus Pty. Ltd. - prh
#* @version 2008 1.0
#* @param $nr the number governing the result
#* @param $sstr the singular string e.g. 'match'
#* @param $pstr the plural string e.g. 'matches'
#* @return $sstr or $pstr
sub lib_plural{
   shift == 1 ? shift: $_[1];
}

merging html files

November 12th, 2008

This morning I took on the task of merging 200 articles which have been published over four years at Kevin Dwyer’s Change Factory website. The aim of the exercise is to create a compendium of change management and business process management advice from the articles; they will be sorted into appropriate chapters, Kevin will write some brief “glue” text and an introduction, but we want to do it without wasted effort.

Fortunately the articles are quite consistent in format, as they are written using a template. I wrote a Perl program to work through the articles directory, retrieve each article, remove its headers and footers, discard some unnecessary content, and add the article to the (large) merged HTML file. After testing its basic functionality and fine tuning for a couple of minor exceptions, the HTML file was created, weighing in at 1.1 MB, and then read in to Microsoft Word and saved as a Word document, ready for Kevin to add the glue text and introduction. The Perl program is, in itself, quite unremarkable; any programmer could write the same in a language of their choice in half an hour.

It is convenient to have each article start a new page, but as HTML does not have a paging paradigm, where to start? This turns out to be dead easy; the aforementioned Perl program creates a level 2 HTML header block for each article, e.g. <h2>This is an Article Title</h2> with the article’s title. When read into Microsoft Word, these were automatically transformed into Words “Heading 2″ elements. All that remained was the page break, and this was readily achieved by creating a style for the merged HTML document using the print-oriented part of the CSS 2 specification, viz:

h2 {
   page-break-before: always;
}

Word (which does have a paging paradigm) recognised the instruction and promptly inserted page breaks before each “Heading 2″ element, just what was wanted. It sometimes surprises me when things turn out to be so simple to achieve. Something a bit more sophisticated could be achieved using the CSS 2 “widows and orphans” controls, but as this Word document is a means to an end rather than a finished product, I heeded Voltaire’s advice that “The perfect is the enemy of the good.”

towards a gentler “captcha”

November 12th, 2008

In Internet terms, a “captcha” is a “Completely Automated Public Turing test to tell Computers and Humans Apart”, i.e. one of those images which requires you to replicate the text in order to pass to some next step, such as submitting a comment. The idea of captchas is not without its problems; for the disabled it can be a complete barrier to site access. Even for people without disabilities, entering some random alpha-numeric characters which have been deliberately distorted can be difficult.

Here are five examples of captchas which I consider to be quite difficult to get right:
difficult captcha imagedifficult captcha imagedifficult captcha image
difficult captcha imagedifficult captcha image

Over-simplification of captchas can mean that they can potentially be solved by computers. However, in the vast majority of cases the “treasure” protected by a captcha is not of sufficient worth to bother trying to break the security by OCR or artificial intelligence techniques. Indeed, for the vast majority of sites using a captcha, the enemy is a simple-minded spambot.

One approach I’m trialling at the moment is to use text that is not composed of random alpha-numeric characters, but is a valid English word, randomly selected from a largish subset of English words. We want a set of words of moderate length (say 6-9 characters) and which are reasonably familiar to most readers. We can, for example, use the names of vegetables; these names are reasonably familiar to most site visitors, tend to be about the right length and, consequently, a human site visitor has a much higher chance of typing in the right text in repsonse to a word like “spinach”, whilst very little advantage is conferred to the simple-minded spambot by the fact that the word is a familiar English word. The use of real words decreases confusion between the letters “l”, “i” and “1″, as we
know that “spinach” is spelled with an “i”, not a “1″ or “l”.

Here is what I’m trialling at the moment:
easier captcha image

This approach does reduce security and could be broken by a sophisticated program which combined image processing capability with dictionary lookup, but that doesn’t really concern me, as I am not using captchas to protect anything more valuable than the right to leave a comment in a guestbook or send an email.

what are kpis?

November 12th, 2008

Key Performance Indicators (KPIs) are quantitative and qualitative measures used to measure an organisation’s performance. These are established as targets in a hierarchy, e.g. by departments and individuals. The achievement of these targets is reviewed regularly.

KPIs are used to monitor the performance of a company, department, process or even an individual machine or business process. They also help to establish and shape the culture of the organisation, i.e. KPIs aid in modifying individual and organisational behaviour.

KPIs need to adapt to the changing goals of the organisation, i.e. the KPIs need to be established in the context of the organisation’s goals. Goals change as the organisation changes in reaction to external factors or as it improves or worsens in relation to achievement of its goals.

KPIs are cascaded down from the organisation’s goals to departmental KPIs and down to individual KPIs, and need to reflect the organisation’s culture and values, by indicating the behaviours
and performances that the organisation will recognise as ‘successful’ and reward employees for.

KPIs need to be measurable and reflect a balance between operational and people-orientated measures.

KPIs are a fundamental component of sustaining a change process and maintaining a performance management culture. KPIs should be aligned with the organisation’s vision and direction and this is achieved by cascading the KPI sets down from the organisation’s goals.

When performance is measured, and the results are made visible, organisations can act to improve.

S.M.A.R.T. KPIs

The acronym S.M.A.R.T. is often used to describe well-formed KPIs. The elements of S.M.A.R.T. KPIs are Specific, Measurable, Achievable, Relevant and Timely.

specific

KPIs need to be specific to the individual job and if possible expressed as statements of actual on-the-job behaviours.

For example, a KPI should:

  • explain clearly to the employee how to perform to be successful
  • have an impact on successful job performance, i.e. distinguish between effective and ineffective performance
  • focus on the behaviour itself, rather than personality attributes such as ‘attitude to customers’.

Terms such as ‘work quality’, and ‘job knowledge’ are too vague to be of much use in and of themselves; KPIs should establish specific, quantifiable and measurable targets for ‘work quality’ and ‘job knowledge’.

measurable

KPIs must be measurable, that is based on behaviour that can be observed and documented, and which is job-related. They should also provide employees with continuous feedback on their standard of performance.

achievable

Performance management needs to be an open, two-way communication process. KPIs must be seen as achievable by all parties to the KPI. The KPI must be realistically achievable. If it is set too high for the circumstances (such as an ambitious production target) it will ensure failure.

relevant

It is essential that employees clearly understand the KPIs, and that they have the same meaning to both parties. Joint development of KPIs is more likely to result in relevant and valid standards than top-down edicts.

timely

KPIs should measure performance against an agreed time frame.

It should be possible to collect the relevant information either immediately or shortly thereafter and disseminate iit quickly, otherwise it will lose its relevance.

HR-related functions, including training and development, recruitment and selection, rewards and recognition, career planning etc. must be aligned with the KPIs and must act in a manner supportive of the KPIs. Thus the tangible reward system should directly reward KPI-based performance.

business aspects that require KPIs

KPIs should cover all aspects of the business. The selected KPI sets should cover a balanced-scorecard of KPIs. Examples are:

  • customer satisfaction
  • employee satisfaction
  • staff turnover
  • absenteeism
  • departmental & divisional specific measures
  • triple bottom line: financial, environmental and social responsibility
  • finance including revenue and costs
  • OHS reporting including incidents and related costs
  • equipment usage and OEE
  • maintenance costs and effectiveness
  • new product development & innovation
  • lead times and down times
  • quality.

KPI components

KPIs should identify the required outcomes, for example:

  • the minimum acceptable performance e.g. daily break-even point
  • target performance e.g. desired daily output.

KPIs should be communicated to all staff so that they are aware of how they are to be measured and how their KPIs impact on the organisation as a whole. KPIs should also be aligned with the vision and direction of the organisation and have relevant reward and recognition criteria linked to each KPI.

When implementing new KPIs, having baseline data to measure improvements is very important. Progress on KPIs should be communicated at regular times to highlight emerging trends. As these trends emerge, corrective action can be implemented in a timely fashion. KPIs need to be communicated via multiple media.

The measures that are selected must be carefully specified to ensure they do not cause unintended behaviours. There needs to be a a balance of qualitative and quantitative factors to encourage the correct behaviours.

Listed below are some examples of the behaviours and outcomes that ill-considered KPIs can cause.

Measurement area Behaviour Outcome
Production output Make more Overproduction
Machine efficiency Run machine longer
Run in most efficient sequence for machine
Unnecessary stock
Customer orders late
Maintenance costs Reduction in maintenance activities to reduce costs Machine breakdowns
Cash flow performance Pay suppliers as late as possible Supplier deliveries unreliable

creating KPIs

This section addresses the practical matter of how to generate candidate KPIs; in other words, given a set of business processes, how can we go about generating a good set of candidate KPIs for those processes. We will examine several useful aids for generating KPIs and then describe the “in practice” process of generation.

cascading corporate goals down

Many ways of generating KPIs for a department or sector of an organisation can be used, but since at some point in the process the KPIs will need to be assessed for their contribution to corporate goals, one approach which can shortcut some work is to determine the department’s goals by cascading the corporate goals down to the department level.

A department’s goals should contribute to the corporate goals in the appropriate manner for the department’s nature. For example, given a corporate goal of “Differentiate top-tier products by market-leading quality”, the Purchasing Department’s goals might include “Seek alternate suppliers with higher quality components at competitive rates and supply conditions”, i.e. the departmental role contributes to the corporate role within the limitations of the department’s raison d’etre.

Once the department’s goals are known and agreed, there is a basis for designing metrics. KPIs can be chosen which not only satisfy the fundamental properties of KPIs (key indicators of process performance) but which also can be directly understood in light of the department’s goals. This approach significantly facilitates generating KPIs.

KPI examples

We provide over 500 common KPIs for your consideration at:
modulus KPIs.
This note has been a high-level summary of the question of what is a KPI. For a much more detailed analysis of KPIs, and especially of choosing KPI sets, consider our e-book Generating and Selecting KPI Sets”

line length, white space, justification & hyphen-ation

November 12th, 2008

introduction

The way in which people read text has been extensively studied since the 19th century, yet when one looks for definitive information on choosing an appropriate line-length for a website for optimum legibility, a wide range of conclusions is found. Recent experiments have shown that on-screen reading speed increases with longer line lengths (testing in the range 35 – 95 characters per line). Other experiments have shown that line length had little effect on readability, but that readability was substantially affected by the size of the surrounding white-space margins, with large margins substantially enhancing readability. Another recent experiment showed that margins slow reading speed, but increase comprehension. These studies are not necessarily conflicting; taken together they indicate that line length itself is not as key as might have been thought, but providing adequate white-space margins (which, coincidentally, will somewhat decrease line length) is important to comprehension, which is even more important to a commercial website than reading speed.

My local newspaper (The Age) uses a principal layout comprising eight columns, each only 45 mm. wide, which results in 28 – 34 characters per line – so short that, in combination with the fact that the text is justified, about 15% of lines end in a hyphen.

Unlike a newspaper or magazine, which has a set size combined with a set font and font size, the website visitor has a high degree of control over properly-designed websites and can change the overall size, text size, font size and even font family. This degree of flexibility more or less rules out hyphenation of text, which, in contrast with newspapers and magazines means that very narrow columns work less well.

Justification (i.e. where inter-word padding is inserted to make all lines equal in measured length) is the rule rather than the exception in books, magazines and newspapers, but the exception rather than the rule on websites. Justification looks much better when combined with hyphenation, and is therefore less appropriate for websites. In addition, only a limited range of browsers support justification.

techniques

The simplest and cleanest way to ensure that you have adequate white-space margins around text is to set the margins directly in CSS e.g.
margin-left: 15%;
margin-right: 15%
etc.

Note that it is preferable to set these margins as percentages of the available width (rather than absolute values) so that they adjust appropriately when the page is resized. Most websites will, however, be more complex in their layout and we must also bear in mind that “white-space” is not necessarily unoccupied.

Setting CSS to justify text (text-align: justify;) or to right-justify text (text-align: right;) works with Mozilla Firefox 2.0.0.13 and its cousin Netscape Navigator 8.0.4. With Internet Explorer 6.0.28
or Opera 9.27, these justifications can be made to work, but the inheritance mechanism in their CSS implementations is not correct and so each element that is to be justified (other than the default ‘left’) has to have the justification specifically set i.e.
body {text-align: justify;}
may not be sufficient, you may need, for example,
body {text-align: justify;}
td {text-align: justify;}
to reliably achieve the justification.

This article is an extract from the “Saffron Book” (Good Practice for Commercial Website Design) available for on-line purchase at e-books.

readability & markup ratios

November 12th, 2008

Following on from counting syllables, I have implemented a range of readability indices in my web editor (Flesch Reading Ease, Flesch Grade count, Gunning Fog Index, Coleman-Liau Index, SMOG Index and Automated Readability Index). In addition, for SGML marked-up pages, I have included a “Markup Ratio”, which is simply the percentage of a document’s characters which are markup characters. Some initial measurements show that this percentage varies from as low as 5% (a very simple HTML help page) to as high as 28% (Microsoft’s home page).

A low ratio (i.e. a high proportion of real content to markup) is said to be a good thing for certain Internet search engines, which favour simply marked-up pages in their page rankings. At this stage it looks to me that, for a real-world, commercial web page that a target of less than 20% is achievable but would require some thought and discipline, e.g. it is not going to be met by the post-modern replacement of <b>wow</b> with <span class=”bold”>wow</span>.

counting syllables

November 12th, 2008

Measures such as “Flesch Reading Ease” and “Gunning’s Fog Index” are measures of the ease of reading a document, based on the average sentence length (words per sentence) and the average syllables per word in the document. Counting sentences and words is relatively easy (ignoring, for the moment, implementation isuues such as underlines, numbers, quoted sentences, etc.) but at first glance counting the syllables in a word looks more challenging. In practice, there is a simple and easily implemented algorithm for counting syllables (in English).

Here’s the algorithm, demonstrated with the word “counted”.

  1. words of 3 or fewer letters return 1
  2. discard trailing “es” and “ed”, thus “counted” -> “count”
  3. discard trailing “e”, except where the ending is “le”
  4. remove all consecutive vowels, thus “count” -> “cont”
  5. count the remaining vowels
  6. add one to the count if the word starts with “mc”

Thus for “counted”, the result is 1. This looks rather surprising since the usual pronunciation of “counted” is as 2 syllables, but discarding “ed” works well for most words, e.g. “raced”, “ripped” etc. are generally pronounced as a monosyllabic sound. It is tempting to make an exception for words ending in “ted”, but English pronunciation is so replete with exceptions that simplicity and consistency with other implementations is probably more valuable than an imaginary exactitude.

The first paragraph of this blog entry has a Gunning Fog Index of 17.0, which is rather foggy; these readability measures reward short sentences with short words, which can be over-simplistic. Still, they are a useful way to attach a concrete measure to a tranche of text, e.g. the mission statement on a website.



Copyright © 2008 Modulus Pty. Ltd.
Were you looking for another 'modulus' site?

tags: Modulus,website,design, tools, applications, consultancies, designers, kpis, balanced scorecard, business process management, internal controls, analog log analysis, computer-based training, cbt, guestbook, link checking, locations, postcodes, distance calculation, find nearest postcode, modular process module, parcel post, calculate postal charges, risk management, as/nzs 4360:2004, sitemap, seo, site search, spell-check in forms

light source
Valid XHTML 1.0 Transitional home | about modulus | bpm e-books | modules | services | links | the blog | contact us Valid CSS!