merging html files

This morning I took on the task of merging 200 articles which have been published over four years at Kevin Dwyer’s Change Factory website. The aim of the exercise is to create a compendium of change management and business process management advice from the articles; they will be sorted into appropriate chapters, Kevin will write some brief “glue” text and an introduction, but we want to do it without wasted effort.

Fortunately the articles are quite consistent in format, as they are written using a template. I wrote a Perl program to work through the articles directory, retrieve each article, remove its headers and footers, discard some unnecessary content, and add the article to the (large) merged HTML file. After testing its basic functionality and fine tuning for a couple of minor exceptions, the HTML file was created, weighing in at 1.1 MB, and then read in to Microsoft Word and saved as a Word document, ready for Kevin to add the glue text and introduction. The Perl program is, in itself, quite unremarkable; any programmer could write the same in a language of their choice in half an hour.

It is convenient to have each article start a new page, but as HTML does not have a paging paradigm, where to start? This turns out to be dead easy; the aforementioned Perl program creates a level 2 HTML header block for each article, e.g. <h2>This is an Article Title</h2> with the article’s title. When read into Microsoft Word, these were automatically transformed into Words “Heading 2″ elements. All that remained was the page break, and this was readily achieved by creating a style for the merged HTML document using the print-oriented part of the CSS 2 specification, viz:

h2 {
   page-break-before: always;
}

Word (which does have a paging paradigm) recognised the instruction and promptly inserted page breaks before each “Heading 2″ element, just what was wanted. It sometimes surprises me when things turn out to be so simple to achieve. Something a bit more sophisticated could be achieved using the CSS 2 “widows and orphans” controls, but as this Word document is a means to an end rather than a finished product, I heeded Voltaire’s advice that “The perfect is the enemy of the good.”

Tags: ,

Leave a Reply

You must be logged in to post a comment.



Copyright © 2008 Modulus Pty. Ltd.
Were you looking for another 'modulus' site?

tags: Modulus,website,design, tools, applications, consultancies, designers, kpis, balanced scorecard, business process management, internal controls, analog log analysis, computer-based training, cbt, guestbook, link checking, locations, postcodes, distance calculation, find nearest postcode, modular process module, parcel post, calculate postal charges, risk management, as/nzs 4360:2004, sitemap, seo, site search, spell-check in forms

light source
Valid XHTML 1.0 Transitional home | about modulus | bpm e-books | modules | services | links | the blog | contact us Valid CSS!