Anglicise

A mini-app for processing text, converting American spellings to British English.

I developed Anglicise for my own use; several people have expressed an interest, so I'm making it available. I'm happy to answer questions and hear comments and suggestions, but this isn't an 'official' supported release, so caveat executor...

Usage

Anglicise is written in Java, so should be usable almost anywhere. (At least, anywhere with a computer.) Source is included. (JDK 1.2 or higher.)

It's intended as a small, neat tool rather than a big, powerful app, so the interface is fairly minimal. You can run it from the command line, giving the names of files to be processed; each gets replaced by the processed version, with the original renamed to *.bak. With no parameters, or if launched from a GUI, it will let you select file(s) to process in a file dialog.

Certain replacements depend upon context (e.g. 'meter' could refer to a unit of length – spelled 'metre' in British English, or a measuring device – also 'meter'): a dialog is shown asking the user about these cases. If you specify a -i flag, then it shows a dialog for every replacement; -n for none.

Replacements

The list of words to replace is stored in a text file, which can be edited. (Take a shufti at it for details of the format &c). I've compiled it from some dictionary entries, web sites, and observation over a long period. Some changes are a matter of personal taste rather than right or wrong, so I make no excuses for it reflecting my own preferences. It also includes some non-US-related preferences, fixes for common errors, and some diacritics and accented characters in foreign words and names. It currently runs to about 1600 separate replacements.

Setup

The top-level class is uk.co.cix.gidds.text.Anglicise; you can run it from the command line with a command like:

java uk.co.cix.gidds.text.Anglicise <file>...

(assuming your classpath is set up). There's only one complication: the text file listing the replacements to perform should be in the classpath as well.

Alternatively, you can run it straight from the .jar file, which includes that file.

Implementation

The matching is done using a Deterministic Finite-state Automaton (DFA). This is substantially faster than the Non-deterministic FAs normally used, e.g. for regular expressions. I couldn't find anything in the literature about generating such a DFA, so I've developed my own algorithm; I'm quite proud of it. On my machine this setup takes about a second, and after that it zips through text like there's no tomorrow.

The two main classes are uk.co.cix.gidds.text.Anglicise, which is the top-level class and handles the scanning and replacement, and uk.co.cix.gidds.text.StringDFA, which generates and simulates the DFA. It uses several other support classes for file handling, argument parsing, GUI, &c; these were mostly written for other projects, hence their odd arrangement and quirkiness; they could probably be made much neater if I was writing only for this app.

Licence

Anglicise is released under the GNU General Public Licence.

This doesn't require that you let me know of any modifications, but if you've made any neat improvements, or have any other suggestions (anatomical or otherwise), I'd appreciate hearing.

Credits

Like most of my apps, Anglicise is a work in progress, with regular tweaks and adjustments. I don't know whether I'll make any further versions available, but if so, they'll be available here.

Download

Anglicise v1.01 (84KB)

– a zip file containing an executable JAR, this documentation, and JARred source code.

Last updated: 2/Nov/2004.

Back Home