Geoff's Sound Change Applier

Geoff's homepage -> Artificial Languages -> Sound Change Applier

Last update: 2 September 2008


About SCA

Geoff's Sound Change Applier, SCA hereafter, is a text-based program which applies sets of rules of sound-change to one or more words. It was originally based on a C program written by Mark Rosenfelder, which is fine for what it does, but I needed something more powerful my purposes, which frequently requires converting one word simultaneously into several descendent cognates. I recommend reading the documentation for his program anyway, since although it works somewhat differently from mine, many of the underlying concepts and principles are the same.

To use SCA, you'll need to install Python on your system; most Linices should have it already installed. You can get it from here. I leave writing a Perl version up to someone else; I tried it myself once, but it got horribly messy!

SCA works well enough for me, and while I hope it will be found useful, I can't guarantee that it will be suitable for your requirements; if you've any comments or suggestions for improvements, let me know. It works with Python 2.4 and 2.51, but not 1.5, for example.

Meanwhile, a nice chap by the name of "pharazon" has turned SCA into a CGI-based webpage, which I would have done myself if my ISP allowed CGI...

Downloading

All of the necessary files are stored in this file (15K):


Using SCA

SCA is invoked from the command-line like this:
python sc_apply.py -c<sc-file> [arguments] <words>

sc-file is the name of the file which contains the sound changes; you must specify this, or the program won't work. You can use --scfile= instead of -c if you prefer. The filename should end with ".sc"; this extension can be omitted from the command-line. The format of the file is described below.

words are the words to process, separated by spaces. If no words are supplied, and no -l argument is present, you will see the banner, but nothing else interesting will happen.

All of the remaining arguments are optional.

If you have specified an input file, you can also use the following, which are useful if you, for example, your input file has words and meanings on the same line:

The following arguments affect the display of the output:

The options -R -a, -r -a, and -r replace the options -v1 -v2 -v3 in earlier versions of the program. They are ignored if you read your words from an input file, based on the entirely reasonable assumption that they may generate far more output than you really want to look at.

As a simple example:

python sc_apply.py -cspanish.sc flamma gutta ossu petra cu:pa
>>> flamma
S: llama
>>> gutta
S: gota
>>> ossu
S: hueso
>>> petra
S: piedra
>>> cu:pa
S: cuba
Of course, it can't get everything right; if you have any improvements, let me know!

If you want to save your output to a file, on Linux and Window$ just add " > name-of-file" to the command-line. I don't know what to do on a Mac.


Format of the sound change file

Each line in the sound change file is considered to be one of the following:

Special words

The words SKIP NOSKIP END, when appearing alone on a line, have special meanings, and may be used to disable chunks of your sound change files if they're giving problems:

Note that these do not nest; END is obeyed even after SKIP.

Sound category definitions

Sound category definitions have the format category = sounds, where category is a string of letters or digits, and sounds is either one or more characters representing individual sounds (phonemes, in the jargon), or a list of sounds and categories separated by '~'. Each '~'-separated element is treated as a category if it identifies one which already exists, otherwise it is taken to be a list of sounds.

For example, these two lines define two categories called ufric and vfric:

ufric = pTsx
vfric = fDzG

This line defines a category fric which consists of all the sounds in the two categories already defined:

fric  = ufric~vfric

And this line defines foo to be fric plus two other sounds:

foo   = fric~hH

Note that the meanings of the letters which represent sounds are meaningful only to you and to the sound change rules; the program has no way of knowing that ufric is supposed to be a voiceless fricative, for example.

Note too that if you really want to define one category to be equal to another, you have to add a tilde to the second definition, for example:

avowels = aáâàäã
low     = avowels~
Otherwise Strange Things will happen; in this example, the phonemes represented by a, v, o, w, e, l and s will be treated as low vowels, which probably isn't what was intended.

Special categories

There are two "special" sound categories, identified by the names dialects and include.

Include = FILE will immediately read in the contents of the file named FILE, which is handy if for example you have a set of common definitions you don't want to copy into several other files.

dialects = DIAL indicates that each character in DIAL is a separate dialect; the program will output one word for each dialect. For example, dialects = PGSCFOIR might be used by someone who's interested in comparing cognates in the Romance languages.


Sound change rules

This is the real meat of the program.

A rule consists of four or five fields, which are separated by spaces or tabs:

DIALECTS   BEFORE  AFTER  ENV  FLAGS

which informally mean "for each dialect in DIALECTS, BEFORE changes to AFTER in the environment ENV", using FLAGS if any are specified.

DIALECTS is a string of letters or numbers which specify whch dialects the rule applies to. You can use other padding characters to make the file visually easier to follow; for example:

PGSC....  (rule applying to Portuguese, Galician, Spanish, and Catalan)
...CFO..  (ditto Catalan, French, and Occitian)

DIALECTS is omitted from example rules below.

BEFORE and AFTER can contain a sound, a category, or a sequence of sounds or categories; these must not be empty! If you really mean "nothing", you must use '0' (i.e. zero).

ENV contains the character '_' (underscore), optionally preceded and followed by a sound, a category, or a sequence of sounds or categories. Essentially, a_b means "when preceded by a and followed by b"; _ by itself means "always".

Categories are always enclosed in <angle brackets>; thus, the following rule:

<ufric>  <vfric>  <vowel>_<vowel>

means "voiceless fricatives become voiced fricatives between vowels" - assuming, of course, that the categories have previously been defined to consist of characters which you use to represent the sounds in question.

If the name of the category is preceded by "^", the category is complemented; in other words <^category> means "anything not in category". So:

<front>  <back>  _<^soft>

might mean "a front vowel becomes the corresponding back vowel when not before a soft consonant".

Special characters

The following characters have special meanings:

CharacterWhere validMeaning
#ENVbeginning or end of a word
%ENVBEFORE
0 (zero)BEFORE, AFTERnull phoneme, empty, blank
<BEFORE, AFTERpart of ENV preceding the underscore
>BEFORE, AFTERpart of ENV following the underscore

Within BEFORE, AFTER and ENV, categories are enclosed in <angle brackets>. If AFTER consists solely of a category, so must BEFORE, and both categories must contain the same number of sounds; a sound in BEFORE which occurs in the appropriate environment is then replaced by the corresponding sound in AFTER.

Because the strings are converted internally to regular expressions, regular expression metacharacters may also appear in any of BEFORE, AFTER and ENV. . (a dot) may be used to represent "any character"; the most useful of the others are |, +, * and ?. For example:

h       0       _<vobs>|<fric>
u       o       _<cons>*a
u       o       _<cons>+a
ja      E       #h?_

mean respectively:

Note that a+ is equivalent to aa*, and that ENV in the first example rule could just as easily be _<foo> where _<foo> is earlier defined to be <fric>~<vobs>.

More special characters

These are in a separate section to draw attention to the fact that they allow you to do very clever things and should really be used sparingly, and only if you know what you are doing. They are valid within AFTER only.

Retrieving substrings

The hash character #, when followed by a number (which may begin with an optional minus sign), refers to a specific phoneme in BEFORE: #1 to the first, #2 to the second, and so on. For example, you can implement metathesis of two sounds with #2#1 (replacing the ~ in older versions of SCA), like this:

<vowel><liquid> #2#1       _<cons>|#

Non-positive numbers, if you need them, count from the end: #-1 refers to the last, #-2 to the second-last, and so on; the #2#1 above could thus also be one of #-1#1 #-1#-2 #2#-1, for whatever that's worth.

Another example comes courtesy of the spelling rules of Liotan and Breathanach, which require purely orthographic "glide" vowels to be inserted in certain situations. Here I is a pseudo-phoneme which indicates that a preceding consonant is slender, and evow and ivow are all varieties (long and short) of /e/ and /i/ respectively:

I<evow>    #2i        _<slender>
I<evow>    #2a        _
I<ivow>    #2         _<slender>
I<ivow>    #2o        _

You can see how the rules make it easy to throw the I away when the relevant rule is applied. Without the indexing, I needed eleven separate rules to do this.

Assimilations

The backquote ` is the most powerful feature of the program, and the one with the greatest potential for surprises, so don't rely on it until you are sure you know what it does. Its main function is to combine several related rules into one. In the following example, lvow represents the three vowels /e/, /a/ and /o/; the rule converts a sequence of two of these vowels to the long version of the first one (/ae/ to long /a/, etc.):

*    <lvow><lvow> #1`short`long _

Here the strange-looking AFTER rule means "the character in long which occupies the same position in short as the forst phoneme in BEFORE". Note that there are no angle brackets around the sound category names in AFTER if you are using a backquote rule.

Here's another example. The following rules, which assimilate a nasal to a following voiced stop consonant:

nasal = mnN

<nasal>    m      _b
<nasal>    n      _d
<nasal>    N      _g

can be represented more succinctly by this:

vstop = bdg

<nasal>    >`vstop`nasal     _<vstop>

With a little bit of thought you should be able to work out that this rule will convert nb to mb.

There is a shorthand for certain types of backquoting rule. If you have rules like:

<A>    >`B`A     _<B>
<A>    <`B`A     <B>_

you can replace them with, respectively:

<A>    `>    _<B>
<A>    `<    <B>_

So the second example could also be:

<nasal>    `>     _<vstop>

which is conveniently read as "a nasal assimilates to a following voiced stop" - assuming, of course, that the categories nasal and vstop represent the same places of articulation in the same order. Similarly, assimilation in the other direction (/nb/ to /nd/) could be represented by:

<vstop>    `<     <nasal>_

The underscore character ("_") can be used in AFTER to break up otherwise confusing parts of a rule; for example:

<short><short> #1`short`long_h _

will convert two successive short vowels to the corresponding long vowel followed by h.


Flags

The FLAGS field consists of zero or more characters, which alter the way the rule works. At present, two flags are supported, and all others are ignored.

The banana problem

Suppose you have rules like the following:

vstop = bdg
vfric = vDG
vowel = aeiou

<vstop>   <vfric>   <vowel>_<vowel>

and you run "abadaga" through it. You'll get "avadaGa" back, not "avaDaGa" as you might expect. So you try this:

0     ;        <vowel><vstop>_<vowel>
<vstop>    <vfric>  _;
;     0        _

with the same result. Before you take my name in vain, don't panic: this is an instance of the banana problem (search for it in the Jargon File). Essentially, are there one or two instances of the string ana in the word banana? What is happening in this example is that once the trailing <vowel> matches the "a" following the "b", this "a" is then considered to have been processed, and next time round the <vowel> section matches the "a" after the "d" instead.

In summary, if obvious-looking conversions are being missed, and there's some overlap between the two parts of ENV, you're probably looking at this problem. The solution in older versions of the program was to include the rule twice in succession; however, this is unsatisfactory if, for example, you need to do vowel harmony counting from the first syllable and don't know how long the words will be. The solution now is to use the B flag:

<vstop>   <vfric>   <vowel>_<vowel>   B
<vstop>   <vfric>   <vowel>_<vowel>   B

What this flag means is essentially "keep applying the rule over and over until the word doesn't change any more", instead of just scanning once through to the end.

Persistent rules

The P flag will specify a rule as "persistent". All persistent rules are applied in the order they appear in the file after each other rule. For example, a persistent nasal harmony rule like this:

<nasal>    `>     _<vstop>   P

will always ensure that nasals are homorganic to following voiced stops.


Example rules

Here are some illustrations of how all this translates to actual phonological rules.

Type of change What to do Rule BEFORE AFTER ENV
Loss Use 0 (zero) for AFTER /t/ disappears word-finally t 0 _#
Epenthesis Use 0 (zero) for BEFORE /d/ is inserted between /n/ and /r/ 0 t n_r
Assimilation Use < or > for AFTER /h/ asssimilates to a preceding fricative h < <fric>_
Simplification Use % in ENV Two identical fricatives are reduced to one <fric> 0 _% or %_
Metathesis Use #2#1 for AFTER <nasal> before <liquid> moves to after it <nasal><liquid> #2#1 _

By the metathesis example rule, assuming the categories are properly defined, canra will become carna.


Notes

What not to do (gotchas)

If you try something like this:

<cat1>X        <cat1>       _

to mean "X disappears after something of category cat1", you'll get the error "Error: Translation strings different lengths". The correct way is this:

X              0            <cat1>_

or, if you're feeling clever:

<cat1>X        #1           _

Environmental damage

You can use otherwise unused characters to temporarily represent environments. For example, consider the following nasalisation rules from one of my conlangs:

*  0         ;          <vowel>_<nasal><obs>
*  0         ;          <vowel>_<nasal><liquid>
*  0         ;          <vowel>_<nasal>#
*  a|u       A          _;
*  e|i       E          _;
*  á|â|ú|û   U          _;
*  é|ê|í|î   I          _;
*  ;    0          _

The first three rules set up the nasalisation environment, the next four nasalise the vowels, and the last rule removes the environment since it's no longer useful. Without this there would need to be twelve rules, since for technical reasons it's not possible to have a rule like the following:

*  <vowel>   <nasvow>        _<nasal>(<obs>|<liquid>|#)

Although you can of course do this:

*  <vowel>   <nasvow>        _<nasal>#

Spelling

If the orthography and phonology of your language are related in predictable ways, you can use SCA to convert between them. Note that this sort of conversion is typically more complicated and fiddly than merely converting sounds, especially if you have a complicated orthography. The last section of spanish.sc, which begins with the comment "Spellings!", is a straightforward example.