This is the second post in a tutorial series on applying Parsec in historical linguistics. We’ve begun by providing a more formal description of sound change rule grammars and will end by building a full-fledged sound change applier.
In my last post we established a BNF grammar for files which describe sound change rules:
<file> ::= (<phoneme-class-defn> <EOL>)* (<rule> <EOL>)+
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
<rule> ::= <context> ">" <replacement> ["/" <condition>]
<condition> ::= <context>_<context>
<context> ::= (<phoneme> | <phoneme-class>)+
<phoneme> ::= <lowercase-letter>
<phoneme-class> ::= <uppercase-letter>
Before we begin parsing, let’s set up some basic datatypes which can be used to store the parse results.
Referencing the BNF grammar, we can use these types to build the returns
for our parsers. Let’s start with the simplest Parsec rules,
anyPhoneme
and anyPhonemeClass
. Any uppercase character in sound
change rules should be interpreted as a phoneme class reference, and any
lowercase character must be a phoneme.
As evidenced by the given type annotations, our parsers (for the moment)
will have a stream type of String
, a user state type of ()
, and a
return type that varies based on their purpose.
Our first lift
We need to next build the parser for phoneme class definitions. As a
first try, we could have our parser return a pair of type (Char,
[Phoneme])
, matching with the type of a PhonemeClassMap
. Let’s start:
This doesn’t work! What gives?
Let’s look at the type of (,)
, a tuple constructor:
And check the type of anyPhonemeClass
and many1 anyPhoneme
:
These have the right types Char
and [Phoneme]
, except they’re
contained within a ParsecT
type.
Good news: ParsecT s u m
is a functor! This means that we can “lift”
functions into the context defined by the type. Check the type of fmap
and fmap (,)
:
You can check that the type of (,)
corresponds with the type of fmap
(,)
. (In fmap
’s type signature, b
corresponds to b -> (a, b)
from
(,)
’s type.)
Let’s provide fmap (,)
with that first argument f a
, where f
is
ParsecT String () Identity
and a
is Char
. (This looks like the
type of anyPhonemeClass
!)
Great - just as fmap
’s signature described, (,)
was lifted into the
ParsecT String () Identity
context and a
was clarified to be a
Char
. We can make our expression look a bit nicer by using an infix
alias for fmap
from Control.Applicative
, <$>
:
Applying within a context
Looking at the types, we’re almost there: we want our final parser to
have a return type of (Char, [Phoneme])
and the current parser has a
return type of b -> (Char, b)
. How can we supply a type b
?
The answer comes from the fact that ParsecT s u m
is not only a
functor but an applicative functor. This means that we can apply
functions already within the context (like b -> (Char, b)
!) to values
within the context (like anyPhoneme
!).
This contextual application is invoked by Control.Applicative
’s <*>
.
Compare its type with the type of $
: the only difference is that <*>
operates within a context f
.
Let’s apply the lifted and partially applied function from the last
section to anyPhoneme
:
Close: our return type is (Char, Phoneme)
. Let’s apply with a many1
anyPhoneme
instead, which will produce a parser that accepts one or
more phonemes.
Great! Our parser returns the proper type. Let’s write the actual implementation of our phoneme class definition rule before continuing:
We must do a bit of bookkeeping. In the original BNF, we stated that a phoneme class definition was of the form
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
We need to account for the “useless” colon in this expression. It’s
useless in that it contributes nothing to the parse result. Using the
*>
function from Control.Applicative
, we can consume a ':'
character and discard its result:
Modifying user state
There’s one significant problem left with this parser. True, it eats up strings without a problem:
Our problem is that we need to reference these definitions in another parser, specifically the context parser:
<context> ::= (<phoneme> | <phoneme-class>)+
Since Parsec has no idea of what a phoneme class is, when we build this parser we’ll need to identify exactly what we should look for in test words given that we saw “V” or “A” in a rule. How can we have the phoneme class definitions “carry over?”
It’s simple using Parsec’s built-in “user state” feature. (It shows up
in the u
in ParsecT s u m a
.) Rather than using ()
as our user
state type, let’s carry along a PhonemeClassMap
as state. Each rule’s
type needs to now be redefined (but the implementation for those not
using the state data need not change):
In phonemeClassDefinition
we’ll need to use Parsec’s modifyState
function:
This type annotation does a great job of helping us understand what
exactly happens within the function. Given some user state modifier
(i.e., a function which takes an old user state of type u
and creates
a new one), a new parser is yielded which has a user state of type u
and returns nothing.
Now phonemeClassDefinition
will return nothing and instead modify the
parser’s state (i.e., add entries to the phoneme class map).
We want to modify this map by insert
ing an entry whose contents will
be equal . We run into a familiar problem, however, since insert
was
not built explicitly for use with the ParsecT s u m
context:
Let’s lift insert
into the ParsecT s u m
functor:
Close, like before: we can now provide fmap insert
with a first
argument in the ParsecT s u m
context, but the a1
in the type
annotation has no concept of context. Using <*>
once more, we can fix
the problem:
Before continuing, let’s give a name to the parser created in this section.
Notice that Map Char [Phoneme]
is equivalent to PhonemeClassMap
, or
our parser’s user state u
. We just did all this work to lift and apply
insert
within a context, but now, upon revisiting modifyState
’s
type, we see we’ll need to head in the other direction:
Binding
If we simplify the types here, the next step should be obvious. (This is pseudo-Haskell.)
We need some function that, with an f a
and a -> f b
, derive an
f b
. This sounds just like >>=
, the monadic bind operation!
That’s it – we’ve found our definition for phonemeClassDefinition
!
With some reformatting:
We’ve finished with the hardest parser of the set. In the next post, we’ll tackle the remaining parsers, most of which are simple combinations of those constructed today.