This is the second post in a tutorial series on applying Parsec in historical linguistics. We’ve begun by providing a more formal description of sound change rule grammars and will end by building a full-fledged sound change applier.
In my last post we established a BNF grammar for files which describe sound change rules:
<file> ::= (<phoneme-class-defn> <EOL>)* (<rule> <EOL>)+
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
<rule> ::= <context> ">" <replacement> ["/" <condition>]
<condition> ::= <context>_<context>
<context> ::= (<phoneme> | <phoneme-class>)+
<phoneme> ::= <lowercase-letter>
<phoneme-class> ::= <uppercase-letter>
Before we begin parsing, let’s set up some basic datatypes which can be used to store the parse results.
import Data.Map (Map)
-- A single phoneme.
type Phoneme = Char
-- Phoneme class storage, mapping from a single character ('V', 'A',
-- 'F', etc.) to a collection of phonemes.
type PhonemeClassMap = Map Char [Phoneme]
-- A string of phonemes used to match a given context.
type Context = [Phoneme]
-- A complete sound change rule.
data Rule = Rule { replacement :: [Phoneme],
beforeContext :: Context,
inContext :: Context,
afterContext :: Context }
instance Show Rule where
show (Rule r b i a) = show i ++ " > " ++ show r ++ " / " ++ show b
++ "_" ++ show a
Referencing the BNF grammar, we can use these types to build the returns
for our parsers. Let’s start with the simplest Parsec rules,
anyPhoneme
and anyPhonemeClass
. Any uppercase character in sound
change rules should be interpreted as a phoneme class reference, and any
lowercase character must be a phoneme.
import Text.Parsec
anyPhoneme :: Parsec String () Phoneme
anyPhoneme = lower
anyPhonemeClass :: Parsec String () Char
anyPhonemeClass = upper
As evidenced by the given type annotations, our parsers (for the moment)
will have a stream type of String
, a user state type of ()
, and a
return type that varies based on their purpose.
Our first lift
We need to next build the parser for phoneme class definitions. As a
first try, we could have our parser return a pair of type (Char,
[Phoneme])
, matching with the type of a PhonemeClassMap
. Let’s start:
phonemeClassDefinition :: Parsec String () (Char, [Phoneme])
phonemeClassDefinition = (,) (anyPhonemeClass) (many1 anyPhoneme)
This doesn’t work! What gives?
Let’s look at the type of (,)
, a tuple constructor:
(,) :: a -> b -> (a, b)
And check the type of anyPhonemeClass
and many1 anyPhoneme
:
anyPhonemeClass :: ParsecT String () Identity Char
many1 anyPhoneme :: ParsecT String () Identity [Phoneme]
These have the right types Char
and [Phoneme]
, except they’re
contained within a ParsecT
type.
Good news: ParsecT s u m
is a functor! This means that we can “lift”
functions into the context defined by the type. Check the type of fmap
and fmap (,)
:
fmap :: Functor f => (a -> b) -> f a -> f b
fmap (,) :: Functor f => f a -> f (b -> (a, b))
You can check that the type of (,)
corresponds with the type of fmap
(,)
. (In fmap
’s type signature, b
corresponds to b -> (a, b)
from
(,)
’s type.)
Let’s provide fmap (,)
with that first argument f a
, where f
is
ParsecT String () Identity
and a
is Char
. (This looks like the
type of anyPhonemeClass
!)
fmap (,) anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))
Great - just as fmap
’s signature described, (,)
was lifted into the
ParsecT String () Identity
context and a
was clarified to be a
Char
. We can make our expression look a bit nicer by using an infix
alias for fmap
from Control.Applicative
, <$>
:
(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))
Applying within a context
Looking at the types, we’re almost there: we want our final parser to
have a return type of (Char, [Phoneme])
and the current parser has a
return type of b -> (Char, b)
. How can we supply a type b
?
The answer comes from the fact that ParsecT s u m
is not only a
functor but an applicative functor. This means that we can apply
functions already within the context (like b -> (Char, b)
!) to values
within the context (like anyPhoneme
!).
This contextual application is invoked by Control.Applicative
’s <*>
.
Compare its type with the type of $
: the only difference is that <*>
operates within a context f
.
(<*>) :: Applicative f => f (a -> b) -> f a -> f b
($) :: (a -> b) -> a -> b
Let’s apply the lifted and partially applied function from the last
section to anyPhoneme
:
anyPhoneme :: ParsecT String () Identity Phoneme
(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))
(,) <$> anyPhonemeClass <*> anyPhoneme
:: ParsecT String () Identity (Char, Phoneme)
Close: our return type is (Char, Phoneme)
. Let’s apply with a many1
anyPhoneme
instead, which will produce a parser that accepts one or
more phonemes.
(,) <$> anyPhonemeClass <*> many1 anyPhoneme
:: ParsecT String () Identity (Char, [Phoneme])
Great! Our parser returns the proper type. Let’s write the actual implementation of our phoneme class definition rule before continuing:
import Control.Applicative ((<$>), (<*>))
phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition = (,) <$> anyPhonemeClass <*> many1 anyPhoneme
We must do a bit of bookkeeping. In the original BNF, we stated that a phoneme class definition was of the form
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
We need to account for the “useless” colon in this expression. It’s
useless in that it contributes nothing to the parse result. Using the
*>
function from Control.Applicative
, we can consume a ':'
character and discard its result:
import Control.Applicative ((<$>), (<*>), (*>))
phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition :: (,) <$> anyPhonemeClass
<*> (char ':' *> many1 anyPhoneme)
Modifying user state
There’s one significant problem left with this parser. True, it eats up strings without a problem:
> parse phonemeClassDefinition "" "V:aeiou"
Right ('V', "aeiou")
Our problem is that we need to reference these definitions in another parser, specifically the context parser:
<context> ::= (<phoneme> | <phoneme-class>)+
Since Parsec has no idea of what a phoneme class is, when we build this parser we’ll need to identify exactly what we should look for in test words given that we saw “V” or “A” in a rule. How can we have the phoneme class definitions “carry over?”
It’s simple using Parsec’s built-in “user state” feature. (It shows up
in the u
in ParsecT s u m a
.) Rather than using ()
as our user
state type, let’s carry along a PhonemeClassMap
as state. Each rule’s
type needs to now be redefined (but the implementation for those not
using the state data need not change):
anyPhoneme :: ParsecT String PhonemeClassMap Identity Phoneme
anyPhonemeClass :: ParsecT String PhonemeClassMap Identity Char
phonemeClassDefinition
:: ParsecT String PhonemeClassMap Identity (Char, [Phoneme])
In phonemeClassDefinition
we’ll need to use Parsec’s modifyState
function:
modifyState :: Monad m => (u -> u) -> ParsecT s u m ()
This type annotation does a great job of helping us understand what
exactly happens within the function. Given some user state modifier
(i.e., a function which takes an old user state of type u
and creates
a new one), a new parser is yielded which has a user state of type u
and returns nothing.
Now phonemeClassDefinition
will return nothing and instead modify the
parser’s state (i.e., add entries to the phoneme class map).
phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()
We want to modify this map by insert
ing an entry whose contents will
be equal . We run into a familiar problem, however, since insert
was
not built explicitly for use with the ParsecT s u m
context:
insert :: a -> b -> Map a b -> Map a b
Let’s lift insert
into the ParsecT s u m
functor:
(<$>) insert :: (Functor f, Ord a) => f a -> f (a1 -> Map a a1
-> Map a a1)
insert <$> anyPhonemeClass
:: ParsecT String PhonemeClassMap Identity (a -> Map Char a
-> Map Char a)
Close, like before: we can now provide fmap insert
with a first
argument in the ParsecT s u m
context, but the a1
in the type
annotation has no concept of context. Using <*>
once more, we can fix
the problem:
insert <$> anyPhonemeClass <*> (char ':' *> many1 anyPhoneme)
:: ParsecT String PhonemeClassMap Identity
(Map Char [Phoneme] -> Map Char [Phoneme])
Before continuing, let’s give a name to the parser created in this section.
modifier :: ParsecT String PhonemeClassMap Identity
(Map Char [Phoneme] -> Map Char [Phoneme])
modifier = insert <$> anyPhonemeClass <*> many1 anyPhoneme
Notice that Map Char [Phoneme]
is equivalent to PhonemeClassMap
, or
our parser’s user state u
. We just did all this work to lift and apply
insert
within a context, but now, upon revisiting modifyState
’s
type, we see we’ll need to head in the other direction:
phonemeClassDefinition :: ParsecT String u Identity ()
modifier :: ParsecT String u Identity (u -> u)
modifyState :: Monad m => (u -> u) -> ParsecT s u m ()
Binding
If we simplify the types here, the next step should be obvious. (This is pseudo-Haskell.)
u :: PhonemeClassMap
f :: ParsecT String u Identity
a :: u -> u
b :: ()
modifier :: f a
modifyState :: Monad m => a -> m b
We need some function that, with an f a
and a -> f b
, derive an
f b
. This sounds just like >>=
, the monadic bind operation!
(>>=) :: Monad m => m a -> (a -> m b) -> m b
modifier >>= modifyState :: ParsecT s PhonemeClassMap m ()
That’s it – we’ve found our definition for phonemeClassDefinition
!
With some reformatting:
phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()
phonemeClassDefinition = modifier >>= modifyState
where modifier = insert <$> anyPhonemeClass
<*> defn
defn = char ':' >> many1 anyPhoneme
We’ve finished with the hardest parser of the set. In the next post, we’ll tackle the remaining parsers, most of which are simple combinations of those constructed today.