Understanding Computation (16 page)

Read Understanding Computation Online

Authors: Tom Stuart

Tags: #COMPUTERS / Programming / General

And check that it works:

>>
pattern
=
Repeat
.
new
(
Literal
.
new
(
'a'
))
=> /a*/
>>
pattern
.
matches?
(
''
)
=> true
>>
pattern
.
matches?
(
'a'
)
=> true
>>
pattern
.
matches?
(
'aaaa'
)
=> true
>>
pattern
.
matches?
(
'b'
)
=> false

Now that we have
#to_nfa_design
implementations for each class of regular expression syntax, we can
build up complex patterns and use them to match strings:

>>
pattern
=
Repeat
.
new
(
Concatenate
.
new
(
Literal
.
new
(
'a'
),
Choose
.
new
(
Empty
.
new
,
Literal
.
new
(
'b'
))
)
)
=> /(a(|b))*/
>>
pattern
.
matches?
(
''
)
=> true
>>
pattern
.
matches?
(
'a'
)
=> true
>>
pattern
.
matches?
(
'ab'
)
=> true
>>
pattern
.
matches?
(
'aba'
)
=> true
>>
pattern
.
matches?
(
'abab'
)
=> true
>>
pattern
.
matches?
(
'abaab'
)
=> true
>>
pattern
.
matches?
(
'abba'
)
=> false

This is a nice result. We began with a syntax for patterns and
have now given a semantics for that syntax by showing how to convert any
pattern into an NFA, a kind of abstract machine that we already know how
to execute. In conjunction with a parser, this gives us a practical way
of reading a regular expression and deciding whether it matches a
particular string. Free moves are useful for this conversion because
they provide an unobtrusive way to glue together smaller machines into
larger ones without affecting the behavior of any of the
components.

Note

The majority of real-world implementations of regular
expressions, like the Onigmo library used by Ruby, don’t work by
literally compiling patterns into finite automata and simulating their
execution. Although it’s a fast and efficient way of matching regular
expressions against strings, this approach makes it harder to support
more advanced features like capture groups and lookahead/lookbehind
assertions. Consequently most libraries use some kind of
backtracking algorithm
that deals with regular
expressions more directly instead of converting them into finite
automata.

Russ Cox’s
RE2
library
is a production-quality C++ regular expression
implementation that
does
compile patterns into
automata,
[
25
]
while Pat Shaughnessy has written a detailed
blog
post
exploring how Ruby’s regular expression algorithm
works.

Parsing

We’ve almost
built a complete (albeit basic) regular expression implementation. The only
missing piece is a parser for pattern syntax: it would be much more convenient if we could
just write
(a(|b))*
instead of building the abstract
syntax tree manually with
Repeat.new(Concatenate.new(Literal.new('a'), Choose.new(Empty.new,
Literal.new
('b'))))
. We saw in
Implementing Parsers
that it’s
not difficult to use Treetop to generate a parser that can automatically transform raw
syntax into an AST, so let’s do that here to finish off our implementation.

Here’s a Treetop
grammar for simple regular expressions:

grammar
Pattern
rule
choose
first
:
concatenate_or_empty
'|'
rest
:
choose
{
def
to_ast
Choose
.
new
(
first
.
to_ast
,
rest
.
to_ast
)
end
}
/
concatenate_or_empty
end
rule
concatenate_or_empty
concatenate
/
empty
end
rule
concatenate
first
:
repeat
rest
:
concatenate
{
def
to_ast
Concatenate
.
new
(
first
.
to_ast
,
rest
.
to_ast
)
end
}
/
repeat
end
rule
empty
''
{
def
to_ast
Empty
.
new
end
}
end
rule
repeat
brackets
'*'
{
def
to_ast
Repeat
.
new
(
brackets
.
to_ast
)
end
}
/
brackets
end
rule
brackets
'('
choose
')'
{
def
to_ast
choose
.
to_ast
end
}
/
literal
end
rule
literal
[a-z]
{
def
to_ast
Literal
.
new
(
text_value
)
end
}
end
end
Note

Again, the order of rules reflects the precedence of each operator: the
|
operator binds loosest, so the
choose
rule goes first, with the higher precedence operator rules appearing
farther down the grammar.

Now we have all the pieces we need to parse a regular expression,
turn it into an abstract syntax tree, and use it to match
strings:

>>
require
'treetop'
=> true
>>
Treetop
.
load
(
'pattern'
)
=> PatternParser
>>
parse_tree
=
PatternParser
.
new
.
parse
(
'(a(|b))*'
)
=> SyntaxNode+Repeat1+Repeat0 offset=0, "(a(|b))*" (to_ast,brackets):
SyntaxNode+Brackets1+Brackets0 offset=0, "(a(|b))" (to_ast,choose):
SyntaxNode offset=0, "("
SyntaxNode+Concatenate1+Concatenate0 offset=1, "a(|b)" (to_ast,first,rest):
SyntaxNode+Literal0 offset=1, "a" (to_ast)
SyntaxNode+Brackets1+Brackets0 offset=2, "(|b)" (to_ast,choose):
SyntaxNode offset=2, "("
SyntaxNode+Choose1+Choose0 offset=3, "|b" (to_ast,first,rest):
SyntaxNode+Empty0 offset=3, "" (to_ast)
SyntaxNode offset=3, "|"
SyntaxNode+Literal0 offset=4, "b" (to_ast)
SyntaxNode offset=5, ")"
SyntaxNode offset=6, ")"
SyntaxNode offset=7, "*"
>>
pattern
=
parse_tree
.
to_ast
=> /(a(|b))*/
>>
pattern
.
matches?
(
'abaab'
)
=> true
>>
pattern
.
matches?
(
'abba'
)
=> false
Equivalence

This chapter has
described the idea of a deterministic state machine and
added more features to it: first nondeterminism, which makes it possible
to design machines that can follow many possible execution paths instead
of a single path, and then free moves, which allow nondeterministic
machines to change state without reading any input.

Nondeterminism and free moves make it easier to design finite state
machines to perform specific jobs—we’ve already seen that they’re very
useful for translating regular expressions into state machines—but do they
let us do anything that we can’t do with a standard DFA?

Well, it turns out that it’s possible to
convert any nondeterministic finite automaton into a deterministic one that
accepts exactly the same strings. This might be surprising given the extra constraints of a
DFA, but it makes sense when we think about the way we simulated the execution of both kinds
of machine.

Imagine we have a particular DFA whose behavior we want to simulate.
The simulation of this hypothetical DFA reading a particular sequence of
characters might go something like this:

  • Before the machine has read any input, it’s in state 1.

  • The machine reads the character
    'a'
    , and now it’s
    in state 2.

  • The machine reads the character
    'b'
    , and now it’s
    in state 3.

  • There is no more input, and state 3 is an accept state, so the string
    'ab'
    has been accepted.

There is something slightly subtle going on here: the simulation,
which in our case is a Ruby program running on a real computer, is
recreating the behavior of the DFA, which is an abstract machine that
can’t run at all because it doesn’t exist. Every time the imaginary DFA
changes state, so does the simulation that we are running—that’s what
makes it a simulation.

It’s hard to see this separation, because both the DFA and the simulation are
deterministic and their states match up exactly: when the DFA is in state 2, the simulation is
in a state that means “the DFA is in state 2.” In our Ruby simulation, this
simulation state
is effectively the value of the
DFA
instance’s
current_state
attribute.

Despite the extra overhead of dealing with nondeterminism and free
moves, the simulation of a hypothetical NFA reading some characters
doesn’t look hugely different:

  • Before the machine has read any input, it’s possible for it to be in either state 1 or
    state 3.
    [
    26
    ]

  • The machine reads the character
    c
    , and now it’s
    possible for it to be in one of states 1, 3, or 4.

  • The machine reads the character
    d
    , and now it’s
    possible for it to be in either state 2 or state 5.

  • There is no more input, and state 5 is an accept state, so the string
    'cd'
    has been accepted.

This time it’s easier to see that the state of the simulation is not
the same thing as the state of the NFA. In fact, at every point of this
simulation we are never certain which state the NFA is in, but the
simulation itself is still deterministic because it has states that
accommodate that uncertainty. When it’s
possible
for
the NFA to be in one of states 1, 3 or 4, we are
certain
that the simulation is in the single state
that means “the NFA is in state 1, 3, or 4.”

The only real difference between these two examples is that the DFA simulation moves from
one current state to another, whereas the NFA simulation moves from one current
set
of possible states
to another. Although an NFA’s rulebook can be
nondeterministic, the decision about which possible states follow from the current ones for a
given input is always completely deterministic.

This determinism means that we can always construct a DFA whose job
is to simulate a particular NFA. The DFA will have a state to represent
each set of possible states of the NFA, and the rules between these DFA
states will correspond to the ways in which a deterministic simulation of
the NFA can move between its sets of possible states. The resulting DFA
will be able to completely simulate the behavior of the NFA, and as long
as we choose the right accept states for the DFA—as per our Ruby
implementation, these will be any states that correspond to the NFA
possibly
being in an accept state—it’ll accept the
same strings too.

Let’s try doing the conversion for a specific NFA. Take this
one:

It’s possible for this NFA to be in state 1 or state 2 before it has read any input (state
1 is the start state, and state 2 is reachable via a free move), so the simulation will begin
in a state we can call “1 or 2.” From this starting point the simulation will end up in
different states depending on whether it reads
a
or
b
:

  • If it reads an
    a
    , the simulation will remain in
    state “1 or 2”: when the NFA’s in state 1 it can read an
    a
    and either follow the rule that keeps it in state 1 or the rule that takes
    it into state 2, while from state 2, it has no way of reading an
    a
    at all.

  • If it reads a
    b
    , it’s possible for the NFA to end
    up in state 2 or state 3—from state 1, it can’t read a
    b
    , but from state 2, it can move into state 3 and potentially take a free
    move back into state 2—so we’ll say the simulation moves into a state called “2 or 3” when
    the input is
    b
    .

By thinking through the behavior of a simulation of the NFA, we’ve
begun to construct a state machine for that simulation:

Note

“2 or 3” is an accept state for the simulation, because state 3 is an accept state for
the NFA.

We can continue this process of discovering new states of the simulation until there are
no more to discover, which must happen eventually because there are only a limited number of
possible combinations of the original NFA’s states.
[
27
]
By repeating the discovery process for our example NFA, we find that there are
only four distinct combinations of states that its simulation can encounter by starting at “1
or 2” and reading sequences of
a
s and
b
s:

If the NFA is in state(s)…
and reads the character…
it can end up in state(s)…
1 or 2
a
1 or 2
b
2 or 3
2 or 3
a
none
b
1, 2, or 3
None
a
none
b
none
1, 2, or 3
a
1 or 2
b
1, 2, or 3

This table completely describes a DFA, pictured below, that accepts the same strings as
the original NFA:

Note

This DFA only has one more state than the NFA we started with, and for some NFAs, this
process can produce a DFA with
fewer
states than the original machine.
In the worst case, though, an NFA with
n
states may require a DFA with
2
n
states, because there are a total of
2
n
possible combinations of
n
states—think
of representing each combination as a different
n
-bit number, where the
n
th
bit indicates whether state
n
is included in that combination—and the simulation might need to be
able to visit all of them instead of just a few.

Let’s try implementing this NFA-to-DFA conversion in Ruby. Our
strategy is to introduce a new class,
NFASimulation
, for collecting information about
the simulation of an NFA and then assembling that information into a DFA.
An instance of
NFASimulation
can be
created for a specific
NFADesign
and
will ultimately provide a
#to_dfa_design
method for converting it to an
equivalent
DFADesign
.

We already have an
NFA
class that
can simulate an NFA, so
NFASimulation
can work by creating and driving instances of
NFA
to find out how they respond to all possible
inputs. Before starting on
NFASimulation
, let’s go back to
NFADesign
and add an optional “current states”
parameter to
NFADesign#to_nfa
so that
we can build an
NFA
instance with any
set of current states, not just the
NFADesign
’s start state:

class
NFADesign
def
to_nfa
(
current_states
=
Set
[
start_state
]
)
NFA
.
new
(
current_states
,
accept_states
,
rulebook
)
end
end

Previously, the simulation of an NFA could only begin in its start state, but this new
parameter gives us a way of jumping in at any other point:

>>
rulebook
=
NFARulebook
.
new
(
[
FARule
.
new
(
1
,
'a'
,
1
),
FARule
.
new
(
1
,
'a'
,
2
),
FARule
.
new
(
1
,
nil
,
2
),
FARule
.
new
(
2
,
'b'
,
3
),
FARule
.
new
(
3
,
'b'
,
1
),
FARule
.
new
(
3
,
nil
,
2
)
]
)
=> #
>>
nfa_design
=
NFADesign
.
new
(
1
,
[
3
]
,
rulebook
)
=> #
>>
nfa_design
.
to_nfa
.
current_states
=> #
>>
nfa_design
.
to_nfa
(
Set
[
2
]
)
.
current_states
=> #
>>
nfa_design
.
to_nfa
(
Set
[
3
]
)
.
current_states
=> #

Other books

Wee Scotch Whisky Tales by Ian R Mitchell
An Ideal Duchess by Evangeline Holland
Crystal Throne (Book 1) by D.W. Jackson
The Song of Troy by Colleen McCullough
Only Hers by Francis Ray