Exploiting Finite State Automata for Efficient Lexical Analysis: A Rust Implementation
Abstract
In this post, I will discuss the process of exploiting Finite State Automata (FSA) in particular Non-Deterministic Finite Automata (NFA) for efficient lexical analysis. I will provide a brief introduction to FSA and NFA, and then I will show how to implement a lexical analyzer using NFA in Rust.
Introduction
I recently had the opportunity to work on a project that required me to implement a lexical analyzer for a custom programming language. The language was functional in nature, and the lexical analyzer was required to be as efficient as possible to ensure that the language could be used in real-time applications.
In my previous experience, I used backtracking algorithms to implement lexical analyzers. However, I knew that backtracking algorithms could be inefficient for certain types of languages. So, I decided to explore other options to see if I could find a more efficient solution.
Finite State Automata
Finite state automata (FSA) are commonly used in computer science to model the behavior of systems that can be in one of a finite number of states at any given time. They are used in a wide range of applications, including lexical analysis, parsing, and pattern matching. In literature, there are two main types of finite state automata: deterministic finite automata (DFA) and non-deterministic finite automata (NFA). NFA is a generalization of DFA, and it can be more expressive than DFA.
Theoretically speaking, a finite state automaton is a 5-tuple \((Q, \Sigma, \delta, q_0, F)\), where: \[ \begin{align*} Q & : \text{a finite non-empty set of states} \\ \Sigma & : \text{a finite set of input symbols} \\ \delta & : Q \times \Sigma \rightarrow Q \text{ a transition function} \\ q_0 \in Q & : \text{the initial state} \\ F \subset Q & : \text{a set of final states, possibly empty} \end{align*} \]
Note that the transition function \(\delta\) is a partial function, meaning that it is not defined for all possible inputs. In other words, there may be some states in the automaton that do not have transitions for all possible input symbols.
NFA differs from DFA in the transition function \(\delta\), and a state can transition to others without reading a symbol (i.e. \(\epsilon\)-transitions). . If the automaton is non-deterministic, then the transition function \(\delta\) is in the form of \(\delta : Q \times \Sigma \rightarrow 2^Q\), where \(2^Q\) is the power set of \(Q\). In other words, the transition function can return a set of states.
In this post, I will implement a lexical analyzer using NFA in Rust. I will show how to define the states, transitions, and the lexical rules using NFA.
FSA
Time Complexity
One of the most significant advantages of using FSA over backtracking algorithms is that FSA operates in linear time. This means that the time complexity of an FSA-based lexical analyzer is \(O(n)\), where \(n\) is the length of the input string. In contrast, the time complexity of a backtracking-based lexical analyzer can be exponential in the worst case.
Space Complexity
Another advantage of using FSA is that it has a lower space complexity compared to backtracking algorithms. FSA uses a fixed amount of memory to store the states and transitions, while backtracking algorithms may need to store multiple paths in memory, which can lead to high space complexity.
Error Handling
With FSA, error states can be explicitly defined, making it easier to handle unexpected input gracefully. When an FSA encounters an invalid character or sequence, it can transition to an error state and provide informative feedback to the user. In backtracking algorithms, error handling can be more complex and less straightforward, as it might require unwinding multiple levels of recursive calls and tracking the source of the error.
Implementation
Now that we have discussed the advantages of using FSA for lexical analysis, let's see how we can implement a lexical analyzer using NFA in Rust. In this example, we will implement a simple lexical analyzer that recognizes the following tokens:
- Integer literals (e.g.
123) - Identifiers (e.g.
foo)
For instance, consider the following input string:
123 foo. The lexical analyzer should output the
following tokens:
- IntegerLiteral(123)
- Identifier(foo)
The Token Struct
We need a struct to represent tokens. The
Token struct has two fields: kind to
represent the type of the token and lexeme to store
the lexeme of the token. The TokenKind enum
represents the different types of tokens that the lexer can
recognize and it has three variants: Identifier,
Number, and EOF (end of file). The
Token struct and TokenKind enum are
defined as follows:
#[derive(Debug)]
pub struct Token {
pub kind: TokenKind,
pub lexeme: String,
}
#[derive(Debug)]
pub enum TokenKind {
Identifier,
Number,
EOF,
}
The Cursor Struct
Next, we need a struct to represent the cursor that will be used
to traverse the input string. The Cursor struct has
three fields: input to store the input string,
position to store the current position in the input
string, and offset to store the current offset for
the current token. The Cursor struct also has
several methods to manipulate the cursor, such as
is_eof to check if the cursor has reached the end
of the input string, peek to peek at the current
character, advance to advance the cursor to the
next character (incrementing the offset), and
align to align the position to the current offset.
The Cursor struct and its methods are defined as
follows:
pub struct Cursor {
pub input: String,
pub position: usize,
pub offset: usize,
}
impl Cursor {
pub fn new(input: String) -> Cursor {
Cursor {
input,
position: 0,
offset: 0,
}
}
pub fn is_eof(&self) -> bool {
self.offset >= self.input.len()
}
pub fn peek(&self) -> Option<char> {
if self.is_eof() {
return None;
}
self.input.chars().nth(self.offset)
}
pub fn advance(&mut self) {
if self.is_eof() {
return;
}
self.offset += 1;
}
pub fn align(&mut self) {
if self.is_eof() {
return;
}
self.position = self.offset;
self.offset += 1;
}
}
State and Transition
Now, we need to define the State and the
Transition for the NFA. The
State trait has a single method
visit that takes a Cursor and returns
an Option<Transition>. The
Transition struct has two fields:
state to represent the next state and
transition_kind to represent the kind of
transition. The TransitionKind enum represents the
different types of transitions that the NFA can have and
it has three variants: Advance,
EmitToken, and End. The
State trait, Transition struct, and
TransitionKind enum are defined as follows:
pub trait State {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition>;
}
pub struct Transition {
pub state: Box<dyn State>,
pub transition_kind: TransitionKind,
}
pub enum TransitionKind {
Advance,
EmitToken(Token),
End,
}
impl TransitionKind {
pub fn apply(&self, cursor: &mut Cursor) {
match self {
TransitionKind::Advance => cursor.advance(),
TransitionKind::EmitToken(_) => cursor.align(),
TransitionKind::End => {}
}
}
}
Basically, the idea is that when we are in a state, we can visit
the state with a cursor, and the state will return a transition
that will tell us what will be the next state (the
state field) and what we should do (the
transition_kind field) when we transition from the
current state to the next state. In the following, we will
define the states and all will be more clear with some examples.
The Lexer Struct
Finally, we need a struct to represent the lexer. The
Lexer struct has two fields: cursor to
store the cursor and state to store the current
state of the lexer. The Lexer struct has two
methods: new to create a new lexer with the given
input string and proceed to proceed to the next
state with the given transition kind. The
Lexer struct and its methods are defined as
follows:
pub struct Lexer {
cursor: Cursor,
state: Box<dyn State>,
}
impl Lexer {
pub fn new(input: String) -> Lexer {
Lexer {
cursor: Cursor::new(input),
state: Box::new(StateStart),
}
}
pub fn proceed(state: Box<dyn State>, transition_kind: TransitionKind) -> Transition {
Transition {
state,
transition_kind,
}
}
}
impl Iterator for Lexer {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
loop {
let transition = self.state.visit(&mut self.cursor)?;
if let TransitionKind::End = transition.transition_kind {
return None;
}
self.state = transition.state;
transition.transition_kind.apply(&mut self.cursor);
if let TransitionKind::EmitToken(token) = transition.transition_kind {
return Some(token);
}
}
}
}
Usually, a Lexer has a method to get a list of
tokens from the input string. In this case, we implement the
Iterator trait for the Lexer struct.
The next method of the Iterator trait
is implemented to return the next token in the input string. The
next method uses a loop to repeatedly visit the
current state with the cursor and get the transition. If the
transition kind is End, the
next method returns None indicating
that there are no more tokens to emit (stopping the iteration).
Otherwise, the state field of the
Lexer struct is updated with the next state and the
transition_kind field of the transition is applied
to the cursor. If the transition kind is EmitToken,
the next method returns the token. Simple, right?
The States
The Start State
The StateStart is the initial state of the
NFA. It is the starting point of the lexical analysis
process. Trivially, the StateStart struct, being a
state, implements the State trait and defines the
visit method. The visit method uses
the peek method of the Cursor to peek
at the current character. If the current character is alphabetic
or an underscore, the visit method returns a
Transition that transitions to the
StateWord state advancing the cursor to the next
character. Analogously, if the current character is numeric, the
visit method returns a Transition that
transitions to the StateNumber state advancing the
cursor to the next character. If the current character is
neither alphabetic nor numeric, the visit method
returns a Transition that transitions to the
StateEOF state. The StateStart struct
and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateStart;
impl State for StateStart {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_alphabetic() || c.eq(&'_') => {
Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
}
Some(c) if c.is_numeric() => Some(Lexer::proceed(
Box::new(StateNumber),
TransitionKind::Advance,
)),
_ => Some(Lexer::proceed(Box::new(StateEOF), TransitionKind::Advance)),
}
}
}
I think that the code is self-explanatory, but note that I am not covering the case in which the current character is a whitespace. In this case, we should ignore these characters and advance the cursor to the next character. In fact, if the input string is␣␣␣foo 123(starting with spaces), theStateStartwill go directly to theStateEOFstate, which is not the expected behavior. We should ignore these spaces and go to theStateWordstate.
I will leave this as an exercise for the reader.
The Word State
The StateWord is a state that represents a word
token. Also, the StateWord struct, being a state,
implements the State trait and defines the
visit method. The visit method uses
the peek method of the Cursor to peek
at the current character. If the current character is
alphanumeric or an underscore, the visit method
returns a Transition that transitions to the
StateWord state advancing the cursor to the next
character; basically, the StateWord state is a loop
that consumes all alphanumeric characters and underscores. If
the current character is not alphanumeric, the
visit method returns a Transition that
transitions to the StateStart state emitting a
token with the kind Identifier and the lexeme being
the substring from the cursor position to the cursor offset. The
StateWord struct and its implementation are defined
as follows:
#[derive(Debug)]
pub struct StateWord;
impl State for StateWord {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_alphanumeric() || c.eq(&'_') => {
Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
}
_ => Some(Lexer::proceed(
Box::new(StateStart),
TransitionKind::EmitToken(Token {
kind: TokenKind::Identifier,
lexeme: cursor.input[cursor.position..cursor.offset].to_string(),
}),
)),
}
}
}
The Number State
The StateNumber is a state that represents a number
token. The StateNumber struct, being a state,
implements the State trait and defines the
visit method. The visit method uses
the peek method of the Cursor to peek
at the current character. If the current character is numeric,
the visit method returns a
Transition that transitions to the
StateNumber state advancing the cursor to the next
character. If the current character is not numeric, the
visit method returns a Transition that
transitions to the StateStart state emitting a
token with the kind Number and the lexeme being the
substring from the cursor position to the cursor offset. The
StateNumber struct and its implementation are
defined as follows:
#[derive(Debug)]
pub struct StateNumber;
impl State for StateNumber {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_numeric() => Some(Lexer::proceed(
Box::new(StateNumber),
TransitionKind::Advance,
)),
_ => Some(Lexer::proceed(
Box::new(StateStart),
TransitionKind::EmitToken(Token {
kind: TokenKind::Number,
lexeme: cursor.input[cursor.position..cursor.offset].to_string(),
}),
)),
}
}
}
The EOF State
I am treating the EOF as a token. The
StateEOF is a state that represents the end of the
input string. The StateEOF struct, being a state,
implements the State trait and defines the
visit method. The visit method returns
a Transition that transitions to the
StateEnd state emitting a token with the kind
EOF and an empty lexeme. The
StateEOF struct and its implementation are defined
as follows:
#[derive(Debug)]
pub struct StateEOF;
impl State for StateEOF {
fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
Some(Lexer::proceed(
Box::new(StateEnd),
TransitionKind::EmitToken(Token {
kind: TokenKind::EOF,
lexeme: "".to_string(),
}),
))
}
}
The End State
The StateEnd is the final state of the NFA.
It is the ending point of the lexical analysis process.
Trivially, the StateEnd struct, being a state,
implements the State trait and defines the
visit method. The visit method returns
a Transition that transitions to the
StateEnd state emitting a token with the kind
End and an empty lexeme. The
StateEnd struct and its implementation are defined
as follows:
#[derive(Debug)]
pub struct StateEnd;
impl State for StateEnd {
fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
Some(Transition {
state: Box::new(StateEnd),
transition_kind: TransitionKind::End,
})
}
}
Conclusion
It is important to note that every time we need to emit a token
we return to StateStart, this is because the start
state should know all the ways a token can start. So to extend
while maintaining consistency with this solution, the start
state should always know where to go unambiguously.
A possible question might be:
how do I distinguish in the practical case an identifier
from a keyword?
Well, it is trivial. Before emitting a word I can do pattern
match on the substring to have emit the correct
TokenKind.
I hope you enjoyed this article. If you have any questions or suggestions, please contact me.