Exploiting Finite State Automata for Efficient Lexical Analysis: A Rust Implementation
Abstract
In this post, I will discuss the process of exploiting Finite State Automata (FSA) in particular Non-Deterministic Finite Automata (NFA) for efficient lexical analysis. I will provide a brief introduction to FSA and NFA, and then I will show how to implement a lexical analyzer using NFA in Rust.
Introduction
I recently had the opportunity to work on a project that required me to implement a lexical analyzer for a custom programming language. The language was functional in nature, and the lexical analyzer was required to be as efficient as possible to ensure that the language could be used in real-time applications.
In my previous experience, I used backtracking algorithms to implement lexical analyzers. However, I knew that backtracking algorithms could be inefficient for certain types of languages. So, I decided to explore other options to see if I could find a more efficient solution.
Finite State Automata
Finite state automata (FSA) are commonly used in computer science to model the behavior of systems that can be in one of a finite number of states at any given time. They are used in a wide range of applications, including lexical analysis, parsing, and pattern matching. In literature, there are two main types of finite state automata: deterministic finite automata (DFA) and non-deterministic finite automata (NFA). NFA is a generalization of DFA, and it can be more expressive than DFA.
Theoretically speaking, a finite state automaton is a 5-tuple \((Q, \Sigma, \delta, q_0, F)\), where: \[ \begin{align*} Q & : \text{a finite non-empty set of states} \\ \Sigma & : \text{a finite set of input symbols} \\ \delta & : Q \times \Sigma \rightarrow Q \text{ a transition function} \\ q_0 \in Q & : \text{the initial state} \\ F \subset Q & : \text{a set of final states, possibly empty} \end{align*} \]
Note that the transition function \(\delta\) is a partial function, meaning that it is not defined for all possible inputs. In other words, there may be some states in the automaton that do not have transitions for all possible input symbols.
NFA differs from DFA in the transition function \(\delta\), and a state can transition to others without reading a symbol (i.e. \(\epsilon\)-transitions). . If the automaton is non-deterministic, then the transition function \(\delta\) is in the form of \(\delta : Q \times \Sigma \rightarrow 2^Q\), where \(2^Q\) is the power set of \(Q\). In other words, the transition function can return a set of states.
In this post, I will implement a lexical analyzer using NFA in Rust. I will show how to define the states, transitions, and the lexical rules using NFA.
FSA
Time Complexity
One of the most significant advantages of using FSA over backtracking algorithms is that FSA operates in linear time. This means that the time complexity of an FSA-based lexical analyzer is \(O(n)\), where \(n\) is the length of the input string. In contrast, the time complexity of a backtracking-based lexical analyzer can be exponential in the worst case.
Space Complexity
Another advantage of using FSA is that it has a lower space complexity compared to backtracking algorithms. FSA uses a fixed amount of memory to store the states and transitions, while backtracking algorithms may need to store multiple paths in memory, which can lead to high space complexity.
Error Handling
With FSA, error states can be explicitly defined, making it easier to handle unexpected input gracefully. When an FSA encounters an invalid character or sequence, it can transition to an error state and provide informative feedback to the user. In backtracking algorithms, error handling can be more complex and less straightforward, as it might require unwinding multiple levels of recursive calls and tracking the source of the error.
Implementation
Now that we have discussed the advantages of using FSA for lexical analysis, let's see how we can implement a lexical analyzer using NFA in Rust. In this example, we will implement a simple lexical analyzer that recognizes the following tokens:
- Integer literals (e.g.
123
) - Identifiers (e.g.
foo
)
For instance, consider the following input string: 123 foo
. The lexical analyzer should output the following tokens:
- IntegerLiteral(123)
- Identifier(foo)
using the following NFA:
The Token Struct
We need a struct to represent tokens. The Token
struct has two fields: kind
to represent the type of the token and lexeme
to store the lexeme of the token. The TokenKind
enum represents the different types of tokens that the lexer can recognize and it has three variants: Identifier
, Number
, and EOF
(end of file). The Token
struct and TokenKind
enum are defined as follows:
#[derive(Debug)]
pub struct Token {
pub kind: TokenKind,
pub lexeme: String,
}
#[derive(Debug)]
pub enum TokenKind {
,
Identifier,
NumberEOF,
}
The Cursor Struct
Next, we need a struct to represent the cursor that will be used to traverse the input string. The Cursor
struct has three fields: input
to store the input string, position
to store the current position in the input string, and offset
to store the current offset for the current token. The Cursor
struct also has several methods to manipulate the cursor, such as is_eof
to check if the cursor has reached the end of the input string, peek
to peek at the current character, advance
to advance the cursor to the next character (incrementing the offset), and align
to align the position to the current offset. The Cursor
struct and its methods are defined as follows:
pub struct Cursor {
pub input: String,
pub position: usize,
pub offset: usize,
}
impl Cursor {
pub fn new(input: String) -> Cursor {
{
Cursor ,
input: 0,
position: 0,
offset}
}
pub fn is_eof(&self) -> bool {
self.offset >= self.input.len()
}
pub fn peek(&self) -> Option<char> {
if self.is_eof() {
return None;
}
self.input.chars().nth(self.offset)
}
pub fn advance(&mut self) {
if self.is_eof() {
return;
}
self.offset += 1;
}
pub fn align(&mut self) {
if self.is_eof() {
return;
}
self.position = self.offset;
self.offset += 1;
}
}
State and Transition
Now, we need to define the State
and the Transition
for the NFA. The State
trait has a single method visit
that takes a Cursor
and returns an Option<Transition>
. The Transition
struct has two fields: state
to represent the next state and transition_kind
to represent the kind of transition. The TransitionKind
enum represents the different types of transitions that the NFA can have and it has three variants: Advance
, EmitToken
, and End
. The State
trait, Transition
struct, and TransitionKind
enum are defined as follows:
pub trait State {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition>;
}
pub struct Transition {
pub state: Box<dyn State>,
pub transition_kind: TransitionKind,
}
pub enum TransitionKind {
,
Advance,
EmitToken(Token),
End}
impl TransitionKind {
pub fn apply(&self, cursor: &mut Cursor) {
match self {
TransitionKind::Advance => cursor.advance(),
TransitionKind::EmitToken(_) => cursor.align(),
TransitionKind::End => {}
}
}
}
Basically, the idea is that when we are in a state, we can visit the state with a cursor, and the state will return a transition that will tell us what will be the next state (the state
field) and what we should do (the transition_kind
field) when we transition from the current state to the next state. In the following, we will define the states and all will be more clear with some examples.
The Lexer Struct
Finally, we need a struct to represent the lexer. The Lexer
struct has two fields: cursor
to store the cursor and state
to store the current state of the lexer. The Lexer
struct has two methods: new
to create a new lexer with the given input string and proceed
to proceed to the next state with the given transition kind. The Lexer
struct and its methods are defined as follows:
pub struct Lexer {
: Cursor,
cursor: Box<dyn State>,
state}
impl Lexer {
pub fn new(input: String) -> Lexer {
{
Lexer : Cursor::new(input),
cursor: Box::new(StateStart),
state}
}
pub fn proceed(state: Box<dyn State>, transition_kind: TransitionKind) -> Transition {
{
Transition ,
state,
transition_kind}
}
}
impl Iterator for Lexer {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
loop {
let transition = self.state.visit(&mut self.cursor)?;
if let TransitionKind::End = transition.transition_kind {
return None;
}
self.state = transition.state;
.transition_kind.apply(&mut self.cursor);
transitionif let TransitionKind::EmitToken(token) = transition.transition_kind {
return Some(token);
}
}
}
}
Usually, a Lexer
has a method to get a list of tokens from the input string. In this case, we implement the Iterator
trait for the Lexer
struct. The next
method of the Iterator
trait is implemented to return the next token in the input string. The next
method uses a loop to repeatedly visit the current state with the cursor and get the transition. If the transition kind is End
, the next
method returns None
indicating that there are no more tokens to emit (stopping the iteration). Otherwise, the state
field of the Lexer
struct is updated with the next state and the transition_kind
field of the transition is applied to the cursor. If the transition kind is EmitToken
, the next
method returns the token. Simple, right?
The States
The Start State
The StateStart
is the initial state of the NFA. It is the starting point of the lexical analysis process. Trivially, the StateStart
struct, being a state, implements the State
trait and defines the visit
method. The visit
method uses the peek
method of the Cursor
to peek at the current character. If the current character is alphabetic or an underscore, the visit
method returns a Transition
that transitions to the StateWord
state advancing the cursor to the next character. Analogously, if the current character is numeric, the visit
method returns a Transition
that transitions to the StateNumber
state advancing the cursor to the next character. If the current character is neither alphabetic nor numeric, the visit
method returns a Transition
that transitions to the StateEOF
state. The StateStart
struct and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateStart;
impl State for StateStart {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_alphabetic() || c.eq(&'_') => {
Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
}
Some(c) if c.is_numeric() => Some(Lexer::proceed(
Box::new(StateNumber),
TransitionKind::Advance,
,
))=> Some(Lexer::proceed(Box::new(StateEOF), TransitionKind::Advance)),
_ }
}
}
I think that the code is self-explanatory, but note that I am not covering the case in which the current character is a whitespace. In this case, we should ignore these characters and advance the cursor to the next character. In fact, if the input string is␣␣␣foo 123
(starting with spaces), theStateStart
will go directly to theStateEOF
state, which is not the expected behavior. We should ignore these spaces and go to theStateWord
state.
I will leave this as an exercise for the reader.
The Word State
The StateWord
is a state that represents a word token. Also, the StateWord
struct, being a state, implements the State
trait and defines the visit
method. The visit
method uses the peek
method of the Cursor
to peek at the current character. If the current character is alphanumeric or an underscore, the visit
method returns a Transition
that transitions to the StateWord
state advancing the cursor to the next character; basically, the StateWord
state is a loop that consumes all alphanumeric characters and underscores. If the current character is not alphanumeric, the visit
method returns a Transition
that transitions to the StateStart
state emitting a token with the kind Identifier
and the lexeme being the substring from the cursor position to the cursor offset. The StateWord
struct and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateWord;
impl State for StateWord {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_alphanumeric() || c.eq(&'_') => {
Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
}
=> Some(Lexer::proceed(
_ Box::new(StateStart),
TransitionKind::EmitToken(Token {
: TokenKind::Identifier,
kind: cursor.input[cursor.position..cursor.offset].to_string(),
lexeme}),
,
))}
}
}
The Number State
The StateNumber
is a state that represents a number token. The StateNumber
struct, being a state, implements the State
trait and defines the visit
method. The visit
method uses the peek
method of the Cursor
to peek at the current character. If the current character is numeric, the visit
method returns a Transition
that transitions to the StateNumber
state advancing the cursor to the next character. If the current character is not numeric, the visit
method returns a Transition
that transitions to the StateStart
state emitting a token with the kind Number
and the lexeme being the substring from the cursor position to the cursor offset. The StateNumber
struct and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateNumber;
impl State for StateNumber {
fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
match cursor.peek() {
Some(c) if c.is_numeric() => Some(Lexer::proceed(
Box::new(StateNumber),
TransitionKind::Advance,
,
))=> Some(Lexer::proceed(
_ Box::new(StateStart),
TransitionKind::EmitToken(Token {
: TokenKind::Number,
kind: cursor.input[cursor.position..cursor.offset].to_string(),
lexeme}),
,
))}
}
}
The EOF State
I am treating the EOF
as a token. The StateEOF
is a state that represents the end of the input string. The StateEOF
struct, being a state, implements the State
trait and defines the visit
method. The visit
method returns a Transition
that transitions to the StateEnd
state emitting a token with the kind EOF
and an empty lexeme. The StateEOF
struct and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateEOF;
impl State for StateEOF {
fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
Some(Lexer::proceed(
Box::new(StateEnd),
TransitionKind::EmitToken(Token {
: TokenKind::EOF,
kind: "".to_string(),
lexeme}),
))}
}
The End State
The StateEnd
is the final state of the NFA. It is the ending point of the lexical analysis process. Trivially, the StateEnd
struct, being a state, implements the State
trait and defines the visit
method. The visit
method returns a Transition
that transitions to the StateEnd
state emitting a token with the kind End
and an empty lexeme. The StateEnd
struct and its implementation are defined as follows:
#[derive(Debug)]
pub struct StateEnd;
impl State for StateEnd {
fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
Some(Transition {
: Box::new(StateEnd),
state: TransitionKind::End,
transition_kind})
}
}
Conclusion
It is important to note that every time we need to emit a token we return to StateStart
, this is because the start state should know all the ways a token can start. So to extend while maintaining consistency with this solution, the start state should always know where to go unambiguously.
A possible question might be: how do I distinguish in the practical case an identifier from a keyword?
Well, it is trivial. Before emitting a word I can do pattern match on the substring to have emit the correct TokenKind
.
I hope you enjoyed this article. If you have any questions or suggestions, please contact me.