Exploiting Finite State Automata for Efficient Lexical Analysis: A Rust Implementation

Abstract

In this post, I will discuss the process of exploiting Finite State Automata (FSA) in particular Non-Deterministic Finite Automata (NFA) for efficient lexical analysis. I will provide a brief introduction to FSA and NFA, and then I will show how to implement a lexical analyzer using NFA in Rust.

Introduction

I recently had the opportunity to work on a project that required me to implement a lexical analyzer for a custom programming language. The language was functional in nature, and the lexical analyzer was required to be as efficient as possible to ensure that the language could be used in real-time applications.

In my previous experience, I used backtracking algorithms to implement lexical analyzers. However, I knew that backtracking algorithms could be inefficient for certain types of languages. So, I decided to explore other options to see if I could find a more efficient solution.

Finite State Automata

Finite state automata (FSA) are commonly used in computer science to model the behavior of systems that can be in one of a finite number of states at any given time. They are used in a wide range of applications, including lexical analysis, parsing, and pattern matching. In literature, there are two main types of finite state automata: deterministic finite automata (DFA) and non-deterministic finite automata (NFA). NFA is a generalization of DFA, and it can be more expressive than DFA.

Theoretically speaking, a finite state automaton is a 5-tuple \((Q, \Sigma, \delta, q_0, F)\), where: \[ \begin{align*} Q & : \text{a finite non-empty set of states} \\ \Sigma & : \text{a finite set of input symbols} \\ \delta & : Q \times \Sigma \rightarrow Q \text{ a transition function} \\ q_0 \in Q & : \text{the initial state} \\ F \subset Q & : \text{a set of final states, possibly empty} \end{align*} \]

Note that the transition function \(\delta\) is a partial function, meaning that it is not defined for all possible inputs. In other words, there may be some states in the automaton that do not have transitions for all possible input symbols.

NFA differs from DFA in the transition function \(\delta\), and a state can transition to others without reading a symbol (i.e. \(\epsilon\)-transitions). . If the automaton is non-deterministic, then the transition function \(\delta\) is in the form of \(\delta : Q \times \Sigma \rightarrow 2^Q\), where \(2^Q\) is the power set of \(Q\). In other words, the transition function can return a set of states.

In this post, I will implement a lexical analyzer using NFA in Rust. I will show how to define the states, transitions, and the lexical rules using NFA.

FSA

Time Complexity

One of the most significant advantages of using FSA over backtracking algorithms is that FSA operates in linear time. This means that the time complexity of an FSA-based lexical analyzer is \(O(n)\), where \(n\) is the length of the input string. In contrast, the time complexity of a backtracking-based lexical analyzer can be exponential in the worst case.

Space Complexity

Another advantage of using FSA is that it has a lower space complexity compared to backtracking algorithms. FSA uses a fixed amount of memory to store the states and transitions, while backtracking algorithms may need to store multiple paths in memory, which can lead to high space complexity.

Error Handling

With FSA, error states can be explicitly defined, making it easier to handle unexpected input gracefully. When an FSA encounters an invalid character or sequence, it can transition to an error state and provide informative feedback to the user. In backtracking algorithms, error handling can be more complex and less straightforward, as it might require unwinding multiple levels of recursive calls and tracking the source of the error.

Implementation

Now that we have discussed the advantages of using FSA for lexical analysis, let's see how we can implement a lexical analyzer using NFA in Rust. In this example, we will implement a simple lexical analyzer that recognizes the following tokens:

Integer literals (e.g. 123)
Identifiers (e.g. foo)

For instance, consider the following input string: 123 foo. The lexical analyzer should output the following tokens:

IntegerLiteral(123)
Identifier(foo)

using the following NFA:

The Token Struct

We need a struct to represent tokens. The Token struct has two fields: kind to represent the type of the token and lexeme to store the lexeme of the token. The TokenKind enum represents the different types of tokens that the lexer can recognize and it has three variants: Identifier, Number, and EOF (end of file). The Token struct and TokenKind enum are defined as follows:

#[derive(Debug)]
pub struct Token {
    pub kind: TokenKind,
    pub lexeme: String,
}

#[derive(Debug)]
pub enum TokenKind {
    Identifier,
    Number,
    EOF,
}

The Cursor Struct

Next, we need a struct to represent the cursor that will be used to traverse the input string. The Cursor struct has three fields: input to store the input string, position to store the current position in the input string, and offset to store the current offset for the current token. The Cursor struct also has several methods to manipulate the cursor, such as is_eof to check if the cursor has reached the end of the input string, peek to peek at the current character, advance to advance the cursor to the next character (incrementing the offset), and align to align the position to the current offset. The Cursor struct and its methods are defined as follows:

pub struct Cursor {
    pub input: String,
    pub position: usize,
    pub offset: usize,
}

impl Cursor {
    pub fn new(input: String) -> Cursor {
        Cursor {
            input,
            position: 0,
            offset: 0,
        }
    }

    pub fn is_eof(&self) -> bool {
        self.offset >= self.input.len()
    }

    pub fn peek(&self) -> Option<char> {
        if self.is_eof() {
            return None;
        }

        self.input.chars().nth(self.offset)
    }

    pub fn advance(&mut self) {
        if self.is_eof() {
            return;
        }

        self.offset += 1;
    }

    pub fn align(&mut self) {
        if self.is_eof() {
            return;
        }

        self.position = self.offset;
        self.offset += 1;
    }
}

State and Transition

Now, we need to define the State and the Transition for the NFA. The State trait has a single method visit that takes a Cursor and returns an Option<Transition>. The Transition struct has two fields: state to represent the next state and transition_kind to represent the kind of transition. The TransitionKind enum represents the different types of transitions that the NFA can have and it has three variants: Advance, EmitToken, and End. The State trait, Transition struct, and TransitionKind enum are defined as follows:

pub trait State {
    fn visit(&self, cursor: &mut Cursor) -> Option<Transition>;
}

pub struct Transition {
    pub state: Box<dyn State>,
    pub transition_kind: TransitionKind,
}

pub enum TransitionKind {
    Advance,
    EmitToken(Token),
    End,
}

impl TransitionKind {
    pub fn apply(&self, cursor: &mut Cursor) {
        match self {
            TransitionKind::Advance => cursor.advance(),
            TransitionKind::EmitToken(_) => cursor.align(),
            TransitionKind::End => {}
        }
    }
}

Basically, the idea is that when we are in a state, we can visit the state with a cursor, and the state will return a transition that will tell us what will be the next state (the state field) and what we should do (the transition_kind field) when we transition from the current state to the next state. In the following, we will define the states and all will be more clear with some examples.

The Lexer Struct

Finally, we need a struct to represent the lexer. The Lexer

                               

Usually, a Lexer has a method to get a list of tokens

struct has two fields: cursor to store the cursor and state to store the current state of the lexer. The Lexer struct has two methods: new to create a new lexer with the given input string and proceed to proceed to the next state with the given transition kind. The Lexer struct and its methods are defined as follows: id="cb1-1">pub struct Lexer { cursor: Cursor, state: Box<dyn State>, class="op">} span> class="kw">impl Lexer { pub fn new(input: String) -> Lexer { Lexer { cursor: Cursor::new(input), state: Box::new(StateStart), } }  pub fn proceed(state: Box<dyn State>, transition_kind: TransitionKind) -> Transition { Transition { state, transition_kind, } } class="op">}  class="kw">impl Iterator for Lexer { type Item = Token;  fn next(&mut self) -> Option<Self::Item> { loop { let transition = self.state.visit(&mut self.cursor)?; if let TransitionKind::End = transition.transition_kind { return None; } self.state = transition.state; transition.transition_kind.apply(&mut self.cursor); if let TransitionKind::EmitToken(token) = transition.transition_kind { return Some(token); } } } class="op">}

from the input string. In this case, we implement the Iterator trait for the Lexer struct. The next method of the Iterator trait is implemented to return the next token in the input string. The next method uses a loop to repeatedly visit the current state with the cursor and get the transition. If the transition kind is End, the next method returns None indicating that there are no more tokens to emit (stopping the iteration). Otherwise, the state field of the Lexer struct is updated with the next state and the transition_kind field of the transition is applied to the cursor. If the transition kind is EmitToken, the next method returns the token. Simple, right?

The States

The Start State

The StateStart is the initial state of the NFA. It is the starting point of the lexical analysis process. Trivially, the StateStart struct, being a state, implements the State trait and defines the visit method. The visit method uses the peek method of the Cursor to peek at the current character. If the current character is alphabetic or an underscore, the visit method returns a Transition that transitions to the StateWord state advancing the cursor to the next character. Analogously, if the current character is numeric, the visit method returns a Transition that transitions to the StateNumber state advancing the cursor to the next character. If the current character is neither alphabetic nor numeric, the visit method returns a Transition that transitions to the StateEOF state. The StateStart struct and its implementation are defined as follows:

#[derive(Debug)]
pub struct StateStart;

impl State for StateStart {
    fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
        match cursor.peek() {
            Some(c) if c.is_alphabetic() || c.eq(&'_') => {
                Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
            }
            Some(c) if c.is_numeric() => Some(Lexer::proceed(
                Box::new(StateNumber),
                TransitionKind::Advance,
            )),
            _ => Some(Lexer::proceed(Box::new(StateEOF), TransitionKind::Advance)),
        }
    }
}

I think that the code is self-explanatory, but note that I am not covering the case in which the current character is a whitespace. In this case, we should ignore these characters and advance the cursor to the next character. In fact, if the input string is ␣␣␣foo 123 (starting with spaces), the StateStart will go directly to the StateEOF state, which is not the expected behavior. We should ignore these spaces and go to the StateWord state.
I will leave this as an exercise for the reader.

The Word State

The StateWord is a state that represents a word token. Also, the StateWord struct, being a state, implements the State trait and defines the visit method. The visit method uses the peek method of the Cursor to peek at the current character. If the current character is alphanumeric or an underscore, the visit method returns a Transition that transitions to the StateWord state advancing the cursor to the next character; basically, the StateWord state is a loop that consumes all alphanumeric characters and underscores. If the current character is not alphanumeric, the visit method returns a Transition that transitions to the StateStart state emitting a token with the kind Identifier and the lexeme being the substring from the cursor position to the cursor offset. The StateWord struct and its implementation are defined as follows:

#[derive(Debug)]
pub struct StateWord;

impl State for StateWord {
    fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
        match cursor.peek() {
            Some(c) if c.is_alphanumeric() || c.eq(&'_') => {
                Some(Lexer::proceed(Box::new(StateWord), TransitionKind::Advance))
            }
            _ => Some(Lexer::proceed(
                Box::new(StateStart),
                TransitionKind::EmitToken(Token {
                    kind: TokenKind::Identifier,
                    lexeme: cursor.input[cursor.position..cursor.offset].to_string(),
                }),
            )),
        }
    }
}

The Number State

The StateNumber is a state that represents a number token. The StateNumber struct, being a state, implements the State trait and defines the visit method. The visit method uses the peek method of the Cursor to peek at the current character. If the current character is numeric, the visit method returns a Transition that transitions to the StateNumber state advancing the cursor to the next character. If the current character is not numeric, the visit method returns a Transition that transitions to the StateStart state emitting a token with the kind Number and the lexeme being the substring from the cursor position to the cursor offset. The StateNumber struct and its implementation are defined as follows:

#[derive(Debug)]
pub struct StateNumber;

impl State for StateNumber {
    fn visit(&self, cursor: &mut Cursor) -> Option<Transition> {
        match cursor.peek() {
            Some(c) if c.is_numeric() => Some(Lexer::proceed(
                Box::new(StateNumber),
                TransitionKind::Advance,
            )),
            _ => Some(Lexer::proceed(
                Box::new(StateStart),
                TransitionKind::EmitToken(Token {
                    kind: TokenKind::Number,
                    lexeme: cursor.input[cursor.position..cursor.offset].to_string(),
                }),
            )),
        }
    }
}

The EOF State

I am treating the EOF as a token. The StateEOF is a state that represents the end of the input string. The StateEOF struct, being a state, implements the State trait and defines the visit method. The visit method returns a Transition that transitions to the StateEnd state emitting a token with the kind EOF and an empty lexeme. The StateEOF struct and its implementation are defined as follows:

#[derive(Debug)]
pub struct StateEOF;

impl State for StateEOF {
    fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
        Some(Lexer::proceed(
            Box::new(StateEnd),
            TransitionKind::EmitToken(Token {
                kind: TokenKind::EOF,
                lexeme: "".to_string(),
            }),
        ))
    }
}

The End State

The StateEnd is the final state of the NFA. It is the ending point of the lexical analysis process. Trivially, the StateEnd struct, being a state, implements the State trait and defines the visit method. The visit method returns a Transition that transitions to the StateEnd state emitting a token with the kind End and an empty lexeme. The StateEnd struct and its implementation are defined as follows:

#[derive(Debug)]
pub struct StateEnd;

impl State for StateEnd {
    fn visit(&self, _cursor: &mut Cursor) -> Option<Transition> {
        Some(Transition {
            state: Box::new(StateEnd),
            transition_kind: TransitionKind::End,
        })
    }
}

Conclusion

It is important to note that every time we need to emit a token we return to StateStart, this is because the start state should know all the ways a token can start. So to extend while maintaining consistency with this solution, the start state should always know where to go unambiguously.

A possible question might be: how do I distinguish in the practical case an identifier from a keyword?
Well, it is trivial. Before emitting a word I can do pattern match on the substring to have emit the correct TokenKind.

I hope you enjoyed this article. If you have any questions or suggestions, please contact me.