x86 Assembler: Building a Lexer (Part 1)

A lexer or tokenizer is a tool used to transform some text into a series of tokens. These tokens are a bit easier to work with rather than just text when building a parser. Typically one would use regular expressions (as most lexical syntax is considered a regular langage) to transform their text into tokens, but given that the C standard does not provide any support for regular expressions, we will be writing our own tokenizing functions.

There are quite a few ways to write a Lexer, we'll be making a lexer that generates all the tokens before passing it to the parser. You could certainly only yield a token as the parser needs it or you could also read from a file incrementally as needed.

We'll start with a structure to represent our lexer,

typedef struct tagLEXER {
    char const *pszBuffer;
    size_t cbBufferSize;
    size_t cbPosition;
} LEXER, *PLEXER;

pszBuffer will be our text, cbBufferSize, the length of the text, and cbPosition the lexer's position in the text.

Next, to write a function to initialize our lexer,

void LexerInitialize(PLEXER pLexer, char const *pszBuffer,
             size_t cbBufferSize) {
    pLexer->pszBuffer = pszBuffer;
    pLexer->cbBufferSize = cbBufferSize;
    pLexer->cbPosition = 0;
}

And now to define our first basic tokens we want to tokenize,

typedef enum tagTOKEN_TYPE {
    TOKEN_TYPE_IDENTIFIER,
    TOKEN_TYPE_COMMA
} TOKEN_TYPE;

In the future we'll add support for integers and strings.

And finally, the structure for our token,

typedef union tagUTOKEN_VALUE {
    char const *pszString;
    int iInteger;
    char cCharacter;
} UTOKEN_VALUE;

typedef struct tagTOKEN TOKEN, *PTOKEN, **PPTOKEN;
struct tagTOKEN {
    TOKEN_TYPE tokenType;
    UTOKEN_VALUE tokenValue;
    PTOKEN pNext;
};

We'll be storing our tokens as a linked list as it's quite an easy data structure to implement. We use a union to store the token value, which will be set respectively for the specific token type.

We'll be using C99's compound union literal to make it quite easy to create tokens. First we'll make a function that will make it easy to create a new token,

PTOKEN CreateToken(TOKEN_TYPE tokenType, UTOKEN_VALUE tokenValue) {
    PTOKEN pToken;

    pToken = malloc(sizeof(TOKEN));
    if(!pToken)
    return NULL;

    pToken->tokenType = tokenType;
    pToken->tokenValue = tokenValue;
    pToken->pNext = NULL;

    return pToken;
}

We can call this function like so,

CreateToken(TOKEN_TYPE_IDENTIFIER, (UTOKEN_VALUE) { .pszString = "example" });

Now, onto the actual tokenizing we'll want to create a function that will transform the text into a list of tokens,

PTOKEN LexerRun(PLEXER pLexer) {
    PTOKEN pTokens = NULL;

    do {
        char c = pLexer->pszBuffer[pLexer->cbPosition];
        ...
    } while(pLexer->cbPosition < pLexer->cbBufferSize);

    return pTokens;
}

We'll want to make a function that will advance the lexer some amount of bytes and modify cbPosition,

void LexerAdvance(PLEXER pLexer, size_t cBytes) {
    pLexer->cbPosition += cBytes;
}

and a function that will actually prepend tokens to our token list.

void TokenPrepend(PPTOKEN ppToken, PTOKEN pToken) {
    if(!*ppToken) {
        pToken->pNext = NULL;
        *ppToken = pToken;
        return;
    }

    pToken->pNext = *ppToken;
    *ppToken = pToken;
}

Now we can focus on the main loop of our lexer, first we'll handle blankspace and commas.

(you'll want to include ctype.h for functions isspace and isalpha)

do {
    char c = pLexer->pszBuffer[pLexer->cbPosition];
    if(c == ',') {
        PTOKEN pToken = CreateToken(
            TOKEN_TYPE_COMMA,
            (UTOKEN_VALUE) { .cCharacter = c }
        );

        TokenPrepend(&pTokens, pToken);
        LexerAdvance(pLexer, 1);
    }
    else if(isspace(c)) {
        LexerAdvance(pLexer, 1);
    } 
} while(pLexer->cbPosition < pLexer->cbBufferSize);

We check the first character, and check if it's a match for a comma, if it is, we generate a token and append it to the tokens, and if it's blankspace we'll just advance to the next character. Now we'd expect to be able to tokenize something like ",, ,\t, ,", some string of empty space and commas.

We have a little bit more busy work to do, you may have noticed we're prepending to the list instead of appending to the list. So currently our tokens are all backwards, so we'll want to create a function to reverse them.

We'll also want a way to actually print the tokens in our list, so we'll write functions to do both.

void TokenReverse(PPTOKEN ppToken) {
    PTOKEN pReversed = NULL;

    for(PTOKEN pCurrent = *ppToken; pCurrent;) {
        PTOKEN pTemp = pCurrent->pNext;
        TokenPrepend(&pReversed, pCurrent);
        pCurrent = pTemp;
    }

    *ppToken = pReversed;
}

We'll make a function that will be able to apply a function to each token in the list, this way we can make a PrintToken function to print the token's type and value,

typedef void (*TOKEN_MAP_FN_ROUTINE)(PTOKEN, void *);
void TokenMapFn(PTOKEN pToken, TOKEN_MAP_FN_ROUTINE pMapFn,
        void *pUserData) {
    for(PTOKEN pCurrent = pToken; pCurrent; pCurrent = pCurrent->pNext) {
        pMapFn(pCurrent, pUserData);
    }
}

void PrintToken(PTOKEN pToken, void *pUserData) {
    switch(pToken->tokenType) {
        case TOKEN_TYPE_COMMA:
            printf("comma: %c\n", pToken->tokenValue.cCharacter);
            break;
        case TOKEN_TYPE_IDENTIFIER:
            printf("identifier: %s\n", pToken->tokenValue.pszString);
            break;
        default:
            printf("unknown: %d\n", pToken->tokenType);
    }
}

And now for our main function,

int main(int argc, char **argv) {
    LEXER lexer;
    char const *pszBuffer = ",, ,\t, ,";

    LexerInitialize(&lexer, pszBuffer, strlen(pszBuffer));
    PTOKEN pTokens = LexerRun(&lexer);
    TokenReverse(&pTokens);
    TokenMapFn(pTokens, PrintToken, NULL);

    return EXIT_SUCCESS;
}

And after compiling and running will yield,

C:\Users\Jacob Lusk\jasm>main.exe
comma: ,
comma: ,
comma: ,
comma: ,
comma: ,

Now this isn't too interesting, we can't do anything really if we just have commas as the only thing we're tokenizing, so let's do identifiers next.

To tokenize an identifier we have to know first what we want to look for. To start we'll say an identifier is only a string of contiguous alphabetical characters.

We'll define a function, LexerIdentifier like so,

PTOKEN LexerIdentifier(PLEXER pLexer) {
    size_t cLexemeLength = 0;
    char c;

    do {
        cLexemeLength++;
        c = pLexer->pszBuffer[pLexer->cbPosition + cLexemeLength];
    } while(
        pLexer->cbBufferSize > (pLexer->cbPosition + cLexemeLength)
        && isalpha(c)
    );

    char *pszLexeme = malloc(sizeof(char) * cLexemeLength + 1);
    pszLexeme[cLexemeLength] = '\0';

    memcpy(
        pszLexeme,
        pLexer->pszBuffer + pLexer->cbPosition,
        cLexemeLength
    );

    PTOKEN pToken = CreateToken(
        TOKEN_TYPE_IDENTIFIER,
        (UTOKEN_VALUE) { .pszString = pszLexeme }
    );

    LexerAdvance(pLexer, cLexemeLength);
    return pToken;
}

Here we will count how many contigous alphabetical characters there are, create a new string and copy over the data from the text to this new string that we can place inside our token. And with a small modification to our LexerRun we should be able to parse identifiers correctly now.

    ...
    else if(isspace(c)) {
        LexerAdvance(pLexer, 1);
    } else if(isalpha(c)) {
        PTOKEN pToken = LexerIdentifier(pLexer);
        TokenPrepend(&pTokens, pToken);
    }
    ...

And now compiling with a more interesting string such as "mov rax, rbx":

char const *pszBuffer = "mov rax, rbx";

And modifying our PrintToken function to support identifiers,

case TOKEN_TYPE_IDENTIFIER:
    printf("identifier: %s\n", pToken->tokenValue.pszString);
    break;

And now running the executable,

C:\Users\Jacob Lusk\jasm>main.exe
identifier: mov
identifier: rax
comma: ,
identifier: rbx

That should be enough for part 1, for part 2 we will tackle error handling since we're completely ignoring that here and finish up with the Lexer.

You can find the source for this here.