Writing a database in Go: Part 2

Lexing


This is a continuation of the first part of this series where we discussed the data model and query language for this new database. In this second part we will setup our lexer.

Lexing

Lexing is the process for transforming a textual input into tokens representing each keyword, character or puncuation. Fortunately, there’s little we have to do for lexing the input due to a library I have written previous.

The library already handles a large majority of the syntax of the query language we designed previously. However, we will need to create the keyword tokens. We revist the first post and simply record each keyword as a constant.

package lexer

import (
    "github.com/eliquious/lexer"
)

const (
    // Starts the keywords with an offset from the built in tokens so our new tokens don't overlap.
    startKeywords lexer.Token = iota + 1000

    // CREATE starts a CREATE KEYSPACE query.
    CREATE

    // SELECT starts a SELECT FROM query.
    SELECT

    // DELETE deletes keys from a keyspace.
    DELETE

    // DROP deletes an entire keyspace.
    DROP

    // FROM specifies which keyspace to select and delete from.
    FROM

    // WHERE allows for key filtering when selecting or deleting.
    WHERE

    // KEYSPACE signifies what is being created.
    KEYSPACE

    // WITH is used to prefix query options.
    WITH

    // KEY signifies a key attribute follows.
    KEY

    // KEYS signifies several key attributes follow.
    KEYS
    endKeywords

    // Separates the keywords from the conditionals
    startConditionals

    // BETWEEN filters an attribute by two values.
    BETWEEN
    endConditionals
)

Now that the tokens are created we need to load them into the lexer. This is super easy to do. We just create a map of the tokens we want to add and call lexer.LoadTokenMap().

func init() {
    // Loads keyword tokens into lexer
    lexer.LoadTokenMap(tokenMap)
}

var tokenMap = map[lexer.Token]string{
    CREATE:   "CREATE",
    SELECT:   "SELECT",
    DELETE:   "DELETE",
    DROP:     "DROP",
    FROM:     "FROM",
    WHERE:    "WHERE",
    KEYSPACE: "KEYSPACE",
    WITH:     "WITH",
    KEY:      "KEY",
    BETWEEN:  "BETWEEN",
}

Testing the lexer

Now we need to test to see if the keywords have been loaded and the lexer can lex our input. To test the lexer we can write a short program to read from os.Stdin and print the tokens as well as their locations.

package main

import (
        "flag"
        "fmt"
        "os"
        "strconv"

        "github.com/eliquious/lexer"
        _ "github.com/eliquious/prefixdb/lexer"
)

var skipWhitespace = flag.Bool("w", false, "skip whitespace tokens")

func main() {
        flag.Parse()
        l := lexer.NewScanner(os.Stdin)
        for {
                tok, pos, lit := l.Scan()

                // exit if EOF
                if tok == lexer.EOF {
                        break
                }

                // skip whitespace tokens
                if tok == lexer.WS && *skipWhitespace {
                        continue
                }

                // Print token
                if len(lit) > 0 {
                        fmt.Printf("[%4d:%-3d] %10s - %s\n", pos.Line, pos.Char, tok, strconv.QuoteToASCII(lit))
                } else {
                        fmt.Printf("[%4d:%-3d] %10s\n", pos.Line, pos.Char, tok)
                }
        }
}

Now that we have a simple tool, let’s test some queries.

Lexer Output

CREATE KEYSPACE

echo "CREATE KEYSPACE users WITH KEY username" | ./lexer -w=true
[   0:0  ]     CREATE
[   0:7  ]   KEYSPACE
[   0:16 ]      IDENT - "users"
[   0:22 ]       WITH
[   0:27 ]        KEY
[   0:31 ]      IDENT - "username"

SELECT

echo "SELECT FROM users WHERE username = 'eliquious'" | ./lexer -w=true
[   0:0  ]     SELECT
[   0:7  ]       FROM
[   0:12 ]      IDENT - "users"
[   0:18 ]      WHERE
[   0:24 ]      IDENT - "username"
[   0:33 ]          =
[   0:34 ]    TEXTUAL - "eliquious"

I could show all the queries, but I’ll let you run the rest of them. Setting up the lexer didn’t take much time at all, but the parser and AST nodes will take a while longer. We’ll tackle those next time.