tef

bad poster & mediocre photographer

  • they/them

well, more accurately, i wrote a grammar interpreter again.

i've written a lot of parsers, and a lot of prolog. after writing some json parser for myself, and then a yaml parser for a friend, i wondered if there was an easier way to write recursive descent parsers in python


the answer is yes. the code might not look great, but it looks a hell of lot nicer than doing it by hand. especially because you can add special operators to the parser to make your life easier.

 @rule()
    def string_literal(self):
        self.literal("\"")
        with self.capture_node("string"), self.repeat(), self.choice():
            with self.case():
                self.literal("\\u")
                self.range("0-9", "a-f", "A-F")
                self.range("0-9", "a-f", "A-F")
                self.range("0-9", "a-f", "A-F")
                self.range("0-9", "a-f", "A-F")
            with self.case():
                self.literal("\\")
                self.range(
                    "\"", "\\", "/", "b", 
                    "f", "n", "r", "t",
                )
            with self.case():
                self.range("\\", "\"", invert=True)
        self.literal("\"")

i then went a little mad with power. i asked myself, is it possible write a grammar for markdown this way? the answer was also yes.

@rule()
    def atx_heading(self):
        self.whitespace(max=3)
        with self.variable(0) as num, self.capture_node("atx_heading", value=num):
            with self.count(char='#') as level:
                with self.repeat(min=1, max=6):
                    self.literal("#")
            self.set_variable(num, level)
            with self.choice():
                with self.case():
                    self.atx_heading_end()
                with self.case():
                    self.whitespace(min=1)
                    self.inline_element()
                    with self.repeat():
                        with self.reject():
                            self.atx_heading_end()
                        with self.capture_node("whitespace"):
                            self.whitespace()
                        self.inline_element()
                    with self.optional():
                        self.literal("\\")
                        self.capture_value("\\")
                    self.atx_heading_end()

there's nothing that special about the builder, but you might notice some off brand parsing actions. there's something counting characters, and another thing sticking it in the parse tree.

along the way, i ended up adding operators like parallel to parse a section twice over, backref to extract some text just matched to pass into another operator, indented to mark columns for indent and dedent, and some absolutely batshit code to handle tabstops.

does your parser handle "half of a tab"? for someone matches whitespace(4) against a single tab character, and then wants whitespace(4) to handle the rest.

it's this sort of weirdness that i'm proud of. most parser generators are about "look at my fancy algorithm" or "here's backtracking and a kleene star, enjoy", and this one was much more about "here is the precise tool you need for this awful grammar".

which is why i ended up porting it to go:

parser = BuildParser(func(g *G) {
	g.Define("expr", func() {
		g.String("x")
		g.Optional(func() {
			g.Call("expr")
		})
	})
})
if parser.Err() != nil {
    return
}

i already wrote a post about this style of builder, so i won't repeat myself, but it is pleasing to note that this style of definition is much easier in go than in python. go even has enough runtime inspection going around to get decent error messages out of the code, too.

ez_test.go:239: error in Grammar(), 3 in total
ez_test.go:243:expr: error in Cut(), cut must be called directly inside choice, sorry.
ez_test.go:244:expr: error in Byte(), cannot call Byte() in text mode
ez_test.go:245:expr: error in String(), String("\t") contains reserved string "\t"

i managed to get as far as a JSON parser this time, before losing steam. i might pick the library up again in a few days when another cool idea strikes me, but i just haven't found the energy to push forward to write yet another markdown parser

not yet, anyway


You must log in to comment.

in reply to @tef's post:

I noticed for the python side, you've got with self.choice(): followed by with self.case(). Was there any decision on doing it the selfapproach versuswith self.choice() as choice, with choice.case() as case` approach, or was that just too much of a burden to implement?

Also, those additions regarding parallel, or backref (and especially whitespace) are actually nice. I wish more stuff had them. :eggbug-pensive:

i didn't think about doing with self.choice() as c, with c.case() i guess

i'm really not sure what it gets you, because i still have to check you've called case afterwards