tef

bad poster & mediocre photographer

  • they/them

i can't promise much but i'm about 80% sure this will be more coherent than my last post on the topic (which you don't need to read)

here is the problem:

  • i have a parsing evaluation grammar type parser (mostly)
  • this means i have operators like * and ? or similar
  • and operators like lookahead or range or choice
  • i want to be able to parse an indented block using operators
  • what should the operator look like?

originally, i went with something like this


i added an indented operator that takes a nested block of parse actions, and an indent operator, to match against the current level of indentation.

literal "block"
newline
whitespace(min=1) # we look for the first indent
indented { # set the indent level at this point 
   expression
   newline
   repeat(min=0) { 
      indent # read in whitespace to match column
      expression
      newline
   }
}

which would parse input like this

block
   expression
   expression

it doesn't look bad, and i know it works, but there's one annoying problem: the parsing engine now has to keep a column value internally. this didn't feel like a big problem at first, as having the column value around was useful for debugging. this time around, i made the indented text stuff optional, and it just feels really clunky underneath.

i hoped that another style of operator might improve matters, and nothing really came of it. one option is to have a parse-ahead operator, another option was to have variables and pass them in explicitly, but neither one really lifts the need to track columns implicitly.

still, they are useful things to have in a parser toolkit. the parse-ahead operator is "parse the input using this rule to work out a length, then parse it again using the next rule, but only up to that length, exactly", and passing in variables is one way of handling things like operator precedence, or data dependent grammars, but i digress.

the only option i saw on the table was to use a "special tokenizer" and match over indent and dedent tokens, but that felt like cheating.

yesterday i stumbled onto another way to express the operator. the code isn't that much different.

literal "block"
newline
# we don't look for whitespace here anymore
indented() { # we don't use the current column value
   set-indent { # we update the indent here
        indent # whatever the previous indent was
        whitespace(min=1) # but a little longer
   }
   expression
   newline
   repeat(min=0) { 
      indent # match same indent set earlier, as before
      expression
      newline
   }
}

indent now means 'find a new indent on first call, then match that exact amount on subsequent calls', rather than 'match the column where indented was called. indented now means 'prepare to find a new indent' and not 'look up the current column...'.

this means i don't have to track the current column value inside the parsing engine all the time. i still have to track the column when i match a new indent, but that's it. the other nice thing here is that "all the code that matches the indented block is inside the indented operator", which wasn't true before. i could even add some sugar to avoid calling set-indent but that's a little beside the point.

the actual point: the problem with adding operators to a peg, especially contextual ones, is that it can be real easy to rely on implicit state to make them work, and designing them with explicit state can be even harder.

i'm not even sure if i'm on the right track but it feels worth exploring. that's all for my hobby notes update.


You must log in to comment.