edit: https://cohost.org/tef/post/4121852-thinking-out-loud-p
i wrote another post, it's a lot more coherent. you're probably better off reading it instead.
but, well, here's the old post if you're curious:
i'd title this part two, but more accurately it's part three. this is less of a "note to a future me" technical post and more of a "present me is thinking out loud" post
let's talk about how to build indentation sensitive grammars:
forgive me for diving right in, but i assume anyone interested enough in parsing has already experienced at least one or two indentation sensitive languages. python makes an obvious example.
python handles indentation in the most classical manner, jamming shit in the lexer. the lexer keeps track of the indentation levels, and emits INDENT and DEDENT tokens, so the parser can handle indented blocks much like it handles parenthesis or begin and end
this is a great "there i fixed it" approach that's compatible with a wide range of parsing algorithms, and this is how most things handle indentation in practice.
it also sucks ass.
i shouldn't have to explain why this isn't a great method for parser generators to employ, it's essentially shrugging off the responsibility back to the user. surely enough, we can make doing weird shit easier.
one parser-generator opts for "binary grammars" where you can say "this thing has to parse according to this rule, and this other rule at the same time", and in a parser generator, it might look something like this:
# a regular normal rule
if_expr :-- "if", expr, ":", newline, indented { statements };
# a constraint rule that gets parsed alongside
constraint indented :-- (whitespace^{N}, any*, newline)*;
in practice the implementation would probably work something like this: when we expect an indented block, we parse ahead, caring only whitespace and newlines, to get the size of the block. we then parse this substring with the real parse rules.
this is a lot of work for something we could just special case. we could just provide built in operators for handling indentation directly.
in the example below, indented means "set the indentation level to this column", and indent means "the right amount of whitespace at the start of a line":
if_expr :--
"if", expr, ":", newline,
whitespace, indented {
statement, newline
(indent, statement, newline)*
}
this is the approach I used in my last parser generator, and i'm still not sure if it's the best approach.
it did surface problems like supporting "matching half of a tab stop to the indent and the other half to whitespace inside the block itself", it also made it easy to handle stuff like "bird notation" code (that's when every line has a prefix like > rather than just whitespace).
on the other hand, it's just not as clear as indent and dedent blocks.
another approach is to expose column information to the grammar:
c = get_column(), ... , newline
start_of_line(), whitespace(c), statement()
this is mentally simpler than special indented rules, but the grammar can end up a lot clunkier, and worse, there's no easy way to nest indented expressions unless they take "current indent" as an argument.this sort of variable passing would work better for handling "type length value" or data dependent grammars.
so we've got several styles so far:
indentanddedenttokens emitted after start of line / newline- binary grammars, or parsing once to get a substring, then reparsing that substring
- indentation operators
- explicit argument passing
the fun thing? i've had to use almost all of these to parse markdown.
- there's sections that you can't accurately parse until the end of a document, so i parse and reparse sections and leave both fragments in the tree.
- there's things that require matching arbitrary delimiters, so i have ways of saving input into a variable and passing it into another rule.
- and i put in indentation operators, but they did more than handle whitespace
markdown also has "prefixed blocks" where they have a > at the start, so i modified the indented operator to take a prefix. markdown also has a lot of "you can indent this if you like" blocks, where it's the absence of certain characters that mark it out.
frankly, i'm really not sure if there's any approach here that makes sense, because it's markdown, and it's built out of special case rules. i was unhappy with the indented approach in my last grammar, but honestly i'm not really sure if there's a better approach i'm missing
i think i'm going to have to do everything, but slightly different this time around
