This conversation would be a lot clearer with a distinction between "regexes" and "regular languages". The former is what Perl/Python/etc. have, and the latter is a mathematical concept (and automata-based non-backtracking engines like RE2, re2c, and rust/regex are closer to this set-based definition).
With those definitions, this part of the snarky answer is wrong:
HTML is not a regular language and hence cannot be parsed by regular expressions
That is, regular expressions as found in the wild can parse more than regular languages. (And that does happen to be useful in the HTML case!)
This answer is also irrelevant, since the poster is asking for a solution with regexes, NOT regular languages:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).
In this post, the example given IS a regex, but it IS NOT a regular language:
<!-- .*? --> # comment
The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.
> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".
Very much so.
> In this post, the example given IS a regex, but it IS NOT a regular language: `<!-- .*? --> # comment` The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.
Actually, that's [edit: "it IS NOT a regular language"] wrong, at least in principle. If you're limiting it to only the shortest match (which is how HTML (and most other) comments actually work), then (just like `(abc)+` is shorthand for `(abc)(abc)*`) `<!-- .*? -->` is shorthand for (assuming I haven't made a mistake in the for-lack-of-a-better-word-arithmetic):
<!-- ([^ ]| [^-]| -[^-]| --[^>])* -->
That is, shortest-repetition-only can be implemented in a purely regular system.
On the other hand, if you want to allow longer matches when actually needed, then for purely-regular purposes (where it either matches or not) `<!-- .*? -->` is just a wierd way of writing `<!-- .* -->` (which is quite obviously a regular language).
Yeah I think you are right -- the nongreedy match can be simply written as a more awkward pattern. I think regular language negation and intersection also help (which the rarer derivatives-based implementations seem to have). They are still regular and equivalent in power, but there's a richer set of operators for writing patterns.
I would also divide it into the "recognition problem" and the "parsing/capturing problem".
Recognition is just yes/no -- does it match? And notably that excludes common regex APIs that either return the end position (Python's re.match) or search (re.search). It is more like math -- is it in the set or not?
For the recognition problem, there's no difference between greedy and non-greedy, in terms of the result. (It does matter in terms of performance, which shows up in practice in backtracking engines!)
But parsing / capturing is more relevant to programmers. I don't remember all the details but there is some discussion on the interaction between greediness and capturing here: https://swtch.com/~rsc/regexp/regexp2.html
It looks like it can all be done in linear time and RE2 supports it fine.
So in that case maybe the 2 questions are closer than I thought. I have a very similar set of regexes that I use for parsing (my own) HTML, but I use one regex for each kind of token, and I do it with the help of an explicit stack.
I'm using Python's engine so I think it is better to separate each case, for performance reasons. But maybe with RE2 it would be fine to combine them all into one big expression as is done here, for "parallel" matching. The expression is a little weird because it's solving a specific problem and not something more general (which is useful and natural).
> common regex APIs that either return the end position (Python's re.match) or search (re.search).
Yeah, I probably should have explicitly said that's what the first translation (`[^a]|a[^b]|ab[^c]|...`) was for; it's a optimization (possibly-backtracking-parser -> [ND]FA) I've used a couple of times to beat things into guaranteed O(N) time.
> But parsing / capturing is more relevant to programmers.
I'd debate "more", since that's a additional thing on top of matching and searching. Any case where you need the former, you also need the latter to even know what to capture. But it's definitely something you do frequently need.
Does your regex assumes that "-->" must be prefixed by a space? This is not the case in XML. (Also the string "--" must not occur inside a comment, so the last clause is not necessary.)
> Does your regex assumes that "-->" must be prefixed by a space?
Yep, because the quoted regex assumed the same thing, and I didn't see a point in editorializing more than necessary.
> Also the string "--" must not occur inside a comment, so the last clause is not necessary.
"Must not" seems unreliable in webpage parsing. What page does your XHTML parser produce when fed text of the form `<!-- --is this a comment? <a href=-->`, for example?
Handling invalid syntax is another can of worms! It would make all parts of the tokenizer much more complex, not just comments.
In the case of XHTML, a parser is supposed to reject any document which is not well formed. HTML parsers typically try to "gracefully recover" from all syntax errors, but this is a crazy complex algorithm.
> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".
This is often an important distinction, but the point of the article is that the Stackoverflow question does not require recursion or any other non-standard regex features, and therefore can be solved using a vanilla regular expression.
So in this particular question the distinction is not important.
https://www.oilshell.org/blog/2020/07/eggex-theory.html
With those definitions, this part of the snarky answer is wrong:
HTML is not a regular language and hence cannot be parsed by regular expressions
That is, regular expressions as found in the wild can parse more than regular languages. (And that does happen to be useful in the HTML case!)
This answer is also irrelevant, since the poster is asking for a solution with regexes, NOT regular languages:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar).
In this post, the example given IS a regex, but it IS NOT a regular language:
The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.I gave my analysis here and listed 3 or 4 caveats: https://news.ycombinator.com/item?id=26359556
I prefer to use regular languages and an explicit stack, but this is not really what the original question was asking.