This conversation would be a lot clearer with a distinction between "regexes" an...

a1369209993 · on May 9, 2021

> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".

Very much so.

> In this post, the example given IS a regex, but it IS NOT a regular language: ` # comment` The nongreedy match of .*? isn't a mathematical construct; it implies a backtracking engine.

Actually, that's [edit: "it IS NOT a regular language"] wrong, at least in principle. If you're limiting it to only the shortest match (which is how HTML (and most other) comments actually work), then (just like `(abc)+` is shorthand for `(abc)(abc)*`) `` is shorthand for (assuming I haven't made a mistake in the for-lack-of-a-better-word-arithmetic):

  <!-- ([^ ]| [^-]| -[^-]| --[^>])* -->

That is, shortest-repetition-only can be implemented in a purely regular system.

On the other hand, if you want to allow longer matches when actually needed, then for purely-regular purposes (where it either matches or not) `` is just a wierd way of writing `` (which is quite obviously a regular language).

chubot · on May 9, 2021

Yeah I think you are right -- the nongreedy match can be simply written as a more awkward pattern. I think regular language negation and intersection also help (which the rarer derivatives-based implementations seem to have). They are still regular and equivalent in power, but there's a richer set of operators for writing patterns.

I would also divide it into the "recognition problem" and the "parsing/capturing problem".

Recognition is just yes/no -- does it match? And notably that excludes common regex APIs that either return the end position (Python's re.match) or search (re.search). It is more like math -- is it in the set or not?

For the recognition problem, there's no difference between greedy and non-greedy, in terms of the result. (It does matter in terms of performance, which shows up in practice in backtracking engines!)

But parsing / capturing is more relevant to programmers. I don't remember all the details but there is some discussion on the interaction between greediness and capturing here: https://swtch.com/~rsc/regexp/regexp2.html

It looks like it can all be done in linear time and RE2 supports it fine.

So in that case maybe the 2 questions are closer than I thought. I have a very similar set of regexes that I use for parsing (my own) HTML, but I use one regex for each kind of token, and I do it with the help of an explicit stack.

I'm using Python's engine so I think it is better to separate each case, for performance reasons. But maybe with RE2 it would be fine to combine them all into one big expression as is done here, for "parallel" matching. The expression is a little weird because it's solving a specific problem and not something more general (which is useful and natural).

a1369209993 · on May 9, 2021

> common regex APIs that either return the end position (Python's re.match) or search (re.search).

Yeah, I probably should have explicitly said that's what the first translation (`[^a]|a[^b]|ab[^c]|...`) was for; it's a optimization (possibly-backtracking-parser -> [ND]FA) I've used a couple of times to beat things into guaranteed O(N) time.

> But parsing / capturing is more relevant to programmers.

I'd debate "more", since that's a additional thing on top of matching and searching. Any case where you need the former, you also need the latter to even know what to capture. But it's definitely something you do frequently need.

goto11 · on May 9, 2021

Does your regex assumes that "-->" must be prefixed by a space? This is not the case in XML. (Also the string "--" must not occur inside a comment, so the last clause is not necessary.)

a1369209993 · on May 9, 2021

> Does your regex assumes that "-->" must be prefixed by a space?

Yep, because the quoted regex assumed the same thing, and I didn't see a point in editorializing more than necessary.

> Also the string "--" must not occur inside a comment, so the last clause is not necessary.

"Must not" seems unreliable in webpage parsing. What page does your XHTML parser produce when fed text of the form ``, for example?

goto11 · on May 10, 2021

Handling invalid syntax is another can of worms! It would make all parts of the tokenizer much more complex, not just comments.

In the case of XHTML, a parser is supposed to reject any document which is not well formed. HTML parsers typically try to "gracefully recover" from all syntax errors, but this is a crazy complex algorithm.

jameshart · on May 10, 2021

The supplied regex is using extended syntax which ignores whitespace (you can tell because it also includes comments)

goto11 · on May 10, 2021

> This conversation would be a lot clearer with a distinction between "regexes" and "regular languages".

This is often an important distinction, but the point of the article is that the Stackoverflow question does not require recursion or any other non-standard regex features, and therefore can be solved using a vanilla regular expression.

So in this particular question the distinction is not important.