6
pk76
5y

Spend several months designing an alternative approach to string matching/parsing not based on parser combination or Regex. Benchmark and profile the ever living hell out of it. Find out the performance is highly competitive with FParsec and far easier on memory than either approach. Discover FParsec responds very badly when failing. Document all of this, do a presentation on the design. Upload video of presentation with links to it on a few sites. Presentation gets downvoted, and receive hatemail. Yay.

Comments
  • 0
    How dare you think outside the box! Link so we can check it out?
  • 2
    @Demolishun github.com/entomy/stringier
    YouTube links are weird but a search for "introducing Stringier" should have it as the first result.
  • 0
  • 0
    @pk76 I only went over the link cursorily, not enough time, but could you explain what makes your approach better than regexes and parser combinators (or techniques like fast and efficient table-driven LR parsers or dynamic programming algorithms like CYK)?
  • 1
    @RememberMe sure.

    Unlike Regex, patterns are reusable, so if you're defined something like an "identifier" for a programming language, and want to put that inside of a larger pattern, you can easily do so because these are assigned to variables and are combined. With most Regex implementations you have to copypasta. There's also error reporting, right down to the specific part of the pattern which failed to match.

    Unlike parser combinators you're working with patterns, descriptions of what to parse, rather than combining smaller parsers. This makes the approach a bit more declarative. It also allows some optimizations made during initialization, based on rewriting parts of the pattern in ways that work faster, similar to optimizing compilers. As long as the patterns would otherwise be identical, a rewrite can occur. Parser combinators require you to know/reference what approach is optimal.

    As compared to more advanced approaches like what you listed, it's complicated...
  • 1
    @pk76 @RememberMe

    In a very broad sense I could claim ease of use. But as I haven't gotten around to benchmarking these, nor doing any serious comparisons for advantages/disadvantages/pitfalls, I don't feel comfortable comparing against these yet as it's too likely to be unfair or inaccurate.

    Which then brings me to performance. There's charts in the presentation, but the tl;dw is, highly competitive performance despite infancy, with substantially lower memory usage.
  • 1
  • 0
    @pk76
    Firstly, FParsec. That library can match context sensitive grammars and is intended for ease of use compared to handrolling or LR, because correctly parsing advanced grammars is hard. Ref: https://microsoft.com/en-us/...

    Even theoretically for full CFGs the best algorithms like CYK work in O(n^3) time, so comparing FParsec to regular grammar parsers like regex/yours is absurd, it's an apples to oranges comparison. Your benchmark is all regular stuff like identifiers and addresses, with no recursive structure.

    Also it's an established design flaw of combinators that they use a ton of memory, have horrible cache friendliness, tons of indirection, and are usually opaque to optimisation because of their construction (it's just a bunch of higher order functions, so they can't observe their own structure to optimise/better error reporting). Error reporting is also relative to the expressive power of the parser.
  • 0
    @pk76 (note: "regular" here means the type of grammar that regex can accept, it's a technical word. Could also say left-linear/right-linear).

    I also don't think that lifting regex building blocks into objects and operators is that big a deal, there is no effective difference that I can see between your patterns and regex patterns in terms of functionality, just that regex uses strings while you use pattern objects. One could easily write a wrapper class above regexes that internally called the msregex implementation, and lifting regex string concatenation into that class would also cover your library's other USP, the reusable patterns (which add no theoretical expressive power anyway, unless they can be called recursively).

    It is neater though, I'll give it that, personally I too prefer regexes declared like this instead of messy strings, but yeah.

    Don't want to sound dismissive or anything, and I might have missed something, those are just my opinions about the library.
  • 1
    @RememberMe So because I don't have an example of LR/LR/LALR/x in the benchmarks, it's only capable of handling those? Isn't it more likely I haven't added those benchmarks yet, as I'm trying to provide the fairest comparisons and don't want to use a bad FParsec implementation? You clearly watched the video as you extracted a screenshot from it, on the very slide you extracted I had stated exactly what I've had to here: The benchmarks are not complete and I would like to add more, but I'm not going to publish misleading benchmarks and need to figure out, or find, good implementations for each.
  • 1
    @RememberMe I've been using this for parsing a language with at least LL grammar, and haven't had a reason to push farther given my requirements. So no, it's not just limited to regular grammars and you're definately missing something; I'd suggest not basing capabilities of a system off its benchmarks.

    I should see how complex of grammars this will handle, you are not incorrect to wonder that. However, handling grammars more complex than I needed was never a design goal, so I've only ever validated up to that requirement, and then focused on other matters.
  • 0
    @RememberMe "regular here means type of grammar"

    Thank you for just assuming I'm an idiot. Cool.
  • 0
    @pk76 oh you do plan to support more advanced grammars? in the video you primarily kept bringing up comparisons to regex, leading me to believe it was intended to replace regexes only. My bad.

    Having gone through the documentation, I keep noticing that you try to differentiate your library from parsec, but don't mention anything that parsec doesn't already provide and the design of things is basically the same as what parsec does. How is it different then? The parsec style is constrained by being designed with purity in mind, but a general parser combinator is just that - parsers and combinators which act on parsers, which is basically what your library is.
  • 1
    @pk76 the "regular" bit was a clarification in case you thought I meant "ye olde grammar", because language ambiguity. In the same paragraph I also used terms like left-linear and right-linear, and I've been freely throwing terms like LR around which I wouldn't expect an idiot to know, so you can rest assured I don't think you're one.
  • 0
    @RememberMe pattern descriptions and parsers are technically separated with this approach, although I will say I've only implemented on parser currently. I will admit it's a subtle difference: combining pattern descriptions instead of combining parser functions. I would say it's derivative but highly inspired by the approach. The OO approach to the same goal otherwise.
  • 1
    @RememberMe my apologies then. I'll admit I'm a bit prone to hyperdefensive attitudes for various reasons, and it is especially agitated by highly academic stuff.
  • 1
    @RememberMe I would like to. Obviously I'm primarily focused on my own requirements first, but I would like to support at least recursive definitions in time. Since recursion is just a tail call to itself, I believe this should easily be possible, and would probably just require a .recursive() method or something to work around the host language not always supporting recursive variable definitions. I'll work out the specifics when I'm ready for it, as right now I'm primarily working on what I made this for, as well as some additional "checker" patterns which aren't covered in the presentation but are in the documentation under "advanced concepts".
  • 1
    @RememberMe It's a presentation meant for browsers of YouTube. I would give a very different presentation at a professional conference. Know your audience. ;)
Add Comment