Regex and off by 1 capture group...

Ranter

atheist

8403

Comments

2

atheist

8403

331d

In case anyone was wondering why I was up until 4am...
2

typosaurus

10688

331d

I kept performing, that's why I continued until around four AM.

New york taxi syndrome (explained to me with this name, there is no source on internet backing this name):when it goes bad, you don't work longer to achieve results but shorter. On good days you keep working longer. A taxi driver goes home when there are not a lot of customers in the street in this philosophy. It's the reverse way how we often think.

But you probably had too finish it on time
2

atheist

8403

331d

@retoor "finish it on time" nooooo... But you are right, in some way it was going well, I found a way to make my library faster by changing it to a regex and in the simple case it works correctly and even faster, but python's regex library doesn't lazy evaluate matches so there's an edge case where I only care about the first few matches where it's got significantly slower. So I'm now trying to regex only the first few bits... Hence, off by 1..
2

atheist

8403

331d

But also I've reached that point where I've written so many little tools that I had a basically working prototype in under an hour and had it going faster than the previous approach 30 minutes later. And it's kinda making me enjoy programming again. I need to do real work to make things go brrrr (high performance).
2

typosaurus

10688

331d

@atheist that's very cool! Yeah, the performance of regex is sick. I wonder why PHP has built it's own regex and not is using the C one. Python regex does support a bit more AFAIK, maybe that's why.

The performance of the glibc regex is relatively low for single time validation since the regex compiler function is slow. But a compiled regex made by the glibc regex is nearly unbeatable. My regex wins only in some cases. I have no idea how they've build it. It doesn't use JIT. This regex doesn't make sense: [a][b][c] and it performed slower in my regex than from glibc. Then, my regex compiler removed all [] if there was only one options and got exactly same performance as the C one. So one optimization trick if have discovered. Later I removed the compiler for my regex.

My regex is one pass always faster than C and on multiple passes often slower. I'm happy with the result. My parser is so fast, it's 6x slower than native C in validating \d\d\d\d\s*\w\w (dutch zipcode). That's amazing
1

typosaurus

10688

331d

@atheist I've written all my tools in seperate .h files and have made a tool to merge all those .h files in correct order by parsing a header file that includes them all. It recursively includes all and prevents double inclusions. The order of source is thus always alright. I only have to use #include <rlib.h> to have a json parser, http server engine, hash tables, (string) formatters, time functions, benchmarking etc. Also the original malloc is overridden by a malloc that keeps statistics so you can be sure that you don't have leaks. Works easier than valgrind but not complete replacement for valgrind. My applications can end with memory statistics and return a 1 if memory issue by just implementing a one liner
1

atheist

8403

331d

@retoor your point about regex optimization is interesting. I've got a mini library for generating regexes. It basically uses classes like group, set, sequence, etc that take nested regex sequences, where appropriate classes have named functions in place of special characters like repeat, ng_repeat (non greedy), optional, negation, etc because I have the memory of a goldfish. It's designed to reduce mistakes so definitely adds a lot of extra group/noncapture group/set characters and may also include several groups of size 1. Something I'll have to look at when I've got it working properly. But for now I'm taking a break.

I've also not touched regexes in a couple of months but it was painless to write something fairly complex again so it clearly does its job.
1

typosaurus

10688

331d

@atheist wauw, generating regexes, that's freaking amazing. Not easy neither.

SIde note, I've just learned this:

Easy is a word people use to describe other people’s jobs. Be careful not to assume the things you don’t know, or don’t routinely do, are easy. [Source: 37signals.com, the maker of Ruby on Rails. Very cool dude! Very enthousiastic about development]
1

atheist

8403

331d

@retoor I'm not saying it's "good" but... 🤷‍♂️

The regex for clustering hangul (Korean) graphemes is:
L* (V+ | LV V* | LVT) T* | L+ | T+

Where L, V, T, LV, LVT are sets of characters. Which, if you're very familiar with regex, sure... In my lib (w2) the equivalent is:

hangul = w2.or_g(
w2.g(
w2.g(L).repeat,
w2.or_g(
w2.g(V).req_repeat,
w2.g(LV, w2.g(V).repeat),
LVT,
),
w2.g(T).repeat,
),
w2.g(L).req_repeat,
w2.g(T).req_repeat,
)

Which if you don't know regex you can probably guess roughly what it does. It's not clever, you can generate invalid sequences (I might do something to reduce that) and you can mix raw regex with structural regex. But it's kinda nice.
1

atheist

8403

331d

Mostly it's useful so I don't have to remember the difference between [] () {} and (?<! vs (?!. And when I come back to it in 6 months it's not going to take me a fucking age to work out what it's doing. That is also then composed into an or group of 4 other regex groups. So it's way worse than it looks.
1

typosaurus

10688

331d

@atheist I like the difference that you made regarding repeat and req_repeat. Also nice with the nesting. In what case it would produce a bad sequence? The source code you provided is exactly L* (V+ | LV V* | LVT) T* | L+ | T+ I guess. Flawless. But I really hate trailing comma's. As writer of interpreter I know why it probably doesn't cause an issue but still hate it.

So w2 is the new lib your're working on? Cool
0

atheist

8403

331d

@retoor I think currently I could nest a group in a set and it wouldn't complain but the resulting regex wouldn't be meaningfully equivalent to the functional implementation. I think I can narrow my type annotations but at the same time the reason I wrote it is because I *don't* understand regex 😅😂
0

atheist

8403

331d

I tend towards trailing commas (at least when split across multiple lines) because when I inevitably later come back to add an extra parameter or something it's a bit easier to my brain.
1

atheist

8403

331d

@retoor back to your optimization comment, holy crap that makes a massive difference. I've removed a single unnecessary capture group. It was one enclosing the entire expression, so would just be the same as the default match. The full regex went from 166 characters to 164. In the previous worst case it got 10% faster.

Thank you!
1

atheist

8403

331d

Did I mention I'm bad at regex?
1

typosaurus

10688

331d

@atheist haha, cool that my random story comment has so much effect!
0

typosaurus

10688

331d

@atheist I think you can't be bad at regex if you've written such tool.

I created my parser when i didn't know regex yet. That fucked me from behind a few times. I didn't understand 'greedy' for a long time. For example, greedy: (r.*t) matching of "Harry Potter Harry Potter" would capture "rry Potter Harry Pott". My parser did "rry Pott". I like mine more, but it's wrong compared to all parsers at regex101(regex site for validation of regex). My interpreter constant checks after every char if the next is c or else continues in case of matching a*bc and would quit then. It does matter in case of performance, backtracking is slow for for long content (a book or smth). Mine can parse a book of unlimited length easy. glibc regex doesn'tfinish if string to validate > 10Mb IIRC. I don't have such issues, but yh, mine officially is wrong but it's just a matter of preference imho
2

devRancid

643

330d

Off by one? Have you heard about named capture groups
1

atheist

8403

330d

@retoor weeeellll I've been having some fun... Through a combination of simple assertions and type constraints I've apparently written a library that "helps" simplify regexes...

Most of what I've got rid of now is redundant noncapture groups which it seems didn't hurt performance but it's interesting. I'm down to 92 characters from 166.
0

lorentz

15317

330d

name them. I've been doing so for as long as I'd known it's possible and it made Regex into an almost pleasant parser representation
1

atheist

8403

330d

@lorentz I've named my capture group. It's called idx:

(?:ab|[abc]|i*(?:d*(?:e+|fe*|h)g*|d+|ll|k(?:n*pk)*|j(?:[nqp]*?q[nqp]*j)*|[^abc])[npoq]*){100}(?P<idx>(?:ab|[abc]|i*(?:d*(?:e+|fe*|h)g*|d+|ll|k(?:n*pk)*|j(?:[nqp]*?q[nqp]*j)*|[^abc])[npoq]*))
1

atheist

8403

330d

Whereas the capture group in my composer library syntax:

cr_lf = w2.seq(CR, LF)
any_ctl = w2.ch_set(CR, LF, Control)
non_ctl = ~any_ctl

hangul_inner = w2.seq(
    w2.ch(L).repeat,
    w2.or_g(
        w2.ch(V).req_repeat,
        w2.seq(LV, w2.ch(V).repeat),
        LVT,
    ),
    w2.ch(T).repeat,
)

hangul = w2.or_seq(
    hangul_inner,
    w2.ch(L).req_repeat,
    w2.ch(T).req_repeat,
)

ri_ri = w2.seq(Regional_Indicator, Regional_Indicator)

xpicto = w2.seq(
    Extended_Pictographic,
    w2.g(
        w2.ch(Extend).repeat,
        ZWJ,
        Extended_Pictographic
    ).repeat,
)

incb = w2.seq(
    InCB_Consonant,
    w2.g(
        w2.ch_set(
            Extend,
            InCB_Linker,
            ZWJ,
        ).ng_repeat,
        InCB_Linker,
        w2.ch_set(
            Extend,
            InCB_Linker,
            ZWJ,
        ).repeat,
        InCB_Consonant,
    ).repeat
)

pre_core = w2.ch(Prepend)

core = w2.or_g(
    hangul,
    ri_ri,
    xpicto,
    incb,
    non_ctl,
)

post_core = w2.ch_set(
    Extend,
    ZWJ,
    SpacingMark,
    InCB_Linker,
)

egc_re = w2.or_seq(
    cr_lf,
    any_ctl,
    w2.seq(
        pre_core.repeat,
        core,
        post_core.repeat,
    )
)
1

atheist

8403

330d

And if anyone wants to know what that insanity is, it's a regex to match grapheme clusters https://unicode.org/reports/tr29/...
1

atheist

8403

330d

I've mapped each grapheme property class to a letter a through whatever

Add Comment

Regex and off by 1 capture group...

rant