Heemers

9y

I spent almost 10 hours coming up with this RegEx. Trial and erroring my way to hell. First I had get rid of the HTML tags (which was easy-ish) then I spent most of my time trying to figure out how to remove the god damn dash but keep hyphenated words ....... Then I found \B and look behinds...

I am making it a point to get good at this shit... Because right now I am petrified of it... Fuck you Regular expression you have taken away all my emotions...

undefined

regex

regular expressions

python

Ranter

Comments

1

Eariel

1892

9y

Just curious, what are you trying to achieve?
0

Heemers

493

9y

@Eariel I have this corpus of wikipedia comments... I need to find the highest word frequency.
17

Bagul

1144

9y

Since RegEx took away all your emotions, does that mean you can't use regular expressions in everyday conversations anymore?
2

arcadesdude

6303

9y

Regular exasperations
Regular depressions
Regular explicitives

Yup I hear ya
0

Heemers

493

9y

@vinerz tell me more
0

Heemers

493

9y

@theOverseer it. Is. Too. Tire.
10

Voxera

10883

9y

Regex is good for finding patterns in text OR for parsing well formated text.

For anything else I either build something custom or use a mix of technics.

Or possibly use several regex after each other.

They are not good for doing more than one thing since every layer usually requires repetition of the same regex part and it quickly grows beyond comprehension.

Also, many times you eventually have to come back to it to tweek it and thats when you start hating your former self for not separating out the different operations.
4

ZaLiTHkA

836

9y

@Heemers, if it's just word frequency you're after, why not simply split the string on word boundaries (\b), obviously catering for words that would be inside links, then iterate over the array of results and discard all invalid strings?

I love RegEx, but as powerful as it is, it's by no means the solution to every string manipulation problem. O.o
8

login

104

9y

Each time you use regex to parse HTML a fairy dies: http://stackoverflow.com/questions/...
4

k0pernikus

5248

9y

Regular expressions can only match regular languages.

HTML is a context-free language.

The pain you are experiencing is using a frozen fish for a hammer.

I advise you to read this: http://stackoverflow.com/q/6751105/...
1

jackgreen

1435

9y

why don't you use BeautifulSoup or something? also any decent scrapper would do it, say Scrapy...
thats way too much hassle with raw regex!
0

anekix

367

9y

I was wondering it could be done with half -a-word code maybe? try XPath?? As simple as:
//text()[normalize-space()]

FYI XPath is not a library or framework
0

anekix

367

9y

@jackgreen yup that's exactly right but more simple would be to use XPath instead.
1

Eariel

1892

9y

@blueCat1301, that SO answer is a classic 😊

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service