Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "commoncrawl"
-
I'm back from the dead to rant again. This time it's punycode.
My job has to do with processing the commoncrawl web archives, and for some reason one in 20.000.000 archived webpages crashed my program. After some debugging I found this issue that seems to be the reason my code crashes https://github.com/servo/rust-url/...
To summarize the issue: Since punycode unicode characters can be encoded into domain names. But not every character is allowed. Not only do these invalid domains get registered, I need an in-depth knowledge about unicode to understand what is wrong here.
How did we turn domain names into something so complicated?3 -
My current task involves processing the commoncrawl web archive, and it's like a box of junk you buy at a flea market. You find so much useless stuff, broken stuff, stuff that makes you question people...
My latest find makes me wonder what lies out there if what I found was in plain sight. I found tens of thousands of websites that look like someone used markov chains to generate pron ads. Those websites exist in 10+ languages, use the same url-scheme, read like a dyslexic camgirl reading alphabet soup and are hosted on the same three ip-adresses. There is no javascript involved and some pages link to a variety of twitter accounts.
I queried a few commoncrawl files and amassed 4GB of this spam. Every time I look at it it gets weirder. There is an italian article about malware in there too.
Here's a text sample:
"Not from her bedroom, she her stream view and meet new experience. In hd india, because swimsuit still laws exist no interaction or frigthened and."1