Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
awesomeest1238191dIm a data nerd and even im wondering why so much... then thinking about how shit most data architectures are
-
vane11280191d@awesomeest yeah going trough couple of warc looks like it’s mostly crap but well it would last for couple of years.
@Demolishun my work is my hobby :) -
vane11280191d@Demolishun I’m thinking about crawling data and building index and search of not commercial websites
-
awesomeest1238191d@vane any time i see piles of data i want to reformat it. It's like OCD when you know more than most humans ever should. Im prone to over-engineerering, know way too much math and know hardware down to the very literal bits... i have yet to see a data system i couldn't optimise.
-
TeachMeCode5175191d@awesomeest I had to do a shit ton on the last architecture I worked with at my last company, it was a mess. Good thing I’m now joining a startup (gaming but with a twist, won’t reveal details) so I don’t inherit crap lol
-
vane11280191d@TeachMeCode obviously new internet, what else you can do when AI would own old one in couple of years
-
vane11280190d@awesomeest you got me thinking a little, do you think it’s possible to compress 80tb of content into 1tb of highest value things and search index build inside
the content needs to be locked in blockchain so every block would be website that anyone can compare hash for integrity
I’m thinking about building a filter that would filter most valuable 850gb and add 10gb index and desktop search app - cross platform, then share it on p2p so people can download and search old content
probably nobody would want stuff like that but well, that’s the plan
offline internet on pendrive -
awesomeest1238190d@vane ummm by how you asked that question youre far out of your depths. There are sooooo many variables before compression can be a math equation.
An oversimplification would be:
If you have a giant db full of millions of strings that were all sentences in standard english. Only the 26 letter alphabet, numbers, . , ' " ! ? and space. Assuming that there's no weird/additional capitalisations... if it used ASCII originally you could pare it down to a binary based on only those chars and make the reverse capitalise the first letter of a sentence (and add the 2 space char after the previous .) then dictionary check every word so non-standard words like names, and things like "I" would end up capitalised without doubling the char count and necessary storage.
Like i said, oversimplification. But the concept is still the same. Redundancies and data type/format are why compression works at all. My simple example could easily cut the size into a 10th, but wouldnt be helpful for pictures. -
awesomeest1238190d@vane i assume since you want to add it on 'the blockchain' youd want it immutable/static forever? Anything you alter there's a bunch of other issues... also, as a dev working on several blockchains... you reeeeeallly need to decide which blockchain you're talking about... there's so much that matters that's blockchain specific. If you value your sanity id stay clear away from kleverchain... even if they offer you a grant.
-
Demolishun34913190d@vane maybe talk to the Mojeek guy. I heard it was a search engine in some dudes closet in the UK. That might be an exaggeration, but I can find things there I cannot find elsewhere.
-
vane11280189d@Demolishun an urban legend, it’s incorporated and they use spiders so their results are not deterministic, I want deterministic results
Related Rants
I’m expanding my storage with 8x 20TB hard drives. With raid5 on it I would get approximately 126TB of storage space.
This would allow me to download full common crawl dataset and play with it locally.
random
dataset