dataset - Downloaded 130gb of movie subtitles zip files. If I find some power deep in my heart I would normalize data and launch - devRant

Ranter

Comments

2

typosaurus

10598

2y

Holy fuck. I wouldn't expect that all movie subtitles in the world would be 130gb
2

iSwimInTheC

41623

2y

@retoor it's kind of impressive isn't it?

I imagine training on that data would get funky with all the time codes in it.
1

typosaurus

10598

2y

@iSwimInTheC hmm, you should create a c program or so to strip all times. Python could be even too slow for that. And how much formats will be there?

It's interesting project.
2

netikras

34640

2y

Holywood dialogues
0

typosaurus

10598

2y

@tosensei so, is vane a liar? 🤥
0

typosaurus

10598

2y

@vane keep us up-to-date about this :) It's very interesting
1

vane

10486

2y

@tosensei @retoor you can download torrent from this link

https://reddit.com/r/DataHoarder/...

123gb torrent + 7gb zip from archive.org

it's single sqlite database with zip blobs in it, also names are in nice format you get some.title.(year).language.(id in database).zip, walk trough all of it takes about 2.5h in python on my NAS where you traverse each record in sqlite open zip in memory and extract subtitle skipping everything else, not much time for 130gb.
2

jestdotty

6629

2y

Now I just wonder the most common phrases in media by category

If you speak to this AI would it feel like the average Disney fan?

That would be hilarious.
0

sideshowbob76

765

2y

@retoor perhaps some are photographs?

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service