9
vane
1y

Downloaded 130gb of movie subtitles zip files.

If I find some power deep in my heart I would normalize data and launch training on generative transformer to see if it produces decent dialogues.

It will probably stop on planning phase because I’m diving deeper towards depression.

Comments
  • 2
    Holy fuck. I wouldn't expect that all movie subtitles in the world would be 130gb
  • 2
    @retoor it's kind of impressive isn't it?

    I imagine training on that data would get funky with all the time codes in it.
  • 1
    @iSwimInTheC hmm, you should create a c program or so to strip all times. Python could be even too slow for that. And how much formats will be there?

    It's interesting project.
  • 2
    Holywood dialogues
  • 1
    @retoor i wouldn't even believe it to be 130GB uncompressed. but as a zip? absolutely not.
  • 0
    @tosensei so, is vane a liar? 🤥
  • 1
    @retoor nah, he probably just downloaded 130gb of garbage.

    but let's wait until he _validated_ that all of the data he got actually consists of meaningful subtitles ;)
  • 0
    @vane keep us up-to-date about this :) It's very interesting
  • 1
    @tosensei @retoor you can download torrent from this link

    https://reddit.com/r/DataHoarder/...

    123gb torrent + 7gb zip from archive.org

    it's single sqlite database with zip blobs in it, also names are in nice format you get some.title.(year).language.(id in database).zip, walk trough all of it takes about 2.5h in python on my NAS where you traverse each record in sqlite open zip in memory and extract subtitle skipping everything else, not much time for 130gb.
  • 2
    Now I just wonder the most common phrases in media by category

    If you speak to this AI would it feel like the average Disney fan?

    That would be hilarious.
  • 0
    @retoor perhaps some are photographs?
Add Comment