Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
uhhhh... half of a terabyte is pretty fucking big. That can't be cached in memory so you are going to be limited by the speeds of your hard drive.
Or are you joking and I missed it lol -
@iam13islucky I'm loading "a small dataset" around 256 MiB from the 512 GiB file.
-
kraator4138yHuge files are not cached entirely but only those ranges you actually access. When reading a small chunk a second time it will likely still be in RAM. Other parts of the same might not yet be cached (even though the OS sometimes reads more than requested to speed up future read requests)
-
@kraator my wording was not so good. I meant loading each half of the small dataset is faster than the full dataset. I can see it if each load puts twice the requested size in the cache. Maybe loading 256 MiB puts 512MiB in the cache? But then the first 128 MiB block happens to put only a small buffer containing the second 128 MiB block in the cache?
-
@thejohnhoffer ah, but I'm doing these loads in series. Maybe that's the obvious problem here...
-
@thejohnhoffer I'm loading the full 256 MiB block. Then immediately I try to load two 128 MiB blocks from the same place in the same file. At least some of that must be cached. I'll redo my experiment backwards to see if that changes anything.
-
Not an expert on OS but pretty sure you will get some great research papers on this topic.
Related Rants
I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte.
Any tips?
undefined
help