help

Ranter

thejohnhoffer

2627

Comments

0

iam13islucky

2123

9y

uhhhh... half of a terabyte is pretty fucking big. That can't be cached in memory so you are going to be limited by the speeds of your hard drive.
Or are you joking and I missed it lol
0

thejohnhoffer

2627

9y

@iam13islucky I'm loading "a small dataset" around 256 MiB from the 512 GiB file.
0

thejohnhoffer

2627

9y

Also happens for datasets as small as 32 MiB from files as small as 64 GiB.
1

iam13islucky

2123

9y

I dunno then, good luck!
0

thejohnhoffer

2627

9y

@iam13islucky thanks! Haha
0

kraator

404

9y

Huge files are not cached entirely but only those ranges you actually access. When reading a small chunk a second time it will likely still be in RAM. Other parts of the same might not yet be cached (even though the OS sometimes reads more than requested to speed up future read requests)
0

thejohnhoffer

2627

9y

@kraator my wording was not so good. I meant loading each half of the small dataset is faster than the full dataset. I can see it if each load puts twice the requested size in the cache. Maybe loading 256 MiB puts 512MiB in the cache? But then the first 128 MiB block happens to put only a small buffer containing the second 128 MiB block in the cache?
0

thejohnhoffer

2627

9y

@thejohnhoffer ah, but I'm doing these loads in series. Maybe that's the obvious problem here...
0

thejohnhoffer

2627

9y

@thejohnhoffer I'm loading the full 256 MiB block. Then immediately I try to load two 128 MiB blocks from the same place in the same file. At least some of that must be cached. I'll redo my experiment backwards to see if that changes anything.
0

thejohnhoffer

2627

9y

@kraator thank you! I'll post results
0

varundey

5538

9y

Not an expert on OS but pretty sure you will get some great research papers on this topic.

Related Rants

Add Comment

I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte.

Any tips?

undefined

I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte. Any tips?

undefined

help

I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte.

Any tips?