Manager: our file IO is slow, any suggestions to make it faster?

Code: multithread writing to a few hundred small (temp) files then single thread combine to one big file and delete the temp files.

Eyes: bleeding

  • 12
    Looks like someone has just blindly re-implemented what a data scientist wrote, so........

    That's great.
  • 19
    The hilarious thing is a code review comment is like "this loop is implementing parallel reduce, there's a library version that doesn't preserve order" and it's not saying "Why the fuck are we doing this stupid shit with temp files?"
  • 6
    Why not just do sequential_writes...
  • 11
    @netikras I think you're overestimating the ability of our dev team.
  • 8
    optimizedn't? 😅
  • 5
    @netikras Why not just using memory buffers?
  • 1
    Because that's how harddrives work, yes
  • 5
    The answer is: write to disk twice while incurring massive file system operation overhead. Kills performance every time... We can just write it once to the single file .

    I mean if you really need have the small files in an archive I can understand this approach even though there are streaming solutions. But this is as stupid as it gets.
  • 15
    This is genius!
    When you create as many threads as there are bytes in the file, you can write that file instantly, no matter its size!
  • 2
    Tell me its a flash drive... tell me...
  • 2
    if those are small files why not use sqlite instead of files ?
    they already implemented most of filesystem optimizations there

  • 4
    Y'all assuming it's some super complicated use case, it's literally writing out a 2d matrix to a file, just with the original implementation written by someone that was good at maths, not high performance code, then someone else "making it faster" by doing the same thing in c++ instead of python.

    The big difference between the c++ and python seems to be that the python didn't delete these temp files, instead had a comment suggesting the user deletes it. So they've automated the comment...
  • 1
    @Fast-Nop because memory buffers are not persistent?
    You *may* use aio to flush them, but then you are never sure whether/when the flush is complete before doing another write. Also, additional hurdle of managing the buffer with aio... painful.
  • 1
    @netikras If the goal is to generate one final output file anyway, there's no point in persistent temporary files because they will be deleted anyway. You don't need aio for the buffers because you eliminate the io part from the equation.

    Sure, you need the parallel threads to finish their work before you can batch up their individual result buffers into the result file, but that's just joining threads.
  • 2
    @atheist hahaha. The good old everything is faster in C approach.
    Just to spite the assholes you should refactor the Python one (or insert favourite language here). It will be magnitudes faster than the C++ one.

    Don't know what the output is used for or how it is build up but a sqlite table or csv can be a 2d matrix. Unfortunately I need to build an an Excel file that is build up horizontally as well
  • 1
    @Fast-Nop missed the "lots of small files" part. My bad
  • 4
    @hjk101 I mean... It *is* faster in c++, the maths being done has gone from an hour to about 10 minutes, but apparently most of the remaining 10 minutes is file IO.

    Like, my background is literally "make shit fast", I'm used to being pleased if I can save a few clock cycles by making something SIMD/vectorized, this stuff I'm like
    * don't write this out to the HD twice. HDs are really really slow.
    * pass vectors by reference, this is copying everything. Copies are slow.
    * mutexes are slow, try to avoid them, here's a design pattern that makes it redundant.
  • 0
    Wow the math in Python taking an hour vs 10 min. Either that is something that is super inefficient in python or it has a similar bad design.

    It would surprise you that pointer dereferencing is can be slower than copying the blasted thing. It's actually a compiler optimisation strategy to remove the derefs.
  • 1
    @hjk101 There is a reason that heavy math lifting in Python usually goes via libs that are written in C, such as NumPy. You don't do that in Python directly. If the previous devs did, that explains why it was slow.
  • 1
    @hjk101 python is "not great" for threaded code (Python GIL). I'm also eyeballing the performance difference, it still takes 10 minutes to run in its current state, but it's still a lot faster than the previous implementation. We're talking 32 core machine.
  • 1
    @Fast-Nop It was some numpy stuff, but even then, threads.
  • 2
    @hjk101 re pointer dereferencing, not sure I follow. The copy would require a malloc which acquires a mutex. Mutex is slooowww.

    I'm a very niche dev...
  • 2
    @Fast-Nop yeah that was indeed what I meant in python crunching numbers works fine as long as you use optimised precompiled stuff where it matters.

    @atheist ah didn't know it was that heavy on the parallelization. Also didn't know python was that poor at it. I always go to Go as my go-to language for that kinda stuff.
  • 0
    @atheist perhaps worth a shot to look at https://nifi.apache.org
  • 2
    Follow up: I was told that testing showed multithreaded IO was faster.

    I just rewrote the IO to be 5x faster (single threaded vs single threaded).
  • 1
    @atheist I'd still guess that "no temporary IO" is even faster.
  • 2
    @Fast-Nop yup, but basically "building the strings to put into a file" was so slow it was benefitting from being multithreaded, even with the temporary files. It's now so much faster that it actually is faster to do single threaded without the temp files.
  • 2
    @atheist By "tempfile" I assume a file on a storage device.
    Wouldn't it have been waaaaayyy faster to store the strings on the heap / system memory and then write everything to the target file in one go?
  • 2
    @PonySlaystation yes.

    But that also overestimate the ability of our dev team.
  • 1
    @atheist Do people use RAM drives for stuff like this anymore?
  • 1
    @Demolishun I have in the past for writing out a lot of data for inspection (image analysis), but this just doesn't need it. 😅
Add Comment