9

What would be the better approach for loading very big in size or in quantity files in java?
1) Loading data parallel through multiple threads
2) Loading data in series in a single thread
3) any other methodology?

Just asking because loading time is varying both cases.

Comments
  • 0
    depends on what you'll be doing with it afterwards
  • 0
    @netikras its a heavy dataset for text classification. Making my own implementation for gaussian naive bayes. Contains a lots and lots of text files varying upto 5 to 8 lines.
  • 0
    @BlackSparrow does it have to stay in the memory once loaded? I.e. will you be reusing files' contents you're trying to load?
  • 1
    @netikras No once i preprocess the data , the original one wont be used anymore.
  • 4
    AFAIK java File API does not have ability to have multiple cursors for the same file nor does it give you ability to mmap the file, so reading itself can be done on one thread per file.
    If there are any transformations involved - these can be done across multiple threads. So a simple single-producer-multiple-consumers pipeline pattern might be a good choice IMO. While producer (file reader) thread is waiting for input from storage device consumers (workers) will be doing transformations of data already read in.
    If there are multiple files to read I'd put multiple producers reading files in and possibly pushing data to a queue consumers are reading from.
    Consumers would do the required transformations . If transformed data needs to be written out, I'd put another single-threaded aggregator collecting input from workers and flushing data to the output. Or multiple aggregators if there should be >1 outputs
  • 4
    Simply put, I'd separate read() operations from process() and dedicate thread-per-file (or thread_pool-per-files_pool) readers with multiple processor threads.

    Having multiple threads doins I/O via the same file handle does not win you any performance. On the contraty -- you'll lose on context switching. You'll lose even more if read() and process() are done on the same thread
  • 1
    Data load is not cpu intensive task, so no need to split it in threads.
    Also, making more read request from storage device won't make storage device faster.
    Data processing is different question.
  • 2
    @ravijojila while it is not intensive, unless NIO is at play read() syscall is a blocking one. So if there are 1000 files, a single thread will only be able to read them in sequentialy, one-by-one. A pool of reader threads, however, would be more efficient, as while thread #3 is waiting for interrupt from storage on file aaa.txt, thread #4 might already have received the required interrupt on bbb.txt and do the reading.
  • 3
    @netikras there are about 25000 files of 1-2 kbs each. So yeah i guess multiple producers is a good option. Thank you👌
  • 0
    Are streams an option? If yes, go for streams after reading and storing the files. File reads are usually very fast, so the reading itself will not be an issue, but mapping them could be a lot more resource intensive.
  • 0
    @BlackSparrow 25k threads is likely to hit nproc and nofile ulimits. I'd use a limited producers' pool, like 100 threads or so. By no means an unlimited threads' pool is a good idea. Os might snip your whole process just bcz it forked too many child processes [thread is a special kind of process]. After all 25k threads would likely slow you down as well
  • 2
    @netikras yeah i used different amount of threads and thought 25 threads is good enough(that too is passed as parameter). However performance of 25 and 50 threads seems pretty close.
  • 4
    I think the main problem in your case is the performance of the storage device the files reside in. Depending on the use case, one solution would be to wrap the files in a container filesystem (iso or something else) and then mount them as a ramdrive (if enough memory is available). Then use a limited number of threads to do the reading - 1 per processor core should be enough.
  • 0
    @BlackSparrow yeah, testing various configurations will help you tune your pools' size. However the same configuration might not be as efficient on another computer. Faster storage, smaller files, cached storage controller, more cpus/threads, etc might affect your readers' efficiency.
  • 0
    25k 1-2kb files. That reminds me of an arff files' assignment I once had at college 😁 had to read a shitload of data and map values across all files using threads
  • 0
    @netikras all the assignments we get from college are just pointless. Copying a research paper for marks. No learning just doing shit for marks.
Add Comment