7

I know streams are useful to enable faster per-chunk reading of large files (eg audio/ video), and in Node they can be piped, which also balances memory usage (when done correctly). But suppose I have a large JSON file of 500MB (say from a scraper) that I want to run some string content replacements on. Are streams fit for this kind of purpose? How do you go about altering the JSON file 'chunks' separately when the Buffer.toString of a chunk would probably be invalid partial JSON? I guess I could rephrase as: what is the best way to read large, structured text files (json, html etc), manipulate their contents and write them back (without reading them in memory at once)?

Comments
  • 4
    "Without reading them in memory at once" - guess you mean reading the whole file to memory?

    Usually you have low level implementations that works on an stream that gets tokenized.

    Tokenized as in "reading char by char", while using a stack or similar structure to keep only the current token alive and the necessary information for validation.

    E.g.
    http://tutorials.jenkov.com/java-js...

    Modifying gets trickier - but not impossible - tricky part is removing _nested_ structures, as you have to forward the input stream to skip till the end of the nested structure... All other content can be written from input stream to e.g. a file token by token.
  • 1
    You were right when guessing that you would have to operate the file in chunks that are not by themselves valid JSON.

    However, some string manipulations could be implemented by a seek-pointer, reading overlapping windows of several chars per chunk.

    Let's say you want to replace all keys that include the word "pony" to "bronco".
    You would need to make an sliding window with at least 2 times the length of the string "pony" (so, 8 chars);that slides at most the length of half the string "pony" (so, 2 chars). you would also need a regex to detect that a given window chunk contains the key declaration including the word "pony".
    Finally, you would need to pipe the altered file to another, updating a pointer to avoid duplicating the overlapping part of several windows.

    It works especially well for NLP.
  • 1
    @IntrusionCM @JsonBoa thx for the detailed replies! I read the linked article. I learned something, but afaik this is all too low-level for a JS static site gen.
    If I end up using any readable or writable stream, it will have to be only for the advantages other than "lower memory footprint"
  • 0
    @webketje

    Yes, it's only useful if you are severely memory constrained or have true "beasts" in term of size...

    Eg log analysis becomes so much easier when a file of several gigs is in JSON and can be tokenized for quick peek a boo.
Add Comment