1
donuts
4y

How is it possible to read a zipped text file (Ie. zgrep) without first extracting the whole thing?

Comments
  • 5
  • 1
    @Ranchonyx ... Of Iterators / Generators / Streams.

    What language specifically?

    Usually you need a library to decode the stream - feeding a file descriptor or a stream that read in chunks.

    The compression format decides what's possible.

    Not sure which but I think some formats only allow extracting specific files, not reading in a streaming manner.

    I think it was simple ZIP.
  • 1
    @IntrusionCM well gzip and Linux commands zgrep, zhead, ztail work if the gz/tar? Only contains a text file.

    So seems like it's reading it in memory just like a normal text file so just wondering how.

    Or maybe how compression works in general... Usually compression on txt is really hi vs images.
  • 2
    Yeah
    Mixed it up in my brain somewhere.

    I think brain was stuck in the old PHP 5.3 ZIP module where you could extract files from ZIP via API but streaming wasn't possible via the API.

    Simply put: Compression works by splitting / tokenizing an byte stream and reducing it.

    https://www.hanshq.net/zip.html

    If you wanna dig deep.
  • 1
    It's not all that magical to understand when you think about it... depending on the compression format a lot is possible in as stream...

    if the compression can be expanded in place or from substitutions you can easily stream it and just expand it as you go

    think about compressing aaaaaaaaaabbbbbbbbbb as "a10b10"

    nothing would stop you from streaming 3 characters "a10" into the unzipper, which would stream out "aaaaaaaaaa" before reading "b10"

    if you're dealing with something like huffman coding it can literally be treated as lookup table as you go through the file... read some bits... expand... read next bits... expand... and so on...

    in both cases it played pretty nicely with something like searching or regex which *mostly* happens in a forward motion anyway

    I don't know the ins and outs of every compression format, but essentially if something can be expanded fully, then it can be expanded partially as well. There might be exceptions but I don't know of any
  • 1
    @Hazarth but where's the lookup table stored? At the head?

    I'm just thing even in text files it's pretty rare to get like a10... Mostly like like random text though I guess if you just translate the most common letters to a smaller bit value that would save a lot of space
  • 2
    @donuts

    I mean, there's definitely some footprint in the memory. It's just that the entire archive doesn't need to be extracted for something like grep to work, it can be executed on top of a buffer.

    if a lookup table is required I'd assume it's loaded into memory as part of the header of the zip file before buffering the contents too...

    I mean, I don't know, but it's the logical assumption to make... load the header... keep that in... and then use that to expand and stream the contents on the fly, something like a regex state machine wouldn't need to look back, so it can just keep going forward in the file, pushing the next token into the state machine and then forgetting about it and moving to the next

    at least that's how I'd do it if I needed to stream over a compressed archive
  • 2
    Look at the URL I posted earlier.

    It's very detailed.

    The information about how and what is structured at the end in case of ZIP (edit)

    Usually you'll read specific positions from the file header (like a table of contents)… which itself refer to position ranges inside the file.

    As soon as you know which position ranges you'll have, it's reading and decompressing.

    You'll need the position ranges to seperate compressed content from metadata / structure information.

    As ZIP can include multiple files, it has an index of files and references to them, too.
  • 2
    Most compressions work with blocks that can be uncompressed one at a time, you are not required to do it from beginning to end.

    That way a program can selectively uncompress just those block it currently requires.

    That why you in most such programs, at least the gui based (have never had to do partial uncompress in command line but I assume it can be fine) can extract selected files from an archive.

    And as soon as you have that you can stream blocks as they are needed.
  • 0
    I think the brain has been trapped in the old ZIP module PHP 5.3 where ZIP files can be extracted via API, and streaming via the API was not possible.
Add Comment