Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
In the end, it's not necessary to identify a file as some specific type I'd say. This knowledge is useless, as long as the file does not follow any standard encoding (json, xml, ...). So all you really need to know is that a specific program can read it.
And that's why we got extensions. To remember the program it's meant to be used with. -
Hazarth95522yfeels like you're talking about data recovery.
The issue with identifying content by it's actual content is that you can never be sure what type, version, encoding, compression, encryption or other types of obfuscations are applied. There's way too many variables for this to be 100% effective. However you certainly could try and recover at least some files purely by their content even with completely damaged headers and missing magic numbers. I can imagine that neural networks could do some of the job for you, say by either direct classification (e.g. find a way to classify raw data?) or by output classification (e.g. try to read the image as all types of text, pixel or audio data and ask the NN if the output makes sense instead of just being random noise?)
But then, you could also get a file with actual random data, and then how do you discern randomness from just a filetype you never seen before? -
You can try to find some known structure in the partial data or just try to intertprete it as something and check whether that works out.
There are some formats that will always match: Everything could be part of an encrypted archive. Everything could be uncompressed image data.
But you could use some heuristics to assess the likelyness of the data being of a certain type. Encrypted stuff normally doesn't show strong patterns. People don't often save pure noise images.
I recommend starting the research by looking at existing data recovery software which normally has to deal with partially overwritten - and therefeore partial - data. -
@nitwhiz Note I mentioned a lack of header information. Seems to me that the answer is an AI could be trained to determine a few best case scenarios as to what it was looking at unless there was a positively perfect way of identifying usable data.
Example, in a packed file database you might have a packed byte record that contained a starting autoincrement column using an 8 byte storage location, and then a 20 character varchar field containing a length descriptor and x bytes of alphanumeric data.
That an AI might be able to identify.
Pixel data would be harder unless you added an Image classifier and the image itself actually depicted something recognizable beyond a shadow of a doubt. -
@Oktokolo you mean like PhotoRec ? I betcha anything that it looks for the header in most cases.
Even shapefiles contain record counts for example. -
@SlaveOfTime I only used scalpel and it actually does only magic byte sequence detection too. But it surely is possible to search for known structures inside a file too.
An easy example would be searching for partial plain text files: One could search for sectors only containing valid Unicode (allow for truncation at the start and end of the sector) in a common encoding and match the contained words against a dictionary of common words to make sure, that it actually contains text. -
Another example would be PDF: That Postscript dialect is full of similar looking object headers (even if the actual data is compressed). So you could be reasonably sure to have found a particla PDF if you suddenly stumble upon such headers and can actually use the information of the header to obtain a valid PDF code segment. Obviously you can't do shit with incomplete objects - but if you get some fully intact objects containing drawings, images or text, you could just place them on a blank page and save them as a new file. If you get lucky and actually find a full page worth of definitions, you could also save that whole page...
Searching for and handling partial data in a meaningful way is harder than just concentrating on whole files. But it certainly can be done and depending on what you try to recover, it might very well be worth the dev time. -
@Oktokolo this is one of those examples where I think I saved time by just specifically honing in on my own personal use for the thing. sigh. still I feel like playing around a bit.
you wouldn't believe all the things a person never heard of in here ! -
@SlaveOfTime we got AoK - so chances are, we all will have read it all eventually...
I've been sitting here staring at extension types and I wonder, what if I had a partial file with partial data ?
In general one could say that in every case where say a header is missing that is ALWAYS going to have some identifying characteristics even given a characteristic statistically frequent pattern of data, that there is always a null value that appears as total chaos.
But I wonder, is there a way beyond simply trying every goddamn possible combination of things until meaningful data is extracted to identify a file by its content when part of that content that is usually used for such a purpose, is missing ?
What kind of application or technology would be required for this ? Certainly not neural networks, but obviously some kind of ai right ?
rant