Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API

From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
The original 0.15% entropy improvement over shannon entropy shouldn't even be possible, and is what got me thinking.
Because I ran out of room in the original post
Each output of [H0, H1, H2, H3, salt, depth, starting value] becomes a block.
We append all these blocks together into a layer.
This layer can then be converted into *another* block and fed through the entire process.
Blocks can be processed in parallel be it decoding or encoding. During decoding layers have to be decoded serially back into blocks, before those blocks can be decoded in parallel, in a process of unpacking.
We repeat this process, blocks to layer, layers into blocks.
If it work, it should improve the lower demonstrated SOTA limit of compression.
To make block encoding/decoding at least break even, the depth of numbers it can reconstruct must be at least 22, because each block has 8 numbers to reconstruct the hashes, along with the final salt value, depth, and its starting value -
retoor61382dI'm happy to be mentioned (typosaurus). Love this article but didn't understand all. If you want, publish your code. If you need anonymous repo, use molodetz.nl and use a false emailaddress, there's no verification needed! If you sign in, you'll see a github like environment. Works exactly the same.
Again, thanks for mention. Very happii. -
@retoor Hypothetically it lets you store absolutely absurd amounts of data in 13 measly values. 2 values for each hash, and one value per other variable per block.
layer stack size, and block count are of course unbounded values, commiserate with the amount of blocks and layers necessary to fully compress a given set of data. Block count will probably be a fixed count, found by parameter sweep or coefficient estimation, or curve fitting to some polynomial, so that leaves layer count as a potentially big number that grows at some unknown rate w/ the size of the data.
Hypothetically you could compress any amount of data into those 13 values, as long as your comfortable with absurd amounts of decompression time.
I jailbroke shannon entropy (if it works, big if), to the lowest limit that will likely ever be achieved, by anyone, ever.
If not, I still got that measly 0.15% entropy improvement over the known theoretical limit. -
@Wisecrack I'm still reading through it, and it reminds me of a friend telling about people storing data using PI (just some random offset and length was that was the end result) but it looks fascinating :D
-
@BordedDev Clever use of pi btw. Seen mathematicians ask if pi contains all sequences. If it does, then its sufficient to find the offset into pi that contains your data.
If pi doesn't, then the question becomes, is there an irrational number that *does* contain all sequences, and how would we go about finding, defining, and calculating it? Then the next question becomes, is there a way to calculate its digits at any given arbitrary offset?
And is there a way to find an offset matching any arbitrary data, without calculating all the prior digits?
Thats a fun one to think about. -
@BordedDev also the image here is dense, but the bottom right is how individual values are processed in the normal case.
The bottom-left corner has 3 notes that really contain the meat of the work.
The heavy lifting is done by finding a hash function h2 that given Zn, outputs Xn-1
AND also such that h2(yn) outputs Zn-1, because we know that h2(z) gives us y for a given z in the single-input case, and h3(y) gives x.
So having a hash function h2(), and a sequence-compressed Z, means automatically getting Zn-1, and thus yn-1, which with h3 gives us the next x value in the sequence during decompression.
The result is we go from Zn-1, to yn-1, to xn-1
then Zn-2, to yn-2, to xn-2, and so on rebuilding the original data in reverse, one element at a time.
People can get confused b/c of hashes like md5, where its letters and numbers. Here our hashes are numbers using a modular algorithm. Numbers go in, numbers come out. Means you have to convert inputs to integers 1st. -
@Lensflare > I only know bloom from computer graphics. Is it related?
I'm not familiar with computer graphics (I say, as I look at a computer monitor while typing this comment). -
I know its not related. But I wonder if we will ever have a VM that simulates dna in a cell. So anyone could take a dna sequence and watch it function in a cell VM.
-
@Lensflare Took a quote from there "One physical basis of bloom is that, in the real world, lenses can never focus perfectly. Even a perfect lens will convolve the incoming image with an Airy disk (the diffraction pattern produced by passing a point light source through a circular aperture).[2] Under normal circumstances, these imperfections are not noticeable, but an intensely bright light source will cause the imperfections to become visible. As a result, the image of the bright light appears to bleed beyond its natural borders.
The Airy disc function falls off very quickly but has very wide tails (actually, infinitely wide tails)."
So not related. But it makes me think. If LLMs sometimes generate wrong answers b/c of issues with tail distribution, what would a statistical distribution and attention mechanism that mimics the airy disk look like?
Tails are after all infinite.
Probably need some non-image interpretation of the phenomenon 1st. It's fascinating though. -
@Demolishun "I know its not related. But I wonder if we will ever have a VM that simulates dna in a cell. "
If we ever achieve workable and efficient quantum processors, you can bet thats exactly something that will be developed to run on them. -
@Lensflare
As primitive as quantum is now, it'll probably be a pong clone first, but I wouldn't be surprised at all.
Optical quantum seems more practicable in all cases.
I'm surprised they haven't tried to use whether an output has an interference pattern or not (the quantum eraser, and two-slit experiment) to build optical quantum gates. -
@retoor still needs to be cooled to sub-zero temperatures, but I thought it was an interesting project all-in-all, from microsoft of all places.
Optical quantum computing is a pretty neglected field compared to some of the other approaches. -
retoor61382d@Wisecrack I watched a video about it and I think that the path they've chosen always would include sub-zero temperatures. But that thing looked like a normal processor tho.
-
@Wisecrack I heard about an electrical engineer who designed optical computing stuff he tried to patent. They rejected his patents and the government started gang stalking him. He went from computer designer genius to disabled because of the measures they used against him. So I think there is weirdness around optical computing. This also isn't the first time I have heard of the patent office suppressing tech.
I have heard there is people going after optical computing recently though. -
You're a madwoman.
I didn't even bother to read further than the first three paragraphs, as I simply wouldn't understand it.
Good to see you back here. -
@Ranchonyx I'll post some code sometime this weekend so people can play around with the system, see it in action, and work out first principles from demonstration.
The simplest version was generating two hash functions, h0, and h1.
h0 takes a list of numbers x, and outputs a seemingly random looking list of numbers y.
The magic happens with h1. in the earliest versions h1 tok y and outputs x again.
The trick was to write a function that, given h0, x, and y, found a hash function (h1) that output x again.
It's kinda beautiful actually and way simpler than it appears on the surface.
The first time you run gen_hash, and then plug in a list of numbers, or a converted string, only to get a seemingly random output, and then plug that into *another* hash which returns your original output 'unhashed', you'll understand the entire principle of the thing immediately.
Everything follows from that. -
@iiii not as weird as all the prior math.
the math for hash functions is pretty normal, it's the way it is being used that is novel.
Related Rants
Hey, been gone a hot minute from devrant, so I thought I'd say hi to Demolishun, atheist, Lensflare, Root, kobenz, score, jestdotty, figoore, cafecortado, typosaurus, and the raft of other people I've met along the way and got to know somewhat.
All of you have been really good.
And while I'm here its time for maaaaaaaaath.
So I decided to horribly mutilate the concept of bloom filters.
If you don't know what that is, you take two random numbers, m, and p, both prime, where m < p, and it generate two numbers a and b, that output a function. That function is a hash.
Normally you'd have say five to ten different hashes.
A bloom filter lets you probabilistic-ally say whether you've seen something before, with no false negatives.
It lets you do this very space efficiently, with some caveats.
Each hash function should be uniformly distributed (any value input to it is likely to be mapped to any other value).
Then you interpret these output values as bit indexes.
So Hi might output [0, 1, 0, 0, 0]
while Hj outputs [0, 0, 0, 1, 0]
and Hk outputs [1, 0, 0, 0, 0]
producing [1, 1, 0, 1, 0]
And if your bloom filter has bits set in all those places, congratulations, you've seen that number before.
It's used by big companies like google to prevent re-indexing pages they've already seen, among other things.
Well I thought, what if instead of using it as a has-been-seen-before filter, we mangled its purpose until a square peg fit in a round hole?
Not long after I went and wrote a script that 1. generates data, 2. generates a hash function to encode it. 3. finds a hash function that reverses the encoding.
And it just works. Reversible hashes.
Of course you can't use it for compression strictly, not under normal circumstances, but these aren't normal circumstances.
The first thing I tried was finding a hash function h0, that predicts each subsequent value in a list given the previous value. This doesn't work because of hash collisions by default. A value like 731 might map to 64 in one place, and a later value might map to 453, so trying to invert the output to get the original sequence out would lead to branching. It occurs to me just now we might use a checkpointing system, with lookahead to see if a branch is the correct one, but I digress, I tried some other things first.
The next problem was 1. long sequences are slow to generate. I solved this by tuning the amount of iterations of the outer and inner loop. We find h0 first, and then h1 and put all the inputs through h0 to generate an intermediate list, and then put them through h1, and see if the output of h1 matches the original input. If it does, we return h0, and h1. It turns out it can take inordinate amounts of time if h0 lands on a hash function that doesn't play well with h1, so the next step was 2. adding an error margin. It turns out something fun happens, where if you allow a sequence generated by h1 (the decoder) to match *within* an error margin, under a certain error value, it'll find potential hash functions hn such that the outputs of h1 are *always* the same distance from their parent values in the original input to h0. This becomes our salt value k.
So our hash-function generate called encoder_decoder() or 'ed' (lol two letter functions), also calculates the k value and outputs that along with the hash functions for our data.
This is all well and good but what if we want to go further? With a few tweaks, along with taking output values, converting to binary, and left-padding each value with 0s, we can then calculate shannon entropy in its most essential form.
Turns out with tens of thousands of values (and tens of thousands of bits), the output of h1 with the salt, has a higher entropy than the original input. Meaning finding an h1 and h0 hash function for your data is equivalent to compression below the known shannon limit.
By how much?
Approximately 0.15%
Of course this doesn't factor in the five numbers you need, a0, and b0 to define h0, a1, and b1 to define h1, and the salt value, so it probably works out to the same. I'd like to see what the savings are with even larger sets though.
Next I said, well what if we COULD compress our data further?
What if all we needed were the numbers to define our hash functions, a starting value, a salt, and a number to represent 'depth'?
What if we could rearrange this system so we *could* use the starting value to represent n subsequent elements of our input x?
And thats what I did.
We break the input into blocks of 15-25 items, b/c thats the fastest to work with and find hashes for.
We then follow the math, to get a block which is
H0, H1, H2, H3, depth (how many items our 1st item will reproduce), & a starting value or 1stitem in this slice of our input.
x goes into h0, giving us y. y goes into h1 -> z, z into h2 -> y, y into h3, giving us back x.
The rest is in the image.
Anyway good to see you all again.
rant
im back
math