llm

Ranter

Wisecrack

9195

Comments

2

Wisecrack

9195

2y

@retoor first, I checked it out, that looks interesting. Especially because so few people are willing to use LSTMs anymore with the new shiny thing on the block.

I prefer relu because it's pretty standard all things being equal.

Incidentally theres no non-linearity happening here. Theres not even multiplication happening.

It's all about using noise to map a given distribution to another distribution, with hard part simply being a permutation search for a given key layer that forms an isomorphic projection between the input and output distribution.

The idea is that instead of doing a computation, we do a sampling of noise, and filter by simple comparison for noise that does the mapping for us. (theres also a lot of different types of noise, all with different properties, like pink noise, white noise, black noise, etc).
2

Wisecrack

9195

2y

I've also left out a TON of information that makes this sound like handwaving and woo.

Things like how you *also* have graph tokens, and a graph token dictionary. Graph tokens are special nodes that seek through scattershot embeddings of a context window, and perform specific functions, like counting instances of words or tokens, checking presence, or lack of tokens, performing matching, performing token substitutions, etc.

The same encoder-decoder that pairs input tokens to output tokens, also does the same for sequences of graphs and subgraph nodes.

How does this interact with inputs and outputs?

Well, during training, for each input, many graphs contexts are generated, and this is done in parallel to training

on tokens.

However, the critical bit is, during inference the input is processed and an output is generated first, and only then are the graphs run on the token context window for the respective input (queries) and outputs (values).
2

Wisecrack

9195

2y

What do the graphs really do here though?

If the same embedding technique is used to encode and decode the graphs, as are used for the context tokens,

then we really only need to ask what the graphs do internally?

First, we assume the graphs themselves are a sort of embedding based on our particular embedding technique.

Second you'll be prone to saying "given the current technique, you can train the system to output anything, any

random string of tokens or words based on any random input." Thats correct, but its also true for

all LLMs. They're only as good as their training data specifically, and their architecture and underlying

math generally. It is, to say, that the world is its own ground truth, and like how axioms within a particular

system cannot prove the set is complete (incompleteness), there is no way to say, definitely, anything

a machine outputs is congruent with the reality outside that *merely* represented by its own training data.
2

Wisecrack

9195

2y

Third, just as each graph token in the graph token dictionary performs a different function, from looking

for particular word token entries (or sets of them), to constraining token order, to counting instances,

its function is divorced from its representation. A graph, being an embedding, also HAS an embedding.

What it does is add additional state information, and generate *new* state, based on the input.

Graph objectives are therefore trained and tested and validated in parallel, to input/output pairs.

Why is this important?
2

Wisecrack

9195

2y

Because if you compressed a lot of training data using the Ln autoencoder method, along with relevant

graphs, then the noise embedding retrieved for the most similar training data, an embedding introduced by a *new* input and

its variance, acts to modify the mean of the graph representation, which changes which graph tokens are

generated from that representation's embedding.

Think about it as dual encoder-decoder pairs, where during training they run in parallel,

one for tokens, and one for graphs. And during inference, the decoder outputs tokens from the encoder's input,

and only *then* does the selected graph for the decoder's network mean output modify said output.
3

Wisecrack

9195

2y

...In much the same way that modern research reveals through MRI, that we often formulate our responses automatically

without being aware of the fact, and then filter and modify them deliberately on the fly only *after* becoming aware our brain

has determined a response. There is supporting evidence of this in the motor system, where the premotor cortext

is known to generate many possible responses and motions in any given situation, and uses inhibitory networks to

filter and select for optimal movements for the given situation.
2

Wisecrack

9195

2y

DevRant butchering my sentences and linebreaks notwithstanding), I hope any of that is readable at all.

The final piece is aligning the function of the generated trained/tested graphs to their embeddings, so in this way, finding the best embedding for mapping an input to a generated output, is equivalent to performing some function on the output that brings the final output closer to the distribution of the input.
2

Wisecrack

9195

2y

The bonus is that all the learning and functions happen at the symbolic level, without hard coding any rules, making such a network highly modifiable and inspectable.

And for any given artificial rule, it becomes trivial to put one together, and then find a matching embedding that works for the training/test/validation data.

if any of this doesn't make any sense I've likely butchered it further, and left out pieces while I was trying to splice around devrants post and comment character limit.

Probably the first *solid* test of the system would be a character-level assessment (ascii characters as tokens) to teach it to identify word spacing.

Some graph tokens I considered included ones that would produce continuations past a default length, modify the target distribution, ones that would embed 'ephemeral tokens' (they don't show up as printable characters or visible tokens but modify the context and the graphs function in some way, or act as state placeholders, etc).
2

Wisecrack

9195

2y

Divorcing representation from function was probably the major thing.

Ironically it is about as fast as most of the 8b models I ran on my system, producing maybe 20 tokens per minute.

Right now just the token encoder-decoder is built and functional, but it was proof enough that the entire premise has merit.

After character level, I'll train it to do word detection (spaces, character chunking).

That'll be proof that the graph method is also viable.

From there component word extraction (noun, verb, adjective, subject, etc).

A lot of this was initially inspired by a brief article I read on gaussian splatting, and research into the subgraph problem that LLMs deal with. (they use subgraphs in training data to pattern match for some definition of pattern matching) in order to solve problems in a way, once revealed, that is counter-intuitive, and not in the fashion the generated output itself may at first even claim.
2

Wisecrack

9195

2y

Theres a few bits and bobs I left out, like how levenshtein distance is used to mutate the transition matrix, or how parts of it (if you squint real hard) are pagerank (or page-rank-like) shoe-horned into a completely non-pagerank shaped problem.
2

Wisecrack

9195

2y

Also, this is probably the least shit-posty post I've made in a while, thanks for being the first to respond!

I got to go to bed before work, so if I don't respond to any comments right away, I will when I'm back up.
2

Wisecrack

9195

2y

@retoor thanks for that.

I think I read an article a few years back that explained the same exact principle in the same style of experiment. Identifying boolean values.

Backpropogation is still a beautiful thing.

I'll make illustrations with source code before long (with way less over-explaination) when the graph portion of the code comes along.

But for example the entire process could be inverted, with the context window composed of graph tokens, and the prior graph layer being composed of token dictionary embeddings, or at least this in affect by finding token dictionary embeddings, used as a noise layer, that map one graph sequence to another.
2

Wisecrack

9195

2y

The output of the sequence might of course look like the stream-of-subconcious babble a guy might produce while flying high on DMT and talking to machine elves, or some other variation of hallucination: the words produced would look random and have no correlation to their real-world meaning. They'd just be a higher-dimensional vector to represent the mapping, using words (and their random embeddings) instead of straight numbers.

But hypothetically, these random looking phrases, would be functions that compress the graph sequences in question wherever we find an autoencoder input where the output is shorter.

Likewise hidden dimensions of a graph sequence could be blown up by finding these 'random word phrases' (representing random vectors) that map shorter graph sequences to longer graph sequences.

God damn, it really does come off as mere babble.
1

Wisecrack

9195

2y

Also whats your understanding of relu vs sigmoid?
1

CoreFusionX

3480

2y

@retoor

Not math wizard like Mr. @Wisecrack, but can try.

Relu is basically a 0-clamped ramp function, while sigmoid is exponential.

Deep NNs use relu mainly because it's significantly faster to compute (no exponents), significantly faster to backpropagate (just an if), and produces less gradient saturation than sigmoid (as in, more places in the function where the derivative is far enough from 0), which speeds up learning.
1

figoore

233

2y

the comment butchering made the best cliff hanger there
> why is that important?

And i can confirm… f*ck it… it worked on me 😂
Pretty intresting writings, love it! @Wisecrack
1

figoore

233

2y

@retoor can’t aggree more on the jail conspiracy theory 😂
Thank god @Wisecrack is doing maths and not meths 😂
1

Wisecrack

9195

2y

@CoreFusionX "more places in the function where the derivative is far enough from 0"

isn't it ReLu that produces derivatives close to the mean too, or am I confusing it with another non-linearity?
1

Wisecrack

9195

2y

@figoore I never said I'm not on ritalin.

But for the record I'm not.

I'm glad you enjoyed the post, even though theres no nice and shiny graphics to explain what the fuck I'm even talking about.
1

Wisecrack

9195

2y

@retoor like a generator-discriminator pair!

Thats fucking cool.

"beat the machine" as it were.

I like it. I like it alot.

Same on the sleepless nights. Laying down, staring up at the white painted ceiling in the dark, thoughts going a million miles an hour, while the street lights filter in through my blinds.

Or pacing on my porch, smoking cigarette after cigarette, beneath the moonlight while turning new ideas over in my head, what will work, what won't work, what might work, what offers new avenues to attack the problem, new possibilities.

Nothing quiet like it.

And then, the a-ha moment, like athena springing from the head of zeus, fully formed, and racing off to code it all.
1

figoore

233

2y

@Wisecrack those nights are the best parts of being a programmer
I really like this wording, such a nice devrant gem
1

Wisecrack

9195

2y

@figoore I'm genuinely flattered. Thanks.
1

Wisecrack

9195

2y

@retoor Nothing like smoking after four hours of going without, the rush, eyes half-lidded, as everything comes into sharp focus.
0

Wisecrack

9195

2y

@retoor some dude brought in a bad-dragon sized vape didn't he?

I bet he did.

And they were like "we draw the line at smoking pole in the office!"
0

Wisecrack

9195

2y

@retoor company policy 178: "you will NOT convert the company's open floor plan into a hookah shop!"

Related Rants

Add Comment

random

machine learning