Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
kamen69252yIf it's all boilerplate, what's the point of writing it yourself...
The significance is in knowing what you have to write yourself. -
I know it’s a joke but I feel like correcting the impression of chatgpt copying codes… No it generates the code in real-time based on algo.
-
@GiddyNaya based on code it’s read elsewhere. Not much different than training ourselves on StackOverflow
-
Hazarth95062y@GiddyNaya that's just the thing with NNs. It's not an algo, It's a mapping A->B.
Just because they use random sampling from a distribution it doesn't mean It's smart -
@Hazarth But how can we call it a map when it can fix the bugs in a code I wrote myself
-
Hazarth95062y@GiddyNaya because It's a word prediction. It doesn't predict entire solutions, just most likely words in a sentence.
The Moment you give it a question, it takes it (its previous generated sequence, which is right now empty) as input and outputs a static array of probabilities, of what It's next word should be. And no matter how many times you run it on the same Input, it 100% of time generates the same Output predictions.
E.g. the next word should be "sure": 0.8, "okay" 0.8, "no": 0.001, "yeah":0.79.....
And it Outputs the same chances, every time. We then sample them from the distrobution, so "sure" and "okay" are equally likely to be picked, but "yeah" could also make the cut...
This is where the magic happens, since this word is sami-random, the next word already depends on this pick, so it seems like it can be unique, but it isn't. If they used greedy sampling the output would be constant e.g. "mapped" from A to B -
Hazarth95062y@GiddyNaya It's real time, but It's also only capable of outputing the most likely word based on It's training data. Which is why people call it copying and It's not wrong in the same sense as connecting two pencils together with a flimsy stick and drawing with one of them while the other rougly traces the same patterns is copying.
It's sorta unique, but It's also exactly the same as the original. Chat GPT does that, but per-word. Which makes it seems like It's reacting to you, but really It's reacting to questions similar to your + what it already said -
Now go back and read both @Demolishun's first comment AND @kamen's next to the tune of Bohemian Rapsody ;)
-
True though to some degree unless you look at most of my code at points where I refigure put established practices lol
-
KALALEX1112yIsn’t calling a system function or a loop using/copying actual low level code? Think about that for a moment and your whole perspective will change.
-
We3D26682yOpen your mind, look at the skies and see... chatGPT is just fancy toy, although it seem smart is noot...
-
hitko31452y@Hazarth What you described is the extreme case of overfitting, and that's exactly what you try to avoid with any at least half-decent nn. What you described is like a person reading the same book over and over until they know it word by word, and then writing their "own" book from that. The best proof ChatGPT doesn't work like that is in its size: the GPT model contains about 30 - 40 GB of parameters, while its training set contains well over 1 TB of data (~30x more data than the entire GPT model) - there's no way to summarise that much data and its relations into such a small model without loss of information.
NNs really learn by iteratively adjusting parameters - the same way a person would read a bunch of fantasy books and summarise the ideas from all of those books. Clearly a person couldn't write their own copy of any of the original books based on that, but they could write fanfiction or their own book that loosely resembles some of the ideas from the existing books. -
Hazarth95062y@hitko
I mean, not really.
Yes I used super high values like 0.9 as an example, to simplify the explanation. That part may represent an overfitted model, though even that's not 100% easy to argue, because I included two different words with the same values. The values are provided are *before* a softmax, because I didn't want to bother writing examples that add up to 1.0
Still, as you said, NN learns by adjusting parameters, and once adjusted, it always results in the same outputs, thus why I call it a mapping... It really is just a function fitting algorithm and nothing more, the resulting NN is the modeled function itself. Just a series of multiplications, additions and nonlinearities, but there's no randomness included in the actual model, everything that appears "creative" afterwards is the sampling method. And yes, I explained the simplest possible sampling method, cause I didn't want to go into beam sampling or more advanced ones for the example, but the example stands -
We3D26682yAlthough I'm pretty happy with the AI progress we all know that the model output is only as good as it's training data, and as stated above their imagination is virtually zero. They are pretty good in combinatorics ( which might count as creativity depending on the viewing angle ) and pattern findings ( and much faster than us ) but AFAIK the ML engineers still didn't figured out how to embed true creativity in them which we can call imagination.
-
hitko31452y@Hazarth I was referring mostly to your comment about copying (if a nn produces a near copy of the training data, that's major overfitting) but okay. Yes, neural networks are generally deterministic in the sense that for some given input they produce the same output, but there are some other factors at play. Specifically in GPT models, including ChatGPT, there's some level of randomness where instead of always selecting the word with the highest probability, other probable words are also considered with some non-zero probability of being selected, which completely changes the downstream context.
There's also another thing which affects large neural networks to some degree, and that is the order of operations. Large-scale parallel processors generally don't distribute the workload in a deterministic way; this means that parallel floating-point computations don't always happen in the same order, and that can, under the right circumstances, propagate enough to change the output. -
Hazarth95062y@hitko Yeah, it's not as much "copying" as "templating". That doesn't change the fact the template is based off examples of real written code.
look, I'm not trying to say it's plagiarism, or that the generated code should be credited or illegal or anything like that. I consider taking stuff from public sources fair game, it's pretty much what humans do to learn anyway!
But it's also true that it can't do anything it hadn't seen before. It can abstract, but only within the bounds of the training data. If it seen values from -10 to 10, it can output an 8.548, and maybe even 10.5, but it can never do a 20 or a 50. In that sense it can only "copy" what it had "seen".
Also talking about nondeterministic behavior in the terms of a random bit flip, or synchronizing parallel work is a moot point. Even you had to admit that it can only happen under "right circumstances". It's a stretch. All the randomness with language models comes from the sampling method. -
hitko31452y@Hazarth I'm not talking about bit flip (which btw isn't as uncommon, it happens several times per day on each instance of ChatGPT model), I'm talking about floating point arithmetic where 0.2+(0.1+0.4) doesn't give the same result as (0.2+0.1)+0.4; unlike the bit flip, that's a really common and well-known problem with large-scale parallel floating point computation (see papers like https://hal.science/hal-00949355v1/...). Neural networks encounter that on pretty much every step where a bunch of signals need to be summed up, so it's definitely happening. The reason I added "right circumstances" is because with such a large and complex model as ChatGPT, it's impossible to determine when and how those differences propagate through the model, and when they cancel out or get amplified.
-
Hazarth95062y@hitko Oh that's what you mean. Well that barely matters for such large networks. It's not only impossible to say how it behaves, but it's also actually part of the training. FP errors like that show up all over the training as well, not to mention when half precision is used. So NNs are automatically tuned to account for certain amount of this noise already, especially large ones that can afford to do so. If this wasn't the case, we'd see exploding gradients happen from one run to another in random places, but the noise is so tiny and already taken into consideration while training just by nature of backprop, that it doesn't really matter. For all intents and purposes of NN models, you get back pretty much identical output for each run. The cases where you get a probability of 0.000002 in one run and 0.0000021 for another don't affect the predictions, when 3-5 words are at 0.2 and are going to be likely candidates whether there's 8e-10 error or not
I feel attacked.
joke/meme