Final synposis. Neural Networks suck. They just plain suck. 5% error rate on the best and most convoluted problem is sti

Ranter

AvatarOfKaine

3759

Comments

3

AlmondSauce

15618

4y

They're just *difficult* to get working effectively, and you tend to need a reasonably deep understanding to get them to work well (especially in scenarios that aren't a traditionally great "fit".) I haven't played with them much in recent times (last time I touched them was over a decade ago when the landscape was very different) but others like @Nomad certainly have.
1

AvatarOfKaine

3759

4y

@AlmondSauce a consideration I would have is that if there is not an activation function per layer you should have a lot broader range of values possible to combine

We already had this discussion I think
Sorry dirty treasonous garbage picking me up and kidnapping me a decade ago kind of screwed up my memory a tad
3

RememberMe

13709

4y

They're difficult to get right. You don't just throw things together and get something working and expect it to be good. There's a reason people dedicate their whole careers to this.
2

AvatarOfKaine

3759

4y

@RememberMe well my simplistic example was a 3d function

z = y^2+x^2+2x^3+2y^3

z was the output
x and y the input

so... with hidden layers i ended up with

2x12x1024x12x1

the initial training flies.
n=400 e=20 optim=adamw loss=mse
lr = 0.0001 initially i drop it to lr=0.0000001 by the time i'm done.

I randomly generate the training data in each session

domains of x and y are 0 < x < 100 0<y<100
i divide the values of the function expectation and the parameters by their max values.

eventually the values start oscillating.
steady convergence for 1000s of training valuesm, if at a low rate.

i adjust the learning rate down more and more but eventually that stops working.
1

AvatarOfKaine

3759

4y

@RememberMe with more layers the error remains about the same.

i have passed the final values through tanh and sigmoid and the end fitting is about the same i can only get so close.

and here i am having been bumped into now to try and piss me off twice by the same little boy toucher.

so obviously something is being missed.
i almost feel like i just need to write my own.
like the backprop algorithm for example. i tried to modify the way data is trained, i spread the values out across partitions for the inputs that are evenly spaced in their beginnings and ends etc that increased the training speed because its an odd shape.

i removed the relus since they stopped training
1

Wisecrack

9328

4y

@AvatarOfKaine you're using tanh and sigmoid but have you tried relu?
1

AvatarOfKaine

3759

4y

@Wisecrack no no i alternated between the two to fit things.

and no, not as the output's activation function no.
1

Wisecrack

9328

4y

@AvatarOfKaine never tried it, but what about arbitrary finely quantized functions or curves? Hell you could set the points on the function *manually*, and then maybe interpolate as needed?
After all, why only stick with well defined functions like tanh and sigmoid?
2

atheist

10816

4y

Wait... How are you defining your error? And what's the biggest net you've tried?
1

AvatarOfKaine

3759

4y

@atheist refining my error ?
Do mean the backpropagation optimizer ?

The biggest network I tried had 4 hidden layers of 1024

It didn’t make any difference it just slowed everything down
1

AvatarOfKaine

3759

4y

@atheist but if you mean how am I trying to ensure a more even training, the data I randomly generate now I arrange into evenly spaced partitions, so that each partition has an even number of randomly generated values in it between the start and end ranges of the partition so at least when i run a training batch, that I shuffle at the beginning of each epoch, its pull the values of the weights likely in a way that i imagine would converge better.
1

AvatarOfKaine

3759

4y

@atheist now last time when we had this discussion you answered me and I wasn't apparently certain what you meant by fixing error.

but you did answer :P
but it was likely some time ago.
1

AvatarOfKaine

3759

4y

@atheist Here. some results.

https://pastebin.com/eFxMtEGu
0

Oktokolo

11330

4y

Don't try to get neural networks to do higher math.
Humans have way more neurons, than you can simulate and still struggle at that after years of training...
1

AvatarOfKaine

3759

4y

@Oktokolo I’m not really trying to make it SOLVE an equation
And the neurons really don’t work this way in our brains

I’m testing the claim that neural nets can approximate any function.
2

atheist

10816

4y

@AvatarOfKaine I think NNs "can" approximate any function. Getting them to is hard. If I was gonna hand craft something using ReLU based on your equation, you could do it with 2 separate nets for each input that are combined at the end, because the equation is separable. If you think of it in a similar way to Taylor series, lots of small bits that increase the accuracy over longer and longer sections.

Doing that manually with a complete understanding of the problem space is possible, going from random noise to that is much harder.

My guesses: making the first layer much wider might help. Otherwise you're vaguely quantizing your input to 12 sections. If you're looking for high accuracy over a 100x100 space, 100,000 should definitely be able to do it (manually, at least, but much slower). And maybe have a couple of layers only connected to each input, then join them. If you wanted to add something like (x^2)(y^3) then if the first couple of layers are split in 3, one for each input, one for both. In theory NNs can learn that relationship. In practice, getting them to is hard.

They won't be able to extrapolate very well, neither can Taylor series. This is where the whole "big data" comes in, if you've seen enough real world examples, almost everything is interpolation.

Not AGI of course, but I think we as a society are missing something academic there.

I'm just eyeballing based on the tiny amount of information I can remember.

I think the "deep" net stuff is of greater benefit when output is based on a combination of features, eg in images, no one pixel tells you a lot about the answer, whereas in your case, each input can be used to calculate part of the answer.

And partial differentials, Your net could alternate training on changes in x then y coz the partial differential of each variable doesn't include the other.
1

AvatarOfKaine

3759

4y

@atheist amusingly that first part about two seperate nets i was think of actually.

still kind of invalidates the claim though :P
1

AvatarOfKaine

3759

4y

@atheist again with the partial differential was thinking the same thing yesterday.

its the reason i can't understand how its not figuring out better since the two inputs do not serve as coefficients to one another so the relationship is a sum that is not very complex.

as far as suggesting how layers are split up that is what confuses me when people say that.

like an individual describing image recognition suggested that layers might take on purposes, that seems kind of automatic though doesn';t it ?

so far as the 12 connections i initially started out at 1024 with no avail initially and saw only slightly better results.

in the end shouldn't matter though right ?

y_i = sum( w_x*A_(x-1))+b_x

with every single connection being multipled and added at each layer.

and there being quite a few to choose from.
1

AvatarOfKaine

3759

4y

@atheist something confuses me slightly.
i had thought that alot of the power came from their being a seperate weight value per neuron's connection to every other neuron the previous and next layer.
1

AvatarOfKaine

3759

4y

@atheist I tried something kind of what you're talking about too, but i feel it skewed the results, alternating that is on one variable or another but that was in a different problem.

what is still confusing me now that i print more valid statistics than just a single epochs loss value is by what criteria to decide to adjust the leanring rate, the errror adjusts down all over over and over but the universal error occasionally jumps high and then drops back down low in each epoch.

however like i said sometimes the behavior of training is strange. certain values jump up and down which yeah I can visualize as a kind of shared modifier to the overall equation which passed through effects a large chunk of the end value being adjusted along with several others..
1

AvatarOfKaine

3759

4y

@atheist shouldn't introducing equally spaced data that differs each time produce better results ?
I feel like it should.
I feel like overfitting would occur if I used a small dataset that matched those specifications but never changed it.
1

atheist

10816

4y

@AvatarOfKaine I think with regard to separating layers, in theory NNs can learn that relation (it's a zero weight on one of the inputs), but it's a harder to learn relation.
2

atheist

10816

4y

Re bigger first layers, if your function is 2 separable functions the net may either learn the whole space, or learn the individual functions. The whole space is doable, but would require a bigger network/more data/more time etc.
1

atheist

10816

4y

And I'm by no means an expert.
2

atheist

10816

4y

If you look at each layer as "extracting features from the previous layer", the answer is a function of the 2 input variables, but it's not much of a combination of the two, so it doesn't benefit so much from the layers because each layer combines information from the previous and outputs the combination to the next. So the first layer, conceptually at least, extracts information from the inputs, then you lose the discrete separation of the information you need. The information can still be passed forward, just harder.

It's possible for a NN to extract the information, you end up with something like lots of small planes (surfaces, not aero) being added together to give the final result, but the problem space can be simplified, and the simpler problem is easier to learn.

I'm gonna have to brush up on this stuff, aren't I?
3

Oktokolo

11330

4y

@AvatarOfKaine
Of they neural nets can approximate any function given enough neurons, layers and interconnects.
It is just absurdly hard to make them to do that.

Like @atheist wrote, having separate clusters of neurons to solve parts of the problem (like it is done for feature detection in convolutional neuronal networks) is probably the way to do it.

You definitely can split the problem into single operations on two inputs and train them seperately. Then you can combine the parts to get the full solution.

In general, keep the models as tiny as possible. More inputs, neurons, layers and interconnects means longer training and more overfitting.
More of these also means more potential for the model to exploit unexpected relations between inputs. But you don't want that for known functions.

Happy research.
1

AvatarOfKaine

3759

4y

@atheist by clusters of neurons you mean in coding terms several models ? That’s the part I don’t get by looking at the available libraries there seems no way of fine grained control nor does it seem like it should work that way given the definition mathematically about how the concept works
0

atheist

10816

4y

@AvatarOfKaine some googling later, as I understand it are you using fully connected layers? Keras seems to support "partially" connected layers, eg https://stackoverflow.com/questions...

With regards to the mathematical definition, not sure what you mean.
0

AvatarOfKaine

3759

4y

@atheist well just the weight adjustment part and the concept of an activation layer
So oh
Partially connected interesting
1

AvatarOfKaine

3759

4y

@atheist I use pytorch lol

Add Comment

random