In my previous post https://babkin-cep.blogspot.com/2022/10/optimization-7-follow-up-to-neural.html I've had the idea that maybe there should be a V-shaped activation function. After some thinking, I think I've got an idea of what this function could be. Let's call it A(x):
A(v) = {
v < 0 : k1*v + k2;
v = 0 : k2;
v > 0: k3*v + k2;
}
It's not necessarily V-shaped as such, it just consists of two linear segments joined at a breaking point. k2 shifts this point vertically. Note that in the neuron formula (using the same variables as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html):
N(v[]) = sum(v[i] * x[i]) + x0
the offset x0 also shifts the result vertically, but does that before activation, effectively shifting the breaking point left or right, while k2 does it after activation and shifts the result up or down.
If the values happen to be k1=0, k2=0, k3=1, then A(x) is equivalent to ReLU. But a major difference here is that k1, k2, k3 also get adjusted by backpropagation. It nicely fits into the backpropagation model: the function is not contiguous but has a derivative at every point, and this derivative is generally not 0, and it's also a linear function, so it can even be applied after the last layer of neurons.
Applying the same sample logic for the last layer as in https://babkin-cep.blogspot.com/2022/10/optimization-4-backpropagation-for-last.html, if we have
N(v[1..3]) = v[1]*x[1] + v[2] * x[2] + v[3]*x[3]
then after activation, assuming that f() < 0, the formula will be:
output = A(N(v[1..3]) = k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2
Then the squared error will be:
f(v[1..3]) = (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b)^2
and using the same logic as before, the partial derivatives will be:
df/dx[1] = 2 * k1 * v[1] * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * k1 * v[1] * (output - b)
df/dk1 = 2 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3])
* (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * N(v[]) * (output - b)
df/dk2 = 2 * (k1 * (v[1]*x[1] + v[2] * x[2] + v[3]*x[3]) + k2 - b) =
= 2 * (output - b)
And when propagating to the previous layers, the propagation back through the activation function will multiply the sigma value by its derivative:
dA/dv = {
v < 0 : k1;
v = 0 : 1;
v > 0 : k2;
}
and the gradient (AKA derivative) for k1 or k3, depending on which path was taken, will be (N(v) * sigma), and for k2 it will be sigma.
Another interesting property of the V-shaped function is that it can implement a XOR logical operation on two inputs in one layer instead of the classical two. If we have false defined as -1 and true as 1 then we can define:
N(v[1], v[2]) = v[1] - v[2]
k1 = -1
k2 = -1
k3 = 1
It could also be argued that this activation function essentially adds an extra layer to every layer of neurons, but it's a special layer, much cheaper than a normal one. I think implementing an equivalent of V-shaped activation with plain ReLU would even require three layers to keep k2 common.
So it looks interesting and promising, although I'm not entirely sure that the breaking point will end up moving in the "intuitively correct" directions. Now I need to try and see. And first, reorganize the training code in FloatNeuralNet to support multiple kinds of activation functions, and factor out the repetitive code in it.
No comments:
Post a Comment