I want to elaborate on why the second trick is so important. Basically, because in general the necessary condition of the chances changing on an event in the opposite way (by dividing or multiplying by the same number) is not true. It takes a specifically defined event to make it true.
In general if an event is true, the chance of a hypothesis after applying this event is:
Chance(H|E) = P(H/E) / P(~H|E)
The probabilities change on an event as follows:
P(H|E) = P(H) * P(E|H) / P(E)
P(~H|E) = P(~H) * P(E|~H) / P(E)
So then the chance changes (after substituting and reducing the formula):
Chance(H|E) = Chance(H) * P(E|H) / P(E|~H)
If an event is false, the chance changes in a similar but different way:
Chance(H|~E) = P(H/~E) / P(~H|~E)
= Chance(H) * P(~E|H) / P(~E|~H)
Being able to do the "opposites" requires that
Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )
Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)
Which then means after doing the substitution:
P(E|H) / P(E|~H) = P(~E|~H) / P(~E|H)
Which is not always true. But if we take a page from the AdaBoost's book and define an event such that
P(E|H) = P(~E|~H)
and
P(E|~H) = P(~E|H)
Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.
There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that
l[i]^x[i]
gets replaced with
if (x[i] > 0) {
lpos[i]^x[i]
} else {
lneg[i]^x[i]
}
AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic, but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.
This opens two possibilities for variations of the neurons:
(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.
(2) One with the definition E=(x > 0)
A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.
This started as my thoughts on the field of Complex Event Processing, mostly about my OpenSource project Triceps. But now it's about all kinds of software-related things.
Sunday, June 4, 2017
emergents
I've attended a talk about the commercial use of the pre-trained neural networks. The idea is that you take a pre-trained network, throw away the top couple of layers, then put the blank layers on top instead and train only these new layers for your purpose.
The presenter gave an example of taking the Google's general image classifier network and teaching it to recognize your specific images instead: "huggable" vs "non-huggable". It was smart enough to recognize that a knitted cactus is huggable while a real one is not.
Well, it's kind of not surprising: the top layers of the classifier network are trained to recognize the various kinds of images but to do that the layer below it must produce the characteristics that allow to recognize these various kinds of images. If the original classifier has the wide knowledge (and it does), the previous layer would know what is important to recognize a very wide rage of images. They call the output of that previous layer "the emergents". And then when you use them as inputs, the new top layer would have an easy job classifying them for the new goal. You could probably even do a Bayesian classifier on top of them without much difference.
The presenter gave an example of taking the Google's general image classifier network and teaching it to recognize your specific images instead: "huggable" vs "non-huggable". It was smart enough to recognize that a knitted cactus is huggable while a real one is not.
Well, it's kind of not surprising: the top layers of the classifier network are trained to recognize the various kinds of images but to do that the layer below it must produce the characteristics that allow to recognize these various kinds of images. If the original classifier has the wide knowledge (and it does), the previous layer would know what is important to recognize a very wide rage of images. They call the output of that previous layer "the emergents". And then when you use them as inputs, the new top layer would have an easy job classifying them for the new goal. You could probably even do a Bayesian classifier on top of them without much difference.
Saturday, June 3, 2017
Neuron in the Bayesian terms, part 2
I've been trying to describe in proper formulas my insight that a neuron in neural networks is a kind of a Bayesian machine(OK, this is apparently known to the people who do the neural networks professionally but it was an insight for me), and I think that I've finally succeeded. Let's start from the beginning:
A neuron implements the function
y = nonlinear(sum(w[i] * x[i]))
Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1] by adjusting the weights w[i]).
The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".
The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:
l[i] = exp(w[i])
w[i] = ln(l[i])
This can be substituted into the expression for the sum:
sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )
Using the rule ln(a)*b = ln(a^b), the sum can be converted further:
sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )
Then using the rule ln(a)+ln(b) = ln(a * b) the sum can be converted into a product:
sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )
Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:
P(H|E) = P(H) * P(E|H) / P(E)
So as each Bayesian event E[i] gets applied, we can say that the final probability will be:
P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.
Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:
Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))
So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:
P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )
And the chances will be:
Chance(H) = P(H) / P(~H)
= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )
= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )
If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:
Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )
Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:
Chance(H) = product(
if (E is true) {
P(E[i]|H) / P(E[i]|~H)
} else {
P(~E[i]|H) / P(~E[i]|~H)
}
)
Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:
P(E[i]|H) = P(y == x[i])
P(~E[i]|~H) = P(y == x[i])
An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that
P(E|H) = P(~E|~H)
P(E|~H) = P(~E|H)
We then use the general property that
P(E|H) = 1 - P(~E|H)
P(E|H~) = 1 - P(~E|~H)
and get
P(E|~H) = P(~E|H) = 1 - P(E|H)
So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:
Chance(H) = product(
if (E is true) {
P(E[i]|H) / (1 - P(E[i]|H))
} else {
(1 - P(E[i]|H)) / P(E[i]|H)
}
)
When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":
Chance(H) = product( ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )
We can now interpret the original formula in terms of the chances. That formula was:
y = nonlinear(sum(w[i] * x[i]))
= nonlinear( ln( product( l[i]^x[i] ) ) )
Here we can substitute the product:
y = nonlinear( ln( Chance(y > 0) ) )
A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].
This substitution of the product will mean:
l[i] = P(E[i]|H) / (1 - P(E[i]|H))
w[i] = ln( P(E[i]|H) / (1 - P(E[i]|H)) )
So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.
A neuron implements the function
y = nonlinear(sum(w[i] * x[i]))
Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1] by adjusting the weights w[i]).
The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".
The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:
l[i] = exp(w[i])
w[i] = ln(l[i])
This can be substituted into the expression for the sum:
sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )
Using the rule ln(a)*b = ln(a^b), the sum can be converted further:
sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )
Then using the rule ln(a)
sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )
Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:
P(H|E) = P(H) * P(E|H) / P(E)
So as each Bayesian event E[i] gets applied, we can say that the final probability will be:
P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.
Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:
Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))
So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:
P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )
And the chances will be:
Chance(H) = P(H) / P(~H)
= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )
= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )
If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:
Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )
Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:
Chance(H) = product(
if (E is true) {
P(E[i]|H) / P(E[i]|~H)
} else {
P(~E[i]|H) / P(~E[i]|~H)
}
)
Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:
P(E[i]|H) = P(y == x[i])
P(~E[i]|~H) = P(y == x[i])
An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that
P(E|H) = P(~E|~H)
P(E|~H) = P(~E|H)
We then use the general property that
P(E|H) = 1 - P(~E|H)
P(E|H~) = 1 - P(~E|~H)
and get
P(E|~H) = P(~E|H) = 1 - P(E|H)
So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:
Chance(H) = product(
if (E is true) {
P(E[i]|H) / (1 - P(E[i]|H))
} else {
(1 - P(E[i]|H)) / P(E[i]|H)
}
)
When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":
Chance(H) = product( ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )
We can now interpret the original formula in terms of the chances. That formula was:
y = nonlinear(sum(w[i] * x[i]))
= nonlinear( ln( product( l[i]^x[i] ) ) )
Here we can substitute the product:
y = nonlinear( ln( Chance(y > 0) ) )
A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].
This substitution of the product will mean:
l[i] = P(E[i]|H) / (1 - P(E[i]|H))
w[i] = ln( P(E[i]|H) / (1 - P(E[i]|H)) )
So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.