## Saturday, June 3, 2017

### Neuron in the Bayesian terms, part 2

I've been trying to describe in proper formulas my insight that a neuron in neural networks is a kind of a Bayesian machine(OK, this is apparently known to the people who do the neural networks professionally but it was an insight for me), and I think that I've finally succeeded. Let's start from the beginning:

A neuron implements the function

y = nonlinear(sum(w[i] * x[i]))

Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1]  by adjusting the weights w[i]).

The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".

The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:

l[i] = exp(w[i])
w[i] = ln(l[i])

This can be substituted into the expression for the sum:

sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )

Using the rule ln(a)*b = ln(a^b), the sum can be converted further:

sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )

Then using the rule ln(a) +ln(b) = ln(a * b) the sum can be converted into a product:

sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )

Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:

P(H|E) = P(H) * P(E|H) / P(E)

So as each Bayesian event E[i] gets applied, we can say that the final probability will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.

Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:

Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))

So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )

And the chances will be:

Chance(H) = P(H) / P(~H)
= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )
= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )

If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:

Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )

Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:

Chance(H) = product(
if (E is true) {
P(E[i]|H) / P(E[i]|~H)
} else {
P(~E[i]|H) / P(~E[i]|~H)
}
)

Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:

P(E[i]|H) = P(y == x[i])
P(~E[i]|~H) = P(y == x[i])

An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that

P(E|H) = P(~E|~H)
P(E|~H) = P(~E|H)

We then use the general property that

P(E|H) = 1 - P(~E|H)
P(E|H~) = 1 - P(~E|~H)

and get

P(E|~H) = P(~E|H) = 1 - P(E|H)

So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:

Chance(H) = product(
if (E is true) {
P(E[i]|H) / (1 - P(E[i]|H))
} else {
(1 - P(E[i]|H)) / P(E[i]|H)
}
)

When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":

Chance(H) = product(  ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )

We can now interpret the original formula in terms of the chances. That formula was:

y = nonlinear(sum(w[i] * x[i]))
= nonlinear( ln( product( l[i]^x[i] ) ) )

Here we can substitute the product:

y = nonlinear( ln( Chance(y > 0) ) )

A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].

This substitution of the product will mean:

l[i] = P(E[i]|H) / (1 - P(E[i]|H))
w[i] =  ln( P(E[i]|H) / (1 - P(E[i]|H)) )

So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.