I want to elaborate on why the second trick is so important. Basically, because in general the necessary condition of the chances changing on an event in the opposite way (by dividing or multiplying by the same number) is not true. It takes a specifically defined event to make it true.

In general if an event is true, the chance of a hypothesis after applying this event is:

Chance(H|E) = P(H/E) / P(~H|E)

The probabilities change on an event as follows:

P(H|E) = P(H) * P(E|H) / P(E)

P(~H|E) = P(~H) * P(E|~H) / P(E)

So then the chance changes (after substituting and reducing the formula):

Chance(H|E) = Chance(H) * P(E|H) / P(E|~H)

If an event is false, the chance changes in a similar but different way:

Chance(H|~E) = P(H/~E) / P(~H|~E)

= Chance(H) * P(~E|H) / P(~E|~H)

Being able to do the "opposites" requires that

Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )

Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)

Which then means after doing the substitution:

P(E|H) / P(E|~H) = P(~E|~H) / P(~E|H)

Which is not always true. But if we take a page from the AdaBoost's book and define an event such that

P(E|H) = P(~E|~H)

and

P(E|~H) = P(~E|H)

Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.

There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that

l[i]^x[i]

gets replaced with

if (x[i] > 0) {

lpos[i]^x[i]

} else {

lneg[i]^x[i]

}

AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic, but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.

This opens two possibilities for variations of the neurons:

(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.

(2) One with the definition E=(x > 0)

A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.

## No comments:

## Post a Comment