Tuesday, June 28, 2016

AdaBoost 5: the square root and the confidence value

The question of what should be used as the confidence value for Bayesian computations is a thorny one. In my previous take on it I've used the value without a square root, while the classic AdaBoost formula uses the one with the square root. Which one is more right?

Both of them produce the same decision for the classification of the cases. The difference is in the absolute value of the chances, or in the AdaBoost terms, the margin (with the classic AdaBoost also applying the logarithm on top).

I still think that using the value without the square root has the more correct "physical meaning". But either can be used with the same result.

Following the classic AdaBoost, the version of the confidence with the square root follows from the expression

Z = sum for all good cases i (Wi / S) + sum for all bad cases j (Wj * S)
Z = sum for all good cases i (Wi * (C-1) / C) + sum for all bad cases j (Wj * C / (C-1))

Just like the last post, W is the weight of a particular training case, same as D(i) in the classic AdaBoost terms, just since I've been using weights rather than a distribution, the letter W comes easier to my mind.

The "physical meaning" here is that the weights of the training cases for the next round of boosting are adjusted opposite to how the chances of these training cases get affected by this round during the final computation. If a training case gets predicted correctly by a partial hypothesis, its weight will get multiplied by C and the weight of the other cases will be multiplied by (1-C), so the changes will get multiplied by C/(1-C), and for the next round of boosting the good cases get divided by that to compensate. For the cases that get predicted incorrectly, the opposite happens.

This translates to

C / (1-C) = S = sqrt((1-e) / e) = sqrt(1-e) / sqrt(e)

The last expression does NOT mean

C = sqrt(1-e)

Instead, it means

C = sqrt(1-e) / (sqrt(1-e) + sqrt(e))

Alternatively, the approach without the square root starts with the premise

Z = sum for all good cases i (Wi / C) + sum for all bad cases j (Wj / (1-C))

The "physical meaning" is as described before, the weights of the training cases for the next round of boosting are adjusted opposite to how the weights of these training cases get affected by this round during the final computation. It seems to me that compensating the weights for the changes in weights is the more correct "physical meaning" than compensating the weights for the changes in chances.

This translates to

C / (1-C) = S2 = sqrt( (1-e) / e )2 = (1-e) / e

and

C = (1-e)

By the way, chasing this version through the derivatives as shown in the previous post was interesting. I've appreciated why the authors of AdaBoost involved the exponents into the formulas: doing the derivatives and integrals with the exponents is much easier than without them. Then I've realized that with the derivatives where S = exp(Alpha) I'm computing dZ/dAlpha, not dZ/dC. And that is the correct approach. So without the exponents I should be computing the dZ/dS, and that gets easy again.

So, the real version of AdaBoost described in the third part is this:



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the weights Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the boundary on the training error Zt where:
    Wgoodt = 0; Wbadt = 0;
    for i = 1, ..., m {
         if ht(xi) = yi {
              Wgoodt += Dt(i)
         } else {
              Wbadt += Dt(i)
         }
    }
    Ct = Wgoodt / (Wgoodt + Wbadt)
    Zt = Wgoodt/Ct + Wbadt/(1-Ct)
    which can also be represented symmetrically through St:
    Zt = Wgoodt/St + Wbadt*St
    St = sqrt(Ct / (1-Ct)) = sqrt(Wgoodt / Wbadt)
    and substituting St:
    Zt = Wgoodt / sqrt(Wgoodt / Wbadt) + Wbadt * sqrt(Wgoodt / Wbadt)
    = 2 * sqrt(Wgoodt * Wbadt)
    Which gets minimalized when either of Wgoodt or Wbadt gets minimalized, but to be definitive we prefer to minimalize Wbadt.
  • Update,
    for i = 1, ..., m {
         if ht(xi) != yi; {
              Dt+1(i) = Dt(i) / (1-Ct)
         } else {
              Dt+1(i) = Dt(i) / Ct
         }
    }
Produce the function for computing the value of the final hypothesis:
H(x) {
     chance = 1;
     for t=1,...,T {
          if (ht(x) > 0) {
               chance *= Ct/(1-Ct);
          } else
               chance *= (1-Ct)/Ct;
          }
     }
     return sign(chance - 1)
}



Having this sorted out, I can move on to more creative versions of the algorithm where on a step of boosting the different training cases may get stamped with the different confidence values C.

No comments:

Post a Comment