Tuesday, August 1, 2017

Bayes 27 & AdaBoost: another problem with partial confidence

The solution from the previous installment can handle the partial confidence but it has its own issue. To demonstrate it, let's look at a run with the table from the part 26 but where both of the fully-correlated events E1 and E2 get a confidence of 0.7:

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The probabilities move after applying E1, after E2 they move again. The interesting question is, should have they moved after E2? It really depends on how E1 and E2 have been measured: are their confidences derived independently or tracking back to the same measurement?

The code assumes that they are independent, so if both are independently measured at the confidence of 0.7, together they provide a higher confidence. But if they both come from the same test, this assumption is wrong.

If they came from the same test then E2 carries no additional information after E1 has been processed. In this case we should have treated the value C(E2)=0.7 (i.e. C(E2)=C(E1)) as having the meaning "I don't know", and then the differences from it would carry the new information.

But this approach has its own problems. First of all, it's fine if the confidence values C(E) represent the measured probabilities. But if they represent some expert estimation then an expert saying "I don't know" and giving the confidence of 0.5 would shift the probabilities instead of leaving them unchanged. Well, it it could be re-scaled automatically but then how do we know that the expert means "I don't know" and not "I've carefully considered everything and I think that it's a 50-50 chance"? So it sounds like getting it right would require using two separate numbers: the absolute confidence that tells how sure we are of the estimation, and separately the estimation. Which would create all kinds of messes.

The second problem, if we want to do such a re-scaling, to which value do we re-scale? It's reasonably easy if two events are copies of each other but what if the correlation is only partial? There actually is an a rather easy answer to the problem: the needed value would be P(E), the probability that the event is true at that point (after the previous events have been considered). I've actually went and modified the code to compute the P(E) based on the weights only to find out that I've forgotten an important property of the AdaBoost logic. Let me show its output:

$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1                     ,E2                     ,E3                     ,
hyp1   ,3.00000,0.66667^0.66667@0.66667,0.66667^0.66667@0.66667,0.40000^0.33333@0.50000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.40000^0.33333@0.50000,
hyp2   ,3.00000,0.33333^0.33333@0.33333,0.33333^0.33333@0.33333,0.60000^0.66667@0.75000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.60000^0.66667@0.75000,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
Expected P(E)=0.500000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
Expected P(E)=0.513333
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000
Expected P(E)=0.598615
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The computation table now has one more element printed after "@", the estimation of event's probablity for the particular branch based on the previous events. During the computation these values get mixed together based on the weight of the branches. But notice that when applying E2, the message says:

Expected P(E)=0.513333

It's 0.513333, not 0.7! The reason is two-fold. First, AdaBoost works by adjusting the weights of the training cases such as to bring the probability of the last seen event to 0.5. It just does the correction opposite to what I was trying to achieve. Second, these are the wrong probabilities in the table. These probabilities are computed per-branch based on the absolute confidence. They don't care if the actual confidence was 0.7 or 0.3, these values are the same from the standpoint of the absolute confidence. But this difference matters a lot when trying to predict P(E)!

Instead we need to honestly compute the P(E) based on the previous events, not per the AdaBoostian table but per the original training cases. But if we do that, we end up again with a table of exponential size. The trick with ignoring all the events but the last few might still work though. The idea would be similar: since we try to order the closely-correlated events close to each other, the previous few events would be the ones with the strongest effects. And if the farther events aren't closely correlated, they won't have a big effect anyway, so they can be ignored. Maybe I'll try this next. Or maybe it's simply not worth the trouble.