## Saturday, October 31, 2015

### Bayes 12: fuzzy training

Now let's consider that the training data might also contain the event results with only partial confidence. Previously I've been talking about the confidence only when the model runs and computes its prediction based on the training data. But what is the training data? It's the same kind of information about the cases from the past where we've already ascertained the correct result. Which means that the event information that went into their diagnosis might also be fuzzy. Just as we might be unable to tell the outcome of an event for sure in the future, we might have had the same problem in the past as well. So we need to be able to handle this in the training data.

In the last installment I've shown that during the computation, the confidence C(E) can be handled by splitting the event into a superposition of two events, a positive one that will multiply the weights of the matching cases by C(E) and the negative one that will multiply the weights of the matching cases by 1-C(E). The same approach can be taken to the training data. If we have a case (with the weight of 1) that has an event with a fractional confidence, let's call it TC(E|I) for "training confidence of event E for the case I", it can be seen as two cases: one with the weight TC(E|I) and 1 for the outcome of that event and another one with the weight 1-TC(E|I) and 0 for the outcome of that event. The outcomes of the other events would be copied between both cases.

For example, if we have a case:

```Wt        evA evB evC
1 * hyp1  1   0   0.75```

we can split it into two cases:

```Wt           evA evB evC
0.75 * hyp1  1   0   1
0.25 * hyp1  1   0   0```

Incidentally, this is the reverse operation of what happens when we're building a classic probability table out of the individual training cases: there we add up the weights of the individual cases and take the weighted average for the outcomes of every event. Well, obviously, if we combine the individual cases, the original weight of each case will be 1, so the sum of the weights will be equal to the count of the cases, and the P(E|H) is a simple average of the outcomes for this event. But in the generalized form it can handle the other weights as well.

Let's go through the computation needed for processing of these sub-cases. When we get the confidence value C(E) during the computation, the weights of the sub-cases will be modified as follows. The weight of the first sub-case that has the event outcome as 1, will be:

`W(I1|E) = W(I1)*C(E)`

The weight of the second sub-case that has the event outcome as 0, will be:

`W(I2|E) = W(I2)*(1 - C(E))`

If it were the first event we looked at, the old weights will be:

```W(I1) = TC(E)
W(I2) = 1 - TC(E)```

And in this case the results can also be written as:

```W(I1|E) = TC(E)*C(E)
W(I2|E) = (1 - TC(E))*(1 - C(E))```

If we want to clump them back together into one case, the weight of that case will be:

`W(I|E) = W(I1|E) + W(I2|E) = TC(E)*C(E) + (1 - TC(E))*(1 - C(E))`

That's assuming that it was only one case split, with the original W(I) = 1. If it was the combination of multiple cases, when we need to take its original weight into account:

`W(I|E) = W(I) * ( TC(E)*C(E) + (1 - TC(E))*(1 - C(E)) )`

Compare that with the probability formula from the 4th installment:

```Pc(P, C) = P*C + (1-P)*(1-C)
P(H|C(E)) = P(H) * Pc(P(E|H), C(E))/Pc(P(E), C(E))```

It's the same. The upper half of the fraction mirrors the weight computation directly, and the lower part shows that the sum of the weights of all cases will change in the same way.

We've got two interesting consequences:

First, now we handle the training cases with the fuzzy events.

Second, with this addition the logic for the computation directly by the training cases can also handle the probability table entries, or any mix thereof.

With the computation directly by the training events, there is a problem with handling the new cases that don't exactly match any of the training cases. This is known as over-fitting. With the computation by probability tables, there is the problem of losing the cross-correlation data of the events (the problem of not matching any training case isn't exactly absent but it's at least reduced). With this new tool we can adjust the processing to be at any mid-point in the spectrum between these boundaries, and thus smoothly adjust the level of the fitting.

For example, we can split each case into two identical sub-cases, one with the weight 0.1, another with the weight 0.9. Then combine all the cases with weight 0.1 for the same hypothesis into one mixed-case similarly to how we've computed the probability tables, and leave the cases with weight 0.9 by themselves. This means that the result will be a mix of 10% of the probability-table processing and 90% of the training-event processing. And this split-point can be chosen anywhere within the range [0..1].

Even if you just want to go exclusively by the classic probability tables, this way of processing is more efficient because it saves on division on every step. You don't need to computate the divisor for every event, you don't need to divide every probability value, and you reduce the rounding errors because you skip these extra computations.