What causes the "impossible" cases that haven't been seen in the training data? Some of them are just rare occurrences that weren't observed during training. Some of them are the results of errors made when evaluating and entering the event data. But in my experience most of them are caused by the situations when multiple hypotheses are true and their evidences interact with each other.
It's not such a big deal for the hypotheses that occur often (i.e. they have the high prior P(H)) and also occur together relatively often. Then there will be the training cases reflecting these interactions, and they can be recognized fairly well. However if you have 200 hypotheses, that means the possibility of 40000 combinations of two of them occurring together. If you have less than 40000 training cases, you're guaranteed not to see them all. And if you remember that the trueness of two hypotheses is usually rather rare, and three or more hypotheses might happen to be true, you won't have a good chance to see all the reasonable combinations until you have millions of training cases.
One way to work around this problem is to try isolating the hypotheses from each other. I've already mentioned it in the part 10: if we can figure out, what events are relevant for a hypothesis, we can ignore the other events that are relevant for the other hypotheses, thus reducing the interaction between these hypotheses.
How can we figure out, which events are relevant and which aren't? One way is to use the human expertise and the knowledge about the structure of the object. Such as, if a brake light on a car is out, it has no connection with the blown engine head gasket.
Some amount of this kind of knowledge engineering is inevitable because the Bayesian logic is notoriously bad at structuring the data. Someone's got to prepare the meaningful events and combine the knowledge into this form before feeding it to this model. For example, the question "is the compression in the rear-most cylinders low?" can be used as an indication that the engine has been mildly overheated. But it combines multiple pieces of knowledge:
- How many cylinders does this car have, and which are the rear-most? Depending on an inline or V (or flat) engine configuration, there might be one or two rear-most cylinders.
- What is the measured compression in these cylinders?
- What is the normal range of compression for this car model?
- Is the coolant flow going from the front of the engine towards the rear? That's the normal situation but in some cars the direction might be opposite, and then we'd have to look at the front-most cylinders instead.
All this would be pretty hard to enter as events into a Bayesian model. But if the data gets combined and pre-processed, it becomes one convenient event. This pre-processing would use the knowledge about the car model to find the cylinders to look at, and then would combine the compression measurement with the lower end of the factory-specified range to give a confidence value that the compression is low.
The problem is obviously that someone has to program this pre-processing. And the same goes for the expert estimation of relevance of symptoms for a hypothesis: someone has to do this work.
The second problem with the expert estimations is that the experts' knowledge might be limited. I believe I've read about the history of the medicine where some easy-to-test symptoms of some illnesses weren't discovered until a much later time, and in the meantime everyone went by the roundabout and unreliable way. But I don't remember the details of this story: what illnesses, what symptoms? Well, we'd like such symptoms to be discovered automatically from the data. But if we force the expert knowledge onto the model, it would never discover such dependencies because it would be forced to follow the human opinions.
So how do we discover this relevance automatically? I don't have a good answer. I have some ideas but they've occurred to me fairly late, and I've had only a very limited time to experiment with some of them, and I haven't done any experimentation with the ones that occurred to me later yet. I'll describe them anyway but not now yet.
Right now let's discuss a different, simpler aspect: suppose we somehow know which events are relevant to which hypotheses, how do we use this knowledge in the processing?
In the part 10 I've shown a way to do it with the independent tables for each hypothesis: to treat an event as irrelevant to a hypothesis, either drop the event from the hypothesis table altogether or set the probabilities of both P(E|H) and P(E|~H) to 0.5.
The computation on training table allows to do the same for the combined table. Just add an additional relevance value to every cell of the table, to every event of every training case. If this value is 1, process the event on this case as before. If this value is 0, leave the weight of this case unchanged when applying this event.
There is also an obvious extension: rather than have the relevance as discrete 0 or 1, make it a number in the range [0..1]. The processing would follow the same principle as for the event confidence: logically split the case into two cases with their relative weights determined by the relevance value R(E|I), one case with the event being fully relevant, another with the event being fully irrelevant, process them separately, and then put back together:
W(I|E) = W(I)*(1 - R(E|I)) + W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )
This concept can also be applied to the classic Bayesian probability tables. It's straightforward: treat the prior probabilities as weights, do the same computation, then divide each resulting weight by the sum of all these weights to get the posterior probability.