You start with the table of probabilities built from the training data. We'll look at the build later, for now let's just assume that the table is there.
Each event represents some symptom, so you look at the symptoms one by one, check if they are present, making the events true or false (or some confidence value in between), and adjust the probabilities of the hypotheses. At the end we pick the hypothesis that is most likely. It's usually a good idea to pick some absolute probability requirement as well. Basically, if you have 98 hypotheses at 1% probability and one hypothesis at 2% probability, it probably doesn't mean that the 2% hypothesis is the right one. It probably means that the system has no clue about the right answer, and the answer it gives should be "I don't know". You usually want the winning hypothesis to be at least somewhere in the vicinity of 70% probability. The optimal value might take some fine-tuning of your system. There is another important consideration to this point: are the hypotheses mutually exclusive or could multiple of them be true at the same time? I'll discuss this consideration in detail later, and for mow just leave it at that.
There are two ways to look for the symptoms. In one way you have some automated way to look for all the symptoms you know: in medicine that might be some panel of lab tests, in car diagnostics it might be some 107-point inspection, and of course as far as the computers or software are concerned, it's just the perfect application for the automated tests. The uses like the spam detection are just the perfect example: all the information you can get is right there in the e-mail, just go and extract it. If you have the results for all the symptoms, you just go and apply them all, one after another, and at the end you have the probabilities for all the hypotheses and can pick the best one.
In the other way the examination of a symptom requires some manual procedure, such as having a doctor poke the patient in some way, or ordering some test. There is some cost associated with performing the test. In some cases it might even be an an attempt at fixing the issue, mixing the symptoms and hypotheses a bit: such as "try this antibiotic" or "replace the alternator, did the problem go away"? When the symptoms are like this, there are two consequences: first, to save on the cost of diagnostics, you want to stop after examining as few symptoms as possible, and second, the order of examining the symptoms matters.
I've dealt in reality only with the situation when all the symptoms are available up-front, so I'll re-tell the solutions for the expensive symptoms I've read in that old British book (perhaps there are some better ones invented by now?).
A certain way to stop early is in the case if one hypothesis becomes certainly more probable than all the others. What does that mean, "certainly"? It means that if we look at all the symptoms we haven't examined yet and suppose that they go the worst way for this hypothesis, it would still beat all the other hypotheses, satisfying the requirements to accept it as the best one.
Another certain way to stop is if no hypothesis, even if all the yet unexamined symptoms go the best way for them, could satisfy the acceptance requirements. It's just the early admission of impending defeat without the useless thrashing beforehand.
The way you do it is by calculating the best and worst cases for each hypothesis after applying each symptom. You can also optimize it by remembering which hypotheses start failing the cut unconditionally, and skipping them in all the future checks. Obviously, if you have all the events available up-front, there is no point in stopping early, it's faster to just apply all the events rather than re-calculating the best and worst cases after every event.
How exacly do you compute the best and the worst cases? Like this:
- Start with the current probability of the hypothesis, and assign it as teh prior values Pbest(H) and Pworst(H)
- For all the un-examined symptoms:
- Check if P(E|H)/P(E) > 1 (i.e. the event has a positive correlation with this hypothesis)
- if this check is true, compute: Pbest(H) *= P(E|H)/P(E); Pworst(H) *= P(~E|H)/P(~E)
- else, compute Pbest(H): *= P(~E|H)/P(~E); Pworst(H) *= P(E|H)/P(E)
Now, what is the best next symptom to try out? The one that will have the strongest effect by affecting the probabilities of the hypotheses in the strongest way. I don't remember how that book defined the strength but here are some logical considerations, and it would probably work out in the same way as they did in that book.
We can use the difference between the probabilities as the measure of the effect of the event on a hypothesis:
abs(P(H|E) - P(H|~E))
Another possible way would be to divide the probabilities by each other but that is fraught with troubles. Let me show you why. Suppose we
define the effect as
max(P(H|E), P(H|~E)) / min(P(H|E), P(H|~E))
Max() and min() are needed to keep the value always above 1. But the minimum may always become 0, which would result in the infinities, which is not good. If we try to divide the other way around, min by max, we will produce the same results for 0/0.0001 and 0/0.9999, which is obviously also a bad metric. Thus the subtraction it is.
And then the obvious way to get the effect of an event on the whole set of hypotheses is to add up its effects on each individual hypotheses. Whichever event gets the highest metric is the most important event.
But there is also the question of the cost of checking the symptom. As everyone who performs the diagnostics knows, the reasonable approach is to try the cheapest things first. Which means that we should probably multiply that cumulative effect of an event by the cost of checking it, and use that as the metric.
The cost approach can be taken further: each hypothesis means a diagnosis that implies a prescription for a certain treatment. That treatment also has some cost. And again the reasonable thing is to hope for the diagnoses with the cheap treatments. So when computing the effect of an event on the whole set of hypotheses we can give the preference to the hypotheses with a cheaper treatment, for example by multiplying the effect on this hypothesis by 1/cost_of_treatment or log(1/cost_of_treatment) or some such.
What if a single test provides the knowledge about multiple symptoms? Then obviously we need to look for the choice not of the next symptom but of the whole next test. The effect of a test can be computed by adding up the effects of all the symptoms it resolves. Well, not necessarily exactly adding, this has issues with the correlated symptoms, but it's a reasonable first approach, and I'll talk more about this issue later.
Another interesting variation (that's not from the book, that's my own musings) is if the hypothesis are used to mean not just a diagnosis but a confirmed diagnosis, meaning "we digagnosed, treated, and confirmed that the treatment had cured the problem". In this case not just the symptoms as such but also the prescriptions for the treatment become the event (with the event being true if the treatment had worked). This nicely folds the knowledge about the alternative treatments into the same model. Obviously, this approach needs to have some stricter criteria of which treatments are worthy enough to try, the same as for the hypotheses in the more basic approach, to prevent the system from thrashing in vain when it's unable to make any useful conclusion.
This approach can be merged with the automated tests as well: We start by applying the results of all the automated tests as events, and then choose one of the possible treatments as the "next event to test" using the principles described above. If it fixes the issue, great, our job is done. Otherwise we can try the next treatment, and maybe re-run the automated tests, see if anything has changed, and take the changes into account before recommending the next treatment.