The first thing you need to define is some cost metric for comparison. The result of the testing will not be binary "pass or fail", it will be the value of that cost metric, and you'll be trying to minimize that cost.
In the perfect world, you would have the table of cost for replacing every spare part (including both the part itself and the labor). You'd also know the labor cost of performing the diagnosis by a human diagnostician. This cost might vary by the type of problem, and in the even more perfect world you'd have this information too. Moreover, the human diagnostician can make mistakes, so you'd have the information about the frequency of mistakes and the cost of labor and parts spend on them factored into the information about the cost of the human diagnosis. Having these tables, you'd be ready to do the comparisons.
If you don't have the exact tables, you can do some kind of estimations. For a toy model, you can do some very rough estimations: for example, say that the cost of the human diagnosis is 1 unit, and the cost of any particular repair is 3 units.
Then you take the body of the training data. The meaning of the words "training data" is kind of loose here, it's not necessarily the data that you will be using to train the model, you might put at least some of it aside to use in the testing. It all can also be called the "ground truth": the information about the previous cases, diagnosed by humans, and confirmed that the diagnosis is correct. For each test, you'd want to measure how good each version of your model does against the other versions of the model, and also against the humans. After all, if your diagnostic automation results in higher costs than the human diagnosis, there is no point in using it. And if it does better, you'd have some exciting numbers to show to your bosses and/or customers.
There is an absolute minimum cost needed to do the repairs on a set of training data: you take the cost of fixing every confirmed problem in it and add them up.
Cperfect = sum( cost(every_problem) )
To estimate, what the human diagnosis would cost, you'd take the cost for diagnosing one case, multiply it my the number of the cases in the training data and add to Cperfect to get the total cost.
Chuman = Cperfect + cost(diagnosis)*n_of_cases
Of course, if you have the more detailed information about the cost of diagnosis by each type of problem, you can use this detailed information to add them up. Generally there still will be the fixed cost of diagnosing one case, plus the sum of diagnosing each problem in this case:
Chuman = Cperfect + cost(diagnose_case)*n_of_cases + sum( cost(diagnose_every_problem) )
Note that even if you build a perfect diagnoser, the best savings you can hope for are (Chuman - Cperfect). Or if you prefer to see it as a percentage, 100%*(Chuman - Cperfect)/Chuman.
In reality when you run your automatic diagnoser, you'll have some problems misdiagnosed. There will be some false negatives (when your model failed to notice some problem) and some false positives (when your model diagnosed a problem that is not present). If your model has produced a wrong diagnosis, that's obviously a combination of a false negative (for the real problem that got missed) and a false positive (for the wrong problem that got diagnosed).
The cost of the false negatives is the cost of a human diagnosis, because the human diagnostician would have to go and look at this case. The cost of post-repair testing might need to be added as well, because that's what would be detecting that the problem is not fixed before sending it to the human. In many cases the cost of this testing might be negligible compared to the cost of human diagnosis.
The cost of the false positives is the cost of the parts and labor spent on replacing the parts that aren't broken.
With all this, the cost of the repairs per the expert system will be:
C = Cperfect + cost(diagnose_case)*n_of_false_negative_cases + sum( cost(diagnose_every_false_negative_problem) ) + sum( cost(repair_every_false_positive_problem) )
You'd compare the output of your model with the perfect diagnosis, notice the false positives and false negatives, and add their costs.
Now you're able to compare two models: run them on the same data, find the resulting cost, and see which cost is lower and by how much. You can try different algorithms and different constants for these algorithms and see the changes of the cost. And sometimes the results would surprise you, you'll discover that you went for that fancier algorithm only to make things worse.
If you're wondering, what kind of boundary constant value should be used for accepting the hypotheses, the answer is to try multiple values and see, which one works better. If all goes well, you should be able to build a graph of the total cost by the boundary value and see a nice smooth-ish curve with a distinct minimum on it, something like this:
| | * | | * * | * * | | | +-----------------------
If you have two interdependent constants (such as in the algorithm that computes probabilities for both all hypotheses and independent hypotheses, and has different acceptance boundaries for these sub-algorithms), you may want to try taking a couple values of one constant, and for each one of them go through the graphing of the cost by the changing of the other constant. That might give you the hint of how they are interdependent. And then with this knowledge you'd be able to choose a smaller interesting area and go through every combination of both constants in it, compute the cost, and find the minimum on the 3-dimensional graph.
You might be even able to analyze the dependencies, build a formula, and find a good approximation of the minimum analytically.
These boundary constants generally adjust the balance between the false positives and false negatives. If you set the boundary too low, a lot of false positives will be accepted. If you set the boundary too high, the model will reject a lot of diagnoses and thus generate many false negatives. And around the optimum, you'll be trading some false positives for false negatives. In the perfect world there would be some area with no false positives nor false negatives but in reality you'll still have both, and it will come to giving up some in one area to win in the other.
The exact win or loss around the optimum area will depend on the mix of cases in the training data. If you move the boundary up, and lose in lots of cases, and win in a few, your cost will go up. If you move the boundary up, lose in a few cases but win in many, your cost will go down. The proportions of different cases mixed in your training data will determine the balance for the minimal total loss.
This has two consequences: First, there is no point in the very precise picking of the boundary constants. The small changes of these constants one way or the other won't matter, they will depend on which mix of cases did we get in the last time period. And the mix will fluctuate a little over time, there is no way to predict it precisely. Second, it's important to preserve the mix of the cases in the training. If you select a subset of data for the training, the mix of cases in it should match the full real set. Don't use the approaches like "we'll only include into the training data the cases where we've been able to diagnose the problem on the first go". If you do that, you'll be changing the mix, throwing away the more complex data, and your cost estimations won't match the
reality.
The common advice is "split your case data in two, use one half for training, another half for testing". As you can see, how exactly your split the data, will matter. You don't want to change the mix too much. If you take the first half of the data for training and the second half for testing, and the data happens to be sorted on some parameter, you'll get the very different mixes in two halves. It can get as bad as the training in the first half being totally wrong for diagnosing the second half. You need to do the splitting in some more random way. Either by picking the cases by some random number generator, or another good approach is to split them by the timestamp: order the cases by the timestamp and then you can divide them into the first and second half.
But this whole advice of splitting the data in two is not very good. It's good for the follow-up testing but not good for the initial testing. For the initial testing, you want to use the SAME data both for training and for testing. If you trained your model on some data and still can't diagnose everything right when testing with the exact same data, that will give you a good opportunity to analyze, what went wrong. Some people might say "oh, but you'll be over-fitting your model to the particular data". Over-fitting is rarely a problem for the Bayesian models, the typical problem is the opposite. And if the mix of cases in your training data matches the typical real mix, you'll be fitting the expectations close enough to the reality.
Thus start with testing on the same data that you used for training. Identify all the cases of false positives and false negatives. See if there is something common with them. Look in depth at some of the cases, how did the model come up with this result? What caused it to miss the correct result? In my small examples, I've been showing the step-by-step intermediate results of applying every event. This is the kind of data you want to look at. Did the steps match what you expected? If they didn't then why, what values in the tables cause the computation to veer this way?
This detailed data becomes large fairly quickly, so some automatic pre-processing can help with finding the source of trouble. In particular, I've been using the automated code that would pick, which events caused the biggest changes in the probabilities of the hypotheses, both the computed ones and the expected ones. Then it can be printed out in the more compact form that is easier to analyze at a glance. To get this kind of printout automatically, it helps to build the diagnostic support into the program itself. The good plan is to run the test data through the model once, pick the strange cases, and then run them through the model the second time, along with the information about the correct diagnosis and the diagnosis produced by the model. These known diagnoses can then be used to drive the debugging printouts during the second computation.
The different algorithms will give different results. Running multiple algorithms on the same set of training data (with the same data used for training and testing) will let you see the upsides and downsides of each algorithm. That's how I've been coming up with my refinements. Look at what can be adjusted to make the algorithm work better. If one algorithm handles some cases well, and another one handles the other cases well, could we perhaps differentiate these cases somehow and then run them through the different algorithms? Some refinements will work well, some will crash and burn, try the different ones and pick the ones that work.
And only after you've got a system that can do reasonably well on processing the data that were used for training, it's time to test it on the other data. Again, identify the cases that go wrong. In some cases the results will be better or worse than before simply because of a different mix on cases. If you see the same kinds of mistakes as when you tested with the training data but in different proportions, it's probably the mix issue. If you see the different mistakes, well, it means that there is something in this set of data that you haven't seen before and that throws your model off.
A useful experiment is to split your data (nicely and randomly) in two, then use each half to train and test in turn. If we call these parts A and B, then you can do:
- train with part A, test with part A
- train with part B, test with part B
- train with part A, test with part B
- train with part B, test with part A
If you test each algorithm change on both halves of the data, you'll be able to see if it affects them in the same way or differently. Then test the training table you produced from one half of data on the other half of data. Did it do substantially differently than the same algorithm did when trained by the same data as used in the test? If yes then perhaps the mix in two halves of data is different.
After your system is in production, you should still keep collecting the training data, and keep re-training the model. As the devices age, some of their parts will be coming to the end of life (by design or by mis-design), and this will be showing as the increasing frequency of their failures. This is something that you'd want to capture for the future diagnosis.
And the diagnostic data that was useful for testing is useful in production too. It's useful in two ways: First, when the diagnosis turns out to be wrong, and the case goes to the human diagnostician, the human may benefit from knowing, why did the machine make this diagnosis. He might be able to pinpoint the error quickly and cheaply, without repeating the difficult steps. Second, you should always look back at what is going wrong in production? It would never be perfect but can we do it better? For that, it would help to take the misdiagnosed cases and analyze them further. Try to figure out, what went wrong, and the diagnostic information is what lets you find out what went wrong. You might be also able to use the feedback from the human diagnosticians.
Oh, and a little tangent on the subject of the "ground truth". Recently I went to an internal conference on the subject of machine learning, and there they've been saying that there are the systems with the "ground truth" and systems without the "ground truth" (i.e. no known perfect solution, such as finding the topic of a text message). I disagree with that. I think that there are no systems without the "ground truth". There is always at least the "ground truth" benchmark of how would a human do at this task? Can the automated system match the human? Can the automated system beat the human? In that example of finding the topic of a test message, there obviously must be a body of training messages that the humans would read and formulate the topics. Then these messages can be used for testing of the automated models. And the good automated models must come reasonably close to what the humans did. Only after that can we use these automated models to process the real messages in the wild. Otherwise there is no way to tell if the results of these models are any good or if they are some complete garbage.
There is always, ALWAYS a way to test your system and estimate, how good is it doing. If you didn't test it, you're not following the best practices, and you probably have a whole lot of bugs in your program. Testing the large volumes of data manually is impossible but the solution is to pick some samples and check carefully that these samples are producing the correct results. And of course there are other indications: for example, if you ever see a probability value of 15, this means that there is something very wrong with your computation.
***
That concludes what I had to say on the subject of the Bayesian expert systems. Unless I recall something that I forgot :-) I wrote up all the ideas I had, and that's actually more ideas than I've started with, I've had some interesting realizations as I went along. At some point I'll probably provide the code examples for the techniques discussed in the last few installments. We'll see how it will go with my time.