Sunday, December 13, 2015

Bayes 20: Testing

I've already mentioned how to test a model but I want to go into it in more detail and collect all the items on this subject in one place.

The first thing you need to define is some cost metric for comparison. The result of the testing will not be binary "pass or fail", it will be the value of that cost metric, and you'll be trying to minimize that cost.

In the perfect world, you would have the table of cost for replacing every spare part (including both the part itself and the labor). You'd also know the labor cost of performing the diagnosis by a human diagnostician. This cost might vary by the type of problem, and in the even more perfect world you'd have this information too. Moreover, the human diagnostician can make mistakes, so you'd have the information about the frequency of mistakes and the cost of labor and parts spend on them factored into the information about the cost of the human diagnosis. Having these tables, you'd be ready to do the comparisons.

If you don't have the exact tables, you can do some kind of estimations. For a toy model, you can do some very rough estimations: for example, say that the cost of the human diagnosis is 1 unit, and the cost of any particular repair is 3 units.

Then you take the body of the training data. The meaning of the words "training data" is kind of loose here, it's not necessarily the data that you will be using to train the model, you might put at least some of it aside to use in the testing. It all can also be called the "ground truth": the information about the previous cases, diagnosed by humans, and confirmed that the diagnosis is correct. For each test, you'd want to measure how good each version of your model does against the other versions of the model, and also against the humans. After all, if your diagnostic automation results in higher costs than the human diagnosis, there is no point in using it. And if it does better, you'd have some exciting numbers to show to your bosses and/or customers.

There is an absolute minimum cost needed to do the repairs on a set of training data: you take the cost of fixing every confirmed problem in it and add them up.

Cperfect = sum( cost(every_problem) )

To estimate, what the human diagnosis would cost, you'd take the cost for diagnosing one case, multiply it my the number of the cases in the training data and add to Cperfect to get the total cost.

Chuman = Cperfect + cost(diagnosis)*n_of_cases

Of course, if you have the more detailed information about the cost of diagnosis by each type of problem, you can use this detailed information to add them up. Generally there still will be the fixed cost of diagnosing one case, plus the sum of diagnosing each problem in this case:

Chuman = Cperfect 
    + cost(diagnose_case)*n_of_cases 
    + sum( cost(diagnose_every_problem) )

Note that even if you build a perfect diagnoser, the best savings you can hope for are (Chuman - Cperfect). Or if you prefer to see it as a percentage, 100%*(Chuman - Cperfect)/Chuman.

In reality when you run your automatic diagnoser, you'll have some problems misdiagnosed. There will be some false negatives (when your model failed to notice some problem) and some false positives (when your model diagnosed a problem that is not present). If your model has produced a wrong diagnosis, that's obviously a combination of a false negative (for the real problem that got missed) and a false positive (for the wrong problem that got diagnosed).

The cost of the false negatives is the cost of a human diagnosis, because the human diagnostician would have to go and look at this case. The cost of post-repair testing might need to be added as well, because that's what would be detecting that the problem is not fixed before sending it to the human. In many cases the cost of this testing might be negligible compared to the cost of human diagnosis.

The cost of the false positives is the cost of the parts and labor spent on replacing the parts that aren't broken.

With all this, the cost of the repairs per the expert system will be:

C = Cperfect 
    + cost(diagnose_case)*n_of_false_negative_cases
    + sum( cost(diagnose_every_false_negative_problem) )
    + sum( cost(repair_every_false_positive_problem) )

You'd compare the output of your model with the perfect diagnosis, notice the false positives and false negatives, and add their costs.

Now you're able to compare two models: run them on the same data, find the resulting cost, and see which cost is lower and by how much. You can try different algorithms and different constants for these algorithms and see the changes of the cost. And sometimes the results would surprise you, you'll discover that you went for that fancier algorithm only to make things worse.

If you're wondering, what kind of boundary constant value should be used for accepting the hypotheses, the answer is to try multiple values and see, which one works better. If all goes well, you should be able to build a graph of the total cost by the boundary value and see a nice smooth-ish curve with a distinct minimum on it, something like this:

|                   *
|   *           *
|      *    *

If you have two interdependent constants (such as in the algorithm that computes probabilities for both all hypotheses and independent hypotheses, and has different acceptance boundaries for these sub-algorithms), you may want to try taking a couple values of one constant, and for each one of them go through the graphing of the cost by the changing of the other constant. That might give you the hint of how they are interdependent. And then with this knowledge you'd be able to choose a smaller interesting area and go through every combination of both constants in it, compute the cost, and find the minimum on the 3-dimensional graph.

You might be even able to analyze the dependencies, build a formula, and find a good approximation of the minimum analytically.

These boundary constants generally adjust the balance between the false positives and false negatives. If you set the boundary too low, a lot of false positives will be accepted. If you set the boundary too high, the model will reject a lot of diagnoses and thus generate many false negatives. And around the optimum, you'll be trading some false positives for false negatives. In the perfect world there would be some area with no false positives nor false negatives but in reality you'll still have both, and it will come to giving up some in one area to win in the other.

The exact win or loss around the optimum area will depend on the mix of cases in the training data. If you move the boundary up, and lose in lots of cases, and win in a few, your cost will go up. If you move the boundary up, lose in a few cases but win in many, your cost will go down. The proportions of different cases mixed in your training data will determine the balance for the minimal total loss.

This has two consequences: First, there is no point in the very precise picking of the boundary constants. The small changes of these constants one way or the other won't matter, they will depend on which mix of cases did we get in the last time period. And the mix will fluctuate a little over time, there is no way to predict it precisely. Second, it's important to preserve the mix of the cases in the training. If you select a subset of data for the training, the mix of cases in it should match the full real set. Don't use the approaches like "we'll only include into the training data the cases where we've been able to diagnose the problem on the first go". If you do that, you'll be changing the mix, throwing away the more complex data, and your cost estimations won't match the

The common advice is "split your case data in two, use one half for training, another half for testing". As you can see, how exactly your split the data, will matter. You don't want to change the mix too much. If you take the first half of the data for training and the second half for testing, and the data happens to be sorted on some parameter, you'll get the very different mixes in two halves. It can get as bad as the training in the first half being totally wrong for diagnosing the second half. You need to do the splitting in some more random way. Either by picking the cases by some random number generator, or another good approach is to split them by the timestamp: order the cases by the timestamp and then you can divide them into the first and second half.

But this whole advice of splitting the data in two is not very good. It's good for the follow-up testing but not good for the initial testing. For the initial testing, you want to use the SAME data both for training and for testing. If you trained your model on some data and still can't diagnose everything right when testing with the exact same data, that will give you a good opportunity to analyze, what went wrong. Some people might say "oh, but you'll be over-fitting your model to the particular data". Over-fitting is rarely a problem for the Bayesian models, the typical problem is the opposite. And if the mix of cases in your training data matches the typical real mix, you'll be fitting the expectations close enough to the reality.

Thus start with testing on the same data that you used for training. Identify all the cases of false positives and false negatives. See if there is something common with them. Look in depth at some of the cases, how did the model come up with this result? What caused it to miss the correct result? In my small examples, I've been showing the step-by-step intermediate results of applying every event. This is the kind of data you want to look at. Did the steps match what you expected? If they didn't then why, what values in the tables cause the computation to veer this way?

This detailed data becomes large fairly quickly, so some automatic pre-processing can help with finding the source of trouble. In particular, I've been using the automated code that would pick, which events caused the biggest changes in the probabilities of the hypotheses, both the computed ones and the expected ones. Then it can be printed out in the more compact form that is easier to analyze at a glance. To get this kind of printout automatically, it helps to build the diagnostic support into the program itself. The good plan is to run the test data through the model once, pick the strange cases, and then run them through the model the second time, along with the information about the correct diagnosis and the diagnosis produced by the model. These known diagnoses can then be used to drive the debugging printouts during the second computation.

The different algorithms will give different results. Running multiple algorithms on the same set of training data (with the same data used for training and testing) will let you see the upsides and downsides of each algorithm. That's how I've been coming up with my refinements. Look at what can be adjusted to make the algorithm work better. If one algorithm handles some cases well, and another one handles the other cases well, could we perhaps differentiate these cases somehow and then run them through the different algorithms? Some refinements will work well, some will crash and burn, try the different ones and pick the ones that work.

And only after you've got a system that can do reasonably well on processing the data that were used for training, it's time to test it on the other data. Again, identify the cases that go wrong. In some cases the results will be better or worse than before simply because of a different mix on cases. If you see the same kinds of mistakes as when you tested with the training data but in different proportions, it's probably the mix issue. If you see the different mistakes, well, it means that there is something in this set of data that you haven't seen before and that throws your model off.

A useful experiment is to split your data (nicely and randomly) in two, then use each half to train and test in turn. If we call these parts A and B, then you can do:

  • train with part A, test with part A
  • train with part B, test with part B
  • train with part A, test with part B
  • train with part B, test with part A

If you test each algorithm change on both halves of the data, you'll be able to see if it affects them in the same way or differently. Then test the training table you produced from one half of data on the other half of data. Did it do substantially differently than the same algorithm did when trained by the same data as used in the test? If yes then perhaps the mix in two halves of data is different.

After your system is in production, you should still keep collecting the training data, and keep re-training the model. As the devices age, some of their parts will be coming to the end of life (by design or by mis-design), and this will be showing as the increasing frequency of their failures. This is something that you'd want to capture for the future diagnosis.

And the diagnostic data that was useful for testing is useful in production too. It's useful in two ways: First, when the diagnosis turns out to be wrong, and the case goes to the human diagnostician, the human may benefit from knowing, why did the machine make this diagnosis. He might be able to pinpoint the error quickly and cheaply, without repeating the difficult steps. Second, you should always look back at what is going wrong in production? It would never be perfect but can we do it better? For that, it would help to take the misdiagnosed cases and analyze them further. Try to figure out, what went wrong, and the diagnostic information is what lets you find out what went wrong. You might be also able to use the feedback from the human diagnosticians.

Oh, and a little tangent on the subject of the "ground truth". Recently I went to an internal conference on the subject of machine learning, and there they've been saying that there are the systems with the "ground truth" and systems without the "ground truth" (i.e. no known perfect solution, such as finding the topic of a text message). I disagree with that. I think that there are no systems without the "ground truth". There is always at least the "ground truth" benchmark of how would a human do at this task? Can the automated system match the human? Can the automated system beat the human? In that example of finding the topic of a test message, there obviously must be a body of training messages that the humans would read and formulate the topics. Then these messages can be used for testing of the automated models. And the good automated models must come reasonably close to what the humans did. Only after that can we use these automated models to process the real messages in the wild. Otherwise there is no way to tell if the results of these models are any good or if they are some complete garbage.

There is always, ALWAYS a way to test your system and estimate, how good is it doing. If you didn't test it, you're not following the best practices, and you probably have a whole lot of bugs in your program. Testing the large volumes of data manually is impossible but the solution is to pick some samples and check carefully that these samples are producing the correct results. And of course there are other indications: for example, if you ever see a probability value of 15, this means that there is something very wrong with your computation.


That concludes what I had to say on the subject of the Bayesian expert systems. Unless I recall something that I forgot :-) I wrote up all the ideas I had, and that's actually more ideas than I've started with, I've had some interesting realizations as I went along. At some point I'll probably provide the code examples for the techniques discussed in the last few installments. We'll see how it will go with my time.

Saturday, December 12, 2015

Bayes 19: model information, and enumerated events

Sometimes the diagnosis depends on the features of the device being diagnosed. For example, you wouldn't want to diagnose the power steering fluid leak in a car with the electric power steering (or no power steering at all).

The typical thing would be to at least ask the make and model of the car up front and enter it as the events into the diagnosis. How would it be entered as events? The straightforward way would be to define one event per model. Then when diagnosing a car, the event for its model will be true, and for the rest of the models false.

This would also mean that every single training case will be model-specific. And even if we fold multiple training cases into the per-hypothesis entry, it still will include the information about what models experienced this failure, and at what prior probability.

In one way, this is good. It means that we will diagnose this hypothesis only for the models where this failure can actually occur. In another way, it's bad. For some rarer models we may not have any training cases showing that yes, this problem can happen for this model. There might be plenty of Volvos in the training data to show all their possible problems but not enough Aston Martins to show any but the most frequent problems. Yet the Aston Martins are mechanically similar to the other cars, and much of knowledge about the other cars can be applied to the Aston Martins too. So should we take the model into account or not?

Its hard to tell for sure for each particular case. But the approach of specifying the model information as events actually works fairly well together with the confidence capping. The confidence capping means that even if we can't find an exact match in training data, we'll still keep trying to find the best match, first with the mismatch of one event, then with the mismatch of two events, and so on. Finding a similar training case on another model means the mismatch of two events: the event for the model of that training case will be false instead of true in the input, and the event for the model of the actual car being examined will be true instead of false. So in essence it will mean that the fully matching training case for the same model will be preferred, but if there isn't one, it will become a choice between the otherwise fully matching case from another model and a partially matching case for this model.

So far so good but there are more caveats to keep in mind. This would not work well if we try aggressively to find the relevant symptoms/events of the training cases. We might end up with a training case for this particular model that has only the model information and one other event marked as relevant. Well, this event will be a mismatch but it's a mismatch on only one event (since the others are marked as irrelevant), so this case will take a priority over the cases for the other model but a matching symptom (since these case will have two mismatches in the model information events). A possible workaround for this is to not fold the model information into the events but just partition the training data into the separate per-model tables and a common table without the model information. Then look in the per-model table first, and if nothing useful is found there, look in the common table. This workaround has its own problems of course, such as what if there are multiple failures, one of which showing in the per-model table but the second one only in the common table? The second table computation might never happen because the first failure would be found in the per-model table and the processing will stop.

Another complication is that with the large number of events the uncertainty introduced by the certainty capping will accumulate fast. With a hundred events, and using 0.01 for the cap value, by the end of the processing the result will be completely uncertain. That's why the smaller values of the cap, such as 1e-8 are typically more useful.

You might think that if we do the computation directly by the training cases, we preserve the correlation between the events, so instead of having N events for N models, we can encode them as a binary number that can be represented with only log2(N) events. But that's a bad idea. It means that the code distance between the representations of the models will become varying. Instead of differing always by 2 events, they could differ by as few as 1 and as many as log2(N) events. That will seriously mess up the logic.

A possible workaround for that is to allow some events to have not 2 possible values (true or false) but a number of possible values, as many as hundreds or even thousands. The fuzzy logic can still be used with such events: they would be represented by two values, the confidence value in the range [0..1] and the enumeration value. We'd check if the enumeration value of the event in the input matches the enumeration value of the event in the training case, and then apply the confidence value appropriately, for or against this training case. It also means that the mismatching value of this event will be a mismatch of only one event, not two, and thus improve the standing of the training cases from other models with the capping logic.

A completely different take on this problem is to preprocess the information about the model and split it into the separate events that describe the mechanical features of the car. There might be a separate event for the type of power steering (none, or hydraulic, or electric), the type of the front brakes (disc or drum), the type of the rear brakes, and so on. The approach with the enumerated multiple-valued events can be used here as well, with for example the power steering type becoming one event with 3 possible values. However to use this approach effectively, we really need to handle the event relevance properly: the training cases for the problems in each subsystem must mark only the identifying information about this subsystem as relevant. You wouldn't want to fail recognizing a brake problem just because the car in the training case had a different type of power steering than the one being diagnosed. But this kind of relevance might be difficult to discover automatically, it would require the information about the subsystems and their relations to be added by the humans.

The good part about this splitting is that it would allow to diagnose the cars that have been modified by the owners. If someone installed the Lincoln disc brakes on the rear axle of a Fox-body Ford Mustang that had the drum brakes from the factory, this information can be entered as an override after entering the car model, changing the event for the type of the rear brakes. And the system will still be capable of diagnosing the problems. On the other hand, it might not help much: the aftermarket parts often have their own problems stemming from the mismatch of their characteristics with the original parts. For example, the said Lincoln brake conversion fits mechanically but causes the very wrong proportioning of the brake forces between the front and the rear. On the third hand, a sufficiently smart diagnostics system might be still useful. Even though the misproportioned brake forces do not happen in the cars from the factory, a smart enough system might still be able to point to a faulty master brake cylinder. It wouldn't be faulty as such but it would be faulty in its function for the modified brakes, distributing the brake forces in the wrong proportion.

Tuesday, December 8, 2015

Bayes 18: finding the relevance

I've previously promised to describe the ideas I have for computing the relevance values, and here we go. I haven't tested this ideas with a set of production data. They're just ideas. But they look like a reasonable start.

The general idea should be to tell apart if this value of this event differentiates this hypothesis/case from the other hypotheses/cases or not. If it does differentiate, it's relevant, otherwise it's irrelevant.

Taking two cases, we should be able to differentiate them based only on the relevant events in them both (and obviously a relevant event in one case can only be used to make a decision for this case, not for the other one).

For example, suppose that we have two cases with outcome of two different hypotheses where the event values are represented as a bit string:

hyp1 0 0 0 1 0 0
hyp2 0 1 0 0 0 0

We can mark some events irrelevant by "-" (how we do this is a separate question, for now let's suppose that we've done it somehow) and still be able to differentiate these cases:

hyp1 - - - 1 - -
hyp2 - 1 - - - -

The bit string "0 0 0 1 0 0" will still match only the case for hyp1, and "0 1 0 0 0 0" will still match only the case for hyp2. Obviously, when working with the full training set, we should be able to differentiate between any two cases in the training set.

As it turns out, this problem of finding the smallest set of differentiating bits is already known. It's the same problem as the optimization of a set of boolean functions in the hardware engineering. The very classic way to approach it is with the Karnaugh maps, though they don't scale so easily to the large number of bits. But there's got to be some newer and better ways for this optimization.

This approach has its pitfalls too. For another example, suppose that we have the cases:

hyp3 0 0 0 1 0 0
hyp4 0 1 0 1 0 0

Then we can differentiate them based purely on one event:

hyp3 - 0 - - - -
hyp4 - 1 - - - -

Of course, this works only as long as we have only two events. And it would likely result in the spurious false positives: in case if all the events in the input are at 0, they would still be taken to mean hyp3.

This can be resolved or at least improved by having the training set include the set of the cases where there is nothing wrong. What is known as the "null hypothesis", and what I've been marking as the "ok" hypothesis in the previous examples. The events that differ in value from the null hypothesis should always be relevant. If we arrange the event polarity so that the null hypothesis has all the events at 0, this rule translates into "if an event is at 1, it must be relevant", and only the events at 0 need to be considered for the possible irrelevance.

Things get more complicated when we have some cases resulting in multiple hypotheses. Splitting their symptoms into the single-hypotheses cases is not obvious. It can be attempted by subtracting the other single-hypothesis cases and looking at what is left. If some hypothesis have no single-hypothesis cases, that becomes even more strange. Then perhaps these cases should be left as-is because they express the special interactions between multiple hypotheses. Or we can try to find the commonality between all the cases that include this particular hypothesis and discard the differences stemming from the other hypotheses included in these cases.

Let's look at the highly pathological example I've shown before and try to do these manipulations with it. The example was:

# tab09_01.txt
!             evA evB evC
1 * hyp1,hyp2 1   1   0
1 * hyp2,hyp3 0   1   1
1 * hyp1,hyp3 1   0   1

Previously I've calculated the probabilities by hypothesis:

# tab09_01.txt

Those probabilities P(E|H) can be used as a guidance for relevance. If they're close to 0 or 1 (say, below 0.1 or above 0.9) , we'll say that this event is relevant for this hypothesis, and if it's somewhere in the middle, we'll say that it's irrelevant. Then we can use this knowledge to split the cases:

!             evA evB evC
1 * hyp1      1   -   -
1 *      hyp2 -   1   -
1 * hyp2      -   1   -
1 *      hyp3 -   -   1
1 * hyp1      1   -   -
1 *      hyp3 -   -   1

Then we can merge the cases that are the same:

!        evA evB evC
2 * hyp1 1   -   -
2 * hyp2 -   1   -
2 * hyp3 -   -   1

It worked well for this special example but might be not so great in the bigger world. It needs more thinking and experimentation.