Saturday, May 28, 2016

Summary of Bayes by weight

This is a copy of the post from my MSDN blog. If you've been reading this blog, you've already seen all the ideas described here. But I've realized that the repost can be useful on this blog for the references, as a short summary of the most interesting parts. So, here it goes.

I recently wrote a series of posts in my other blog on the Bayes expert system. Some time ago I wrote an expert system, and I've been surprised by how little of the information is available on the Internet and how much of it is misleading, so I wanted to make a write-up of my experience, and recently I've finally found time to do it. It turned out larger than I expected, and as I was writing it I myself have gained a deeper understanding of my previous experience and came up with further better ideas. The text starts from the very basics of the Bayesian formula and goes through its use in the expert systems, the typical pitfalls when building the expert systems and the solutions to these pitfalls, the ways to handle the uncertainties, a deeper dive into how and why these formulas actually work, more of the practical solutions based on that knowledge, and the aspects of the testing.

The whole series is here (in the reverse order): http://babkin-cep.blogspot.com/search/label/bayes
The first post is here: http://babkin-cep.blogspot.com/2015/10/bayes-1-basic-terms.html
They are not interspersed with any other posts, so you can read sequentially.

Right in the first post I refer to the Yudkowsky's explanation, so why would you want to read my explanation rather than his? By all means, read his explanation too, it's a good one. But he explains it in a different context. He takes one application of the Bayesian formula and works through it. So do many other texts on the artificial intelligence. The expert systems don't apply the formula once. They apply the formula thousands of times to make a single diagnosis. Which brings its own problems of scale that I describe. And there are many other practical issues. Your input data might be "impossible" from the standpoint of probabilities in your knowledge base. The evidence of absence is not the same as the absence of evidence. The events are rarely completely independent. Multiple hypotheses might be valid at the same time. And so on.

The particularly surprising realization for me was that the Bayesian systems and the decision trees are fundamentally the same thing. You can represent either one through the other one. They are traditionally used in somewhat different ways but that's just the question of tuning the parameters. They can be easily tuned either way or anywhere in between. This stuff is in the part 11 of my notes. There it requires the context from the previous parts, but here is the short version as a spoiler. The short version might be not very comprehensible due to its shortness but at least it points to what to look for in the long version.

So, it starts with the basic Bayes formula where the probability of some hypothesis H after taking into account the information about the event E is:

P(H|E) = P(H) * P(E|H) / P(E)

The hypotheses can also be thought of as diagnoses, and the events as symptoms.

In a typical expert system there can be hundreds of hypotheses and hundreds of events to consider. The values of P(E|H) are taken from the table that is computed from the training data. The values of P(H) and P(E) change as the events are applied one after another (P(H|E) for one event becomes P(H) for the next event, and P(E) gets computed from the complete set of values for P(H) and P(E|H)) but their initial values are also sourced from the same table. At the end one or multiple most probable hypotheses are chosen as the diagnosis.

What is the training data? It's the list of the previously diagnosed cases, where both the correct winning hypotheses and the values of the events are known. In the simplest way it can be thought of as a bitmap where each row has a hypothesis name and the bits for every event, showing if it was true or false:

    E1  E2  E3  E4
H1   1   0   0   0
H1   1   0   0   0
H1   0   0   0   1
H2   0   1   1   0
H2   0   1   1   0

In reality one case may have a diagnosis of multiple hypotheses, and the values of the events might be not just 0 and 1 but a number in the range between 0 and 1: we might not be completely confident in some symptom but have say a 0.7 confidence that it's true.

How is the table of probabilities built from the training data? For P(H) we take the proportion of the number of cases with this hypothesis to the total number of cases. For P(E|H) we take all the cases for the hypothesis H and average all the values for the event E in them. For P(E) we average all the values for E in all the cases in the whole training data.

As it turns out, this approach doesn't always work so well. Consider the following training data:

    E1 E2
H1   0  0
H1   1  1
H2   1  0
H2   0  1

The meaning of the data is intuitively clear, it's H1 if E1 and E2 are the same and H2 if E1 and E2 are different. But when we start computing P(E|H), all of them end up at 0.5 and the resulting expert system can't tell anything apart.

There is a solution: for the duration of the computation, split each hypothesis into two, say H1A and H1B:

    E1 E2
H1A  0  0
H1B  1  1
H2A  1  0
H2B  0  1

Before making the final decision, add up the probabilities P(H1)=P(H1A)+P(H1B) and use them for the decision. Now the logic works. The hypotheses can be split down to the point where each case in the training data becomes its own sub-hypothesis. Indeed the current example had done exactly this. If there are multiple cases that are exactly the same, we could keep them together by assigning a weight instead of splitting them into the separate sub-hypotheses. For example, if there are 5 cases that are exactly the same, we can make them one sub-hypothesis with the weight of 5.

And with such a fine splitting the computation of probabilities can be thought of as striking out the sub-hypotheses that don't match the incoming events. If we're computing the diagnosis and receive E1=1, we can throw away H1A and H2B, leaving only H1B and H2A. If then we receive E2=0, we can throw away H2A, and the only hypothesis left, H1B, becomes the final diagnosis that we'll round up to H1.

We're essentially trying to find a match between the current case and one of the training cases. But that's exactly what the decision trees do! We can represent the same table as a decision tree:

             |
             V
        0   E1    1
        +----+----+
        |         |
        V         V
     0 E2  1   0 E2  1
     +--+--+   +--+--+
     |     |   |     |
     V     V   V     V
    H1A   H2B H2A   H1B

As you can see, it's equivalent, produces the exact same result on the same input.

But what if we're not entirely confident in the input data? What if we get E1 with the confidence C(E1)=0.7? The way the Bayesian formula

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
  / ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

treats it amounts to "multiply the weight of the cases where E1=1 by 0.7 and multiply the weight of the cases where E1=0 by 0.3". So in this example we won't throw away the cases H1A and H2B but will multiply their weights by 0.3 (getting 0.3 in the result). H1B and H2A don't escape unscathed either, their weights get multiplied by 0.7. Suppose we then get E2=0.8. Now the weights of H1A and H2A get multiplied by 0.2, and of H1B and H2B get multiplied by 0.8. We end up with the weights:

W(H1A) = 1*0.3*0.2 = 0.06
W(H1B) = 1*0.7*0.8 = 0.56
W(H2A) = 1*0.7*0.2 = 0.14
W(H2B) = 1*0.3*0.8 = 0.24

When we add up the weights to full hypotheses, W(H1)=0.62 and W(H2)=0.38, so H1 wins (although whether it actually wins or we consider the data inconclusive depends on the boundaries we set for the final decision).

Can this be represented as a decision tree? Sure! It just means that when we make a decision at each event node we don't choose one way out. We choose BOTH ways out but with different weights. If the weight of some branch comes down to 0, we can stop following it, but we faithfully follow all the other branches until the come down to the leaves, and keep track of the weights along the way.

Now, what if the table contains not just 0s and 1s but the arbitrary values between 0 and 1? This might be because some training cases had only a partial confidence in their events. Or it might be because we averaged the events of multiple training cases together, building the classic Bayesian table with one line per hypothesis.

We can still compute it with weights. We can just logically split this case into two cases with the partial weights. If we have a value of 0.7 for the hypothesis H and event E, we can split this line into two, one with 1 in this spot and the weight of 0.7, another one with 0 in this spot and the weight of 0.3. And we can keep splitting this case on every event. For example, if we start with

    E1  E2
H1  0.7 0.1

we can first split it into 2 cases on E1:

     E1  E2   W
H1A  1   0.1  0.7
H1B  0   0.1  0.3

And then further split on E2:

     E1  E2   W
H1AA 1   1    0.07
H1AB 1   0    0.63
H1BA 0   1    0.03
H1BB 0   0    0.27

Or we can modify the table as we apply the events: right before applying the event, split each row in the table in twain based on the values in it for this event, apply the event by multiplying the weights appropriately, and then collapse the split rows back into one by adding up their weights. This makes the intermediate values more short-lived and saves on their upkeep. And that's exactly the logic used in the classic Bayesian formula.

Since once the table rows get split, the operations on the table stay exactly the same, they still can be represented with the decision tree. Just the splitting of the rows in the table would be transformed to the splitting of nodes in the decision tree.

In the end the Bayesian logic and the decision trees are equivalent. They do the same things. Traditionally the Bayesian logic uses the training cases that have been averaged out while the decision trees try to match to the individual training cases. But both can be used in either way. And it's possible to get a mix of both by partitioning the weights between the original training cases and their combined averaged versions. It's a smooth scale of 0 to 1: if you transfer all the weight of the training cases into averaging, you get the traditional Bayesian logic, if you transfer none of it, you get the traditional decision trees, and if you transfer say 0.3 of the weight, you get the logic that is a combination of 0.3 of the traditional Bayesian logic and of 0.7 of the traditional decision trees. Note I've been saying "traditional" because they are really equivalent and can be transformed into each other.

Sunday, May 22, 2016

AdaBoost in simpler formulas 2

I've been reading along the book on boosting, and I'm up to about 1/3 of it :-) I've finally realized an important thing about how the H(x) is built.

For easy reference, here is another copy of the AdaBoost algorithm from the previous installment, simplified slightly further by replacing (1-et)/et with Wgoodt/Wbadt, and et/(1-et) with Wbadt/Wgoodt as mentioned at the end of it and getting rid of et altogether:



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the weights Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the weighted error Wbadt/Wgoodt:
    Wgoodt = 0; Wbadt = 0;
    for i = 1, ..., m {
         if ht(xi) = yi {
              Wgoodt += Dt(i)
         } else {
              Wbadt += Dt(i)
         }
    }
  • Update,
    for i = 1, ..., m {
         if ht() != yi; {
              Dt+1(i) = Dt(i) * sqrt(Wgoodt/Wbadt)
         } else {
              Dt+1(i) = Dt(i) * sqrt(Wbadt/Wgoodt)
         }
    }
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (ln(sqrt(Wgoodt/Wbadt))*ht(x)) ).



I've been wondering, what's the meaning of ln() in the formula for H(x). Here is what it is:

First of all, let's squeeze everything into under the logarithm. The first step would be to put ht(x) there.

ln(sqrt(Wgoodt/Wbadt))*ht(x) = ln(sqrt(Wgoodt/Wbadt)ht(x))

This happens by the rule of ln(a)*b = ln(ab).

Since ht(x) can be only +1 or -1, taking the value to the power of it basically means that depending on the result of the ht(x) the value be either taken as-is or 1 divided by it. Which is the exact same thing that is happening in the computation of Dt+1(i). The two formulas are getting closer.

The next step, let's stick the whole sum under the logarithm using the rule ln(a)+ln(b) = ln(a*b):

H(x) = sign(ln( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) ))

The expression under the logarithm becomes very similar to the formula for Dt+1(i) as traced through all the steps of the algorithm:

Dt+1(i) = product for t=1,...,T ( sqrt(Wgoodt/Wbadt)-yt*ht(x) )

So yeah, the cuteness of expressing the condition as a power comes handy. And now the final formula for H(x) makes sense, the terms in it are connected with the terms in the computation of D.

The next question, what is the meaning of the logarithm? Note that its result is fed into the sign function. So the exact value of the logarithm doesn't matter in the end result, what matters is only if it's positive or negative. The value of logarithm is positive if its agrument is > 1, and negative if it's < 1. So we can get rid of the logarithm and write the computation of H(x) as:

if ( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

Okay, if it's exactly = 1 then H(x) = 0 but we can as well push it to +1 or -1 in this case. Or we can write that

H(x) = sign( (product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) - 1 )

The next thing, we can pull the square root out of the product:

if (sqrt( product for t=1,...,T (Wgoodt/Wbadtht(x)) ) > 1 ) then H(x) = +1 else H(x) = -1

But since the only operation on its result is the comparison with 1, taking the square root doesn't change the result of this comparison. If the argument of square root was > 1, the result will still be >1, and the same for < 1. So we can get rid of the square root altogether:

if ( product for t=1,...,T (Wgoodt/Wbadtht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

The downside of course is that the computation becomes unlike the one for Dt+1(i). Not sure yet if this is important or not.

Either way, we can do one more thing to make the algorithm more readable, we can write the product as a normal loop:



chance = 1;
for t=1,...,T {
     if (ht(x) > 0) {
          chance *= Wgoodt/Wbadt;
     } else
          chance *= Wbadt/Wgoodt;
     }
}
H(x) = sign(chance - 1)



Note that this code runs not at the training time but later, at the run time, with the actual input data set x. When the model runs, it computes the actual values ht(x) for the actual x and computes the result H(x).

I've named the variable "chance" for a good reason: it represents the chance that H(x) is positive. The chance can be expressed as a relation of two numbers A/B. The number A represents the positive "bid", and the number B the negative "bid". The chance and probability are connected and can be expressed though each other:

chance = p / (1-p)
p = chance / (1+chance)

The chance of 1 matches the probability of 0.5. Initially we have no knowledge about the result, so we start with the chance of 1, and with each t the chance gets updated according to the hypothesis picked on that round of boosting.

The final thing to notice is that in the Bayesian approach we do a very similar thing: we start with the prior probabilities (here there are two possible outcomes, with the probability 0.5 each), and then look at the events and multiply the probability accordingly. At the end we see which hypothesis wins. Thus I get the feeling that there should be a way to express the boosting in the Bayesian terms, for a certain somewhat strange definition of events. Freund and Shapire describe a lot of various ways to express the boosting, so why not one more. I can't quite formulate it yet, it needs more thinking. But the concept of "margins" maps very nicely to the Bayesian approach.

In case if you wonder what the margins are: as the rounds of boosting go, after each round of boosting H(x) can be computed for each set of the training data xi. At some point they all start matching the training results yi, however the boosting can be run further, and more rounds can still improve the results on the test data set. This happens because the sign() function in H(x) collapses the details, and the further improvements are not visible on the training data. But if we look at the argument of sign(), such as the result of the logarithm in the original formula, we'll notice that they keep moving away from the 0 boundary, representing more confidence. This extra confidence then helps make better decisions on the test data. This distance between the result of the logarithm and 0 is called the margin. Well, in the Bayesian systems we also have the boundary of the probability (in the simplest case for two outcomes, 0.5), and when a Bayesian system has more confidence, it drives the resulting probabilities farther from 0.5 and closer to 1. Very similar.

Saturday, April 9, 2016

career advice 3

As I've mentioned before, I've read the book "Friend & Foe" by Adam Galinsky and Maurice Schweitzer, and I went to see their talk too. It's an interesting book on the psychology of competition and cooperation. But some examples from it and their interpretation struck me as odd.

In particular, they have a story of Hannah Riley Bowles who got an offer for a tenured position from the Nazareth College, decided to negotiate for some better conditions, and had the offer retracted (to her surprise). Perhaps she'd followed the advice similar to Tarah Van Vleck's: "never accept the first offer". What went wrong? According to Galinsky and Schweitzer, there must be some evil afoot, it's all because she's a woman.

But is it? The whole story as described in the book strikes me as a sequence of bad decisions. I'm not the greatest expert on negotiations by far but I know a thing or two about them. For example, I've negotiated for a year about one of my jobs. For another job, I've negotiated through 3 different engagements over 6 years. And basically in the story about Hannah I see both the glaring mistakes on her part and the mistaken pre-suppositions in the narrative.

The major mistake in the narrative is that by negotiations you cannot lose ("never accept the first offer, it's always lower by 10-15% than the final offer"). It's not true. If you decide not to accept the first offer but start the negotiations, you have to be prepared that the other side would turn around and go away. It has no relation to gender, happens to everyone, and is nothing unusual. It's something you've got to be prepared to. I've had it happen on multiple occasions in my career.

If doesn't mean that you shouldn't negotiate. The negotiation of the future salary at a new place is the best way and time to raise your income, after that a 10% raise will be considered a huge one. A 10% raise did happen to me once, but again, it was a very unusual thing, and much easier achieved by negotiating at the hiring time. But you've got to place a realistic goal in front of you, what kind of raise would be worth changing the jobs, and go from there. If the offer is way below this goal, there is no point in taking it. If it's way above, you've achieved your goal, there's not much more to desire. Extra negotiation can bring extra money but can also break the whole thing. If someone offers you double the current money, it's probably reasonable to just take the offer and not risk it.

And yes, the offer of double money did happen to me but it came with a catch: it was for a 6-month contract, with a long commute and not a particularly exciting job. So it wasn't a no-brainer, it took me some thinking. In the end I've decided that if I get double the money for 6 months and then spend another 6 months looking for another job like my current one, I'd be still ahead, and I took it (and in result the things have worked out better than expected).

To give an example of when things didn't work out, let's look at that 6-year negotiation story. When I've talked to them for the first time, I went there for an interview, I've talked to the HR about what kind of money I want, and a week later I get a form letter saying that they're not interested. Well, overall fine with me (except for one point that I'll return to later), they didn't look particularly interesting anyway. When their recruiter contacted me next time, I've asked them: you people didn't like me once already, why are you calling me again? And he said, no, the technical interviews actually are marked pretty good, so it's got to be some other reason. From which I could only conclude that the money was the problem.  And then I've tried a bit different approach to find out what kind of money they had in mind, and it turned out that yes, there was a major disagreement. But the important point for Hannah's story is that they didn't make me an offer for half the money I was asking, they just turned around and went away.

Making another digression, this kind of confirms Tarah Van Vleck's advice "never name your number first". Or does it? Remember, in this case our expectations have been off by a factor of 2. If they made me an offer for half the money I thought reasonable, I wouldn't have taken it anyway, just as I didn't take when I found it out during the second engagement. By the way, yes, there are disadvantages of naming your number first but there are also are other issues, and there are some advantages too: if you overshoot their expectations by a reasonable amount, you'll have a lot easier time in defending this number in the further negotiations. If they name a number and you say "I want 10% more", they'll figure out that you're just trying to stretch it a little, and they might either stay firm or maybe settle at something like 5% more. If you name a number 20% more than they were expecting to offer, you'll probably get if not all 20% then at least 15%. And it's not just me, I've also read it in some book (Galinsky&Schweitzer's? Cialdini's? Karras's?) that the first number named sets the tone for the negotiation, which is difficult to move afterwards. It can be moved but not by 15%, if you want to make progress you've got to start with something like "this is laughable! my reasonable estimation is 50% more!" and get maybe extra 30-45%. And of course bear the risk that the other side would go away, so I'd recommend doing this only if you really do see the initial offer as laughable.

If the other side thinks that your demands are unreasonably high (or low, for the other side, and yes, I've done things like that from my side as well), they'll just go away. But of course from my standpoint the requests have been perfectly reasonable, I would not have agreed to their low-ball offer anyway, so I haven't lost anything. This is a problem only if you're bluffing.
Now turning to Hannah's mistakes. Sorry, but she led the negotiations in a very offensive way, as offensive as it can get without calling the prospective employer names.

The first major mistake was that she responded by writing of a letter with the list of requests, and in such a formal tone. Negotiation in the written form is bad, it's highly prone to cause very negative feelings in the counterparty. The good way to negotiate is over the phone.

The use of the formal tone is even worse. It's guaranteed to offend. Returning to that example above, receiving that form letter had pissed me off very much. If they simply said "No, we're not interested" or "No, we're not interested, we don't think you're good enough for us", it would have been OK. But receiving a page-long form letter in legalese created a major grudge. For a few years after that I wouldn't even talk to the their recruiters.

The right way to negotiate is on the phone, and try to keep as friendly a tone as possible. The point of negotiations is to convince the other party that your viewpoint is more reasonable, not to fight them.

This brings us to the next error, but here Hannah had no control: she had to negotiate directly with the college because the college had contacted her directly. The negotiations go much, much better when conducted through an intermediary.  An independent recruiting agent is the best intermediary, the company recruiter is the second best one. Negotiating directly with the hiring manager, as Hannah essentially did, is fraught with peril. The recruiters are the professional negotiators, they understand how the negotiations work, and transfer the information between two parties while maintaining friendliness on both sides. You can talk more bluntly to them, and when the message reaches the other side, it will become formatted in a friendly way. On the other hand, the hiring managers tend to take offense easily. Many of them are technical specialists but not really people persons, and for quite a few of them the feeling self-importance goes strongly to their head. Might be even worse in academia than in the industry, at least judging by what I read. The even worse part is that she had to deal with a committee. The problem with committees is that there is a higher probability that at least one member will be a self-important moron who will take offense.

Ironically, this went so bad because from the tone of the letter Hannah doesn't appear to be a people person either, but one with the self-importance gone to her head. It's hard enough to negotiate when one side has this attitude, and much harder when both sides do. For all I understand, the tenure positions are coveted in academia, so when the committee made an offer to Hannah, they likely felt that they're making her an honor. Which is expected to be accepted humbly.  Responding to the offer with the words "Granting some of the following provisions will make my decision easier" is the opposite of humility. It's the negotiation from the position of power, implying that they've made a humble supplication of her, and she is considering whether to grant their wish. I hope you can see by now how they felt offended.

As you can see, great many things went wrong with Hannah's negotiation, and none of them have anything to do with her gender. All of them had to do with the communication mistakes, character of the people involved, pride and prejudice of academic nature, and lack of an experienced intermediary to calm down the tempers.

What could Hannah had done better? I'd recommend first thing going there, looking at the place, and meeting the people. A personal contact always makes the following remote communications much more personable. And then making her requests either in a face-to-face meeting or over the phone. Making them in a personable tone of requests, not demands. Like "hey, and how does such a thing run at your college? would it be OK if I do it like this?". Perhaps, making some of the requests through the HR department people. And what could have the college done better? After the hiring committee had made the decision, they could have used a professional recruiter from HR to communicate between the committee and Hannah.

Of course, yet another way to look at it is "do you want to work with people like this?". The point of the interview is that not only candidate is a good fit for the company but also that the company is a good fit for the candidate. If you think that the company behaves unreasonably in response to your reasonable requests, it's probably best not to work there: obviously, your ideas of what is reasonable differ widely.

And this also brings the point about whether the women are undeservedly seen as too aggressive. I'd say that Hannah's example demonstrates exactly the kind of over-aggressiveness. It's not that she tried to negotiate for the better conditions, it's HOW she tried to do it. Instead of building the mutual rapport and convincing the counterparty of her goals in a friendly way, she saw it as a fight. It's not the perception of the aggression that is the problem, the problem is in the aggression that is actually present.

I wonder if it might also be connected to another effect about negotiations. As described in the book "The negotiating game" by Karrass, and as I can anecdotically confirm from my experience, when a good negotiator gets the major thing he wants, he goes soft on the opponent and doesn't mind giving up some minor points, to keep the relationship happier. On the other hand, the poor negotiators keep hammering non-stop even if they've got the negotiating power and already managed the good conditions, they still keep trying to squeeze everything possible out of the opponent. Perhaps the second case is the same character trait that is seen as high aggression, the irony being that the higher aggression brings less success.

career advice 2

For an example of a good career advice, I want to point to James Whittaker.  I've attended his classes "Career Superpowers" and "The Art of Stage Presence", which also are available as books. He is a great presenter (and hopefully a good writer too, I didn't read the book versions).

I wouldn't say that I've learned a whole lot from the class on Career Superpowers but that's because it largely matches what I already know from my experience. But I really liked it being presented in a systematic fashion, and I did learn some things (and perhaps I need to exercise more of following them).

Or you might say it the other way: maybe I've liked it because it matched my own thoughts so much. Either way, he's done quite a career, much better than me.

And that's one of his points: it makes sense to follow the advice of someone who know what he's doing and draws this advice from a success. Well, there are lots of caveats too. One, the career recipes are different by the times and by the industries. Following the advice of a CEO from the auto industry in the 70-80s won't help you in the software industry now. James's recipes are for the software industry, so if you're in the auto or financial industry, they might be a bad advice for your situation. The second caveat, you want to follow the advice of someone of who is analytical. Succeeding by following a good instinct is one thing but breaking this experience down into an advice that can be transferred to other people is a whole separate feat.

Friday, April 8, 2016

career advice 1

I went to see a book presentation by Tarah Wheeler Van Vlack on career advice for women in technology. And I think it's all bad advice. It's not that everything she said is wrong but even when she starts with something that makes sense (like "always negotiate"), the details end up as a bad advice. If women really follow the advice like this, no wonder that it would create the "gender gap".

I also have another story on the subject with another book, "Friend & Foe" which I think was also giving weird career advice. I wrote a letter to its authors, and after some editing of the personal details, it might be entertaining to publish at some point, as an example of what I see as the good career advice.

But getting back to this one, I've got a couple of interesting points that are completely separate from the advice as such.

The first one is that she was talking a lot about the "nerd culture": board games, videogames, cosplay, comics books and what not. She seems to imply a strong connection between the nerd culture and the engineering. Which I don't think is true. If someone is into the nerd culture, it doesn't necessarily mean that they would be good at the technical jobs, and if someone is good at technical jobs, it doesn't mean that they would be into the nerd culture. For the sake of the point, I don't think I'm into the nerd culture as described. I'm more into the real books, cooking, racecars, and lots of other small hobbies. Well, there are different opinions about this, as one of the girls I dated said that I'm so nerdy, but her reference point was different, she was really from a redneck background. I like to play sometimes at pretending being a redneck (you could say that it's my cosplay) but I really am not one by a wide margin and I know it.

Let me tell you an old Russian parable by Kozma Prutkov (a collective pen name of a group of writers): Once upon a time there was a gunsmith who had built a wonderful new rifle with great many features. With it you could clean a game, cook it, unfold the rifle into a table and utensils and have a very nice dinner. An absolutely great experience. This rifle had only one drawback to it: it couldn't shoot.

I really like applying this parable to everything. If we call something a rifle, the extra features might be nice to have with other things being equal, but first and foremost it must shoot. And to be a good rifle, it must shoot well.

Let's now apply this parable to the nerd culture. There seems to be a lot of pressure saying that if you're into the nerd culture, you should look for a job in engineering. And that to look for a job in engineering, you have to subscribe to the nerd culture. But in reality to look for a job in engineering and make a good career out of it, first and foremost you should be good at engineering. The nerd culture doesn't really matter: you can like it or dislike it or whatever, it won't affect anything. (And that's by the way is the real meaning of diversity in the good sense: as long as you do the job well, what kind of culture you belong to doesn't matter). This pressure causes the people who are into the nerd culture to go into the engineering. And some turn out to be no good at it and fail. And become very frustrated, they feel that they've done everything right, as told that they should do, and still failed (or succeeded only very moderately) for no explainable reason. There must be some evil forces at work!

But the real reason is that they've been given bad advice and the wrong kind of pressure. If you're good at drawing the comic books, or playing video games, it doesn't mean that you're any good at engineering. You might be good at both but there is no guarantee whatsoever. I'm fairly decent at engineering but I'm not any good at drawing comic books. If you enjoy drawing comic books and don't feel any particular attachment to actual engineering, perhaps a better career choice would be to go into something connected with the drawing of comic books, or with the drawing in general. If you're good at action videogames, this skill might be connected to being good at driving a racecar or flying a fighter plane but not much with the engineering. And the same bad advice works the other way around: some people would feel that if they're not into the nerd culture, they shouldn't even try engineering.

Another related thing is that there is a big difference between being interested in something and being good at something. I like racing cars but I understand than I'm no Michael Schumacher, so I don't even try to make it into a career. It's a hobby where I waste money, not make money. You don't get paid for being interested, you get paid for being good at something that is useful for someone. Being interested is something that gives you a push towards trying something, and stimulates you to spend time and effort on improving, but by itself it doesn't make you good. In reality when people get good at something they become less interested: after the skill gets learned it becomes "same old, same old", and the time comes to find some next interesting thing (or at least the higher more difficult level of the previous thing). And, well, the people who keep being interested but never become good, tend to try being around things they're interested in. And there is a big difference between doing things and just being around them. I think I've read a good rant by Steve Yegge about it a few years ago. But people who are around things don't really understand the difference, in their view they're in the very midst of activity. And when they're not appreciated that much compared to the people who do things, they don't understand why and feel frustrated. There must be some evil forces at work! I've learned from Tarah the new phrase "officewife", which describes someone who contributes to the social dynamics of a workgroup and say brings donuts but is not taken seriously for the work contribution. Which I think is a salient example for this paragraph. Of course, people can be labeled unjustly and incorrectly by social inertia, but this I think is where the phrase grows from. It's not limited to any gender, I've seen plenty of men in the same role.

The second point is that Tarah said that she is not afraid to be a bitch, and well, she really comes across as one. Or I better like a gender-neutral term she also used, as an asshole. No big deal by itself but there is an interesting connection: there is this widespread moaning (including in the book "Friend & Foe") that "if women try to be assertive, they're seen as bitches, while the men aren't". Or I'd rather say assholes because I don't see any gender difference. I mean, assholes are assholes, and there are plenty among men. There is a difference between being assertive and being an asshole.

What's this difference between "assertive" and "asshole"? I think a lot of it is about being analytical versus being blind to the world.  One thing is when someone builds the plans from the experience, then pushes to implement these plans, recognizes the feedback when something goes not up to the plan, and does the corrections. Another thing is when someone comes up with a load of crap and unloads it unto the wold, often being mean for no good reason along the way.


This is by the way the communist tradition of management by assoholism (for all I can gather, still popular in the former USSR, and also in the Arab world): if you're the boss, scream, make tantrums and abuse the underlings until they do something.


All kinds of prophets and proselytizers also tend to be major assholes. If someone had come up with The Great Truth and decides to blindly spread this Great Truth throughout the world, he is an asshole. But if he (or she) pauses sometimes to listen to the reasonable comments and objections, and think about this Great Truth, and give a reasonable response, and maybe sometimes modify the Great Truth, then perhaps this he or she becomes merely assertive.

To give another example, "I want this done because I said so" is an asshole, "I want this done because I see such and such good reason" is assertive. Or as yet another example, when you see a blog where the commenters get banned simply for disagreeing with the author, you can say for sure that the author is an asshole.

Circling back, to the second point,  it could be that the reson for "if women try to be assertive, they're seen as bitches" might be because they try to follow the bad examples, and there seem to be quite a few bad examples abound spreading bad advice. If someone follows the advice on how to become an asshole, they can successfully become an asshole and not even know it.

Friday, March 18, 2016

AdaBoost in simpler formulas

I'm reading a book about the "boosting": the machine learning idea that a better algorithm can be built by combining multiple instances of a simple algorithm, each instance trained on the same set of training data but with different weights assigned to each case.

The very basic concept itself goes along the same way as I've been doing manually for the Bayesian system: after a system is trained on a set of training cases, re-test it on the same set of data and find what cases got diagnosed wrong (the proper word for it, as it turns out, is "training error"). Then try to improve the recognition of these cases. But unlike me the boosting does it in an automated way: it doesn't try to change the underlying algorithm, instead it simply produces the second training table, and the end result is produced by running the expert system with both tables then combining their results in a weighted fashion. Boosting can be done for hundreds of rounds, producing the systems that run the basic algorithm hundreds of times. For Bayesian algorithms, I think it might be possible to combine these many tables into one instead of running them separately, although I'm not quite sure about that.

There is also a limitation on the boosting: looks like it's mostly sorted out for the cases where there are only two mutually exclusive hypotheses. Apparently it starts having problems even with multiple mutually exclusive hypotheses, let alone the overlapping ones.

Well, getting back to the book, it's "Boosting", written by the inventors of the concept in general and of the AdaBoost (Adaptive Boost) algorithm in particular, Shapire and Freund.

I've spent some time understanding the AdaBoost algorithm, and I feel that it can be expressed in a simpler form. The authors' definition of the algorithm is like this (minus the mathematical characters that I don't know how to write in HTML and wrote in words or similar Latin characters):



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1/m for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the distribution Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the weighted error:
    et = Pri~Dt[ht(xi) != yi].
  • Choose Alphat = 1/2 * ln((1-et)/et).
  • Update, for i = 1, ..., m:
    Dt+1(i) = Dt(i)/Zt * {
         exp(-Alphat) if ht(xi)=yi;
         exp(Alphat) if ht(xi) != yi
    }
     = Dt(i)*exp(-Alphat*yi*ht(xi)) / Zt
    where Zt is a normalization factor (chosen so that Dt+1 will be a distribution).
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (Alphat*ht(x)) ).



Some explanations are in order. The pairs (xi, yi) are the training cases. xi represents the whole set of symptoms, yi represents the outcome (in Bayesian terms, we could also say "hypotheses" but here the word hypothesis has a different meaning). Only two outcomes are possible, and unlike the Bayesian tradition of representing them as 0 and 1, here they are represented as -1 and +1. This representation allows the clever trick in the formula for Dt+1(i), replacing the conditional choice of the sign for Alphat with multiplication of (yi*ht(xi)). If these values don't match, the result will be -1, if they do match then 1. The concept of hypothesis here is different from the Bayesian one, here the "hypothesis" means the whole basic algorithm with a computed training table. ht(xi) means the computation of the outcome by this trained algorithm by feeding the symptoms of a training case into it.

The epic expression Pri~Dt[ht(xi) != yi] means the weighted probability of a training case outcome being computed incorrectly when it's fed back into the basic algorithm with the current round's training table (the training table is computed independently in each round based only on the set of training cases and on the distribution of their weights for this round). Basically, it's the relation of the weight of incorrectly computed cases to the total weight of all cases. The underlying basic algorithm tries to make up its table to minimize this number to its best ability. But since it's not perfect and might be far from perfect, the number of erroneous cases is unlikely to become 0.

There are T rounds of boosting. Each round trains the same basic algorithm on the same set of training cases but assigns different weights to the importance of these cases, according to Dt. The weights in Dt are adjusted for each round such as to sum up to 1. Zt is thus the "normalization factor": the sum of values in Dt+1 before this adjustment. From the algorithmic standpoint, there is no good reason to do this normalization: weights are weights, who cares if they sum up to 1 or not, this algorithm doesn't depend on it in any way. There is a reason why the authors use the Zt as it is, because it turns up in the proofs about the properties of the algorithm. But here it's easier to place it into the calculation of et.

Note also that Alphat is first computed as a logarithm, and then an exponent is taken on it. This logarithm and exponent cancel each other out except for the sign. This and the manipulation of the sign by (yi*ht(xi)) look mathematically cute but confuse the hell out of it. Well, they have a reason for using a logarithm because they use this Alphat to prove some properties, but like Zt in this place it only complicates things.

Getting rid of all these complications, here is the simplified version:



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the weights Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the weighted error et:
    Wgoodt = 0; Wbadt = 0;
    for i = 1, ..., m {
         if ht(xi) = yi {
              Wgoodt += Dt(i)
         } else {
              Wbadt += Dt(i)
         }
    }
    et = Wbadt / (Wgoodt + Wbadt)
  • Update,
    for i = 1, ..., m {
         if ht(xi) != yi; {
              Dt+1(i) = Dt(i) * sqrt((1-et)/et)
         } else {
              Dt+1(i) = Dt(i) * sqrt(et/(1-et))
         }
    }
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (ln(sqrt((1-et)/et))*ht(x)) ).



I think it becomes much easier to understand in this form. In particular, it becomes much easier to see an important property: each round adjusts the weights such as if the last table were used on them, it would produce the weighted error of exactly 50%, which would mean that it has no idea what is going on. The following round is then forced to come up with the new solution and scrape up a new way to do some differentiation. This is where the "Adaptive" in the algorithm's name comes from. But they say that this is not how they've come up with this formula, they say that they've come up with this particular value of Alphat in order to minimize Zt. I don't understand yet, how exactly they've done it. I've tried to compute it and I get a different answer. I must be making an error somewhere, need to think more about it. And they try to minimize Zt because it represents the upper bound on the training error (again, I can follow the explanation but I can't tell on my own yet, exactly why, will need to read this section a couple more times).

I don't understand yet the motivation for choosing the per-round multipliers in H(x) either. I can see some things about it. For example, the logarithm nicely handles the case of et > 0.5. If more than half the cases get misclassified by the basic algorithm, this situation is easy to improve: just flip the sign of the outcomes, and suddenly less than half of them will be misclassified. In this case the value under the logarithm will be less than 1, and the value of the logarithm will be negative, thus automatically flipping the sign! But I don't know why the logarithm is the right choice the rest of the time.

We could also replace (1-et)/et with Wgoodt/Wbadt, and et/(1-et) with Wbadt/Wgoodt but this might be detracting too much from the properties of et.

Sunday, December 13, 2015

Bayes 20: Testing

I've already mentioned how to test a model but I want to go into it in more detail and collect all the items on this subject in one place.

The first thing you need to define is some cost metric for comparison. The result of the testing will not be binary "pass or fail", it will be the value of that cost metric, and you'll be trying to minimize that cost.

In the perfect world, you would have the table of cost for replacing every spare part (including both the part itself and the labor). You'd also know the labor cost of performing the diagnosis by a human diagnostician. This cost might vary by the type of problem, and in the even more perfect world you'd have this information too. Moreover, the human diagnostician can make mistakes, so you'd have the information about the frequency of mistakes and the cost of labor and parts spend on them factored into the information about the cost of the human diagnosis. Having these tables, you'd be ready to do the comparisons.

If you don't have the exact tables, you can do some kind of estimations. For a toy model, you can do some very rough estimations: for example, say that the cost of the human diagnosis is 1 unit, and the cost of any particular repair is 3 units.

Then you take the body of the training data. The meaning of the words "training data" is kind of loose here, it's not necessarily the data that you will be using to train the model, you might put at least some of it aside to use in the testing. It all can also be called the "ground truth": the information about the previous cases, diagnosed by humans, and confirmed that the diagnosis is correct. For each test, you'd want to measure how good each version of your model does against the other versions of the model, and also against the humans. After all, if your diagnostic automation results in higher costs than the human diagnosis, there is no point in using it. And if it does better, you'd have some exciting numbers to show to your bosses and/or customers.

There is an absolute minimum cost needed to do the repairs on a set of training data: you take the cost of fixing every confirmed problem in it and add them up.

Cperfect = sum( cost(every_problem) )

To estimate, what the human diagnosis would cost, you'd take the cost for diagnosing one case, multiply it my the number of the cases in the training data and add to Cperfect to get the total cost.

Chuman = Cperfect + cost(diagnosis)*n_of_cases

Of course, if you have the more detailed information about the cost of diagnosis by each type of problem, you can use this detailed information to add them up. Generally there still will be the fixed cost of diagnosing one case, plus the sum of diagnosing each problem in this case:

Chuman = Cperfect 
    + cost(diagnose_case)*n_of_cases 
    + sum( cost(diagnose_every_problem) )

Note that even if you build a perfect diagnoser, the best savings you can hope for are (Chuman - Cperfect). Or if you prefer to see it as a percentage, 100%*(Chuman - Cperfect)/Chuman.

In reality when you run your automatic diagnoser, you'll have some problems misdiagnosed. There will be some false negatives (when your model failed to notice some problem) and some false positives (when your model diagnosed a problem that is not present). If your model has produced a wrong diagnosis, that's obviously a combination of a false negative (for the real problem that got missed) and a false positive (for the wrong problem that got diagnosed).

The cost of the false negatives is the cost of a human diagnosis, because the human diagnostician would have to go and look at this case. The cost of post-repair testing might need to be added as well, because that's what would be detecting that the problem is not fixed before sending it to the human. In many cases the cost of this testing might be negligible compared to the cost of human diagnosis.

The cost of the false positives is the cost of the parts and labor spent on replacing the parts that aren't broken.

With all this, the cost of the repairs per the expert system will be:

C = Cperfect 
    + cost(diagnose_case)*n_of_false_negative_cases
    + sum( cost(diagnose_every_false_negative_problem) )
    + sum( cost(repair_every_false_positive_problem) )

You'd compare the output of your model with the perfect diagnosis, notice the false positives and false negatives, and add their costs.

Now you're able to compare two models: run them on the same data, find the resulting cost, and see which cost is lower and by how much. You can try different algorithms and different constants for these algorithms and see the changes of the cost. And sometimes the results would surprise you, you'll discover that you went for that fancier algorithm only to make things worse.

If you're wondering, what kind of boundary constant value should be used for accepting the hypotheses, the answer is to try multiple values and see, which one works better. If all goes well, you should be able to build a graph of the total cost by the boundary value and see a nice smooth-ish curve with a distinct minimum on it, something like this:

|
|                   *
|                   
|   *           *
|      *    *
|         
|
|
+-----------------------

If you have two interdependent constants (such as in the algorithm that computes probabilities for both all hypotheses and independent hypotheses, and has different acceptance boundaries for these sub-algorithms), you may want to try taking a couple values of one constant, and for each one of them go through the graphing of the cost by the changing of the other constant. That might give you the hint of how they are interdependent. And then with this knowledge you'd be able to choose a smaller interesting area and go through every combination of both constants in it, compute the cost, and find the minimum on the 3-dimensional graph.

You might be even able to analyze the dependencies, build a formula, and find a good approximation of the minimum analytically.

These boundary constants generally adjust the balance between the false positives and false negatives. If you set the boundary too low, a lot of false positives will be accepted. If you set the boundary too high, the model will reject a lot of diagnoses and thus generate many false negatives. And around the optimum, you'll be trading some false positives for false negatives. In the perfect world there would be some area with no false positives nor false negatives but in reality you'll still have both, and it will come to giving up some in one area to win in the other.

The exact win or loss around the optimum area will depend on the mix of cases in the training data. If you move the boundary up, and lose in lots of cases, and win in a few, your cost will go up. If you move the boundary up, lose in a few cases but win in many, your cost will go down. The proportions of different cases mixed in your training data will determine the balance for the minimal total loss.

This has two consequences: First, there is no point in the very precise picking of the boundary constants. The small changes of these constants one way or the other won't matter, they will depend on which mix of cases did we get in the last time period. And the mix will fluctuate a little over time, there is no way to predict it precisely. Second, it's important to preserve the mix of the cases in the training. If you select a subset of data for the training, the mix of cases in it should match the full real set. Don't use the approaches like "we'll only include into the training data the cases where we've been able to diagnose the problem on the first go". If you do that, you'll be changing the mix, throwing away the more complex data, and your cost estimations won't match the
reality.

The common advice is "split your case data in two, use one half for training, another half for testing". As you can see, how exactly your split the data, will matter. You don't want to change the mix too much. If you take the first half of the data for training and the second half for testing, and the data happens to be sorted on some parameter, you'll get the very different mixes in two halves. It can get as bad as the training in the first half being totally wrong for diagnosing the second half. You need to do the splitting in some more random way. Either by picking the cases by some random number generator, or another good approach is to split them by the timestamp: order the cases by the timestamp and then you can divide them into the first and second half.

But this whole advice of splitting the data in two is not very good. It's good for the follow-up testing but not good for the initial testing. For the initial testing, you want to use the SAME data both for training and for testing. If you trained your model on some data and still can't diagnose everything right when testing with the exact same data, that will give you a good opportunity to analyze, what went wrong. Some people might say "oh, but you'll be over-fitting your model to the particular data". Over-fitting is rarely a problem for the Bayesian models, the typical problem is the opposite. And if the mix of cases in your training data matches the typical real mix, you'll be fitting the expectations close enough to the reality.

Thus start with testing on the same data that you used for training. Identify all the cases of false positives and false negatives. See if there is something common with them. Look in depth at some of the cases, how did the model come up with this result? What caused it to miss the correct result? In my small examples, I've been showing the step-by-step intermediate results of applying every event. This is the kind of data you want to look at. Did the steps match what you expected? If they didn't then why, what values in the tables cause the computation to veer this way?

This detailed data becomes large fairly quickly, so some automatic pre-processing can help with finding the source of trouble. In particular, I've been using the automated code that would pick, which events caused the biggest changes in the probabilities of the hypotheses, both the computed ones and the expected ones. Then it can be printed out in the more compact form that is easier to analyze at a glance. To get this kind of printout automatically, it helps to build the diagnostic support into the program itself. The good plan is to run the test data through the model once, pick the strange cases, and then run them through the model the second time, along with the information about the correct diagnosis and the diagnosis produced by the model. These known diagnoses can then be used to drive the debugging printouts during the second computation.

The different algorithms will give different results. Running multiple algorithms on the same set of training data (with the same data used for training and testing) will let you see the upsides and downsides of each algorithm. That's how I've been coming up with my refinements. Look at what can be adjusted to make the algorithm work better. If one algorithm handles some cases well, and another one handles the other cases well, could we perhaps differentiate these cases somehow and then run them through the different algorithms? Some refinements will work well, some will crash and burn, try the different ones and pick the ones that work.

And only after you've got a system that can do reasonably well on processing the data that were used for training, it's time to test it on the other data. Again, identify the cases that go wrong. In some cases the results will be better or worse than before simply because of a different mix on cases. If you see the same kinds of mistakes as when you tested with the training data but in different proportions, it's probably the mix issue. If you see the different mistakes, well, it means that there is something in this set of data that you haven't seen before and that throws your model off.

A useful experiment is to split your data (nicely and randomly) in two, then use each half to train and test in turn. If we call these parts A and B, then you can do:

  • train with part A, test with part A
  • train with part B, test with part B
  • train with part A, test with part B
  • train with part B, test with part A

If you test each algorithm change on both halves of the data, you'll be able to see if it affects them in the same way or differently. Then test the training table you produced from one half of data on the other half of data. Did it do substantially differently than the same algorithm did when trained by the same data as used in the test? If yes then perhaps the mix in two halves of data is different.

After your system is in production, you should still keep collecting the training data, and keep re-training the model. As the devices age, some of their parts will be coming to the end of life (by design or by mis-design), and this will be showing as the increasing frequency of their failures. This is something that you'd want to capture for the future diagnosis.

And the diagnostic data that was useful for testing is useful in production too. It's useful in two ways: First, when the diagnosis turns out to be wrong, and the case goes to the human diagnostician, the human may benefit from knowing, why did the machine make this diagnosis. He might be able to pinpoint the error quickly and cheaply, without repeating the difficult steps. Second, you should always look back at what is going wrong in production? It would never be perfect but can we do it better? For that, it would help to take the misdiagnosed cases and analyze them further. Try to figure out, what went wrong, and the diagnostic information is what lets you find out what went wrong. You might be also able to use the feedback from the human diagnosticians.

Oh, and a little tangent on the subject of the "ground truth". Recently I went to an internal conference on the subject of machine learning, and there they've been saying that there are the systems with the "ground truth" and systems without the "ground truth" (i.e. no known perfect solution, such as finding the topic of a text message). I disagree with that. I think that there are no systems without the "ground truth". There is always at least the "ground truth" benchmark of how would a human do at this task? Can the automated system match the human? Can the automated system beat the human? In that example of finding the topic of a test message, there obviously must be a body of training messages that the humans would read and formulate the topics. Then these messages can be used for testing of the automated models. And the good automated models must come reasonably close to what the humans did. Only after that can we use these automated models to process the real messages in the wild. Otherwise there is no way to tell if the results of these models are any good or if they are some complete garbage.

There is always, ALWAYS a way to test your system and estimate, how good is it doing. If you didn't test it, you're not following the best practices, and you probably have a whole lot of bugs in your program. Testing the large volumes of data manually is impossible but the solution is to pick some samples and check carefully that these samples are producing the correct results. And of course there are other indications: for example, if you ever see a probability value of 15, this means that there is something very wrong with your computation.

***

That concludes what I had to say on the subject of the Bayesian expert systems. Unless I recall something that I forgot :-) I wrote up all the ideas I had, and that's actually more ideas than I've started with, I've had some interesting realizations as I went along. At some point I'll probably provide the code examples for the techniques discussed in the last few installments. We'll see how it will go with my time.

Saturday, December 12, 2015

Bayes 19: model information, and enumerated events

Sometimes the diagnosis depends on the features of the device being diagnosed. For example, you wouldn't want to diagnose the power steering fluid leak in a car with the electric power steering (or no power steering at all).

The typical thing would be to at least ask the make and model of the car up front and enter it as the events into the diagnosis. How would it be entered as events? The straightforward way would be to define one event per model. Then when diagnosing a car, the event for its model will be true, and for the rest of the models false.

This would also mean that every single training case will be model-specific. And even if we fold multiple training cases into the per-hypothesis entry, it still will include the information about what models experienced this failure, and at what prior probability.

In one way, this is good. It means that we will diagnose this hypothesis only for the models where this failure can actually occur. In another way, it's bad. For some rarer models we may not have any training cases showing that yes, this problem can happen for this model. There might be plenty of Volvos in the training data to show all their possible problems but not enough Aston Martins to show any but the most frequent problems. Yet the Aston Martins are mechanically similar to the other cars, and much of knowledge about the other cars can be applied to the Aston Martins too. So should we take the model into account or not?

Its hard to tell for sure for each particular case. But the approach of specifying the model information as events actually works fairly well together with the confidence capping. The confidence capping means that even if we can't find an exact match in training data, we'll still keep trying to find the best match, first with the mismatch of one event, then with the mismatch of two events, and so on. Finding a similar training case on another model means the mismatch of two events: the event for the model of that training case will be false instead of true in the input, and the event for the model of the actual car being examined will be true instead of false. So in essence it will mean that the fully matching training case for the same model will be preferred, but if there isn't one, it will become a choice between the otherwise fully matching case from another model and a partially matching case for this model.

So far so good but there are more caveats to keep in mind. This would not work well if we try aggressively to find the relevant symptoms/events of the training cases. We might end up with a training case for this particular model that has only the model information and one other event marked as relevant. Well, this event will be a mismatch but it's a mismatch on only one event (since the others are marked as irrelevant), so this case will take a priority over the cases for the other model but a matching symptom (since these case will have two mismatches in the model information events). A possible workaround for this is to not fold the model information into the events but just partition the training data into the separate per-model tables and a common table without the model information. Then look in the per-model table first, and if nothing useful is found there, look in the common table. This workaround has its own problems of course, such as what if there are multiple failures, one of which showing in the per-model table but the second one only in the common table? The second table computation might never happen because the first failure would be found in the per-model table and the processing will stop.

Another complication is that with the large number of events the uncertainty introduced by the certainty capping will accumulate fast. With a hundred events, and using 0.01 for the cap value, by the end of the processing the result will be completely uncertain. That's why the smaller values of the cap, such as 1e-8 are typically more useful.

You might think that if we do the computation directly by the training cases, we preserve the correlation between the events, so instead of having N events for N models, we can encode them as a binary number that can be represented with only log2(N) events. But that's a bad idea. It means that the code distance between the representations of the models will become varying. Instead of differing always by 2 events, they could differ by as few as 1 and as many as log2(N) events. That will seriously mess up the logic.

A possible workaround for that is to allow some events to have not 2 possible values (true or false) but a number of possible values, as many as hundreds or even thousands. The fuzzy logic can still be used with such events: they would be represented by two values, the confidence value in the range [0..1] and the enumeration value. We'd check if the enumeration value of the event in the input matches the enumeration value of the event in the training case, and then apply the confidence value appropriately, for or against this training case. It also means that the mismatching value of this event will be a mismatch of only one event, not two, and thus improve the standing of the training cases from other models with the capping logic.

A completely different take on this problem is to preprocess the information about the model and split it into the separate events that describe the mechanical features of the car. There might be a separate event for the type of power steering (none, or hydraulic, or electric), the type of the front brakes (disc or drum), the type of the rear brakes, and so on. The approach with the enumerated multiple-valued events can be used here as well, with for example the power steering type becoming one event with 3 possible values. However to use this approach effectively, we really need to handle the event relevance properly: the training cases for the problems in each subsystem must mark only the identifying information about this subsystem as relevant. You wouldn't want to fail recognizing a brake problem just because the car in the training case had a different type of power steering than the one being diagnosed. But this kind of relevance might be difficult to discover automatically, it would require the information about the subsystems and their relations to be added by the humans.

The good part about this splitting is that it would allow to diagnose the cars that have been modified by the owners. If someone installed the Lincoln disc brakes on the rear axle of a Fox-body Ford Mustang that had the drum brakes from the factory, this information can be entered as an override after entering the car model, changing the event for the type of the rear brakes. And the system will still be capable of diagnosing the problems. On the other hand, it might not help much: the aftermarket parts often have their own problems stemming from the mismatch of their characteristics with the original parts. For example, the said Lincoln brake conversion fits mechanically but causes the very wrong proportioning of the brake forces between the front and the rear. On the third hand, a sufficiently smart diagnostics system might be still useful. Even though the misproportioned brake forces do not happen in the cars from the factory, a smart enough system might still be able to point to a faulty master brake cylinder. It wouldn't be faulty as such but it would be faulty in its function for the modified brakes, distributing the brake forces in the wrong proportion.

Tuesday, December 8, 2015

Bayes 18: finding the relevance

I've previously promised to describe the ideas I have for computing the relevance values, and here we go. I haven't tested this ideas with a set of production data. They're just ideas. But they look like a reasonable start.

The general idea should be to tell apart if this value of this event differentiates this hypothesis/case from the other hypotheses/cases or not. If it does differentiate, it's relevant, otherwise it's irrelevant.

Taking two cases, we should be able to differentiate them based only on the relevant events in them both (and obviously a relevant event in one case can only be used to make a decision for this case, not for the other one).

For example, suppose that we have two cases with outcome of two different hypotheses where the event values are represented as a bit string:

hyp1 0 0 0 1 0 0
hyp2 0 1 0 0 0 0

We can mark some events irrelevant by "-" (how we do this is a separate question, for now let's suppose that we've done it somehow) and still be able to differentiate these cases:

hyp1 - - - 1 - -
hyp2 - 1 - - - -

The bit string "0 0 0 1 0 0" will still match only the case for hyp1, and "0 1 0 0 0 0" will still match only the case for hyp2. Obviously, when working with the full training set, we should be able to differentiate between any two cases in the training set.

As it turns out, this problem of finding the smallest set of differentiating bits is already known. It's the same problem as the optimization of a set of boolean functions in the hardware engineering. The very classic way to approach it is with the Karnaugh maps, though they don't scale so easily to the large number of bits. But there's got to be some newer and better ways for this optimization.

This approach has its pitfalls too. For another example, suppose that we have the cases:

hyp3 0 0 0 1 0 0
hyp4 0 1 0 1 0 0

Then we can differentiate them based purely on one event:

hyp3 - 0 - - - -
hyp4 - 1 - - - -

Of course, this works only as long as we have only two events. And it would likely result in the spurious false positives: in case if all the events in the input are at 0, they would still be taken to mean hyp3.

This can be resolved or at least improved by having the training set include the set of the cases where there is nothing wrong. What is known as the "null hypothesis", and what I've been marking as the "ok" hypothesis in the previous examples. The events that differ in value from the null hypothesis should always be relevant. If we arrange the event polarity so that the null hypothesis has all the events at 0, this rule translates into "if an event is at 1, it must be relevant", and only the events at 0 need to be considered for the possible irrelevance.

Things get more complicated when we have some cases resulting in multiple hypotheses. Splitting their symptoms into the single-hypotheses cases is not obvious. It can be attempted by subtracting the other single-hypothesis cases and looking at what is left. If some hypothesis have no single-hypothesis cases, that becomes even more strange. Then perhaps these cases should be left as-is because they express the special interactions between multiple hypotheses. Or we can try to find the commonality between all the cases that include this particular hypothesis and discard the differences stemming from the other hypotheses included in these cases.

Let's look at the highly pathological example I've shown before and try to do these manipulations with it. The example was:

# tab09_01.txt
!             evA evB evC
1 * hyp1,hyp2 1   1   0
1 * hyp2,hyp3 0   1   1
1 * hyp1,hyp3 1   0   1

Previously I've calculated the probabilities by hypothesis:

# tab09_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.5,0.5
hyp2,0.66667,0.5,1,0.5
hyp3,0.66667,0.5,0.5,1

Those probabilities P(E|H) can be used as a guidance for relevance. If they're close to 0 or 1 (say, below 0.1 or above 0.9) , we'll say that this event is relevant for this hypothesis, and if it's somewhere in the middle, we'll say that it's irrelevant. Then we can use this knowledge to split the cases:

!             evA evB evC
1 * hyp1      1   -   -
1 *      hyp2 -   1   -
1 * hyp2      -   1   -
1 *      hyp3 -   -   1
1 * hyp1      1   -   -
1 *      hyp3 -   -   1

Then we can merge the cases that are the same:

!        evA evB evC
2 * hyp1 1   -   -
2 * hyp2 -   1   -
2 * hyp3 -   -   1

It worked well for this special example but might be not so great in the bigger world. It needs more thinking and experimentation.

Tuesday, November 17, 2015

Bayes 17: confidence subtraction

One of the ways I've come up with for differentiating the valid hypotheses from noise is the idea of subtracting the competing hypotheses.

To remind, what the problem is, consider a simple training table:

# tab17_01.txt
!,,evA,evB,evC
hyp1,1,1,0,0
hyp2,1,0,1,0
hyp3,1,0,0,1

One event indicates one hypothesis. For now let's ignore the idea of the relevance because for now the relevance is not obvious to compute. I actually plan to work up to the computation of the relevance from the subtraction logic.

If we feed the input with all events present (and with capping), that results with all hypotheses getting the equal probability:

# in17_01_01.txt 
evA,1
evB,1
evC,1

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_01.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00010,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

If we feed the input with all the events absent (and with capping), that also results with all hypotheses getting the equal probability:

# in17_01_02.txt 
evA,0
evB,0
evC,0

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_02.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.99000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.99000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.98010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00980,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00980,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00980,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

How do we know that for the first input the right answer is "all three hypotheses are true" and in the second input the right answer is "all three hypotheses are false"? Note that if we look at the weights, the weights are much higher for the second input.

The idea I've come up with is that we can take the set of the highly probable hypotheses (all three hypotheses in the examples above) and try to subtract the effects of all but one hypothesis in the set from the input. Then run the modified input through the table again and see if that one remaining hypothesis will pop up above all the others. If it will, it should be accepted. If it won't, it should be refused. Repeat the computation for every hypothesis in the set.

To do that, we need to decide, what does it mean, "subtract"?

It seems reasonable to make the decision based on what probability this event has for this one hypothesis and what probability it has for all the other top hypotheses.

This can be interpreted in two ways depending on what case weights we're using for this computation: these from the original table or these from the result of the first computation. Using the weights from the result of the first computation seems to make more sense, since it favors the cases that have actually matched the input.

OK, suppose, we get these two probability values, how do we subtract the effects? Let's look at some examples of what results would make sense.

Let's name the probability of the event Pone (it can also be called P(E|H) fairly consistently with what we designated by it before, or TC(E|H)) in the situation where the one chosen hypothesis is true, and Prest in the situation where all the other top hypotheses are true. Let's call the actual event confidence C, and the confidence after subtraction Csub.

Some obvious cases would be if Pone and Prest are opposite:

Pone=0, Prest=1, C=1 => Csub=0
Pone=1, Prest=0, C=1 => Csub=1
Pone=0, Prest=1, C=0 => Csub=0
Pone=1, Prest=0, C=0 => Csub=1

Basically, if Prest and C are opposite, C stays as it was, if Prest and C are the same, C flips. The other way to say it is that Csub ends up matching Pone.

The less obvious cases are where both Pone and Prest point the same way. Should C stay? Should it move towards 0.5? One thing that can be said for sure is that C shouldn't flip in this situation. There are arguments for both staying and for moving towards 0.5. This situation means that all the remaining cases match the state of this event, so staying means that there is no reason to penalize one case just because the other cases match it. Moving towards 0.5 means that we say that the rest of the hypotheses can account well for this event by themselves, so let's try to eliminate the "also ran" hypothesis. Staying seems to make more sense to me.

The goal of the subtraction is that with the subtracted confidences applied, a valid hypothesis should be strongly boosted above all others. If it doesn't get boosted (i.e. if it still gets tangled with other hypotheses), it's probably not a valid hypothesis but just some random noise.

The only time I've used the subtraction approach with the real data, I did it in a simple way, and it still worked quite well. That implementation can be expressed as:

Csub = C * TCone/(TCone + TCrest)

Here TCone and TCrest are similar to Pone and Prest but represent the sums of weighted training confidences instead of probabilities:

TCone(E) = sum(TC(E|Hone)*W(Hone))
TCrest(E) = sum(TC(E|Hrest)*W(Hrest))

That implementation was asymmetric: if C is 1, Csub may become less than C, but if C is 0, Csub will stay at 0. It handles reasonably well the situations where the event is mostly positive for the top hypotheses but not the situations where the event is mostly negative for the top hypotheses.

If we compute the values of Csub in this way for the first example above (C(evA)=1, C(evB)=1, C(evC)=1), we will get:

  • For hyp1: Csub(evA)=1, Csub(evB)=0, Csub(evC)=0
  • For hyp2: Csub(evA)=0, Csub(evB)=1, Csub(evC)=0
  • For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=1

These exactly match the training cases, so all three hypotheses will be accepted, with each hypothesis going to the probability 1 on its run.

If we compute the values of Csub in this way for the second example above (C(evA)=0, C(evB)=0, C(evC)=0), we will get:

  • For hyp1: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
  • For hyp2: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
  • For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0

They stay the same as the original input, thus the results won't change, the probabilities of all hypothesis will stay at 0.33 for each run, and all the hypotheses will be rejected.

The defect of this formula shows itself when the events are negative, their absence pointing towards the hypotheses, as in the following table:

# tab17_02.txt
!,,evA,evB,evC
hyp1,1,0,1,1
hyp2,1,1,0,1
hyp3,1,1,1,0

In this case the all-0 input should produce the result saying that all the hypotheses are true, and all-1 input should have all the hypotheses as false.

For all-one C(evA)=1, C(evB)=1, C(evC)=1, we will get:

  • For hyp1: Csub(evA)=0, Csub(evB)=0.5, Csub(evC)=0.5
  • For hyp2: Csub(evA)=0.5, Csub(evB)=0, Csub(evC)=0.5
  • For hyp3: Csub(evA)=0.5, Csub(evB)=0.5, Csub(evC)=0

Let's try applying the computed values for hyp1:

# in17_02_03.txt
evA,0
evB,0.5
evC,0.5

$ perl ex16_01run.pl -c 0.01 tab17_02.txt in17_02_03.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.01000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.01000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Applying event evB, c=0.500000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.49500,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.00500,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.00500,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Applying event evC, c=0.500000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.24750,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.00250,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.00250,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Probabilities
hyp1    0.98020
hyp2    0.00990
hyp3    0.00990

The hyp1 still comes out as true! This is the problem caused by the asymmetry of the formula.

My next idea of a better formula was this: instead of subtractions, just either leave Csub=C or "flip" it: Csub=(1-C). The idea of the flipping is that the direction of C, whether it's less or greater than 0.5, shows the "value" of the event while the distance between C and 0.5 shows the confidence as such. The operation of flipping keeps the confidence (i.e. the distance between C and 0.5) the same while changes the direction. And if C was 0.5, the flipping will have no effect.

C would be flipped in the situation where it points against this hypothesis but for another top hypothesis. This situation likely means that this event is a symptom of another hypothesis but not really relevant for this one.

The logic will be like this:

If TCone and C point in the same direction (i.e. both >0.5 or both <0.5)
then Csub = C;
else if there exists another top hypothesis with TCother pointing in the
 same direction as C
then Csub = (1-C);
else Csub = C;

And instead of hypotheses, we can work with the individual training cases. Instead of picking the top hypotheses, pick the top cases. And then do the subtraction/flipping logic case-by-case. Except perhaps exclude the other cases of the same hypothesis from consideration of "exists another" for flipping.

Let's work this logic through our examples.

The first example is the table

# tab17_01.txt
!,,evA,evB,evC
hyp1,1,1,0,0
hyp2,1,0,1,0
hyp3,1,0,0,1

and the all-one input: C(evA)=1, C(evB)=1, C(evC)=1. From the previous computation we know that all three hypotheses are the top probable hypotheses.

Let's check hyp1:

  • For evA, TC=1 and C=1, point the same way, so Csub=1.

  • For evB, TC=0 and C=1, point opposite, and there is TC(evB|hyp2)=1, so flip to Csub=0.

  • For evC, TC=0 and C=1, point opposite, and there is TC(evC|hyp3)=1, so flip to Csub=0.

We've got the subtracted values, let's run them through the processing:

# in17_01_03.txt
evA,1
evB,0
evC,0

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_03.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.98010,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00990,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.97030,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.99980
hyp2    0.00010
hyp3    0.00010

It says that hyp1 is quite true. Similarly, hyp2 and hyp3 would show themselves as true.

For the second example, let's look at the same table and the inputs of all-0: C(evA)=0, C(evB)=0, C(evC)=0. From the previous computation we know that all three hypotheses are again the top probable hypotheses.

Let's check hyp1:

  • For evA, TC=1 and C=0, point opposite, and there are two other hypotheses with TC=0, so flip to Csub=1.
  • For evB, TC=0 and C=0, point the same, so Csub=0.
  • For evC, TC=0 and C=0, point the same, so Csub=0.

It again produced (1, 0, 0), and hyp1 would also show as true! But that's not a good result, it's a false positive. This idea didn't work out well.

The problem is that we need to differentiate between the states of the event that say "there is nothing wrong" and "there is something wrong", and flip the event only if it was pointing in the direction of "there is something wrong". That's what my first asymmetric logic did, it always assumed that C=0 meant
"there is nothing wrong".

If we have the additional information about which state of the event is "normal" and which is "wrong", that would solve this problem. If we don't have this information, we can try to deduce it. A simple assumption could be that if a symptom is specific to some cases, then in most of the training cases it will be in the "normal" state, and only in a few of these specific cases it will be in the "wrong" state.

Of course, there will be exceptions, for example if a medical diagnostic system has an event with the question "is the patient feeling unwell?" then the answer for this question in most cases will be true, even though this is not the "normal" state. But it doesn't seem to cause problems: for most patients and most hypotheses TC and C on this event will be pointing the same way, and there will be no need for flipping anyway.

So, let's update the logic rules:

If TCone and C point in the same direction (i.e. both >0.5 or both <0.5)
then Csub = C;
else if there exists another top hypothesis with TCother pointing in the
 same direction as C and most cases in the original training data were
 pointing opposite C
then Csub = (1-C);
else Csub = C;

With this revised logic the revisited case of the all-0 inputs (C(evA)=0, C(evB)=0, C(evC)=0) for hyp1 will be:

  • For evA, TC=1 and C=0, point opposite, but most training cases (2 of 3) also point to C=0, so leave Csub=0.
  • For evB, TC=0 and C=0, point the same, so Csub=0.
  • For evC, TC=0 and C=0, point the same, so Csub=0.

With this unchanged input, hyp1 will still finish with the probability of 0.33, and it won't make the cut. Neither will make hyp2 nor hyp3 when processed in the same way.

Let's look at the example with the opposite table

# tab17_02.txt
!,,evA,evB,evC
hyp1,1,0,1,1
hyp2,1,1,0,1
hyp3,1,1,1,0

and again the all-0 input. In it the handling of hyp1 will be:

  • For evA, TC=0 and C=0, point the same, so Csub=0.
  • For evB, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp2 with TC(evB|hyp2) = 0, so flip to Csub=1.
  • For evC, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp3 with TC(evB|hyp3) = 0, so flip to Csub=1.

This result of (C(evA)=0, C(evB)=1, C(evC)=1) will match the training case for hyp1 exactly, and drive its probability all the way up, just as we wanted to.

This last logic has managed to handle all the examples fairly decently.

Monday, November 9, 2015

Bayes 16: code for hypothesis composition

This time I want to show the code that composes the separate cases into the merged hypotheses. I've shown before how to do it manually, now the code to do it.

The short summary of changes is:

  • The %hyphash went away since it turned out to be unnecessary.
  • Option "-compose N" added. N is the fraction in range [0..1] of what part of the weights needs to be composed into the per-hypothesis merged cases. 0 is the default and means that nothing will be composed. 1 means that all the data will be composed per hypothesis.
  • Option "-msplit" added. It enables the splitting of the multi-hypothesis cases into multiple single-hypothesis cases (just as shown before, the same data gets copied to every hypothesis separately).
  • The table loading code has been extended to allow the reading of the multi-hypothesis cases that are formatted on multiple lines. This allows to save the result of the composition and then load it for the diagnosis runs.
  • The new logic is in the function compose_hypotheses().

Initially I wanted to show only the changed code but then decided that since I'm not uploading the files anywhere yet, it would be easier for the people who want to play with the code to copy-and-paste it than to assembly it from bits and pieces. So here we go, the whole new code:

# ex16_01run.pl
#!/usr/bin/perl
#
# Running of a Bayes expert system on a table of training cases.
# Includes the code to compose the training cases by hypothesis.
# And can parse back the multi-hypothesis cases printed in multi-line.

use strict;
use Carp;

our @evname; # the event names, in the table order
our %evhash; # hash of event names to indexes
our @case; # the table of training cases
  # each case is represented as a hash with elements:
  # "hyp" - array of hypotheis names that were diagnosed in this case
  # "wt" - weight of the case
  # "origwt" - original weight of the case as loaded form the table
  # "tc" - array of training confidence of events TC(E|I)
  # "r" - array of relevance of events R(E|I)
our %phyp; # will be used to store the computed probabilities

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate
  # for overfitting in the training data;
  # limits the confidence to the range [$cap..1-$cap]
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome
our $composite = 0.; # fraction [0..1] of the cases to compose by hypothesis
our $mhyp_split = 0; # split the multi-hypothesis cases when computing the composites

# print formatting values
our $pf_w = 7; # width of every field
our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field
our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field
our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding)
our $pf_rw = 4; # width of the field R(E|I)
our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I)
our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I)
our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I)

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  @case = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);

  my $prepend;
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } elsif (/\+\s*$/) {
      # a split-line of hypothesis names
      $prepend .= $_;
    } else {
      $_ = $prepend . $_;
      $prepend = undef;
      my @s = split(/,/); # CSV format for a training case
      # Each line contains:
      # - list of hypotheses, separated by "+"
      # - weight (in this position it's compatible with the format of probability tables)
      # - list of event data that might be either of:
      #   - one number - the event's training confidence TC(E|I), implying R(E|I)=1
      #   - a dash "-" - the event is irrelevant, meaning R(E|I)=0
      #   - two numbers separated by a "/": TC(E|I)/R(E|I)

      my $c = { };
      
      my @hyps = split(/\+/, shift @s);
      for (my $i = 0; $i <= $#hyps; $i++) {
        $hyps[$i] =~ s/^\s+//;
        $hyps[$i] =~ s/\s+$//;
      }

      $c->{hyp} = \@hyps;
      $c->{origwt} = $c->{wt} = shift(@s) + 0.;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }

      # the rest of fields are the events in this case
      foreach my $e (@s) {
        if ($e =~ /^\s*-\s*$/) {
          push @{$c->{r}}, 0.;
          push @{$c->{tc}}, 0.;
        } else {
          my @edata = split(/\//, $e);
          push @{$c->{tc}}, ($edata[0] + 0.);
          if ($#edata <= 0) {
            push @{$c->{r}}, 1.;
          } else {
            push @{$c->{r}}, ($edata[1] + 0.);
          }
        }
      }

      push @case, $c;
    }
  }
  close(INPUT);
  if ($prepend) {
    confess "The input contained the hanging hypothese names: $prepend\n";
  }

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
}

sub print_table()
{
  # the title row
  printf($pf_tfmt . ",", "!");
  printf($pf_tfmt . ",", "");
  foreach my $e (@evname) {
    printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, "");
  }
  print("\n");
  # the cases
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    # if more than one hypothesis, print each of them on a separate line
    for (my $j = 0; $j < $#{$c->{hyp}}; $j++) {
      printf($pf_tfmt . "+\n", $c->{hyp}[$j]);
    }

    printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]);
    printf($pf_nfmt . ",", $c->{wt});
    for (my $j = 0; $j <= $#evname; $j++) {
      printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]);
    }
    print("\n");
  }
}

# Compute the hypothesis probabilities from weights
sub compute_phyp()
{
  %phyp = ();
  my $h;

  # start by getting the weights
  my $sum = 0.;
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    $sum += $w;

    foreach $h (@{$case[$i]->{hyp}}) {
      $phyp{$h} += $w;
    }
  }

  if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone
    for $h (keys %phyp) {
      $phyp{$h} /= $sum;
    }
  }
}


# Print the probabilities of the kypotheses
sub print_phyp()
{
  printf("--- Probabilities\n");
  for my $h (sort keys %phyp) {
    printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h});
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # update the weights
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    my $r = $case[$i]->{r}[$evi];
    my $tc = $case[$i]->{tc}[$evi];

    $case[$i]->{wt} = $w * (1. - $r)
      + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) );
  }
}


# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# compose the training cases by hypothesis
sub compose_hypotheses()
{
  if ($mhyp_split) {
    # the number of cases will grow, remember the index of last original case
    my $lastcase = $#case + 1;
    for (my $i = 0; $i <= $lastcase; $i++) {
      my $c = $case[$i];
      while ($#{$c->{hyp}} > 0) {
        # split a copy of the case for each resulting hypothesis in it
        my $hname = pop @{$c->{hyp}};
        push @case, {
          hyp => [$hname],
          wt => $c->{wt},
          origwt => $c->{origwt},
          tc => $c->{tc},
          r => $c->{r},
        };
      }
    }
  }

  if ($composite <= 0.) {
    return; # nothing to do
  }

  # newly-generated composite hypotheses
  my %hyps; # keyed by the hypothesis names
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    my $key = join("+", sort @{$c->{hyp}});
    if (!exists $hyps{$key}) {
      $hyps{$key} = {
        hyp => $c->{hyp},
        wt => 0.,
        wtrel => [], # weight of relevant part, by event
        tc => [], # initially will contain the sum
        r => [], # will be filled later
      };
    }
    my $hyp = $hyps{$key};
    my $wt = $c->{wt};

    $hyp->{wt} += $wt;

    for (my $e = 0; $e <= $#evname; $e++) {
      my $r = $c->{r}[$e];
      $hyp->{wtrel}[$e] += $wt * $r;
      $hyp->{tc}[$e] += $wt * $r * $c->{tc}[$e];
    }
  }
  if ($composite >= 1.) {
    # throw away the raw cases, since the hypotheses will replace them
    @case = ();
  } else {
    # 2nd pass: adjust the weight of the raw cases,
    # unless it's the only case of the hypothesis (then throw
    # away that hypothesis)
    for (my $i = 0; $i <= $#case; $i++) {
      my $c = $case[$i];
      my $key = join("+", sort @{$c->{hyp}});
      if ($hyps{$key}->{wt} == $c->{wt}) {
        delete $hyps{$key}; # hypothesis is a copy of the case
      } else {
        $c->{wt} *= (1. - $composite);
      }
    }
  }

  foreach my $h (sort keys %hyps) {
    my $hyp = $hyps{$h};
    my $wt = $hyp->{wt};

    for (my $e = 0; $e <= $#evname; $e++) {
      $hyp->{r}[$e] = $hyp->{wtrel}[$e] / $wt;
      if ($hyp->{wtrel}[$e] == 0.) {
        $hyp->{tc}[$e] = 0.;
      } else {
        $hyp->{tc}[$e] /= $hyp->{wtrel}[$e];
      }
    }

    # scale down the weight
    $hyp->{wt} *= $composite;
    $hyp->{origwt} = $hyp->{wt};
    delete $hyp->{wtrel};
    push @case, $hyp;
  }
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } elsif ($1 eq "b") {
    shift @ARGV;
    $boundary = $ARGV[0]+0.;
  } elsif ($1 eq "compose") {
    shift @ARGV;
    $composite = $ARGV[0]+0.;
  } elsif ($1 eq "msplit") {
    $mhyp_split = 1;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
if ($composite > 0. || $mhyp_split) {
  &compose_hypotheses;
  printf "--- Composite ${pf_nfmt}:\n", $composite;
  &print_table;
}
&compute_phyp;
&print_phyp;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
  &compute_phyp;
  &print_phyp;
}

The basic example, loading of a multi-line case and splitting of the multi-hypothesis cases:

#tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,

$ perl ex16_01run.pl -msplit tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same but also fully composed by hypotheses:

$ perl ex16_01run.pl -msplit -compose 1 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,2.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,2.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,2.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same with only 0.5 of the weights composed:

$ perl ex16_01run.pl -msplit -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,1.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,1.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

It's a combination of the two examples above. The original split events are left half their weight, and the other half of the weight has been composed.

What if we try the composition without splitting?

$ perl ex16_01run.pl -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.66667
hyp3    0.66667

The result is exactly the same as the original. This is because the original had only one case for each hypothesis combination, so there is nothing to compose and there is no point in splitting them into two.

Now a demo of the relevance computation:

# tab16_02.txt
!,,evA,evB,evC
hyp1,1,1,0,-
hyp1,1,0,1,-
hyp1,6,-,-,1

$ perl ex16_01run.pl -compose 1 tab16_02.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/0.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/0.00,
hyp1   ,6.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,8.00000,0.50000/0.25,0.50000/0.25,1.00000/0.75,
--- Probabilities
hyp1    1.00000

According to the logic from the previous installment, the relevance of evA and evB gets set to 0.25 because 2/8=1/4 of the cases for them have them relevant, and for evC the relevance is set to 0.75 because 6/8=3/4 cases for it are relevant.