tag:blogger.com,1999:blog-31953525109246115042017-04-24T06:41:33.211-04:00Sergey Babkin on CEPMy thoughts on the field of Complex Event Processing. Since I'm a technical guy, they go mostly on the technical side. And mostly about my OpenSource project Triceps. To read the Triceps documentation, please follow from the oldest posts to the newest ones.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.comBlogger348125tag:blogger.com,1999:blog-3195352510924611504.post-72102313079948741002017-03-08T15:54:00.003-05:002017-03-08T15:54:59.979-05:00OS support for thread stoppingI've recently had done some thinking about how can an OS make the shutdown of a multithreaded application or its part easier, to stop the threads quickly but relatively nicely on request. I think a very helpful support can be pretty easy to implement:<br /><br /><ul><li>Have system call to request the shutdown of an OS-level thread.</li><li>This call would make any system calls that could block non-blocking (except for the thread synchronization calls, like mutex or condition variable operations), would interrupt any current such blocking calls, and return an error. This includes the timed sleeps, and even the thread synchronization calls with timeouts would return an error immediately instead of sleeping. Optionally, it might also just fail any file operations altogether, possibly except for close(). Note that the file descriptors as such wouldn't be revoked altogether, they would just kind of stop working in one thread, but the other threads would still work.</li><li>A problem with this might be in such things as logging: when your thread is stopping, it might still write the useful log files. Revoking these log files would be Well, one solution for this would be to buffer the logs in memory and write from a separate thread. Another might be to keep some kind of exception list.</li><li>The threads might also beget threads, and keeping track of all of them is complicated. So there is a need for grouping them, like the process groups in the OS, or Triceps Triead fragments, and then stopping the whole group.</li></ul>Yet another separate approach might be to say "but if you want the ability to stop something, why don't you run it as a separate process and kill if when needed?". This would make a lot of sense for such things as PAM authentication modules.<br /><br />There are problems with the separate processes too. But after some thinking, they seem to be fairly easy to solve.<br /><br />When you think about processes, usually a separate binary comes to mind but it doesn't have to. You can just do a fork() and run from the same binary, and you even get the whole address space copy-on-write out of the box. Though I guess it might be interesting to NOT copy-on-write the whole address space. Instead just allocate some communication area for passing the arguments and results, and get rd of the rest, maybe even re-running the run-time loader from scratch in the new context (that would be kind of like the Java class loader spaces).<br /><br />Communication with a separate process is kind of painful. But if you map an anonymous memory segment, that segment can be inherited by the child process. Indeed, should be inherited to act as the communication region (truly shared, not COW) even if the rest of address space is dropped. And then you can happily create the mutexes and condition variables in it for synchronization. The arguments and results would have to be serialized to pass through this region but it shouldn't be such a big deal. It would be also possible to keep some large data dictionaries in this communication space.<br /><br />Of course the callbacks wouldn't work across the process boundary. But there is a solution for this too: create a local thread pool (or simply a thread) for callbacks that would get the requests from the remote process though a portion of the same communications region, execute the callback, and return the results back there.<br /><br />I think this looks feasible even without any new technology needed, just needs a helper library to implement the interface. This isn't even a new concept, it's a classic microkernel architecture without the common memory space. Yeah, copying the data in the serialized form adds an overhead but it's not that much overhead compared to say JVM, and people use Java and .NET all over the place with no regrets.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-24482752653729421082017-03-05T21:41:00.000-05:002017-03-05T21:41:27.421-05:00Bayes 25 & AdaBoost: the partial confidence problemThe code in the previous example works only with the full-confidence events, those that have the confidence of 0 or 1. Unfortunately I haven't found a good way to handle the partial confidence yet. I've found <i>a</i> way but it has problems. Let me show you.<br /><br /><a href="http://babkin-cep.blogspot.com/2015/10/bayes-4-fuzzy-logic.html">Previously</a> I was treating the partial confidence as a superposition of 0 and 1. This approach cannot be properly used here because it doesn't cover the connections between the events. To give an example, in the <a href="http://babkin-cep.blogspot.com/2017/03/bayes-24-adaboost-as-solution-to.html">previous post</a> the code had correctly found that in tab24_01.txt the events E1and E2 are the same, so after E1 is applied, E2 becomes ignored (it's AdaBoost probabilities become 0.5). So now if we then do a computation and find that E1 is unknown (i.e. its confidence is 0.5) but E2 is known, E2 will still be ignored because its table entry says to always ignore it.<br /><br />The correct way should handle this situation properly: if E1 is unknown, this should bring up the table values for E2 to compensate for it and treat E2 as E1 was formerly.<br /><br />The obvious way to do it is to treat the partial confidence as a superposition of "the event is unknown" and "the event is known". So if the event is known, we process it as before. If the event is unknown, we would follow the unknown branch and then apply all the following events as if this events wasn't there. And if the event is partially known, we would split the weights of all the outcomes proportionally and handle one part with the known branch and another one with the unknown branch.<br /><br />So the confidence of 0.7 would mean a superposition of 0.4 of unknown and 0.6 of the known value "1". The confidence 0.1 would mean a superposition of 0.8 of unknown and 0.2 of the known value "0". So if we denote C(E) as the event confidence, K(E) as the known weight in the superposition, and U(E) as the unknown weight in the superposition, we can write them as:<br /><br />K(E) = abs(C(E) - 0.5) * 2<br />U(E) = 1 - K(E)<br /><br />The downside of this obvious approach is that the table and computation becomes more complex. As we go through more events, two branches become four, then eight, then sixteen, and so on. The formerly quadratic complexity O(Noutcomes * Nevents) becomes exponential O(Noutcomes*Nevents*2^Nevents). I don't know yet if there is a better way.<br /><br />Maybe it would be possible to build a table of "gradients" between the events: i.e. if the confidence of the event A changes, how are the probabilities of another event B affected. The problem though is that this gradient depends on all the events in the sequence between A and B, so tis idea doesn't seem to really fly.<br /><br />Maybe another idea is workable, that would reduce the exponential cost to a cubic one: after splitting the superposition, instead of following two branches, just take the probabilities for each following event from both branches and mix them by the weight of the branches:<br /><br />P(H|Ei) = K(Ej) * P(H|KjEi) + U(Ej) * P(H|UjEi)<br /><br />It's better than exponential but still not great, with the run time O(Noutcomes*Nevents*Nevents). And the table of these pre-computed values would still seem to be exponential, each event depending on all the possible combinations of the previous events. Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-88190786169127206932017-03-01T02:37:00.000-05:002017-03-01T02:37:32.916-05:00Bayes 24: AdaBoost as a solution to the correlated eventsWhen writing the <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/whitepapers/AdaboostBayes.pdf?format=raw">whitepaper</a>, I've realized that AdaBoost is very good at finding the cross-correlation of its partial hypotheses (that map to the Bayesian events), and it always picks the next step with the minimal cross-correlation. The Bayesian computations do have an issue with cross-correlation, as described in the whitepaper and near the end of <a href="http://babkin-cep.blogspot.com/2015/10/bayes-8-impossible.html">Part 8 of my notes</a>: if there are multiple events that duplicate or mirror each other, all of them are treated as independent and drive the probabilities off. This looked like AdaBoost could give a hint of how to solve this problem, so I've started thinking about it, and I think I've found a solution.<br /><br />In retrospect the solution is pretty obvious: instead of computing the Bayesian model directly, treat it as an AdaBoost model whose set of the partial hypotheses equals to the set of the Bayesian events, with an extra limitation that each one of them must be used exactly once. If there are multiple equivalent events and they happen to go one after another, AdaBoost will treat the first one as real, and the following ones will get their error rate of 0.5 and will be effectively no-ops.<br /><br />Well, if there are other events interspersed in between these equivalent copies, they won't be no-ops. These other events will shift the weights of the training cases, so that the second an further copies will find themselves drifted away from 0.5. The good news here is that AdaBoost is pretty good at removing the partial correlation too but the bad news is that the order matters, to fully remove the duplication the equivalent events should go one after another. And in general, it looks like the best removal of the cross-correlation happens when the closely related events follow closely one after another. This kind of order can be implemented if we reverse the AdaBoost preference of the partial hypotheses and make it always pick the one that has its training error rate closest to 0.5. Since (unlike the classic AdaBoost) the set of partial hypotheses is limited and each element from it must be used only once, the better ones will eventually be used as well.<br /><br />To experiment with these ideas, I wrote the code example <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/ex24_01run.pl">ex24_01run.pl</a>. It doesn't reorder the events, just computes the AdaBoost values for whatever order it's given, allowing to experiment with the various orders. But it has a couple of advanced features from the later chapters of the book on AdaBoost: it does the multi-class (but single-label) version that also partitions the error rate computation for the sign of the value of the partial hypothesis.<br /><br />The multi-class means that it can work not only with two outcomes (a yes/no decision) but with multiple possible outcomes (i.e. Bayesian hypotheses), just like the normal Bayesian table. And just like the normal Bayesian table, these outcomes must be mutually exclusive, so each case may have only one outcome (single-label). The multi-class logic works in the way that is similar to the one described in <a href="http://babkin-cep.blogspot.com/2015/10/bayes-10-independence-and-relevance.html">Part 10</a>: the resulting table is just a combination of the individual per-outcome tables. From the standpoint of each outcome, the single-label property means that the rest of the rows of the table taken together represent the opposite of this outcome, so there is no need to compute that opposite explicitly.<br /><br />The partitioning of the error rate computation not only makes the result better but also comes very handy to prove that the opposite of one outcome is equivalent to the rest of the rows of the table. The AdaBoost training error <i>e</i> is the ratio of the total weights of the training cases where the value of the event E (that is, the Bayesian event, in AdaBoost terms it's a partial hypothesis) is not equal to the value of the outcome (Bayesian hypothesis) H. And we'll compute it separately for E (i.e. E=1) and ~E (i.e. E=0). It ends up expressable as probabilities that can be computed by summing and dividing weights W of the training cases:<br /><br />e(H,E) = P(~H|E) = (sum for cases i where H(i)=0 and E=1 (Wi))/(sum for cases j where E=1 (Wj))<br />e(H,~E) = P(H|~E) = (sum for cases i where H(i)=1 and E=0 (Wi))/(sum for cases j where E=0 (Wj))<br /><br />So if we pick a particular outcome and denote it H1, we can show that its opposite ~H1 is equivalent to the sum of the rest of hypotheses H2...HN by showing that<br /><br />P(~H1|E) = sum for i in [2, N] (P(Hi|E))<br /><br />which after removing the common denominator means <br /><br />(sum for cases j where H1(E)=0 and E=1 (Wj)) = (sum for i in [2, N] (sum for cases k where Hi(E)=1 and E=1 (Wk))<br /><br />The single-label property means that each case is marked with exactly one outcome, and all the cases where the outcome H1 is absent must have some other outcome Hi present, so these sums are indeed equal.<br /><br />So it is all self-consistent. The key part of the computation is:<br /><br /><pre>for my $hyp (values %abhyp) {<br /> if ($conf) {<br /> $hyp->{wt} *= $hyp->{ppos}[$evi];<br /> } else {<br /> $hyp->{wt} *= (1. - $hyp->{pneg}[$evi]);<br /> }<br /> }<br /></pre><br />And the matching part of the table building is:<br /><br /><pre>foreach my $c (@case) {<br /> my $hyp = $c->{hyp}[0];<br /> # this code really handles only the TC values of 0 and 1,<br /> # for the arbitrary values it would have to do an interpolation<br /> if ($c->{tc}[$i] >= 0.5) {<br /> $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];<br /> } else {<br /> $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);<br /> } <br /> } <br /></pre><br />The training undoes the weight change that will be done in the computation.<br /><br />Now let's try some examples. Here is the example that I used in the whitepaper:<br /><br /><pre>$ perl <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/ex24_01run.pl">ex24_01run.pl</a> -n <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/tab24_01.txt">tab24_01.txt</a> <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/in24_01_01.txt">in24_01_01.txt</a><br />Original training cases:<br />! , ,E1 ,E2 ,E3 ,<br />hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp1 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,<br />AdaBoostian table:<br />! , ,E1 ,E2 ,E3 ,<br />hyp1 ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,<br />hyp2 ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,<br />--- Applying event E1, c=1.000000<br />hyp1 w=2.00000 p=0.66667<br />hyp2 w=1.00000 p=0.33333<br />--- Applying event E2, c=1.000000<br />hyp1 w=1.00000 p=0.66667<br />hyp2 w=0.50000 p=0.33333<br />--- Applying event E3, c=1.000000<br />hyp1 w=0.40000 p=0.57143<br />hyp2 w=0.30000 p=0.42857<br /></pre><br />Here the option "-n" means "don't normalize the weights" (or when normalized to the sum of 1 they will be the same as probabilities). The two values printed in the AdaBoostian table separated by a "^" are the P(H|E) and P(~H|~E).<br /><br />As you can see, the result is not correct either (because of the XOR problem) but it's better than with the plain Bayes version that gave hyp1 the probability 0.667. And since E1 and E2 are the same, E2 ended up with the values of 0.5 in the table, and applying it didn't change the weight.<br /><br />If we change the order of events, E2 gains some use back:<br /><br /><pre>$ perl <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/ex24_01run.pl">ex24_01run.pl</a> -n <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/tab24_02.txt">tab24_02.txt</a> <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/in24_01_01.txt">in24_01_01.txt</a> <br />Original training cases:<br />! , ,E1 ,E3 ,E2 ,<br />hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp1 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,<br />AdaBoostian table:<br />! , ,E1 ,E3 ,E2 ,<br />hyp1 ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,<br />hyp2 ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,<br />--- Applying event E1, c=1.000000<br />hyp1 w=2.00000 p=0.66667<br />hyp2 w=1.00000 p=0.33333<br />--- Applying event E3, c=1.000000<br />hyp1 w=0.80000 p=0.57143<br />hyp2 w=0.60000 p=0.42857<br />--- Applying event E2, c=1.000000<br />hyp1 w=0.37895 p=0.54545<br />hyp2 w=0.31579 p=0.45455<br /></pre><br />Rather surprisingly, the result ended up more correct. Or maybe it's not that surprising: after all, AdaBoost is designed to gain more information from available data.<br /><br />Yet another interesting result get produced from repeating the same pair of events multiple times:<br /><br /><pre>$ perl <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/ex24_01run.pl">ex24_01run.pl</a> -n <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/tab24_03.txt">tab24_03.txt</a> <a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/in24_03_01.txt">in24_03_01.txt</a> <br />Original training cases:<br />! , ,E01 ,E02 ,E11 ,E12 ,E21 ,E22 ,E31 ,E32 ,E41 ,E42 ,E51 ,E52 ,E61 ,E62 ,E71 ,E72 ,E81 ,E82 ,E91 ,E92 ,<br />hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,<br />hyp1 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,<br />hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,<br />AdaBoostian table:<br />! , ,E01 ,E02 ,E11 ,E12 ,E21 ,E22 ,E31 ,E32 ,E41 ,E42 ,E51 ,E52 ,E61 ,E62 ,E71 ,E72 ,E81 ,E82 ,E91 ,E92 ,<br />hyp1 ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,0.49694^0.49526,0.49916^0.49946,0.49990^0.49985,0.49997^0.49998,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,<br />hyp2 ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,0.50306^0.50474,0.50084^0.50054,0.50010^0.50015,0.50003^0.50002,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,<br />--- Applying event E01, c=1.000000<br />hyp1 w=2.00000 p=0.66667<br />hyp2 w=1.00000 p=0.33333<br />--- Applying event E02, c=1.000000<br />hyp1 w=0.80000 p=0.57143<br />hyp2 w=0.60000 p=0.42857<br />--- Applying event E11, c=1.000000<br />hyp1 w=0.37895 p=0.54545<br />hyp2 w=0.31579 p=0.45455<br />--- Applying event E12, c=1.000000<br />hyp1 w=0.18831 p=0.54242<br />hyp2 w=0.15886 p=0.45758<br />--- Applying event E21, c=1.000000<br />hyp1 w=0.09400 p=0.54159<br />hyp2 w=0.07956 p=0.45841<br />--- Applying event E22, c=1.000000<br />hyp1 w=0.04699 p=0.54149<br />hyp2 w=0.03979 p=0.45851<br />--- Applying event E31, c=1.000000<br />hyp1 w=0.02349 p=0.54147<br />hyp2 w=0.01990 p=0.45853<br />--- Applying event E32, c=1.000000<br />hyp1 w=0.01175 p=0.54146<br />hyp2 w=0.00995 p=0.45854<br />...<br /><br />--- Applying event E92, c=1.000000<br />hyp1 w=0.00000 p=0.54146<br />hyp2 w=0.00000 p=0.45854<br /></pre><br />AdaBoost converges pretty quickly to the weights that make both events get the probability 0.5 and stop the further changes.<br /><br />This seems to work, although it needs more testing. One more limitation of this implementation is that it works only with the events that have the value of either 0 or 1, not anything in between. I'll talk more about this in the next post.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-11383816463495882842017-02-22T09:51:00.001-05:002017-02-22T09:51:22.265-05:00a better explanation of Bayes-AdaBoost connectionAfter some re-reading, my previous posts on the Bayesian interpretation of AdaBoost have been kind of cryptic. It's no wonder because I've been writing them as I was understanding things, so they came out as more of a personal reminder rather than the general explanation. But I've got around and wrote another version as a whitepaper, that goes all the way from scratch and hopefully is much more understandable:<br /><br /><a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/whitepapers/AdaboostBayes.pdf?format=raw">https://sourceforge.net/p/exbayes/code/HEAD/tree/whitepapers/AdaboostBayes.pdf?format=raw</a><br /><br />Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-36199880819210511012017-02-20T13:21:00.003-05:002017-02-21T01:37:50.507-05:00Bayesian examples now on SourceForgeI've been working on one more post on the Bayes-AdaBoost cross-over, and I've finally got tired of pasting the substantial examples right in the post. So I've created a SourceForge project for them and uploaded all the examples there:<br /><br /><a href="https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/">https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/</a><br /><br />The proper SVN checkout recipe from there is:<br /><br />svn checkout svn://svn.code.sf.net/p/exbayes/code/ exbayesSergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-59640407711855512572017-02-04T20:08:00.001-05:002017-02-28T13:45:27.569-05:00The mythical 10x programmer, or the silver bullet explainedRecently I've been talking about some of the experiences from Aleri times when my interlocutor responded "so I guess you're one of those 10x programmers". This made me pause. Like any programmer per Larry Wall's recipe, I have a good amount of hubris but can I really say that I'm a 10x programmer? After some thinking, I guess sometimes I am and sometimes I'm not. This got me thinking further, about what was different between the situations when I was and when I wasn't. And I think the difference is very much about the organization and methodology, and thus can probably be applied to any programmers to improve their productivity. I've tried to distill the knowledge, and here it goes. By the way, Fred Brooks also called the goal of 10x productivity increase the "silver bullet", so here is maybe your silver bullet. Maybe. I'm not entirely sure that the recipe will work out of the box, it will probably need at least some debugging. Or maybe it won't work but it's still worth a try.<br /><br />Let me start with describing a couple of cases when I could fairly be said to have been a 10x programmer. There were some other good cases too but on a couple of occasions I've actually had a benchmark.<br /><br />One case was at Aleri, working on the CEP system. Here is the real short version of the events: At the time there were 3 main players in the CEP world: StreamBase, Aleri and Coral8, all having an approximate parity in their products. Around 2009 Coral8 apparently ran out of money, so Aleri used it this opportunity to consume it. And then it had two approximately equal products and set about merging them. The merging didn't go that great and Aleri ran out of money too, eventually consumed in 2010 by Sybase. See, even though the Coral8 product was about the same from the functional standpoint, it had the code base about 10x the size, and had maybe not quite 10x but a lot more people working on it. Since the Coral8 code base was so much larger and there were many more people familiar with it, it was taken as the foundation of the merged product. And this merging then crashed and burned. Another merging attempt was then done under Sybase, this time doing it the other way around, and it worked much better. So it's an interesting example where the 10x was there, then it wasn't, and then it was again.<br /><br />Another case was the UnitedLinux. The full history of UnitedLinux is an interesting story of dog-eat-dog in the world of the Linux vendors but I won't go into the full funny details, just touch up on the important parts. One of conditions of the UnitedLinux consortium set by SuSE was that SuSE does all the general development. Everyone else does just their branding customizations and additional packages. So Caldera where I worked at the time (soon to be renamed SCO, with the SCO Linux brand of UnitedLinux) had closed the Linux development office in Germany, with most of the people heading to SuSE. When UnitedLinux was released, the various consortium members (I remember Connectiva, and whoever else) held their release parties and posted pictures with great many people. We didn't have any party at SCO: see, the SCO Linux was done by only two developers and half a tester (and a bit of a graphical artist), and Ron was in California while I was in New Jersey, so having a party would have been difficult. And we've had it ready earlier: the final image was sent to testing about 20 minutes after downloading the drop from SuSE. I think I still have the commemorative t-shirt somewhere, one thing Caldera was big on was the t-shirts.<br /><br />So what was special in these cases? In a very short summary, I'd say that it was the compact code done with the familiar technologies, incrementally, with low interdependencies, and with high level of script automation. Oh, and you need to have the good engineers as well. But just the good engineers aren't enough: I wouldn't say that the engineers at Coral8 were any worse than at Aleri, just they've got mired in the slow methodologies.<br /><br />Let's look at the details.<br /><br />The compact code pretty much means the straightforward and to-the-point code. There is a classic book "The practice of programming" (and I liked it so much that I've patterned the name of my book after it) that shows the code like this: when you have a problem, solve it in the most straightforward way. There are lots of published systems of various code decorations, and all of them are wrong. The decorations don't make the code better, they make it worse, more difficult to read and modify. The extra layers of abstractions don't help anything either, they make the code more difficult to both understand and modify (not to mention writing it in the first place). There is an article about "worse is better" written by Richard Gabriel from the MIT AI labs when he discovered that the simpler Unix code worked better than his "better" over-abstracted code. But this phrase is misleading, better is better, it's just that the over-abstractions are worse, not better. This absolutely doesn't mean that the code should be flat abstraction-wise. The code that is too flat is also an unmaintainable spaghetti mess. The abstraction layers are needed. But they have to be introduced for a good reason, not on a whim and not by the bureaucratic rules. The good functions are usually pretty large. The good classes are usually pretty large. The good rule of thumb for making a chunk of code into a new abstraction is to notice when you do the same thing over and over again. Then it's time to put this code into one place and call it from the other places. A good abstraction reduces the size of code, does not increase it.<br /><br />The problem with the overly-abstracted code is really the same as with the under-abstracted flat code: there are only so many things you can keep in your head at a time. In the code that is too flat you need to keep too many interdependencies of that flat level in your head. In the code that is too abstracted you need to keep too many abstractions and their APIs and calling patterns in your head. The optimal spot is in the middle.<br /><br />A lot of things are easier to just do than to use a library. If you do the logging from a shell script, a 10-line shell function will not only work much better than a shell version of log4j but will also be a lot easier to call (not to mention avoiding an extra dependency). Or an iteration over a container with a plain for-loop is not only easier to write and understand but also much easier to remember than the STL templates for that.<br /><br />An important thing is to keep the documentation close to the code. This means both the good comments in the code and the user documentation.<br /><br />The basic rule behind the good comments is very simple: the comments must be one (or sometimes more) step more abstract than the code. A comment should be explaining either why something is done (or not done), especially if the reasoning is not obvious, or telling what is done as a short summary in the higher-level terms. As long as this rule is followed, the more comments is pretty much the better (but there is only so much you can write without breaking this rule).<br /><br />Keeping the user documentation together with the code, in the same versioning system and in the same tree hierarchy is very important. This way the documentation can be updated in parallel with the code and will never become obsolete. And at the end of the release cycle all the changes can be found and re-edited. This principle implies that the documentation must be kept in a text-based format that is easy to diff, just as the code is, and that the tools used on it must not change that exact text format at random. Which generally means that the editing should be done with a normal text editor, and the WYSIWYG tools should go into the garbage bucket. The pain inflicted by them is much greater than the gain. The ease of versioning and diffing is much more important than the ease of visual formatting. While on the subject of documentation, the technical writers need to either really be technical or be kept in check. The precise meaning of documentation is important, and I've seen too often an edit to make a sentence smoother also making it ambiguous or entirely meaningless. And "smoother" is not always "better" either. I've bought a book of Richard Feynman's letters, more for the collection purposes since the letters are usually fairly boring but unexpectedly it also turned out to be a great reading. Except for the part where it included the magazine articles about his Nobel prize. The style of these articles was completely different from Feynman's, it was the usual unreadable journalist drivel. Don't turn your documentation into the journalist drivel.<br /><br />To give another example of where I think smooth is bad, the sentence above "An important thing is to keep the documentation close to the code" is better in that place than "Keeping the documentation close to the code is important". The complex sentence slows you down and gives you a breather before the text switches to a new subject, emphasizing that switch. Also, it brings the word "important" closer to the start of the sentence where it's more noticeable.<br /><br />Next principle, the familiar technologies, affects the speed of development a lot. It's kind of obvious but the counter-point is "if we use that latest and greatest new tool, we'll be done much faster". And sometimes the latest and greatest tool really does that. But most of the time the familiarity is much more important. If you have an idea of how the goal can be straightforwardly achieved with the familiar tools, that's usually the best bet by a wide margin. If not, learn the new tool. Sometimes this would also give you an idea of how the goal can be achieved easier with your old tools. And a lot of technologies are just crap anyway. So the overall priority should be not "the goal for the tool" but "the tool for the goal".<br /><br />This doesn't mean that the new tools shouldn't be learned or that everyone has to be familiar with every tool to be used. Usually one person having the familiarity with a tool is enough to frame the code around it, giving everyone else a big head start in learning it. It's much easier to add code to an existing unfamiliar framework than to build a new framework with an unfamiliar tool. And there are the times to use the familiar tools and times to learn the new ones. There is the cadence of development when the things start amorphous and slow, and then as the solutions for various items start falling together the project starts crystallizing and picking up the pace until it rolls at full steam to the goal. Then the next cycle starts. The slow part of the cycle is a good time to learn the new tools. But things tend to go a lot faster when only one thing is new: a new tool for an old purpose or an old tool for a new purpose. Of course sometimes a bullet has to be bitten and a new tool learned together with the new purpose but that goes slower.<br /><br /><br />On to the incrementality. It's nothing new, Eric Raymond wrote about it in "The Cathedral and the Bazaar". But it still gets lost surprisingly often, even in the "agile methodologies". Instead of "we'll make this great new thing to replace that dingy old thing" the right approach is "we'll transform this dingy old thing into this great new thing". The programs are not physical objects, they're more like blueprints for the physical objects. It's not like you have to raze a bridge to build a new bridge in its place. Instead a program's <i>run</i> is more like a physical object. Every time you restart a program an old bridge gets razed and a new one gets built for you according to the new plan. Or I guess another good example from the world of the physical things is the automobile manufacturing: the car models might stay but their internal mechanics is updated pretty regularly. At some point the cars would look the same but the new cars from that point on would get say a better alternator. The Chevy small block V8 engine has been around for more than 50 years. The only item that stayed for all this time is the dimensions of the cylinders. Everything else has been replaced. But it hasn't been replaced in one go, it has been a product of a long evolution with many gradual small changes.<br /><br />Every time you hear about an order of magnitude budget and time overruns, it's when someone tried to re-do a program from scratch. Fred Brooks talks about this in his "Mythical man-month", so the effect has been known since at least 1970s, and yet it keeps being repeated. And guess what, that's not limited to software either. Have you ever heard of Tucker? It was a new automobile company in 1940s that decided to design its car completely from scratch, with all the newest and greatest technology. They've never got the cars working, they kept falling apart until the company ran out of money, and its founder got sued by the government for all the dubious financial activities he'd done to delay the running out of money. Here someone might ask "but what about the iPhone, what about Tesla?" They are actually the examples of the evolutionary development. iPhone was built around the clever idea of a big-screen phone used as an application platform but it wasn't developed all at once, and Apple didn't try to reinvent the cell communications chipsets at the same time. Some of the evolutionary moments of iPhone are widely known: such as, they started with a stylus as was then typical, and then Steve Jobs said "screw this, use the fingers instead". And the last-moment decision of replacing the plastic screens with glass had been a subject of many moralizing articles. Tesla is best seen in the contrast to GM's EV1. GM tried to develop the whole technology chain from scratch, spent a huge amount of time and money, and failed. While Tesla piggybacked on the already existing technologies and packaged them together into a car (that they also took off the Lotus's shelf). Even then Tesla's path wasn't easy, they've had their tuckeresque moments with the failing transmissions (which they eventually simply got rid of) and with the battery cooling for a few years before they were able to release their first car.<br /><br />One of the worst statements I've heard about why a system needed to be rewritten from scratch went like this: "it has accumulated all these little features that make making changes without breaking them difficult, so we have to re-write everything to get rid of them and make the changes simple again". It completely misses the point: it's these little features that make the product worthwhile. Doing the 80% of the functionality is easy but nobody needs only 80% of the functionality (and in reality that particular project overran by years). The code is a living document that preserves the knowledge of the subject area. Discarding the code means discarding this knowledge which then has to be re-learned. This doesn't mean that the best solution isn't sometimes to throw away the old code and re-write it from scratch. But it's usually done by replacing the code in small chunks and checking that the functionality is preserved after the move of every chunk. To give an analogy from biology, the human (and animal) organs develop by growing the new cells within a scaffolding of the other neighboring cells that shapes them and makes them properly differentiate for the function. The old code provides such a scaffolding for the new chunks.<br /><br />I want to illustrate this last point with a couple of examples. I'll leave the company and all the details in the first example unnamed, since the example is kind of embarassing. This company decided to make an appliance with its software. It had all the software bits and pieces that it has been selling to the customers, so the only thing that needed to be done was put it all together, like the customers do. Simple, right? Turns out, no, the bits and pieces would not work together, and no customer had actually put them all together because they didn't work. Each piece worked in a way, so each responsible team was happy, but not together. And getting them to work together took a large amount of time and work. Another similar example comes from Microsoft where I've seen a very similar issue with the <a href="https://blogs.msdn.microsoft.com/sergey_babkins_blog/2016/02/26/powershell-windows-clusters/">PowerShell support in the Windows clusters</a>: there are many components involved and each one has its own PowerShell commandlets for management but they don't mesh well together with each other, need to be combined in the very non-obvious ways with no big goal-oriented commandlets, and some things turn out to be not really doable at all.<br /><br />The obvious objection is "but how do we test all these changes"? This is where the automated testing comes in, and it's such a big subject that I wrote a <a href="http://babkin-cep.blogspot.com/2016/12/on-testing.html">whole separate post about that</a>. But it's really a bogus question. When you write a completely new system,you still test each chunk of code as you write it, and if you don't have the automated testing then you just test it manually. The difference is that in an incomplete system your chunk will have only the limited interactions, so you would be able to test only some of its effects. In a complete system you get a much more complete coverage. And if the old system doesn't have any automated tests then it's a good way to start them: for each chunk write the tests before replacing it, and then use them to check that the the replaced chunk still works in the same way. Another important point, a chunk doesn't have to be all in one place, it might be distributed, such as an one-line change in a thousand files.<br /><br />The small changes don't have to go together with the frequent releases. But every time you do a release one way or another, you get the issue "we have to complete all the changes before the release!" and it gets worse if you want to do the more frequent releases. Well, you definitely do want to make sure that the things don't get accidentally broken at the last moment, so the commits have to slow down just before the release. But it doesn't mean that you can't include the changes at all. Any change is fine to include as long as it doesn't break the existing functionality. The new functionality doesn't have to work, as long as it doesn't get called in the production environment nobody cares about that, it only has to not break the old functionality. This way you can integrate the changes a lot earlier, and then spend a lot less effort merging them together.<br /><br />In the <a href="http://babkin-cep.blogspot.com/2016/12/on-testing.html">post on testing</a> I've already said that an important part of the incrementality is the incremental tests, being ability to run easily the immediate subset of tests affected by a change. But a prerequisite for it is the incrementality of the build: the ability to build quickly all the components dependent on the change, all the way to the final package, while skipping the build of the unchanged parts of the tree. "Quickly" here means right away, not in some kind of a nightly build. This does wonders for the turnaround time. Obviously, in some cases, like a fundamental library, pretty much all of the tree will depend on it. But then the incrementality should be interpreted as the ability to build a few of the representative projects based on it to test that they didn't break. If the full nightly test shows that some dependent components got broken, the ability to rebuild them quickly also makes wonders for the turnaround of the investigation.<br /><br />Another thing about the build is that I think the purely rules-based builds are bad. It's not that having the common rules is bad, they're actually quite convenient. But the build system must also allow to do the quick one-offs. These one-offs are very important for the automation of the build. They allow to write quickly say a small Perl script that would do some kind of transformation or code generation. A build system without such one-offs turns the hour-long tasks into the week-long ones. So just please do support the full proper make and the scripting environment, not some crap like Cmake. The counter-argument is of course "but we need to build on multiple platforms including Windows". Yeah, building on Windows sucks, and the only decent solution I see it to bring a decent environment of the portable tools to each of your build platforms.<br /><br />Another very simple but important thing about the build is that it must detect the errors. The Java-based tools in particular are notorious for ignoring the status codes, so when something breaks, the only way you know it is by the result files being absent. Or not even by them being absent but them failing in the tests. This is bad. All the errors have to be detected as early as possible.<br /><br />This looks like a good place for another related rant, on reliability. Basically, when something works it should work reliably. Einstein said "The definition of insanity is doing the same thing over and over again and expecting different results." But in reality we don't control everything, and when these hidden changes cause a different result of the same actions, it's just maddening and time-consuming. The good code, including the tests and build, should work reliably. If the underlying systems are known to have outages, it should retry and still work reliably on top of them and not propagate this unreliability up. Every step the unreliability propagates makes it more difficult to fix, so it's best nipped in the bud. There of course are the reasonable limits too, the calls should not get stuck retrying forever, and probably not ever for an hour. Although retrying for an hour might be fine for a scheduled nightly run, but for a manual run it's way too long. And when a program is retrying it should not be silent, it should be telling what its is doing.<br /><br />Next item is the low interdependencies. Basically, it means that the people should be able to work without stepping too much on each other. You might ask: but doesn't the principle of the small code base go against it? More code should allow more people to work on its part, right? No, it doesn't work like this. The catch is that if you have more code solving the same problem, any changes to it mean that you have to change about the same <i>percentage</i> of the code. This is why a sprawling code base is slow in development, because any change requires changing a lot of code. But this is also why it doesn't allow more people to work on it in parallel. So the corollary is that if you add too many people to a project, they will start stepping on each other a lot and the productivity will drop.<br /><br />The other major kind of interaction is between the people in general. If you need to do a 5-minute task, it takes you 5 minutes. But if you need someone else to do a 5-minute task, it's likely to take you a day of interaction. Which still might be faster than learning how to do it yourself but if the task is repetitive then this becomes a major drag. To work fast, this drag has to be reduced and eliminated whenever possible. One of the consequences is that the posessive code ownership is bad. Now, having someone knowledgeable available to talk about a part of code is good, and it also helps to discuss the general directions of the code to avoid breaking things for other people. But there should not be a single-person or single-team bottleneck for changes to a certain part of code, nor any single person controlling what is allowed to go in. There are other reasons to avoid these things too but even the speed of development should be a sufficient reason. Another corollary is that the strict code style requirements are bad. The small style variations just don't matter, and the enforcement is a lot of pain with no gain whatsoever.<br /><br />Another case where the excessive interaction shows is planning. Have you seen these beautiful Gantt charts and critical path diagrams, with the carefully planned moments of interdependencies? Well, they're all garbage. They might be an interesting exercise on the way to the better understanding of the project but things never work out like this in reality. In reality all the interdependency points end up as a mess. So there is no point in overplanning. Instead I really like the saying of my past manager: "We can plan an aggressive schedule because we don't mind it slipping once in a while". When combined with the idea of the incrementality, this points to the following concept: the detail level of the plans must be tied to their distance. A detailed planning of the far-off items doesn't make sense because they are sure to change before we get there. These items should be on the list, so that we can keep track of the direction, but only in the general terms. Only the very near-term items need to be planned in detail, and the farther in the future the less detail is needed. Though the very near-term items have their Heisenbergian uncertainty as well. In my experience the minimal unit of time for planning is a week. If you have five items that you expect to take one day each, they will never work out like this. Some of them will take an hour and some will take three days. But if you lump all five together, they will take together a week with a decent amount of certainty.<br /><br />Since I've already talked about the scripted automation throughout the text, this is a wrap for this post. The things I've described look kind of small but they really are important, and I believe that taken together the do amount to the 10x difference in productivity. What do you think?<br /><br />P.S. Someone else's thoughts on the subject of 10x: <a href="http://antirez.com/news/112">http://antirez.com/news/112</a><br /><br /><br />Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com3tag:blogger.com,1999:blog-3195352510924611504.post-55452373423872998842017-01-06T15:52:00.002-05:002017-01-13T17:08:14.950-05:00Bayesian networks & neural networksI've been recently reading about the neural networks. By the way, here are some introductory links:<br /><br />I can most highly recommend this document that explains in a very clear and simple way how the training of the neural networks through the backpropagation works:<br /><a href="http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf">http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf</a><br /><br />A simple introductory series of 6 (and maybe growing to more) articles starting with this one:<br /><a href="https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2">https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2</a><br /><br />Some other links:<br /><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">http://karpathy.github.io/2015/05/21/rnn-effectiveness/</a><br /><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">http://colah.github.io/posts/2015-08-Understanding-LSTMs/</a><br /><a href="https://colah.github.io/posts/2015-01-Visualizing-Representations/">https://colah.github.io/posts/2015-01-Visualizing-Representations/</a> <br /><a href="http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/">http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/</a><br /><br />Also, the apparently the father of the deep neural networks is G.E. Hinton, and you may also want to search for the articles by Harry Shum. Hinton's home page is:<br /><a href="https://www.cs.toronto.edu/~hinton/">https://www.cs.toronto.edu/~hinton/</a><br />that seems to have a bunch of the links to his courses but I haven't looked at them yet. <br /><br /><br />As you can see from the introductory reading, each neuron in a neural network is a pretty simple machine: it takes some input values, multiples them by some coefficients, adds the results up, and then passes the result through a nonlinear (usually some kind of a sigmoid) function. The whole thing can be written as an expression:<br /><br />result = nonlinear( sum [for inputs i] (input<sub>i</sub> * C<sub>i</sub>) )<br /><br />The nonlinear part is pretty much what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value.<br /><br />The sum part is pretty much the same as the sum done in AdaBoost. AdaBoost does the sum of logarithms. And I've shown in <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-3-and-bayesian-logic.html">the previous posts</a> that this sum can be converted to a logarithm of a product, and then the product can be seen as a Bayesian computation expressed as chances, and the logarithm being a part of the decision-making process that converts the resulting chance value to a positive-or-negative value. So we can apply the same approach to the sum in the neuron, say that the values it sums up are logarithms, convert it to the logarithm of a product, make the logarithm a part of the final nonlinear function, and then the remaining product can be seen as a Bayesian computation on chances.<br /><br />This pretty much means that a neuron can be seen as a Bayesian machine.<br /><br />And guess what, apparently there is also such a thing as a Bayesian network. There people take multiple Bayesian machines, and connect the results of one set of machines as the input events to the Bayesian machines of the next level. For all I can tell, the major benefit of the handling of the problem when some hypotheses are indicated by a XOR of the input events, similarly to the splitting of the hypotheses into two and then merging them afterwards <a href="http://babkin-cep.blogspot.com/2015/10/bayes-11-what-it-all-really-means.html">like I've shown before</a> but instead of the external arithmetics the logic being folded into the Bayesian computation of the second level.<br /><br />But if a neuron can be seen as a Bayesian machine then the neural networks can also be seen as the Bayesian networks! The only difference being that in the Bayesian networks the first-level (and in general intermediate-level) hypotheses are hand-picked while the neural networks find these intermediate hypotheses on their own during the training.<br /><br />I wonder if this is something widely known, or not known, or known to be wrong?Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-88681204547838303342016-12-26T18:51:00.001-05:002017-01-04T15:02:53.902-05:00On testingI've started writing another post (hope to complete it somewhere soon), and as one of the offshoots it brought my thoughts to the automated testing. The automated testing had been a great new thing, then not so new and kind of a humbug thing, then "instead of testing we'll just run a canary in production", then maybe it's doing a bit of a come-back. I think the fundamental reason for automated testing dropping from the great new thing is (besides the natural cycle of newer greater things) is that there are multiple ways to do it and a lot of these ways being wrong. Let me elaborate.<br /><br />I think that the most important and most often missed point about the automated testing is this: <b>Tests are a development tool</b>. The right tests do increase the quality of the product but first and foremost they increase the productivity of the developers while doing so.<br /><br />A popular analogy of the programs is that a program is like a house, and writing a program is like building a house. I'd say not, this analogy is wrong and misleading. A program is not like a house, it's like <b>a drawing or a blueprint</b> of a house. A house is an analogy of what a program produces.<br /><br />The engineering drawings are not done much with the pencil and paper nowadays, they're usually done in a CAD system. A CAD system not only records the drawing but also allows to model the items on the drawing and test that they will perform adequately in reality before they're brought to reality. The automated tests are the CAD for programs. <br /><br />The CADs for drawings require entering the extra information to be able to do their modeling. So do the automated tests. It's an overhead but if done right this overhead is an acceptable one, bringing more benefits than overhead.<br /><br />The first obvious benefit of the automated tests is that they make the programs robust. To bring the house analogy again, like a house of bricks, not a house of cards. You might remember the saying about "If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization". Not with the automated tests, not any more. With the automated tests you don't get into the situation "I've just changed one little thing and it all suddenly fell apart" any more.<br /><br />But that's again just the reliability. Where is the productivity coming from? The major benefit for the productivity is that you don't need to do so much analysis any more. You can always change things quickly and see what happens in a test environment. This is also very useful for tracing the dependencies: if you change something and the product breaks, you see where it breaks. This is a major, major item that makes the programming much more tractable and turns the devilishly complicated modifications into the reasonably straightforward ones.<br /><br />The programming tasks tend to come in two varieties: The first variety is the straightforward one: you just sit and write what the program needs to be doing, or even better copy a chunk of the existing code and modify it to do the new thing you need. The second variety happens when the thing you need to do hits the limitations of the infrastructure you have. It usually has some kind of a paradox in it, when you need to do <i>this</i> and <i>that</i> but if you do <i>this</i> then <i>that</i> falls apart and vice versa. <br /><br />When you have this second kind of a task on your hands, you need to stretch your infrastructure to support both <i>this</i> and <i>that</i> at the same time. And this is the situation when things tend to get broken. I really like to solve such problems in two steps:<br /><br />1. Stretch the infrastructure to support the new functionality but don't add the new functionality.<br />2. Add the new functionality.<br /><br />At the end of step 1 the program works differently inside but in exactly the same way on the outside. Having the automated tests, they all still pass in exactly the same way as before. And only after the second step there will be the new functionality that will require the new tests and/or modification of the old tests.<br /><br />But as I've said before, not all the tests are useful. I have formulated the following principles of the useful tests:<br /><br /><ul><li>Easily discoverable.</li><li>Fast running. </li><li>Fast starting.</li><li>Easy to run both by the developer and in an automated scheduler.</li><li>As much end-to-end as possible.</li><li>Test support built into the product and visible as an external API.</li></ul><br />What do these things mean? Let's look in detail.<br /><br />Easily discoverable means that the tests must live somewhere next to the code. If I change a line of code, I must be able to find the most relevant tests easily and run them. It's an easy mantra: changed something, run the tests on it. And doing so must be easy, even if the code is not familiar to the developer.<br /><br />Fast running means that the tests should try not to take too much time. This comes partially to trying to make the tests fast and partially to the ability to run only the relevant subsets of the tests. The way of work here is: changed a few lines, run the basic tests, change a few more lines, run the tests again. The basic subset for some lines should be easy to specify and take seconds, or at worst a few minutes to run. This is sometimes quite complicated when testing the handling of the time-based events. For time-based events it's very good to have a concept of the application time that can be consistently accelerated at will. Another typical pitfall is the situation "I've requested to do this, how do I find out if it's done yet, before running the next step of the test?". This is something that is easy to resolve by adding the proper asynchronous notifications to the code, and making everyone's life easier along the way. The fast running concept is also useful for debugging of the failed tests: how long does it take to re-run the test and reproduce the problem or verify the fix?<br /><br />Fast starting is related to the fast running but is a bit different. It has to do with the initialization. The initialization should be fast. If you're running a large test suite of ten thousand tests, you can afford to have a lengthy initialization up front. If you run one small test, you can't have this lengthy initialization every time. If you really need this initialization, there must be a way to do it once and then repeatedly re-run the individual test.<br /><br />The other aspect of the fast starting is that the programmer must be able to run the tests directly from the development environment. The repeating cycle is write-compile-test. The tests must directly and automatically consume the newly compiled code. There must not be any need to commit the code nor to run it through some kind of an official build.<br /><br />This partially overlaps with the next requirement: the tests being easy to run both directly by the programmer and in an automated system that goes together with the automated official build. The need to run the tests directly comes from the need for the efficiency of development and from the use of the tests as a CAD. There shouldn't be any special hoops to jump through for including the tests into the official builds either, it should <i>just work</i> once checked in. This is another place where keeping the tests next to the code comes in: both the tests and the code must be a part of the same versioning system and must come in as a single commit. Well, not necessary literally as a single commit, personally I like to do a lot of small partial commits on my personal branch as I write the code and tests, but when it comes up into the official branch, the code of the program and of the tests must come together.<br /><br />The end-to-end part is very important from both the standpoint of the cost of the tests and of their usefulness. It needs a good amount of elaboration. There had been much widespread love professed for the unit tests, and then also as much disillusionment. I think the reason for this is that the unit tests as they're often understood are basically crap. People try to test each function, and this tends to both require a lot of code and be fairly useless, as many functions just don't have enough substance to test. Mind you, there are exceptions: if a function does something complicated or is a part of a public API, it really should have some tests. But if a function is straightforward, there is no point in looking at it in isolation. Any issues will come up anyway as a part of a bigger test of the code that uses this function.<br /><br />The worst possible thing is the use of the mocks: those result in just the tautologies when the same thing is written twice and compared. These "tests" test only that the function goes through the supposedly-right motions, not that these motions are actually right and produce the proper result. This produces many horrible failures on deployment to production. A real test must check that the result is correct. If some API you use changes under it and creaks the result, a proper test will detect it. If you really have to use mocks, you must use them at least one level down. I.e. if you're testing a function in API1 that calls API2 that calls API3, you might sometimes get something useful by mocking the functions in API3 but never mock the functions in API2.<br /><br />Another benefit of the integrated testing is that it often turns up bugs where you didn't expect to. The real programs are full of complex interactions, and the more you exercise these interactions, the better are your tests.<br /><br />So my concept of an unit test is an "integrated unit test": a test for one feature of an externally visible API. Note that there might be some complex internal components that should be seen as a layer of API and tested accordingly, especially if they are reused in multiple places. But the general rule is the same. This both increases the quality and cuts down on the amount of the tautology code.And it also facilitates the stretch-then-extend model I've described above: if a test tests what the API call does, not how it does that then you can change the underlying implementation without any change to the tests, and verify that all the tests still pass, your external API hasn't changed. This is a very important and fundamental ability of tests-as-CAD, if your tests don't have it then they are outright crap.<br /><br />Some readers might now say: but doesn't that end-to-end approach contradict the requirement of the tests being fast? Doesn't this integration include the lengthy initialization? When done right, no, it doesn't. The answer comes two-fold: First, when the tests are more integrated, there are fewer of them, and due to the integration they cover more of the internal interactions. Second, this forces you to make the initialization not lengthy. It might require a little extra work but it's worth it, and your customers will thank you for that.<br /><br />This brings us to the last point: when you've built the support of the testing into your product, keep it there for the public releases and make it publicly available. This will make the life of your customers a lot easier, allowing them to easily build their own tests on top of your infrastructure. Instead of writing the stupid mocks, it's much better to have a side-door in the underlying APIs that would make them return the values you need to test your corner cases, and do it internally consistent with the rest of their state. I.e. if you mock some kind of an underlying error, how do you know that you mock it right, in the same way as the underlying API would manifest it? You don't. But if an underlying API has a way to ask it to simulate this error, it would simulate properly, as it does in the "real life".<br /><br />Some people might now ask: But what about security? But what about performance?<br /><br />As far as the security goes, security-through-obscurity is a bad idea anyway. Obviously, don't give the test access to the random entities, have a separate "access panel" in your API for the tests that would only allow the authorized connections. And if your product is a library then the security-through-obscurity is a REAL bad idea, and there are no random entities to start with. Make things accessible to your user. There is no good reason for a class to have any <i>private</i> members other than for the non-existing components. The most protection a class might ever need is <i>protected</i>. The same, there is no good reason for a class to have any <i>final</i> members. And if you're worried that someone will use the testing part of the API to do something that the main API doesn't allow then the solution is simple: make the main API allow it, otherwise you're crippling it intentionally.<br /><br />As far as the performance goes, the overhead is usually pretty low. Note that you don't need to embed the test support all over the place, only at the substantial API layers. These API layers usually deal with the larger-scale concepts, not the ones you'd find at the bottom of a triple-nested loop. Sometimes there are exceptions to this rule but nothing that can't be resolved in some way, and the result after resolution is always better than before.<br /><br />Hope that I've convinced you that the right testing infrastructure makes a world of difference in the software development, not only improves the quality but also makes it faster and cheaper. With the right tests you'll never get stuck with "it's working somehow, don't touch it" or "we will need to study for a year before making that change". Or to use again the building analogy (yeah, I've just decried it as wrong but I'll use it differently), the tests are not like the scaffolding on a building that you discard after the construction is completed, they are like the maintenance passages and hatches that keep the building in an inhabitable condition throughout its life.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com2tag:blogger.com,1999:blog-3195352510924611504.post-52537327909016171232016-08-27T01:26:00.002-04:002016-08-27T01:26:49.532-04:00AdaBoost 9: Boost by majority afterthoughtAfter some time I've realized that all this <a href="http://babkin-cep.blogspot.com/2016/08/adaboost-8-boost-by-majority.html">monkeying</a> with the conditional probabilities in the Bayesian table is not necessary. You can just throw away a whole training case or a part of it and continue like nothing happened, the probabilities should stay consistent anyway. After all, the point of adjusting the weights after each round opposite to how the run-time weights would be changed is to give each training case an equal chance. But if we don't want to give some training case an equal chance then there is no point in treating it equally, an ignored training case can be simply ignored.<br /><br />Another thought is that it looks like the two-pass approach can be used to find what training cases to throw away in a dynamic way. We can do it by splitting the set of available cases randomly in half. Then use one half for the first pass of training of N rounds and remember the weights throughout it. Then test the effectiveness of this training on the second half of the cases. But not just run the N rounds in the test. Instead, keep the test results for using only 1 round, 2 rounds, 3 rounds, and so on all the way to the N rounds. Then see the number of rounds on which the test did best, say K rounds. Going back to the training weights, we can find, what training cases were not getting guessed well at K rounds. We can mark them as outliers. Then repeat the same thing swapping the two halves, and find the outliers in the second half. Then throw away the outliers and do the second pass of training on the rest of the cases.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-83909188227911025772016-08-20T03:01:00.000-04:002016-08-20T03:01:10.252-04:00AdaBoost 8: Boost by majorityWhen I <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-3-and-bayesian-logic.html">wrote before</a><br /><br /><i>The premise of boosting is that we're able to find a number of methods (what they call "hypotheses" in AdaBoost) to predict the correct outcomes of the training cases, each method correctly predicting more than 50% of the training cases. Then if we collect a large-ish number of these methods, we can predict the correct outcomes of all the training cases simply by averaging the predictions of these methods. And the other cases will follow the training cases (unless an overfitting happens). Since more than 50% of the cases are correctly predicted by each method, after the averaging more than 50% of the votes for each training case will be correct, and thus the result will be correct too. Of course this depends on the correct predictions being distributed pretty evenly among the cases. If we have a thousand methods that predict correctly the same cases and incorrectly the other cases, obviously after averaging these other cases will still be predicted incorrectly. So the selection of methods must somehow shuffle the preference for the cases, so that the next picked method will predict well the cases that have been predicted poorly by the previous picked methods. That's it, that's the whole basic idea. It's a bit of an oversimplification but it's easy to understand.</i><br /><br />I really did mean it as an oversimplification, since AdaBoost uses the Bayesian decisions to do much better than the simple majority counting. Little did I know that there actually is the method of Boost By Majority (BBM) that does just the counting. It has some other differences but more about that later.<br /><br />The simple averaging can be simulated through the Bayesian means too. Just use the same confidence for each event. Incidentally, that's what the NonAdaBoost algorithm, also known as epsilon-Boost does: it looks for the weak hypotheses that have at least a fixed "edge" gamma (i.e. the probability of the right guess being at least 0.5+gamma) and then always sets the confidence C=0.5+gamma, and uses the same value to adjust the weights of the training cases.<br /><br />The NonAdaBoost is essentially a version of AdaBoost with a fixed confidence, and suffering from this defect. But Boost By Majority has another big difference: the way it adjusts the weight of the training cases. The formula it uses is pretty complicated, so I won't even try to reproduce it here. But here is the gist: it keeps track of how many rounds are left to the end of the boosting and what is the balance of the votes collected by each training case. If the modulo of the balance is higher than the number of rounds left, it means that the fate of this case can't be changed any more: it's either guaranteed to be guessed right if the balance is positive or guaranteed to be guessed wrong if the balance is negative, so the algorithm gives up on these cases. It gives the most weight to the most undecided cases, and spreads the rest of the weights in the bell shape of the binomial distribution. The result is that unlike AdaBoost and NonAdaBoost, BBM doesn't concentrate on the hardest cases that are likely to be the noise in the data anyway, and thus reduces the overfitting.<br /><br />The last chapter of the book is about a combination of AdaBoost and BBM called BrownBoost (from the Brownian motion), or also "boosting in continuous time". It starts with the idea that if the returned partial hypothesis has a higher edge than minimally needed, it might still have enough edge after the re-weighting the training cases, then it can be directly reused on the next round too without searching for a new one, and so on until its edge wears off. This gets developed into an algorithm that uses a real number in the range [0,1) instead of the round count, with the actual rounds moving the current point on it by the varying amounts. The speed of the movement is determined by the pre-set desired training error. This training error gets reached when the end of the range is reached. If the target error is set to 0, the algorithm behaves in the same way as AdaBoost.<br /><br />The downside is that the algorithm is complex, there isn't even a formula for determining the confidence values for each partial hypothesis. Instead you get a system of two equations that connect this confidence value and advancement through the range to be solved numerically. In the great scheme of things it's not a big deal, after all, compared to the finding of the partial hypotheses this shouldn't be such a big overhead. But it's not easy to think of. And the proof of the validity of this algorithm is quite complicated.<br /><br />I can't help thinking of a couple of simpler ideas. <br /><br />The first idea, or should I say guess, is that when we do AdaBoost, it fits into a Bayesian model. So if we keep this Bayesian model consistent, the boosting should still be valid. Obviously, I have no proper proof of that but it looks like a reasonable assumption. There is an easy way to take some training case (or a part of its weight) out of rotation and still keep the Bayesian model consistent. <br /><br />Remember, <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-3-and-bayesian-logic.html">previously I've described</a> that we start with a Bayesian table that contains each individual training case<br /><br /><pre>CaseId Weight Outcome Ev1 ... EvT<br />1 1 * true 1 ... 1<br />...<br />M 1 * false 0 ... 0</pre><br />Which then gets conflated into a two-line table, all the cases with the true outcome combined into one line, and the false ones into another line. The conditional probabilities in the table get averaged during conflation but since it's an averaging of all ones (or all zeroes), the result is still one (or zero).<br /><br /><pre>Weight Outcome Ev1 ... EvT<br />1 * true 1 ... 1<br />1 * false 0 ... 0</pre><br />To take a training case out of rotation, we just change its conditional probability in all the following events (that match the AdaBoost's partial hypotheses) to 0.5. This says that no future events will affect it. And accordingly we set the AdaBoost weight of it to 0. Such decisions can be made according to any algorithm, matching any desired curve. <br /><br />For example, suppose that we decide to take a case out of rotation when its weight relative to the sum of all weights reaches 0.1 (this is probably not a very good rule, since it allows at most 10 cases to be excluded, but simple for the sake of demonstration). Suppose that its a case with the true ("1") outcome. And suppose that all the weights of all the true cases are totaling to the same value as of all the false cases, each of them having the relative weight of 0.5 (not very likely in reality but just as good a number as any other).<br /><br />After the disablement, the conditional probability of the true cases will become ((0.5*0.1) + (1 * 0.4))/0.5 = 0.9.<br /><br /><pre>Weight Outcome Ev1 ... EvN-1 EvN ...<br />1 * true 1 ... 1 0.9 ...<br />1 * false 0 ... 0 0 ...</pre><br />Once a training case gets disabled, it stays disabled for all the future rounds, and the AdaBoost keeps acting only on the weights of those training cases that still have 1 or 0 for the conditional probability. Obviously, the more you disable, the less will be the effect of the following rounds when the Bayesian computation runs.<br /><br />Interestingly though the edges of the partial hypotheses will be higher. Remember, the training cases that get thrown away get it for being difficult to predict. So suppose the partial hypothesis EvN would have returned the confidence of 0.8 if that case wasn't thrown away and had guessed that case wrong. When we throw away the stubborn case, that would become 0.8 out of former 0.9, so the confidence becomes 0.8/0.9 = 0.89, an improvement! <br /><br />However all this throwing-away has no effect on the previous rounds, these are already set in stone. Which begs an easy solution which is my second idea: why not do two passes of AdaBoost? After the first pass look at the final weights of the training cases to determine the most stubborn ones. Throw them away and do the second pass from scratch. After all, BrownBoost requires an adjustment of the target training error which gets done by running it multiple times with the different values and then selecting the best one. Doing two passes of AdaBoost isn't any worse than that.<br />Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-84267594398101350542016-08-06T16:01:00.005-04:002016-08-06T16:01:56.092-04:00Bayes 23: enumerated events revisitedAnother idea that I've glimpsed from the book on boosting is the handling of the enumerated events, previously described in the <a href="http://babkin-cep.blogspot.com/2015/12/bayes-19-model-information-and.html">part 19</a>. The <a href="http://babkin-cep.blogspot.com/2016/07/adaboost-6.html">part 6 of my notes on boosting</a> describes how the decision stumps can be treated as only "half-stumps": actionable if the answer is "yes" and non-actionable if the answer is "no" (or vice versa). This is actually the same thing as <a href="http://babkin-cep.blogspot.com/2015/10/bayes-2-hypotheses-and-events.html">I complained of before</a> as the mistreatment of the Bayesian formula where the negative answer is treated the same as the lack of answer. But taken in the right context, it makes sense.<br /><br />If we take a complementary pair of such half-symptoms (asking the same question, and one of them equaling the negative answers with "don't know", another one equaling the positive answer with "don't know"), their combined effect on the probability of the hypotheses will be exactly the same as of one full symptom. In the weight-based model, the weights of the hypotheses after the complementary pair will be only half of those after one full symptom but they will all be scaled proportionally, making no difference. Alternatively, if we equal the negative answers not with "don't know" but with "irrelevant", even the weights will stay the same.<br /><br />The interesting thing is that these half-symptoms can be straightforwardly extended to the multiple-choice questions. Each choice can be equaled with one half-symptom. So if the answer to this choice is "yes" then it takes effect, if "no" then it gets skipped. In the end exactly one choice takes effect. Or potentially the answer can also be "none of the above" and then this symptom will be simply skipped. It should also be relatively straightforward to accommodate the answers like "it's probably one of these two", taking both answers at half-weight. I didn't work through the exact formulas yet but I think it shouldn't be difficult.<br /><br />The approach of taking the answer at a partial weight also provides a possible answer to "should we treat this problem as model-specific or generic?": it allows to mix both together, taking say the model-specific approach at the weight 0.99 and the generic at 0.01. Then if the model-specific approach finds a matching hypothesis, great, if not then the answer found with the generic approach will outweigh it. This weight of the generic approach should be higher than the confidence cap of the "can't happen" answer: the generic weight of 0.01 would probably work decently well together with the capping probability of 0.001.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-53371409125000807632016-08-06T14:53:00.005-04:002016-08-06T14:53:56.624-04:00Bayes 22: overfitting revisitedPreviously I've been saying that I didn't experience overfitting in the Bayesian models, and pretty much discounted it. Now I've read a model of overfitting in the book on AdaBoost, and I understand why. Here is the gist, with some of my thoughts included.<br /><br />The overfitting happens when the model starts picking the peculiarities of the particular training set rather than the general properties. It's down to the noise in the data: if the data contains random noise, only the cases without the noise can be predicted well on the general principles, and the noisy ones are bound to be mispredicted. The training data also contains noise. Since the noise is random, the noise in the test data (an in the future real-world cases) won't follow the noise in the training data closely. If the model starts following the noise in the training data too closely, it will mispredict the well-behaved cases in the test data, in addition to the noisy test cases. For all I can tell, this means that the overfitting magnifies the noise in the quadratic proportion, with probabilities:<br /><br />P(good prediction) = P(no noise in the training data) * P(no noise in the test data)<br /><br />If the model makes the decisions based on the "yes-no" questions (AKA binary symptoms), picking up the general trends takes a relatively small number of yes-no questions, because their effects are systematic. The effects of the noise are random, so each noisy training case is likely to require at least one extra yes-no question to tell it apart. If there is a substantial number of noisy cases in the training data, a lot of extra questions would be needed to tell them all apart. So the rule of thumb is, if there are few questions in the model compared to the number of the training cases, not much overfitting will happen.<br /><br />In the models I was working with, there were tens of thousands of the training cases and only hundreds of symptoms. So there wasn't such a big chance of overfitting in general. Even if you say "but we should count the symptoms per outcome", there still were only low hundreds of outcomes, and if we multiply 100 symptoms by 100 outcomes, it's still only 10K decision points in the table, the same order of magnitude as the number of the training cases. <br /><br />There also was very little noise as such in the data I've dealt with. If you do diagnostics, you get the natural testing: if the fix doesn't work, the client will come back. There is of course the problem of whether you've changed too many parts. It can be controlled to a degree by looking for training only at the cases where the fix was done at the first attempt. Though you still can't get the complete confidence for the cases where more than one part was changed. And of course if you don't look at the cases that required multiple attempts, it means that you're not learning to diagnose the more difficult cases.<br /><br />But there was a particular kind of noise even in this kind of fairly clean data: the noise of multiple problems occurring or not occurring together in various combinations. If the model is sensitive to whether it had seen a particular combination or not, the randomness of the combinations means that they represent a random noise. And I've spent quite a bit of effort on reducing this dependence in the logic and on reducing this noise by preprocessing the training data. Which all amounts to reducing the overfitting. So I was wrong, there was an overfitting, just I didn't recognize it.<br /><br />Actually, I think this can be used as a demonstration of the relation between the number of symptoms and amount of overfitting. If we're looking to pick only one correct outcome, the number of questions is just the number of questions, which was in hundreds for me. Much lower than the number of the training cases, and very little overfitting had a chance to happen. Yes, there were hundreds of possible outcomes but only one of them gets picked, and the number of questions that are used is the number of questions that affect it. But if we're looking at picking correctly all the outcomes, the number of questions gets multiplied by the number of outcomes. In the system I worked on, the total was comparable to the number of training cases, and the overfitting became noticeable. It would probably become even worse if the Bayesian table contained the rows not just for the outcomes but for the different ways to achieve these outcomes, like I've described in this series of posts. So with extra complexity you win on precision but the same precision magnifies the effects of overfitting. The sweet spot should be somewhere in the middle and depend a lot on the amount of noise in the data.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-43763720990717536832016-08-06T13:15:00.000-04:002016-08-06T13:15:23.173-04:00AdaBoost 7: multi-class & unseen combinationsThe basic idea behind the multi-class (and also multi-label, i.e. where each case may have more than one outcome) AdaBoost can be described as boosting the recognition of all the outcomes in parallel. It takes the "yes or no" dichotomies for all the outcomes, and on each round it tries to find such a partial hypothesis where the sum of Z for all of them is minimal. This is very similar to what was described in the <a href="http://babkin-cep.blogspot.com/2016/07/adaboost-6.html">part 6</a>, where multiple ranges were added, each with its own confidence. The difference is in the formula for the final computation: in the multi-class version there is a separate formula for each class that uses only the confidence values that all the training rounds computed for this particular class.<br /><br />There is also a possible optimization for the situation where there may be only one outcome per case (i.e. single-label), saying that the mapping between the dichotomies used in the logic above and the outcomes doesn't have to be one-to-one. Instead each outcomes can be mapped to a unique combination (or multiple combinations) of the dichotomies. The dichotomies can be selected on some smart way, say if we're trying to recognize the handwritten digits, they can be "pointy vs roundy", "a loop on the bottom vs no loop at the bottom" etc. Or just by just dividing the outcomes in half in some blind way, like "0,1,2,3,4 vs 5,6,7,8,9", "0,2,4,6,8 vs 1,3,5,7,9" etc. <br /><br />Returning to the multi-label situation, one of the problems I've experienced with it is the ability to recognize the combinations of outcomes that weren't present in the training data. That is, the outcomes were present but none of the training cases had exactly this combination. For the basic diagnostics, this can be discounted by saying "but what's the percentage of such cases" but when you start pushing the quality of diagnosis around 95%, it turns out that great many of the remaining misdiagnosed cases fall into this category.<br /><br />AdaBoost doesn't have any built-in solution for this problem. The solution it produces is only as good as the underlying algorithm. There is nothing in AdaBoost that puts the pressure on the underlying algorithm to recognize the combinations that aren't present in the training data. If the underlying algorithm can do it anyway (and perhaps despite the pressure from AdaBoost), the resulting combined formula will be able to do it too. If it can't then the combined formula won't either. The simple algorithms like the decision stumps can't. <br /><br />But maybe some multi-pass schemes can be devised. Run the boosting once, get a set of the candidate symptoms (i.e. partial hypotheses). Use these symptoms on the training cases to try to differentiate, which symptom is related to which outcome. Then run the boosting the second time from scratch, only this time with the relevance knowledge mixed in: whenever a symptom that is close to an irrelevant one is tested on a training case, make it return "I don't know", i.e. the confidence 0.5. This will shift the choice of symptoms. Obviously, if using the resulting formula, the same pruning of irrelevance has to be applied there in the same way. The symptoms from the second pass can be re-tested for relevance, and if any change is found, the third pass can be made, and so on. <br /><br />Or even better, perhaps this logic can be merged directly into each round of the underlying algorithm in one pass of AdaBoost: when a candidate partial hypothesis (i.e. symptom) is found, measure its relevance right there and change its Z-value accordingly. Pick the candidate that has the lowest Z-value even after it has been corrected for the relevance. Include the relevance information into the partial hypothesis.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-57880451011481928442016-08-02T02:39:00.002-04:002016-08-02T02:39:42.930-04:00Bayes 21: mutual exclusivity and independence revisitedI've been thinking about the use of AdaBoost on the multi-class problems, and I've accidentally realized what is going on when I've used the combination of the mutually exclusive and of independent computation of hypotheses, as described in the <a href="http://babkin-cep.blogspot.com/2015/10/bayes-10-independence-and-relevance.html">part 10</a>.<br /><br />Basically, the question boils down to the following: if the mutually exclusive computation shows that multiple hypotheses have risen to the top in about equal proportions (and consequently every one of them getting the probability below 0.5), how could it happen that in the independent computation each one of them would be above 0.5? If the training data had each case resulting in only one true hypothesis, the probabilities computed both ways would be exactly the same, in the independent version the probability of ~H being exactly equal to the sum of probabilities of the other hypotheses.<br /><br />The answer lies in the training cases where multiple hypotheses were true. If we use the weight-based approach, it becomes easy to see that if a new case matches such a training case, it would bring all the hypotheses in this case to the top simultaneously. So the independent approach simply emphasizes the handling of these training cases. Equivalently, such training cases can be labeled with pseudo-hypotheses, and then the weight of these pseudo-hypotheses be added to the "pure" hypotheses. For example, let's consider re-labeling of the example from the part 10:<br /><br /># tab09_01.txt and tab10_01.txt<br /> evA evB evC<br />1 * hyp1,hyp2 1 1 0<br />1 * hyp2,hyp3 0 1 1<br />1 * hyp1,hyp3 1 0 1<br /><br />Let's relabel it as:<br /><br /> evA evB evC<br />1 * hyp12 1 1 0<br />1 * hyp23 0 1 1<br />1 * hyp13 1 0 1<br /><br />Then the probabilities of the original hypotheses can then be postprocessed as:<br /><br />P(hyp1) = P(hyp12)+P(hyp13)<br />P(hyp2) = P(hyp12)+P(hyp23)<br />P(hyp3) = P(hyp23)+P(hyp13)<br /><br />And this would give the same result as the independent computations for every hypothesis. <br /><br />So this approach works well for when the combined set of symptoms for multiple hypotheses had been seen in training, and not much good for the combinations that haven't been seen in the training. The combined use of the independent and mutually-exclusive combinations with a low acceptance threshold for the independent computations tempers this effect only slightly.<br />Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-63579584459293924342016-07-02T03:14:00.003-04:002016-08-06T12:13:39.586-04:00AdaBoost 6: multiple confidence ranges & other thoughtsOne of the chapters I quickly read through was on using a better estimation of probability of the success of the partial hypotheses (in the AdaBoost sense), with the following example:<br /><br />Suppose, the underlying algorithm found a decision stump, and finds that it can classify the cases having a value above some limit pretty well (say, 90% of them are "+1") while the cases with the values under this limit are undifferentiated (say, 49% of them are "+1"). In this case if we allow the underlying algorithm to return the values not only +1 and -1 but also 0 for when it can't differentiate well, the boosting goes much better. <br /><br />Since the part 3 I've been wondering if the algorithm can be improved in an obvious way by using the separate confidence values for the situations when the underlying algorithm returns +1 or -1 (or in equivalent Bayesian terms, 1 or 0), and maybe even do better than that by forming multiple "bands" of training cases by some rule and assigning a separate confidence values to each band, and maybe even further by computing an individual confidence value for each case. The idea above shows that it can, and the book then goes on to the exact same idea of multiple bands, each with its own confidence value, and on to the partial hypotheses returning the individual confidence values for each training case. So that idea really was obvious, and there is not much point in writing more about it. The only tricky part is the minimalization criteria, but that becomes straightforward from the description in the part 4.<br /><br />One more idea would be to make the decision stumps not discrete ("1" on this side of this value, "0" on the other side) but graduated, with the confidence being 0.5 at the value and growing towards 1 on one side and towards 0 on the other side. Possibly saturating at some distance from the center point value. <br /><br />An easy example of saturation would be this: If we're classifying a set of points in a 2-dimensional space with the decision stumps based on the coordinate values, this means that on each step we find some value of a coordinate (X or Y) and say that the points on one side have mostly the value "1" and on the other side they have mostly the value "0". Then the classic AdaBoost approach (without the square root in the confidence) takes the confidence C as the fraction of the points whose values got guessed right. Or, translating from the pair (value, confidence) to the just the confidence, the pair (1, C) becomes C, and the pair (0, C) becomes (1-C). But the breaking point is usually not one point, it's a range between the coordinate values closest to this break. The classic approach is to pick the centerpoint of this range as the breaking point. But we could say that the confidence changes linearly between these two coordinate values and beyond this range saturates to C and (1-C). This smooth transition could help with the training on the data with errors in it. The book contains an example of training on the data that contained 20% of errors which led to overfitting over many rounds of boosting. The smooth transitions could provide the averaging that would counteract this overfitting. Maybe.<br /><br />Another thought that occurred to me is what if we're boosting while we're boosting? Or in other words, what if the partial algorithm runs a few rounds of boosting on its own? How can it be equivalently represented as a single round of boosting?<br /><br />This looks like a good example where an individual value of the confidence for each training case would be handy. As we go through the execution stage, the weight of each example would be multiplied on each round of "nested boosting" in the same way as if would be on each round of normal boosting. I.e. if a round of boosting had the confidence C, and guessed this particular training case right, the weight of this training case will be multiplied by C, and if it guessed this training case wrong, the weight would be multiplied by (1-C).<br /><br />So if we define the effective confidence Ce as:<br /><br />(if the round guessed right, its Ce=C, otherwise its Ce=1-C)<br /><br />the we can see that the multiplier will be:<br /><br />product for rounds t(Ce<sub>t</sub>)<br /><br />Except that it's not properly scaled to be in the range [0..1]. To adjust the scaling, it has to become<br /><br />product for rounds t(Ce<sub>t</sub>) / (product for rounds t(Ce<sub>t</sub>) + product for rounds t(1-Ce<sub>t</sub>))<br /><br />There won't be a single confidence value for all the training cases of such a composite round, each training case will have its own confidence value, with the probability of pointing towards 0 or 1 built into it.<br /><br />One more interesting thing I've read is that after some number of rounds AdaBoost tends to start going in a circle through the same set of distributions (or sometimes not quite the same but very close). To me this looks like an indication that it's time to stop. Boosting the margins by going through this loop repeatedly looks like cheating, because if we look at its Bayesian meaning, this meaning implies that the events that get examined must be essentially independent of each other. But if we repeatedly examine the same events, they're not independent, they're extra copies of the events that have already been seen. Thus re-applying them repeatedly doesn't seem right. On the other hand, the events in such a loop do form this loop because they represent each event in a balanced way. So maybe the right thing to do is to throw away everything before this loop too, and just leave one iteration of this loop as the only events worth examining. This would be interesting to test if/when I get around to do it.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-21414210430745252452016-06-28T02:30:00.000-04:002016-06-28T12:33:11.851-04:00AdaBoost 5: the square root and the confidence valueThe question of what should be used as the confidence value for Bayesian computations is a thorny one. In my <a href="https://www.blogger.com/blogger.g?blogID=3195352510924611504#editor/target=post;postID=1744569943656630096;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=3;src=postname">previous take on it</a> I've used the value without a square root, while the classic AdaBoost formula uses the one with the square root. Which one is more right?<br /><br />Both of them produce the same decision for the classification of the cases. The difference is in the absolute value of the chances, or in the AdaBoost terms, the margin (with the classic AdaBoost also applying the logarithm on top).<br /><br />I still think that using the value without the square root has the more correct "physical meaning". But either can be used with the same result.<br /><br />Following the classic AdaBoost, the version of the confidence with the square root follows from the expression<br /><br />Z = sum for all good cases i (Wi / S) + sum for all bad cases j (Wj * S)<br />Z = sum for all good cases i (Wi * (C-1) / C) + sum for all bad cases j (Wj * C / (C-1))<br /><br />Just like the last post, W is the weight of a particular training case, same as D(i) in the classic AdaBoost terms, just since I've been using weights rather than a distribution, the letter W comes easier to my mind.<br /><br />The "physical meaning" here is that the weights of the training cases for the next round of boosting are adjusted opposite to how the chances of these training cases get affected by this round during the final computation. If a training case gets predicted correctly by a partial hypothesis, its weight will get multiplied by C and the weight of the other cases will be multiplied by (1-C), so the changes will get multiplied by C/(1-C), and for the next round of boosting the good cases get divided by that to compensate. For the cases that get predicted incorrectly, the opposite happens.<br /><br />This translates to<br /><br />C / (1-C) = S = sqrt((1-e) / e) = sqrt(1-e) / sqrt(e)<br /><br />The last expression does NOT mean<br /><br />C = sqrt(1-e)<br /><br />Instead, it means<br /><br />C = sqrt(1-e) / (sqrt(1-e) + sqrt(e))<br /><br />Alternatively, the approach without the square root starts with the premise<br /><br />Z = sum for all good cases i (Wi / C) + sum for all bad cases j (Wj / (1-C))<br /><br />The "physical meaning" is as <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-3-and-bayesian-logic.html">described before</a>, the weights of the training cases for the next round of boosting are adjusted opposite to how the weights of these training cases get affected by this round during the final computation. It seems to me that compensating the weights for the changes in weights is the more correct "physical meaning" than compensating the weights for the changes in chances.<br /><br />This translates to<br /><br />C / (1-C) = S<sup>2</sup> = sqrt( (1-e) / e )<sup>2</sup> = (1-e) / e<br /><br />and <br /><br />C = (1-e)<br /><br />By the way, chasing this version through the derivatives as shown in the previous post was interesting. I've appreciated why the authors of AdaBoost involved the exponents into the formulas: doing the derivatives and integrals with the exponents is much easier than without them. Then I've realized that with the derivatives where S = exp(Alpha) I'm computing dZ/dAlpha, not dZ/dC. And that is the correct approach. So without the exponents I should be computing the dZ/dS, and that gets easy again.<br /><br />So, the <i>real</i> version of AdaBoost described in the <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-3-and-bayesian-logic.html">third part</a> is this:<br /><br /><hr /><br />Given: (x<sub>1</sub>, y<sub>1</sub>), ..., (x<sub>m</sub>, y<sub>m</sub>) where x<sub>i</sub> belongs to X, y<sub>i</sub> belongs to {-1, +1}.<br />Initialize: D<sub>1</sub>(i) = 1 for i = 1, ..., m.<br />For t = 1, ..., T:<br /><ul><li>Train the basic algorithm using the weights D<sub>t</sub>.</li><li>Get weak hypothesis h<sub>t</sub>: X -> {-1, +1}.</li><li>Aim: select h<sub>t</sub> to minimalize the boundary on the training error Z<sub>t</sub> where:<br />Wgood<sub>t</sub> = 0; Wbad<sub>t</sub> = 0;<br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) = y<sub>i</sub> {<br /> Wgood<sub>t</sub> += D<sub>t</sub>(i)<br /> } else {<br /> Wbad<sub>t</sub> += D<sub>t</sub>(i)<br /> }<br />}<br />C<sub>t</sub> = Wgood<sub>t</sub> / (Wgood<sub>t</sub> + Wbad<sub>t</sub>)<br />Z<sub>t</sub> = Wgood<sub>t</sub>/C<sub>t</sub> + Wbad<sub>t</sub>/(1-C<sub>t</sub>)<br />which can also be represented symmetrically through S<sub>t</sub>:<br />Z<sub>t</sub> = Wgood<sub>t</sub>/S<sub>t</sub> + Wbad<sub>t</sub>*S<sub>t</sub><br />S<sub>t</sub> = sqrt(C<sub>t</sub> / (1-C<sub>t</sub>)) = sqrt(Wgood<sub>t</sub> / Wbad<sub>t</sub>)<br />and substituting S<sub>t</sub>:<br />Z<sub>t</sub> = Wgood<sub>t</sub> / sqrt(Wgood<sub>t</sub> / Wbad<sub>t</sub>) + Wbad<sub>t</sub> * sqrt(Wgood<sub>t</sub> / Wbad<sub>t</sub>)<br />= 2 * sqrt(Wgood<sub>t</sub> * Wbad<sub>t</sub>)<br />Which gets minimalized when either of Wgood<sub>t</sub> or Wbad<sub>t</sub> gets minimalized, but to be definitive we prefer to minimalize Wbad<sub>t</sub>.<br /></li><li>Update, <br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) / (1-C<sub>t</sub>)<br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) / C<sub>t</sub><br /> }<br />}</li></ul>Produce the function for computing the value of the final hypothesis:<br />H(x) {<br /> chance = 1;<br /> for t=1,...,T {<br /> if (h<sub>t</sub>(x) > 0) {<br /> chance *= C<sub>t</sub>/(1-C<sub>t</sub>);<br /> } else <br /> chance *= (1-C<sub>t</sub>)/C<sub>t</sub>;<br /> }<br /> }<br /> return sign(chance - 1)<br />}<br /><br /><hr /><br />Having this sorted out, I can move on to more creative versions of the algorithm where on a step of boosting the different training cases may get stamped with the different confidence values C.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-27777845184745775712016-06-24T11:04:00.000-04:002016-06-28T12:32:40.204-04:00AdaBoost 4 or the square root returnsI've had to return the library book on boosting. I've ordered my own copy, but on the last day I've paged through the more interesting chapters. This gave me some ideas on how the Bayesian model can be fit to those too, and I started writing it up, but I've got stopped at the question: what should get minimized? So I went back to the book (my own copy now) and I think that I finally understand it. Here is how it works:<br /><br />The real measure that gets minimized on each round of boosting is Z. In the book the authors prove that it's not the training error as such but the upper bound on the training error. They write it (with dropped subscript <i>t</i> of the boosting round) as<br /><br />Z = sum for all good cases i (W<sub>i</sub> * exp(-Alpha)) + sum for all bad cases j (W<sub>j</sub> * exp(Alpha))<br /><br />Where W is the weight of a particular training case (same as D(i) in the classic AdaBoost terms, just since I've been using weights rather than a distribution, the letter W comes easier to my mind). They have a use for this exponent in the proofs but here it doesn't matter. Let's create a notation with a new variable S (the name doesn't mean anything, it's just a letter that hasn't been used yet) that will sweep the exponent under the carpet:<br /><br />S = exp(Alpha)<br />Z = sum for all good cases i (W<sub>i</sub> / S) + sum for all bad cases j (W<sub>j</sub> * S)<br /><br />S has a "physical meaning": it represents how the chances of a particular case change. The weights of the "good" cases (i.e. those that got predicted correctly on a particular round) get divided by S because we're trying to undo the effects of the changes from this round of boosting for the next round. Getting back to the Bayesian computation, <br /><br />S = C / (1-C)<br /><br />Where C is the confidence of the round's prediction acting on this case. When the produced boosting algorithm runs, if the round's prediction votes for this case (which we know got predicted by it correctly in training), its weight will be multiplied by C. If the prediction votes against this case, its weight will be multiplied by (1-C). Thus S measures how strongly this prediction can differentiate this case. For the cases that get mispredicted by this round, the relation is opposite, so their weights get multiplied by S instead of division.<br /><br />The mathematical way to minimize Z is by finding the point(s) where its derivative is 0.<br /><br />dZ/dS = sum for all good cases i (W<sub>i</sub> / -S<sup>2</sup>) + sum for all bad cases j (W<sub>j</sub>) = 0<br />sum for all good cases i (W<sub>i</sub> / S<sup>2</sup>) = sum for all bad cases j (W<sub>j</sub>)<br /><br />And since for now we assume that S is the same for all the cases,<br /><br />sum for all good cases i (W<sub>i</sub>) = Wgood<br />sum for all bad cases j (W<sub>j</sub>) = Wbad<br /><br />then <br /><br />Wgood / S<sup>2</sup> = Wbad<br />Wgood / Wbad = S<sup>2</sup><br />S = sqrt(Wgood / Wbad)<br /><br />And since we know that the training error e is proportional to Wbad and (1-e) is proportional to Wgood:<br /><br />e = Wbad / (Wgood + Wbad)<br />1-e = Wgood / (Wgood + Wbad)<br /><br />then we can rewrite S as<br /><br />S = sqrt( (1-e) / e )<br /><br /><strike>Or substituting the expression of S through C,<br /><br />C / (1-C) = sqrt( (1-e) / e )<br />C = sqrt(1-e)<br /><br />Which means that the Bayesian confidence in AdaBoost really is measured as a square root of the "non-error". This square root can be put away in the basic case but it becomes important for the more complex variations. </strike> [I think now that this part is not right, will update later after more thinking]<br /><br />Returning to the minimization, this found value of S gets substituted into the original formula of Z, and this is what we aim to minimize on each round. For the classic AdaBoost it is:<br /><br />Z = Wgood / sqrt((1-e) / e) + Wbad * sqrt((1-e) / e)<br /><br />Since Wgood is proportional to (1-e) and Wbad is proportional to e (and in case if the sum of all the weights is 1, Wgood=1-e, and Wbad=e), we can substitute them:<br /><br />Z = (1-e) / sqrt((1-e) / e) + e * sqrt((1-e) / e)<br />= sqrt(e * (1-e)) + sqrt(e * (1-e))<br />= 2 * sqrt(e * (1-e))<br /><br />This value gets minimized when e gets closer to either 0 or 1 (because the formula is symmetric and automatically compensates for the bad predictions by "reverting" them). The classic AdaBoost algorithm then says "we'll rely on the ability of the underlying algorithm to do the same kind of reverting" and just picks the minimization of e towards 0. This makes the computation more efficient for a particular way of computation for C but the real formula is above. If we want to handle the more generic ways, we've got to start with this full formula and only then maybe simplify it for a particular case.<br /><br />This formula can be also written in a more generic way, where each training case may have its own different value of confidence C and thus of S:<br /><br />Z = sum for all good cases i(W<sub>i</sub> / S<sub>i</sub>) + sum for all bad cases j(W<sub>j</sub> * S<sub>j</sub>)<br /><br />And now I'll be able to talk about these more generic cases.<br /><br />By the way, the difference of how the variation of AdaBoost for logistic regression works from how the classic AdaBoost works is in a different choice of measure for what gets minimized, a different formula for Z.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-17445699436566300962016-05-29T23:03:00.000-04:002016-05-31T02:22:09.302-04:00AdaBoost 3 and Bayesian logicI think I've figured out a way to express the workings of AdaBoost in terms of the Bayesian processing, and I think it becomes simpler this way. Now, I'm not the first one to do that, the book "Boosting" describes a way to do this with what it calls the logistical regression. It also derives an approach to probabilities in AdaBoost from this way. But the logistical regression uses a sigmoid curve (I think like the one generally used for the normal distribution) and differs from the plain AdaBoost. I think my explanation works for the plain AdaBoost.<br /><br />Before I start on that explanation, I want to talk about why the boosting works. Yeah, I've followed the proofs (more or less) but it took me a while to get a bit of an intuitive feeling in my head, the "physical meaning" of the formulas. I want to share this understanding which is much simpler than these proofs (and I hope that it's correct): <br /><br />The premise of boosting is that we're able to find a number of methods (what they call "hypotheses" in AdaBoost) to predict the correct outcomes of the training cases, each method correctly predicting more than 50% of the training cases. Then if we collect a large-ish number of these methods, we can predict the correct outcomes of all the training cases simply by averaging the predictions of these methods. And the other cases will follow the training cases (unless an overfitting happens). Since more than 50% of the cases are correctly predicted by each method, after the averaging more than 50% of the votes for each training case will be correct, and thus the result will be correct too. Of course this depends on the correct predictions being distributed pretty evenly among the cases. If we have a thousand methods that predict correctly the same cases and incorrectly the other cases, obviously after averaging these other cases will still be predicted incorrectly. So the selection of methods must somehow shuffle the preference for the cases, so that the next picked method will predict well the cases that have been predicted poorly by the previous picked methods. That's it, that's the whole basic idea. It's a bit of an oversimplification but it's easy to understand.<br /><br />Now on to the explanation of the connection between AdaBoost and the Bayesian logic.<br /><br />To start with, I want to show one more rewrite of the AdaBoost algorithm. It builds further on the version I've shown in the <a href="http://babkin-cep.blogspot.com/2016/05/adaboost-in-simpler-formulas-2.html">last installment</a>. Now I'm returning back to the notation of e<sub>t</sub> and (1-e<sub>t</sub>). These values are proportional to Wbad<sub>t</sub> and Wgood<sub>t</sub> but are confined to the range [0, 1] which fits well into the Bayesian computations. More importantly, this new version gets rid of the square root in the computation of D<sub>t+1</sub>(i). It wasn't the first thing I realized for the connection between the AdaBoost and Bayesian logic, it was actually the last thing I realized, but after that all the parts of the puzzle fell into place. So I want to show this key piece of the puzzle first.<br /><br />The point of this computation is to readjust the weights of the training cases for the next round, so that the total weight of all the cases successfully predicted by the current round equals to the total weight of the unsuccessfully predicted rounds. There is more than one way to make the weights satisfy this condition, they're all proportional to each other and all just as good. They way the weight are modified in the classic for of AdaBoost algorithm has been selected to make the modification symmetric, and allow it to write the adjustment for both correct and incorrect cases as a single formula:<br /><br />D<sub>t+1</sub>(i) = D<sub>t</sub>(i)*exp(-Alpha<sub>t</sub>*y<sub>i</sub>*h<sub>t</sub>(x<sub>i</sub>)) / Z<sub>t</sub><br /><br />But if we're writing an honest if/else statement, it doesn't have to be symmetric. We can as well write either of:<br /><br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) / e<sub>t</sub><br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) / (1-e<sub>t</sub>)<br /> }<br /><br />or with a degenerate "else" part:<br /><br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * (1-e<sub>t</sub>) / e<sub>t</sub><br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i)<br /> }<br /><br />Either way the result is the same. And there is no need to involve the square roots, the formula becomes simpler. After making this simplification, here is my latest and simplest version of the algorithm:<br /><br /><hr /><br />Given: (x<sub>1</sub>, y<sub>1</sub>), ..., (x<sub>m</sub>, y<sub>m</sub>) where x<sub>i</sub> belongs to X, y<sub>i</sub> belongs to {-1, +1}.<br />Initialize: D<sub>1</sub>(i) = 1 for i = 1, ..., m.<br />For t = 1, ..., T:<br /><ul><li>Train the basic algorithm using the weights D<sub>t</sub>.</li><li>Get weak hypothesis h<sub>t</sub>: X -> {-1, +1}.</li><li>Aim: select h<sub>t</sub> to minimalize the weighted error e<sub>t</sub>: <br />Wgood<sub>t</sub> = 0; Wbad<sub>t</sub> = 0;<br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) = y<sub>i</sub> {<br /> Wgood<sub>t</sub> += D<sub>t</sub>(i)<br /> } else {<br /> Wbad<sub>t</sub> += D<sub>t</sub>(i)<br /> }<br />}<br />e<sub>t</sub> = Wbad<sub>t</sub> / (Wgood<sub>t</sub> + Wbad<sub>t</sub>)<br /></li><li>Update, <br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * (1-e<sub>t</sub>) / e<sub>t</sub><br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i)<br /> }<br />}</li></ul>Produce the function for computing the value of the final hypothesis:<br />H(x) {<br /> chance = 1;<br /> for t=1,...,T {<br /> if (h<sub>t</sub>(x) > 0) {<br /> chance *= (1-e<sub>t</sub>)/e<sub>t</sub>;<br /> } else <br /> chance *= e<sub>t</sub>/(1-e<sub>t</sub>);<br /> }<br /> }<br /> return sign(chance - 1)<br />}<br /><br /><hr /><br />Now this form can be translated to the Bayesian approach. I'll be using the form of Bayesian computations that works with <a href="http://babkin-cep.blogspot.com/2016/05/summary-of-bayes-by-weight.html">weights rather than probabilities</a>. The translation to his form is fairly straightforward:<br /><br />There are two Bayesian mutually-exclusive hypotheses (but to avoid confusion with the AdaBoost term "hypothesis", let's call them "outcomes" here): one that the result is true (1), another one that the result false (0).<br /><br />Each "hypothesis" h<sub>t</sub>(x) in AdaBoost terms becomes an "event" Ev<sub>t</sub> in the Bayesian form (to avoid the terminological confusion I'll call it an event henceforth). Each event is positive: it being true predicts that the true outcome is correct, and vice versa. The training table looks like this:<br /><br /><pre>Weight Outcome Ev<sub>1</sub> ... Ev<sub>T</sub><br />1 * true 1 ... 1<br />1 * false 0 ... 0<br /></pre><br />Aside from this, we remember e<sub>t</sub> from all the boosting rounds, and of course the computations for all the "events" chosen by the boosting.<br /><br />Then when we run the model, we get the set of arguments x as the input, and start computing the values of the events. When we compute the event Ev<sub>t</sub>, we apply it with the confidence C(Ev<sub>t</sub>) value of (1-e<sub>t</sub>), as described in the <a href="http://babkin-cep.blogspot.com/2015/10/bayes-12-fuzzy-training.html">fuzzy training logic</a>. In result if the Ev<sub>t</sub> is true, the weight of the true outcome gets multiplied by (1-e<sub>t</sub>) and the weight of the false outcome gets multiplied by e<sub>t</sub>. If the event is false, the multiplication goes the opposite way.<br /><br />In the end we compute the probability of the true outcome from the final weights: W(true)/(W(true)+ W(false)). If it's over 0.5, the true outcome wins. This logic is exactly the same as in the function H(x) of the AdaBoost algorithm.<br /><br />Why is the (1-e<sub>t</sub>) used as the confidence value? Since it's the fraction of the training cases that got predicted correctly by Ev<sub>t</sub>, it makes sense that we're only so much confident in this event. Well, for the first event it is, but for the following events (1-e<sub>t</sub>) is computed from a modified distribution of the events, with the adjusted weights. Does it still make sense?<br /><br />As it turns out, it does. To find out why, we need to look at the weights of the training cases after they pass the filter of the first event. Instead of looking at the training table with two composite rows we need to step back and look at the table with the M original uncombined training cases:<br /><br /><pre>CaseId Weight Outcome Ev<sub>1</sub> ... Ev<sub>T</sub><br />1 1 * true 1 ... 1<br />...<br />M 1 * false 0 ... 0<br /></pre><br />The cases that have the outcome of true still have 1 for all the events, and the one with the false outcome have 0 for all the events. In case if we run the algorithm for an input that matches a particular training case, Ev<sub>1</sub> will predict the outcome correctly for some of the training cases and incorrectly for the others. When it predicts correctly, the weight of this training case will be multiplied by the confidence C(Ev<sub>1</sub>) and will become (1-e<sub>1</sub>). But when it predicts incorrectly, the weight will be multiplied by e<sub>1</sub>. The distribution will become skewed. We want to unskew it for computing the confidence of Ev<sub>2</sub>, so we compensate by multiplying the weight of the cases that got predicted incorrectly by (1-e<sub>1</sub>)/e<sub>1</sub>. Lo and behold, it's the same thing that is done in AdaBoost:<br /><br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * (1-e<sub>t</sub>) / e<sub>t</sub><br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i)<br /> }<br /><br />So it turns out that on each step of AdaBoost the distribution represents the compensation of the weight adjustment for application of the previous events, and the value (1-e<sub>t</sub>) is the proper adjusted fraction of the training cases that got predicted correctly. Amazing, huh? I couldn't have come up with this idea from scratch, but given an existing formula, it all fits together.<br /><br />What is the benefit of looking at AdaBoost from the Bayesian standpoint? For one, we get the probability value for the result. For another one, it looks like AdaBoost can be slightly improved by changing the initial weights of true and false outcomes from 1:1 to their actual weights in the M training cases. That's an interesting prediction to test. And for the third one, maybe it can be used to improve the use of AdaBoost for the multi-class classifications. I haven't read the book that far yet, I want to understand the current state of affairs before trying to improve on it.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-6824791878760633372016-05-28T18:09:00.000-04:002016-05-28T18:09:41.943-04:00Summary of Bayes by weightThis is a copy of the post from my MSDN blog. If you've been reading this blog, you've already seen all the ideas described here. But I've realized that the repost can be useful on this blog for the references, as a short summary of the most interesting parts. So, here it goes.<br /><br /><p>I recently wrote a series of posts in my other blog on the Bayes expert system. Some time ago I wrote an expert system, and I've been surprised by how little of the information is available on the Internet and how much of it is misleading, so I wanted to make a write-up of my experience, and recently I've finally found time to do it. It turned out larger than I expected, and as I was writing it I myself have gained a deeper understanding of my previous experience and came up with further better ideas. The text starts from the very basics of the Bayesian formula and goes through its use in the expert systems, the typical pitfalls when building the expert systems and the solutions to these pitfalls, the ways to handle the uncertainties, a deeper dive into how and why these formulas actually work, more of the practical solutions based on that knowledge, and the aspects of the testing.</p><p>The whole series is here (in the reverse order): <a href="http://babkin-cep.blogspot.com/search/label/bayes">http://babkin-cep.blogspot.com/search/label/bayes</a> <br /> The first post is here: <a href="http://babkin-cep.blogspot.com/2015/10/bayes-1-basic-terms.html">http://babkin-cep.blogspot.com/2015/10/bayes-1-basic-terms.html</a> <br /> They are not interspersed with any other posts, so you can read sequentially.</p><p>Right in the first post I refer to the Yudkowsky's explanation, so why would you want to read my explanation rather than his? By all means, read his explanation too, it's a good one. But he explains it in a different context. He takes one application of the Bayesian formula and works through it. So do many other texts on the artificial intelligence. The expert systems don't apply the formula once. They apply the formula thousands of times to make a single diagnosis. Which brings its own problems of scale that I describe. And there are many other practical issues. Your input data might be "impossible" from the standpoint of probabilities in your knowledge base. The evidence of absence is not the same as the absence of evidence. The events are rarely completely independent. Multiple hypotheses might be valid at the same time. And so on.</p><p>The particularly surprising realization for me was that the Bayesian systems and the decision trees are fundamentally the same thing. You can represent either one through the other one. They are traditionally used in somewhat different ways but that's just the question of tuning the parameters. They can be easily tuned either way or anywhere in between. This stuff is in the part 11 of my notes. There it requires the context from the previous parts, but here is the short version as a spoiler. The short version might be not very comprehensible due to its shortness but at least it points to what to look for in the long version.</p><p>So, it starts with the basic Bayes formula where the probability of some hypothesis H after taking into account the information about the event E is:</p><pre>P(H|E) = P(H) * P(E|H) / P(E)<br /></pre><p>The hypotheses can also be thought of as diagnoses, and the events as symptoms.</p><p>In a typical expert system there can be hundreds of hypotheses and hundreds of events to consider. The values of P(E|H) are taken from the table that is computed from the training data. The values of P(H) and P(E) change as the events are applied one after another (P(H|E) for one event becomes P(H) for the next event, and P(E) gets computed from the complete set of values for P(H) and P(E|H)) but their initial values are also sourced from the same table. At the end one or multiple most probable hypotheses are chosen as the diagnosis.</p><p>What is the training data? It's the list of the previously diagnosed cases, where both the correct winning hypotheses and the values of the events are known. In the simplest way it can be thought of as a bitmap where each row has a hypothesis name and the bits for every event, showing if it was true or false:</p><pre> E1 E2 E3 E4<br />H1 1 0 0 0<br />H1 1 0 0 0<br />H1 0 0 0 1<br />H2 0 1 1 0<br />H2 0 1 1 0<br /></pre><p>In reality one case may have a diagnosis of multiple hypotheses, and the values of the events might be not just 0 and 1 but a number in the range between 0 and 1: we might not be completely confident in some symptom but have say a 0.7 confidence that it's true.</p><p>How is the table of probabilities built from the training data? For P(H) we take the proportion of the number of cases with this hypothesis to the total number of cases. For P(E|H) we take all the cases for the hypothesis H and average all the values for the event E in them. For P(E) we average all the values for E in all the cases in the whole training data.</p><p>As it turns out, this approach doesn't always work so well. Consider the following training data:</p><pre> E1 E2<br />H1 0 0<br />H1 1 1<br />H2 1 0<br />H2 0 1<br /></pre><p>The meaning of the data is intuitively clear, it's H1 if E1 and E2 are the same and H2 if E1 and E2 are different. But when we start computing P(E|H), all of them end up at 0.5 and the resulting expert system can't tell anything apart.</p><p>There is a solution: for the duration of the computation, split each hypothesis into two, say H1A and H1B:</p><pre> E1 E2<br />H1A 0 0<br />H1B 1 1<br />H2A 1 0<br />H2B 0 1<br /></pre><p>Before making the final decision, add up the probabilities P(H1)=P(H1A)+P(H1B) and use them for the decision. Now the logic works. The hypotheses can be split down to the point where each case in the training data becomes its own sub-hypothesis. Indeed the current example had done exactly this. If there are multiple cases that are exactly the same, we could keep them together by assigning a weight instead of splitting them into the separate sub-hypotheses. For example, if there are 5 cases that are exactly the same, we can make them one sub-hypothesis with the weight of 5.</p><p>And with such a fine splitting the computation of probabilities can be thought of as striking out the sub-hypotheses that don't match the incoming events. If we're computing the diagnosis and receive E1=1, we can throw away H1A and H2B, leaving only H1B and H2A. If then we receive E2=0, we can throw away H2A, and the only hypothesis left, H1B, becomes the final diagnosis that we'll round up to H1.</p><p>We're essentially trying to find a match between the current case and one of the training cases. But that's exactly what the decision trees do! We can represent the same table as a decision tree:</p><pre> |<br /> V<br /> 0 E1 1<br /> +----+----+<br /> | |<br /> V V<br /> 0 E2 1 0 E2 1<br /> +--+--+ +--+--+<br /> | | | |<br /> V V V V<br /> H1A H2B H2A H1B<br /></pre><p>As you can see, it's equivalent, produces the exact same result on the same input.</p><p>But what if we're not entirely confident in the input data? What if we get E1 with the confidence C(E1)=0.7? The way the Bayesian formula</p><pre>P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )<br /> / ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )<br /></pre><p>treats it amounts to "multiply the weight of the cases where E1=1 by 0.7 and multiply the weight of the cases where E1=0 by 0.3". So in this example we won't throw away the cases H1A and H2B but will multiply their weights by 0.3 (getting 0.3 in the result). H1B and H2A don't escape unscathed either, their weights get multiplied by 0.7. Suppose we then get E2=0.8. Now the weights of H1A and H2A get multiplied by 0.2, and of H1B and H2B get multiplied by 0.8. We end up with the weights:</p><pre>W(H1A) = 1*0.3*0.2 = 0.06<br />W(H1B) = 1*0.7*0.8 = 0.56<br />W(H2A) = 1*0.7*0.2 = 0.14<br />W(H2B) = 1*0.3*0.8 = 0.24<br /></pre><p>When we add up the weights to full hypotheses, W(H1)=0.62 and W(H2)=0.38, so H1 wins (although whether it actually wins or we consider the data inconclusive depends on the boundaries we set for the final decision).</p><p>Can this be represented as a decision tree? Sure! It just means that when we make a decision at each event node we don't choose one way out. We choose BOTH ways out but with different weights. If the weight of some branch comes down to 0, we can stop following it, but we faithfully follow all the other branches until the come down to the leaves, and keep track of the weights along the way.</p><p>Now, what if the table contains not just 0s and 1s but the arbitrary values between 0 and 1? This might be because some training cases had only a partial confidence in their events. Or it might be because we averaged the events of multiple training cases together, building the classic Bayesian table with one line per hypothesis.</p><p>We can still compute it with weights. We can just logically split this case into two cases with the partial weights. If we have a value of 0.7 for the hypothesis H and event E, we can split this line into two, one with 1 in this spot and the weight of 0.7, another one with 0 in this spot and the weight of 0.3. And we can keep splitting this case on every event. For example, if we start with</p><pre> E1 E2<br />H1 0.7 0.1<br /></pre><p>we can first split it into 2 cases on E1:</p><pre> E1 E2 W<br />H1A 1 0.1 0.7<br />H1B 0 0.1 0.3<br /></pre><p>And then further split on E2:</p><pre> E1 E2 W<br />H1AA 1 1 0.07<br />H1AB 1 0 0.63<br />H1BA 0 1 0.03<br />H1BB 0 0 0.27<br /></pre><p>Or we can modify the table as we apply the events: right before applying the event, split each row in the table in twain based on the values in it for this event, apply the event by multiplying the weights appropriately, and then collapse the split rows back into one by adding up their weights. This makes the intermediate values more short-lived and saves on their upkeep. And that's exactly the logic used in the classic Bayesian formula.</p><p>Since once the table rows get split, the operations on the table stay exactly the same, they still can be represented with the decision tree. Just the splitting of the rows in the table would be transformed to the splitting of nodes in the decision tree.</p><p>In the end the Bayesian logic and the decision trees are equivalent. They do the same things. Traditionally the Bayesian logic uses the training cases that have been averaged out while the decision trees try to match to the individual training cases. But both can be used in either way. And it's possible to get a mix of both by partitioning the weights between the original training cases and their combined averaged versions. It's a smooth scale of 0 to 1: if you transfer all the weight of the training cases into averaging, you get the traditional Bayesian logic, if you transfer none of it, you get the traditional decision trees, and if you transfer say 0.3 of the weight, you get the logic that is a combination of 0.3 of the traditional Bayesian logic and of 0.7 of the traditional decision trees. Note I've been saying "traditional" because they are really equivalent and can be transformed into each other.</p>Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-39518544437661248472016-05-22T01:15:00.000-04:002016-05-22T01:15:38.144-04:00AdaBoost in simpler formulas 2I've been reading along the book on boosting, and I'm up to about 1/3 of it :-) I've finally realized an important thing about how the H(x) is built.<br /><br />For easy reference, here is another copy of the AdaBoost algorithm from the <a href="http://babkin-cep.blogspot.com/2016/03/adaboost-in-simpler-formulas.html">previous installment</a>, simplified slightly further by replacing (1-e<sub>t</sub>)/e<sub>t</sub> with Wgood<sub>t</sub>/Wbad<sub>t</sub>, and e<sub>t</sub>/(1-e<sub>t</sub>) with Wbad<sub>t</sub>/Wgood<sub>t</sub> as mentioned at the end of it and getting rid of e<sub>t</sub> altogether:<br /><br /><hr /><br />Given: (x<sub>1</sub>, y<sub>1</sub>), ..., (x<sub>m</sub>, y<sub>m</sub>) where x<sub>i</sub> belongs to X, y<sub>i</sub> belongs to {-1, +1}.<br />Initialize: D<sub>1</sub>(i) = 1 for i = 1, ..., m.<br />For t = 1, ..., T:<br /><ul><li>Train the basic algorithm using the weights D<sub>t</sub>.</li><li>Get weak hypothesis h<sub>t</sub>: X -> {-1, +1}.</li><li>Aim: select h<sub>t</sub> to minimalize the weighted error Wbad<sub>t</sub>/Wgood<sub>t</sub>: <br />Wgood<sub>t</sub> = 0; Wbad<sub>t</sub> = 0;<br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) = y<sub>i</sub> {<br /> Wgood<sub>t</sub> += D<sub>t</sub>(i)<br /> } else {<br /> Wbad<sub>t</sub> += D<sub>t</sub>(i)<br /> }<br />}<br /></li><li>Update, <br />for i = 1, ..., m {<br /> if h<sub>t</sub>() != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * sqrt(Wbad<sub>t</sub>/Wgood<sub>t</sub>) <br /> }<br />}</li></ul>Output the final hypothesis: <br />H(x) = sign(sum for t=1,...,T (ln(sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>))*h<sub>t</sub>(x)) ). <br /><br /><hr /><br />I've been wondering, what's the meaning of ln() in the formula for H(x). Here is what it is:<br /><br />First of all, let's squeeze everything into under the logarithm. The first step would be to put h<sub>t</sub>(x) there.<br /><br />ln(sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>))*h<sub>t</sub>(x) = ln(sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<sup>h<sub>t</sub>(x)</sup>)<br /><br />This happens by the rule of ln(a)*b = ln(a<sup>b</sup>).<br /><br />Since h<sub>t</sub>(x) can be only +1 or -1, taking the value to the power of it basically means that depending on the result of the h<sub>t</sub>(x) the value be either taken as-is or 1 divided by it. Which is the exact same thing that is happening in the computation of D<sub>t+1</sub>(i). The two formulas are getting closer.<br /><br />The next step, let's stick the whole sum under the logarithm using the rule ln(a)+ln(b) = ln(a*b):<br /><br />H(x) = sign(ln( product for t=1,...,T ( sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<sup>h<sub>t</sub>(x)</sup> ) ))<br /><br />The expression under the logarithm becomes very similar to the formula for D<sub>t+1</sub>(i) as traced through all the steps of the algorithm:<br /><br />D<sub>t+1</sub>(i) = product for t=1,...,T ( sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<sup>-y<sub>t</sub>*h<sub>t</sub>(x)</sup> )<br /><br />So yeah, the cuteness of expressing the condition as a power comes handy. And now the final formula for H(x) makes sense, the terms in it are connected with the terms in the computation of D.<br /><br />The next question, what is the meaning of the logarithm? Note that its result is fed into the sign function. So the exact value of the logarithm doesn't matter in the end result, what matters is only if it's positive or negative. The value of logarithm is positive if its agrument is > 1, and negative if it's < 1. So we can get rid of the logarithm and write the computation of H(x) as:<br /><br />if ( product for t=1,...,T ( sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<sup>h<sub>t</sub>(x)</sup> ) > 1 ) then H(x) = +1 else H(x) = -1<br /><br />Okay, if it's exactly = 1 then H(x) = 0 but we can as well push it to +1 or -1 in this case. Or we can write that <br /><br />H(x) = sign( (product for t=1,...,T ( sqrt(Wgood<sub>t</sub>/Wbad<sub>t</sub>)<sup>h<sub>t</sub>(x)</sup> ) - 1 )<br /><br />The next thing, we can pull the square root out of the product:<br /><br />if (sqrt( product for t=1,...,T (Wgood<sub>t</sub>/Wbad<sub>t</sub><sup>h<sub>t</sub>(x)</sup>) ) > 1 ) then H(x) = +1 else H(x) = -1<br /><br />But since the only operation on its result is the comparison with 1, taking the square root doesn't change the result of this comparison. If the argument of square root was > 1, the result will still be >1, and the same for < 1. So we can get rid of the square root altogether:<br /><br />if ( product for t=1,...,T (Wgood<sub>t</sub>/Wbad<sub>t</sub><sup>h<sub>t</sub>(x)</sup> ) > 1 ) then H(x) = +1 else H(x) = -1<br /><br />The downside of course is that the computation becomes unlike the one for D<sub>t+1</sub>(i). Not sure yet if this is important or not.<br /><br />Either way, we can do one more thing to make the algorithm more readable, we can write the product as a normal loop:<br /><br /><hr /><br />chance = 1;<br />for t=1,...,T {<br /> if (h<sub>t</sub>(x) > 0) {<br /> chance *= Wgood<sub>t</sub>/Wbad<sub>t</sub>;<br /> } else <br /> chance *= Wbad<sub>t</sub>/Wgood<sub>t</sub>;<br /> }<br />}<br />H(x) = sign(chance - 1)<br /><br /><hr /><br />Note that this code runs not at the training time but later, at the run time, with the actual input data set x. When the model runs, it computes the actual values h<sub>t</sub>(x) for the actual x and computes the result H(x).<br /><br />I've named the variable "chance" for a good reason: it represents the chance that H(x) is positive. The chance can be expressed as a relation of two numbers A/B. The number A represents the positive "bid", and the number B the negative "bid". The chance and probability are connected and can be expressed though each other:<br /><br />chance = p / (1-p)<br />p = chance / (1+chance)<br /><br />The chance of 1 matches the probability of 0.5. Initially we have no knowledge about the result, so we start with the chance of 1, and with each t the chance gets updated according to the hypothesis picked on that round of boosting. <br /><br />The final thing to notice is that in the Bayesian approach we do a very similar thing: we start with the prior probabilities (here there are two possible outcomes, with the probability 0.5 each), and then look at the events and multiply the probability accordingly. At the end we see which hypothesis wins. Thus I get the feeling that there should be a way to express the boosting in the Bayesian terms, for a certain somewhat strange definition of events. Freund and Shapire describe a lot of various ways to express the boosting, so why not one more. I can't quite formulate it yet, it needs more thinking. But the concept of "margins" maps very nicely to the Bayesian approach.<br /><br />In case if you wonder what the margins are: as the rounds of boosting go, after each round of boosting H(x) can be computed for each set of the training data x<sub>i</sub>. At some point they all start matching the training results y<sub>i</sub>, however the boosting can be run further, and more rounds can still improve the results on the test data set. This happens because the sign() function in H(x) collapses the details, and the further improvements are not visible on the training data. But if we look at the argument of sign(), such as the result of the logarithm in the original formula, we'll notice that they keep moving away from the 0 boundary, representing more confidence. This extra confidence then helps make better decisions on the test data. This distance between the result of the logarithm and 0 is called the margin. Well, in the Bayesian systems we also have the boundary of the probability (in the simplest case for two outcomes, 0.5), and when a Bayesian system has more confidence, it drives the resulting probabilities farther from 0.5 and closer to 1. Very similar.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-6434271114746493392016-04-09T19:07:00.000-04:002016-04-09T19:07:47.908-04:00career advice 3As I've mentioned before, I've read the book "Friend & Foe" by Adam Galinsky and Maurice Schweitzer, and I went to see their talk too. It's an interesting book on the psychology of competition and cooperation. But some examples from it and their interpretation struck me as odd.<br /><br />In particular, they have a story of Hannah Riley Bowles who got an offer for a tenured position from the Nazareth College, decided to negotiate for some better conditions, and had the offer retracted (to her surprise). Perhaps she'd followed the advice similar to Tarah Van Vleck's: "never accept the first offer". What went wrong? According to Galinsky and Schweitzer, there must be some evil afoot, it's all because she's a woman.<br /><br />But is it? The whole story as described in the book strikes me as a sequence of bad decisions. I'm not the greatest expert on negotiations by far but I know a thing or two about them. For example, I've negotiated for a year about one of my jobs. For another job, I've negotiated through 3 different engagements over 6 years. And basically in the story about Hannah I see both the glaring mistakes on her part and the mistaken pre-suppositions in the narrative.<br /><br />The major mistake in the narrative is that by negotiations you cannot lose ("never accept the first offer, it's always lower by 10-15% than the final offer"). It's not true. If you decide not to accept the first offer but start the negotiations, you have to be prepared that the other side would turn around and go away. It has no relation to gender, happens to everyone, and is nothing unusual. It's something you've got to be prepared to. I've had it happen on multiple occasions in my career.<br /><br />If doesn't mean that you shouldn't negotiate. The negotiation of the future salary at a new place is the best way and time to raise your income, after that a 10% raise will be considered a huge one. A 10% raise did happen to me once, but again, it was a very unusual thing, and much easier achieved by negotiating at the hiring time. But you've got to place a realistic goal in front of you, what kind of raise would be worth changing the jobs, and go from there. If the offer is way below this goal, there is no point in taking it. If it's way above, you've achieved your goal, there's not much more to desire. Extra negotiation can bring extra money but can also break the whole thing. If someone offers you double the current money, it's probably reasonable to just take the offer and not risk it.<br /><br />And yes, the offer of double money did happen to me but it came with a catch: it was for a 6-month contract, with a long commute and not a particularly exciting job. So it wasn't a no-brainer, it took me some thinking. In the end I've decided that if I get double the money for 6 months and then spend another 6 months looking for another job like my current one, I'd be still ahead, and I took it (and in result the things have worked out better than expected).<br /><br />To give an example of when things didn't work out, let's look at that 6-year negotiation story. When I've talked to them for the first time, I went there for an interview, I've talked to the HR about what kind of money I want, and a week later I get a form letter saying that they're not interested. Well, overall fine with me (except for one point that I'll return to later), they didn't look particularly interesting anyway. When their recruiter contacted me next time, I've asked them: you people didn't like me once already, why are you calling me again? And he said, no, the technical interviews actually are marked pretty good, so it's got to be some other reason. From which I could only conclude that the money was the problem. And then I've tried a bit different approach to find out what kind of money they had in mind, and it turned out that yes, there was a major disagreement. But the important point for Hannah's story is that they didn't make me an offer for half the money I was asking, they just turned around and went away.<br /><br />Making another digression, this kind of confirms Tarah Van Vleck's advice "never name your number first". Or does it? Remember, in this case our expectations have been off by a factor of 2. If they made me an offer for half the money I thought reasonable, I wouldn't have taken it anyway, just as I didn't take when I found it out during the second engagement. By the way, yes, there are disadvantages of naming your number first but there are also are other issues, and there are some advantages too: if you overshoot their expectations by a reasonable amount, you'll have a lot easier time in defending this number in the further negotiations. If they name a number and you say "I want 10% more", they'll figure out that you're just trying to stretch it a little, and they might either stay firm or maybe settle at something like 5% more. If you name a number 20% more than they were expecting to offer, you'll probably get if not all 20% then at least 15%. And it's not just me, I've also read it in some book (Galinsky&Schweitzer's? Cialdini's? Karras's?) that the first number named sets the tone for the negotiation, which is difficult to move afterwards. It can be moved but not by 15%, if you want to make progress you've got to start with something like "this is laughable! my reasonable estimation is 50% more!" and get maybe extra 30-45%. And of course bear the risk that the other side would go away, so I'd recommend doing this only if you really do see the initial offer as laughable.<br /><br />If the other side thinks that your demands are unreasonably high (or low, for the other side, and yes, I've done things like that from my side as well), they'll just go away. But of course from my standpoint the requests have been perfectly reasonable, I would not have agreed to their low-ball offer anyway, so I haven't lost anything. This is a problem only if you're bluffing.<br />Now turning to Hannah's mistakes. Sorry, but she led the negotiations in a very offensive way, as offensive as it can get without calling the prospective employer names.<br /><br />The first major mistake was that she responded by writing of a letter with the list of requests, and in such a formal tone. Negotiation in the written form is bad, it's highly prone to cause very negative feelings in the counterparty. The good way to negotiate is over the phone.<br /><br />The use of the formal tone is even worse. It's guaranteed to offend. Returning to that example above, receiving that form letter had pissed me off very much. If they simply said "No, we're not interested" or "No, we're not interested, we don't think you're good enough for us", it would have been OK. But receiving a page-long form letter in legalese created a major grudge. For a few years after that I wouldn't even talk to the their recruiters. <br /><br />The right way to negotiate is on the phone, and try to keep as friendly a tone as possible. The point of negotiations is to convince the other party that your viewpoint is more reasonable, not to fight them.<br /><br />This brings us to the next error, but here Hannah had no control: she had to negotiate directly with the college because the college had contacted her directly. The negotiations go much, much better when conducted through an intermediary. An independent recruiting agent is the best intermediary, the company recruiter is the second best one. Negotiating directly with the hiring manager, as Hannah essentially did, is fraught with peril. The recruiters are the professional negotiators, they understand how the negotiations work, and transfer the information between two parties while maintaining friendliness on both sides. You can talk more bluntly to them, and when the message reaches the other side, it will become formatted in a friendly way. On the other hand, the hiring managers tend to take offense easily. Many of them are technical specialists but not really people persons, and for quite a few of them the feeling self-importance goes strongly to their head. Might be even worse in academia than in the industry, at least judging by what I read. The even worse part is that she had to deal with a committee. The problem with committees is that there is a higher probability that at least one member will be a self-important moron who will take offense.<br /><br />Ironically, this went so bad because from the tone of the letter Hannah doesn't appear to be a people person either, but one with the self-importance gone to her head. It's hard enough to negotiate when one side has this attitude, and much harder when both sides do. For all I understand, the tenure positions are coveted in academia, so when the committee made an offer to Hannah, they likely felt that they're making her an honor. Which is expected to be accepted humbly. Responding to the offer with the words "Granting some of the following provisions will make my decision easier" is the opposite of humility. It's the negotiation from the position of power, implying that they've made a humble supplication of her, and she is considering whether to grant their wish. I hope you can see by now how they felt offended.<br /><br />As you can see, great many things went wrong with Hannah's negotiation, and none of them have anything to do with her gender. All of them had to do with the communication mistakes, character of the people involved, pride and prejudice of academic nature, and lack of an experienced intermediary to calm down the tempers.<br /><br />What could Hannah had done better? I'd recommend first thing going there, looking at the place, and meeting the people. A personal contact always makes the following remote communications much more personable. And then making her requests either in a face-to-face meeting or over the phone. Making them in a personable tone of requests, not demands. Like "hey, and how does such a thing run at your college? would it be OK if I do it like this?". Perhaps, making some of the requests through the HR department people. And what could have the college done better? After the hiring committee had made the decision, they could have used a professional recruiter from HR to communicate between the committee and Hannah.<br /><br />Of course, yet another way to look at it is "do you want to work with people like this?". The point of the interview is that not only candidate is a good fit for the company but also that the company is a good fit for the candidate. If you think that the company behaves unreasonably in response to your reasonable requests, it's probably best not to work there: obviously, your ideas of what is reasonable differ widely.<br /><br />And this also brings the point about whether the women are undeservedly seen as too aggressive. I'd say that Hannah's example demonstrates exactly the kind of over-aggressiveness. It's not that she tried to negotiate for the better conditions, it's HOW she tried to do it. Instead of building the mutual rapport and convincing the counterparty of her goals in a friendly way, she saw it as a fight. It's not the perception of the aggression that is the problem, the problem is in the aggression that is actually present.<br /><br />I wonder if it might also be connected to another effect about negotiations. As described in the book "The negotiating game" by Karrass, and as I can anecdotically confirm from my experience, when a good negotiator gets the major thing he wants, he goes soft on the opponent and doesn't mind giving up some minor points, to keep the relationship happier. On the other hand, the poor negotiators keep hammering non-stop even if they've got the negotiating power and already managed the good conditions, they still keep trying to squeeze everything possible out of the opponent. Perhaps the second case is the same character trait that is seen as high aggression, the irony being that the higher aggression brings less success.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-61147718451675276252016-04-09T00:33:00.000-04:002016-04-09T00:33:02.096-04:00career advice 2For an example of a good career advice, I want to point to James Whittaker. I've attended his classes "Career Superpowers" and "The Art of Stage Presence", which also are available as books. He is a great presenter (and hopefully a good writer too, I didn't read the book versions).<br /><br />I wouldn't say that I've learned a whole lot from the class on Career Superpowers but that's because it largely matches what I already know from my experience. But I really liked it being presented in a systematic fashion, and I did learn some things (and perhaps I need to exercise more of following them).<br /><br />Or you might say it the other way: maybe I've liked it because it matched my own thoughts so much. Either way, he's done quite a career, much better than me.<br /><br />And that's one of his points: it makes sense to follow the advice of someone who know what he's doing and draws this advice from a success. Well, there are lots of caveats too. One, the career recipes are different by the times and by the industries. Following the advice of a CEO from the auto industry in the 70-80s won't help you in the software industry now. James's recipes are for the software industry, so if you're in the auto or financial industry, they might be a bad advice for your situation. The second caveat, you want to follow the advice of someone of who is analytical. Succeeding by following a good instinct is one thing but breaking this experience down into an advice that can be transferred to other people is a whole separate feat.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-74574797215097898862016-04-08T23:59:00.001-04:002016-04-08T23:59:29.239-04:00career advice 1I went to see a book presentation by Tarah Wheeler Van Vlack on career advice for women in technology. And I think it's all bad advice. It's not that everything she said is wrong but even when she starts with something that makes sense (like "always negotiate"), the details end up as a bad advice. If women really follow the advice like this, no wonder that it would create the "gender gap".<br /><br />I also have another story on the subject with another book, "Friend & Foe" which I think was also giving weird career advice. I wrote a letter to its authors, and after some editing of the personal details, it might be entertaining to publish at some point, as an example of what I see as the good career advice.<br /><br />But getting back to this one, I've got a couple of interesting points that are completely separate from the advice as such.<br /><br />The first one is that she was talking a lot about the "nerd culture": board games, videogames, cosplay, comics books and what not. She seems to imply a strong connection between the nerd culture and the engineering. Which I don't think is true. If someone is into the nerd culture, it doesn't necessarily mean that they would be good at the technical jobs, and if someone is good at technical jobs, it doesn't mean that they would be into the nerd culture. For the sake of the point, I don't think I'm into the nerd culture as described. I'm more into the real books, cooking, racecars, and lots of other small hobbies. Well, there are different opinions about this, as one of the girls I dated said that I'm so nerdy, but her reference point was different, she was really from a redneck background. I like to play sometimes at pretending being a redneck (you could say that it's my cosplay) but I really am not one by a wide margin and I know it.<br /><br />Let me tell you an old Russian parable by Kozma Prutkov (a collective pen name of a group of writers): Once upon a time there was a gunsmith who had built a wonderful new rifle with great many features. With it you could clean a game, cook it, unfold the rifle into a table and utensils and have a very nice dinner. An absolutely great experience. This rifle had only one drawback to it: it couldn't shoot.<br /><br />I really like applying this parable to everything. If we call something a rifle, the extra features might be nice to have with other things being equal, but first and foremost it must shoot. And to be a good rifle, it must shoot well.<br /><br />Let's now apply this parable to the nerd culture. There seems to be a lot of pressure saying that if you're into the nerd culture, you should look for a job in engineering. And that to look for a job in engineering, you have to subscribe to the nerd culture. But in reality to look for a job in engineering and make a good career out of it, first and foremost you should be good at engineering. The nerd culture doesn't really matter: you can like it or dislike it or whatever, it won't affect anything. (And that's by the way is the real meaning of diversity in the good sense: as long as you do the job well, what kind of culture you belong to doesn't matter). This pressure causes the people who are into the nerd culture to go into the engineering. And some turn out to be no good at it and fail. And become very frustrated, they feel that they've done everything right, as told that they should do, and still failed (or succeeded only very moderately) for no explainable reason. There must be some evil forces at work!<br /><br />But the real reason is that they've been given bad advice and the wrong kind of pressure. If you're good at drawing the comic books, or playing video games, it doesn't mean that you're any good at engineering. You might be good at both but there is no guarantee whatsoever. I'm fairly decent at engineering but I'm not any good at drawing comic books. If you enjoy drawing comic books and don't feel any particular attachment to actual engineering, perhaps a better career choice would be to go into something connected with the drawing of comic books, or with the drawing in general. If you're good at action videogames, this skill might be connected to being good at driving a racecar or flying a fighter plane but not much with the engineering. And the same bad advice works the other way around: some people would feel that if they're not into the nerd culture, they shouldn't even try engineering.<br /><br />Another related thing is that there is a big difference between being interested in something and being good at something. I like racing cars but I understand than I'm no Michael Schumacher, so I don't even try to make it into a career. It's a hobby where I waste money, not make money. You don't get paid for being interested, you get paid for being good at something that is useful for someone. Being interested is something that gives you a push towards trying something, and stimulates you to spend time and effort on improving, but by itself it doesn't make you good. In reality when people get good at something they become less interested: after the skill gets learned it becomes "same old, same old", and the time comes to find some next interesting thing (or at least the higher more difficult level of the previous thing). And, well, the people who keep being interested but never become good, tend to try being around things they're interested in. And there is a big difference between doing things and just being around them. I think I've read a good rant by Steve Yegge about it a few years ago. But people who are around things don't really understand the difference, in their view they're in the very midst of activity. And when they're not appreciated that much compared to the people who do things, they don't understand why and feel frustrated. There must be some evil forces at work! I've learned from Tarah the new phrase "officewife", which describes someone who contributes to the social dynamics of a workgroup and say brings donuts but is not taken seriously for the work contribution. Which I think is a salient example for this paragraph. Of course, people can be labeled unjustly and incorrectly by social inertia, but this I think is where the phrase grows from. It's not limited to any gender, I've seen plenty of men in the same role.<br /><br />The second point is that Tarah said that she is not afraid to be a bitch, and well, she really comes across as one. Or I better like a gender-neutral term she also used, as an asshole. No big deal by itself but there is an interesting connection: there is this widespread moaning (including in the book "Friend & Foe") that "if women try to be assertive, they're seen as bitches, while the men aren't". Or I'd rather say assholes because I don't see any gender difference. I mean, assholes are assholes, and there are plenty among men. There is a difference between being assertive and being an asshole.<br /><br />What's this difference between "assertive" and "asshole"? I think a lot of it is about being analytical versus being blind to the world. One thing is when someone builds the plans from the experience, then pushes to implement these plans, recognizes the feedback when something goes not up to the plan, and does the corrections. Another thing is when someone comes up with a load of crap and unloads it unto the wold, often being mean for no good reason along the way.<br /><br /><br />This is by the way the communist tradition of management by assoholism (for all I can gather, still popular in the former USSR, and also in the Arab world): if you're the boss, scream, make tantrums and abuse the underlings until they do something.<br /><br /><br />All kinds of prophets and proselytizers also tend to be major assholes. If someone had come up with The Great Truth and decides to blindly spread this Great Truth throughout the world, he is an asshole. But if he (or she) pauses sometimes to listen to the reasonable comments and objections, and think about this Great Truth, and give a reasonable response, and maybe sometimes modify the Great Truth, then perhaps this he or she becomes merely assertive.<br /><br />To give another example, "I want this done because I said so" is an asshole, "I want this done because I see such and such good reason" is assertive. Or as yet another example, when you see a blog where the commenters get banned simply for disagreeing with the author, you can say for sure that the author is an asshole.<br /><br />Circling back, to the second point, it could be that the reson for "if women try to be assertive, they're seen as bitches" might be because they try to follow the bad examples, and there seem to be quite a few bad examples abound spreading bad advice. If someone follows the advice on how to become an asshole, they can successfully become an asshole and not even know it.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-27273078402519509382016-03-18T02:11:00.000-04:002016-06-24T11:08:05.506-04:00AdaBoost in simpler formulasI'm reading a book about the "boosting": the machine learning idea that a better algorithm can be built by combining multiple instances of a simple algorithm, each instance trained on the same set of training data but with different weights assigned to each case.<br /><br />The very basic concept itself goes along the same way as I've been doing manually for the Bayesian system: after a system is trained on a set of training cases, re-test it on the same set of data and find what cases got diagnosed wrong (the proper word for it, as it turns out, is "training error"). Then try to improve the recognition of these cases. But unlike me the boosting does it in an automated way: it doesn't try to change the underlying algorithm, instead it simply produces the second training table, and the end result is produced by running the expert system with both tables then combining their results in a weighted fashion. Boosting can be done for hundreds of rounds, producing the systems that run the basic algorithm hundreds of times. For Bayesian algorithms, I think it might be possible to combine these many tables into one instead of running them separately, although I'm not quite sure about that.<br /><br />There is also a limitation on the boosting: looks like it's mostly sorted out for the cases where there are only two mutually exclusive hypotheses. Apparently it starts having problems even with multiple mutually exclusive hypotheses, let alone the overlapping ones.<br /><br />Well, getting back to the book, it's "Boosting", written by the inventors of the concept in general and of the AdaBoost (Adaptive Boost) algorithm in particular, Shapire and Freund. <br /><br />I've spent some time understanding the AdaBoost algorithm, and I feel that it can be expressed in a simpler form. The authors' definition of the algorithm is like this (minus the mathematical characters that I don't know how to write in HTML and wrote in words or similar Latin characters):<br /><br /><hr /><br />Given: (x<sub>1</sub>, y<sub>1</sub>), ..., (x<sub>m</sub>, y<sub>m</sub>) where x<sub>i</sub> belongs to X, y<sub>i</sub> belongs to {-1, +1}.<br />Initialize: D<sub>1</sub>(i) = 1/m for i = 1, ..., m.<br />For t = 1, ..., T:<br /><ul><li>Train the basic algorithm using the distribution D<sub>t</sub>.</li><li>Get weak hypothesis h<sub>t</sub>: X -> {-1, +1}.</li><li>Aim: select h<sub>t</sub> to minimalize the weighted error:<br />e<sub>t</sub> = <b>Pr</b><sub>i~D<sub>t</sub></sub>[h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>].</li><li>Choose Alpha<sub>t</sub> = 1/2 * ln((1-e<sub>t</sub>)/e<sub>t</sub>).</li><li>Update, for i = 1, ..., m:<br />D<sub>t+1</sub>(i) = D<sub>t</sub>(i)/Z<sub>t</sub> * {<br /> exp(-Alpha<sub>t</sub>) if h<sub>t</sub>(x<sub>i</sub>)=y<sub>i</sub>;<br /> exp(Alpha<sub>t</sub>) if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub><br />}<br /> = D<sub>t</sub>(i)*exp(-Alpha<sub>t</sub>*y<sub>i</sub>*h<sub>t</sub>(x<sub>i</sub>)) / Z<sub>t</sub><br />where Z<sub>t</sub> is a normalization factor (chosen so that D<sub>t+1</sub> will be a distribution).<br /></li></ul>Output the final hypothesis:<br />H(x) = sign(sum for t=1,...,T (Alpha<sub>t</sub>*h<sub>t</sub>(x)) ).<br /><br /><hr /><br />Some explanations are in order. The pairs (x<sub>i</sub>, y<sub>i</sub>) are the training cases. x<sub>i</sub> represents the whole set of symptoms, y<sub>i</sub> represents the outcome (in Bayesian terms, we could also say "hypotheses" but here the word hypothesis has a different meaning). Only two outcomes are possible, and unlike the Bayesian tradition of representing them as 0 and 1, here they are represented as -1 and +1. This representation allows the clever trick in the formula for D<sub>t+1</sub>(i), replacing the conditional choice of the sign for Alpha<sub>t</sub> with multiplication of (y<sub>i</sub>*h<sub>t</sub>(x<sub>i</sub>)). If these values don't match, the result will be -1, if they do match then 1. The concept of hypothesis here is different from the Bayesian one, here the "hypothesis" means the whole basic algorithm with a computed training table. h<sub>t</sub>(x<sub>i</sub>) means the computation of the outcome by this trained algorithm by feeding the symptoms of a training case into it.<br /><br />The epic expression <b>Pr</b><sub>i~D<sub>t</sub></sub>[h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>] means the weighted probability of a training case outcome being computed incorrectly when it's fed back into the basic algorithm with the current round's training table (the training table is computed independently in each round based only on the set of training cases and on the distribution of their weights for this round). Basically, it's the relation of the weight of incorrectly computed cases to the total weight of all cases. The underlying basic algorithm tries to make up its table to minimize this number to its best ability. But since it's not perfect and might be far from perfect, the number of erroneous cases is unlikely to become 0.<br /><br />There are T rounds of boosting. Each round trains the same basic algorithm on the same set of training cases but assigns different weights to the importance of these cases, according to D<sub>t</sub>. The weights in D<sub>t</sub> are adjusted for each round such as to sum up to 1. Z<sub>t</sub> is thus the "normalization factor": the sum of values in D<sub>t+1</sub> before this adjustment. From the algorithmic standpoint, there is no good reason to do this normalization: weights are weights, who cares if they sum up to 1 or not, this algorithm doesn't depend on it in any way. There is a reason why the authors use the Z<sub>t</sub> as it is, because it turns up in the proofs about the properties of the algorithm. But here it's easier to place it into the calculation of e<sub>t</sub>.<br /><br />Note also that Alpha<sub>t</sub> is first computed as a logarithm, and then an exponent is taken on it. This logarithm and exponent cancel each other out except for the sign. This and the manipulation of the sign by (y<sub>i</sub>*h<sub>t</sub>(x<sub>i</sub>)) look mathematically cute but confuse the hell out of it. Well, they have a reason for using a logarithm because they use this Alpha<sub>t</sub> to prove some properties, but like Z<sub>t</sub> in this place it only complicates things.<br /><br />Getting rid of all these complications, here is the simplified version:<br /><br /><hr /><br />Given: (x<sub>1</sub>, y<sub>1</sub>), ..., (x<sub>m</sub>, y<sub>m</sub>) where x<sub>i</sub> belongs to X, y<sub>i</sub> belongs to {-1, +1}.<br />Initialize: D<sub>1</sub>(i) = 1 for i = 1, ..., m.<br />For t = 1, ..., T:<br /><ul><li>Train the basic algorithm using the weights D<sub>t</sub>.</li><li>Get weak hypothesis h<sub>t</sub>: X -> {-1, +1}.</li><li>Aim: select h<sub>t</sub> to minimalize the weighted error e<sub>t</sub>: <br />Wgood<sub>t</sub> = 0; Wbad<sub>t</sub> = 0;<br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) = y<sub>i</sub> {<br /> Wgood<sub>t</sub> += D<sub>t</sub>(i)<br /> } else {<br /> Wbad<sub>t</sub> += D<sub>t</sub>(i)<br /> }<br />}<br />e<sub>t</sub> = Wbad<sub>t</sub> / (Wgood<sub>t</sub> + Wbad<sub>t</sub>)<br /></li><li>Update, <br />for i = 1, ..., m {<br /> if h<sub>t</sub>(x<sub>i</sub>) != y<sub>i</sub>; {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * sqrt((1-e<sub>t</sub>)/e<sub>t</sub>)<br /> } else {<br /> D<sub>t+1</sub>(i) = D<sub>t</sub>(i) * sqrt(e<sub>t</sub>/(1-e<sub>t</sub>)) <br /> }<br />}</li></ul>Output the final hypothesis: <br />H(x) = sign(sum for t=1,...,T (ln(sqrt((1-e<sub>t</sub>)/e<sub>t</sub>))*h<sub>t</sub>(x)) ). <br /><br /><hr /><br />I think it becomes much easier to understand in this form. In particular, it becomes much easier to see an important property: each round adjusts the weights such as if the last table were used on them, it would produce the weighted error of exactly 50%, which would mean that it has no idea what is going on. The following round is then forced to come up with the new solution and scrape up a new way to do some differentiation. This is where the "Adaptive" in the algorithm's name comes from. But they say that this is not how they've come up with this formula, they say that they've come up with this particular value of Alpha<sub>t</sub> in order to minimize Z<sub>t</sub>. I don't understand yet, how exactly they've done it. I've tried to compute it and I get a different answer. I must be making an error somewhere, need to think more about it. And they try to minimize Z<sub>t</sub> because it represents the upper bound on the training error (again, I can follow the explanation but I can't tell on my own yet, exactly why, will need to read this section a couple more times).<br /><br />I don't understand yet the motivation for choosing the per-round multipliers in H(x) either. I can see some things about it. For example, the logarithm nicely handles the case of e<sub>t</sub> > 0.5. If more than half the cases get misclassified by the basic algorithm, this situation is easy to improve: just flip the sign of the outcomes, and suddenly less than half of them will be misclassified. In this case the value under the logarithm will be less than 1, and the value of the logarithm will be negative, thus automatically flipping the sign! But I don't know why the logarithm is the right choice the rest of the time.<br /><br />We could also replace (1-e<sub>t</sub>)/e<sub>t</sub> with Wgood<sub>t</sub>/Wbad<sub>t</sub>, and e<sub>t</sub>/(1-e<sub>t</sub>) with Wbad<sub>t</sub>/Wgood<sub>t</sub> but this might be detracting too much from the properties of e<sub>t</sub>.<br /><br />P.S. Read more by the tag <a href="http://babkin-cep.blogspot.com/search/label/learning_boosting">learning_boosting</a>.Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0tag:blogger.com,1999:blog-3195352510924611504.post-30763707870649728222015-12-13T19:41:00.000-05:002015-12-13T19:41:20.819-05:00Bayes 20: TestingI've already mentioned how to test a model but I want to go into it in more detail and collect all the items on this subject in one place.<br /><br />The first thing you need to define is some cost metric for comparison. The result of the testing will not be binary "pass or fail", it will be the value of that cost metric, and you'll be trying to minimize that cost.<br /><br />In the perfect world, you would have the table of cost for replacing every spare part (including both the part itself and the labor). You'd also know the labor cost of performing the diagnosis by a human diagnostician. This cost might vary by the type of problem, and in the even more perfect world you'd have this information too. Moreover, the human diagnostician can make mistakes, so you'd have the information about the frequency of mistakes and the cost of labor and parts spend on them factored into the information about the cost of the human diagnosis. Having these tables, you'd be ready to do the comparisons.<br /><br />If you don't have the exact tables, you can do some kind of estimations. For a toy model, you can do some very rough estimations: for example, say that the cost of the human diagnosis is 1 unit, and the cost of any particular repair is 3 units.<br /><br />Then you take the body of the training data. The meaning of the words "training data" is kind of loose here, it's not necessarily the data that you will be using to train the model, you might put at least some of it aside to use in the testing. It all can also be called the "ground truth": the information about the previous cases, diagnosed by humans, and confirmed that the diagnosis is correct. For each test, you'd want to measure how good each version of your model does against the other versions of the model, and also against the humans. After all, if your diagnostic automation results in higher costs than the human diagnosis, there is no point in using it. And if it does better, you'd have some exciting numbers to show to your bosses and/or customers.<br /><br />There is an absolute minimum cost needed to do the repairs on a set of training data: you take the cost of fixing every confirmed problem in it and add them up.<br /><br /><pre>Cperfect = sum( cost(every_problem) )</pre><br />To estimate, what the human diagnosis would cost, you'd take the cost for diagnosing one case, multiply it my the number of the cases in the training data and add to Cperfect to get the total cost.<br /><br /><pre>Chuman = Cperfect + cost(diagnosis)*n_of_cases</pre><br />Of course, if you have the more detailed information about the cost of diagnosis by each type of problem, you can use this detailed information to add them up. Generally there still will be the fixed cost of diagnosing one case, plus the sum of diagnosing each problem in this case:<br /><br /><pre>Chuman = Cperfect <br /> + cost(diagnose_case)*n_of_cases <br /> + sum( cost(diagnose_every_problem) )</pre><br />Note that even if you build a perfect diagnoser, the best savings you can hope for are <tt>(Chuman - Cperfect)</tt>. Or if you prefer to see it as a percentage, <tt>100%*(Chuman - Cperfect)/Chuman</tt>.<br /><br />In reality when you run your automatic diagnoser, you'll have some problems misdiagnosed. There will be some false negatives (when your model failed to notice some problem) and some false positives (when your model diagnosed a problem that is not present). If your model has produced a wrong diagnosis, that's obviously a combination of a false negative (for the real problem that got missed) and a false positive (for the wrong problem that got diagnosed).<br /><br />The cost of the false negatives is the cost of a human diagnosis, because the human diagnostician would have to go and look at this case. The cost of post-repair testing might need to be added as well, because that's what would be detecting that the problem is not fixed before sending it to the human. In many cases the cost of this testing might be negligible compared to the cost of human diagnosis. <br /><br />The cost of the false positives is the cost of the parts and labor spent on replacing the parts that aren't broken.<br /><br />With all this, the cost of the repairs per the expert system will be:<br /><br /><pre>C = Cperfect <br /> + cost(diagnose_case)*n_of_false_negative_cases<br /> + sum( cost(diagnose_every_false_negative_problem) )<br /> + sum( cost(repair_every_false_positive_problem) )</pre><br />You'd compare the output of your model with the perfect diagnosis, notice the false positives and false negatives, and add their costs.<br /><br />Now you're able to compare two models: run them on the same data, find the resulting cost, and see which cost is lower and by how much. You can try different algorithms and different constants for these algorithms and see the changes of the cost. And sometimes the results would surprise you, you'll discover that you went for that fancier algorithm only to make things worse.<br /><br />If you're wondering, what kind of boundary constant value should be used for accepting the hypotheses, the answer is to try multiple values and see, which one works better. If all goes well, you should be able to build a graph of the total cost by the boundary value and see a nice smooth-ish curve with a distinct minimum on it, something like this:<br /><br /><pre>|<br />| *<br />| <br />| * *<br />| * *<br />| <br />|<br />|<br />+-----------------------</pre><br />If you have two interdependent constants (such as in the algorithm that computes probabilities for both all hypotheses and independent hypotheses, and has different acceptance boundaries for these sub-algorithms), you may want to try taking a couple values of one constant, and for each one of them go through the graphing of the cost by the changing of the other constant. That might give you the hint of how they are interdependent. And then with this knowledge you'd be able to choose a smaller interesting area and go through every combination of both constants in it, compute the cost, and find the minimum on the 3-dimensional graph.<br /><br />You might be even able to analyze the dependencies, build a formula, and find a good approximation of the minimum analytically.<br /><br />These boundary constants generally adjust the balance between the false positives and false negatives. If you set the boundary too low, a lot of false positives will be accepted. If you set the boundary too high, the model will reject a lot of diagnoses and thus generate many false negatives. And around the optimum, you'll be trading some false positives for false negatives. In the perfect world there would be some area with no false positives nor false negatives but in reality you'll still have both, and it will come to giving up some in one area to win in the other.<br /><br />The exact win or loss around the optimum area will depend on the mix of cases in the training data. If you move the boundary up, and lose in lots of cases, and win in a few, your cost will go up. If you move the boundary up, lose in a few cases but win in many, your cost will go down. The proportions of different cases mixed in your training data will determine the balance for the minimal total loss.<br /><br />This has two consequences: First, there is no point in the very precise picking of the boundary constants. The small changes of these constants one way or the other won't matter, they will depend on which mix of cases did we get in the last time period. And the mix will fluctuate a little over time, there is no way to predict it precisely. Second, it's important to preserve the mix of the cases in the training. If you select a subset of data for the training, the mix of cases in it should match the full real set. Don't use the approaches like "we'll only include into the training data the cases where we've been able to diagnose the problem on the first go". If you do that, you'll be changing the mix, throwing away the more complex data, and your cost estimations won't match the<br />reality.<br /><br />The common advice is "split your case data in two, use one half for training, another half for testing". As you can see, how exactly your split the data, will matter. You don't want to change the mix too much. If you take the first half of the data for training and the second half for testing, and the data happens to be sorted on some parameter, you'll get the very different mixes in two halves. It can get as bad as the training in the first half being totally wrong for diagnosing the second half. You need to do the splitting in some more random way. Either by picking the cases by some random number generator, or another good approach is to split them by the timestamp: order the cases by the timestamp and then you can divide them into the first and second half.<br /><br />But this whole advice of splitting the data in two is not very good. It's good for the follow-up testing but not good for the initial testing. For the initial testing, you want to use the SAME data both for training and for testing. If you trained your model on some data and still can't diagnose everything right when testing with the exact same data, that will give you a good opportunity to analyze, what went wrong. Some people might say "oh, but you'll be over-fitting your model to the particular data". Over-fitting is rarely a problem for the Bayesian models, the typical problem is the opposite. And if the mix of cases in your training data matches the typical real mix, you'll be fitting the expectations close enough to the reality.<br /><br />Thus start with testing on the same data that you used for training. Identify all the cases of false positives and false negatives. See if there is something common with them. Look in depth at some of the cases, how did the model come up with this result? What caused it to miss the correct result? In my small examples, I've been showing the step-by-step intermediate results of applying every event. This is the kind of data you want to look at. Did the steps match what you expected? If they didn't then why, what values in the tables cause the computation to veer this way?<br /><br />This detailed data becomes large fairly quickly, so some automatic pre-processing can help with finding the source of trouble. In particular, I've been using the automated code that would pick, which events caused the biggest changes in the probabilities of the hypotheses, both the computed ones and the expected ones. Then it can be printed out in the more compact form that is easier to analyze at a glance. To get this kind of printout automatically, it helps to build the diagnostic support into the program itself. The good plan is to run the test data through the model once, pick the strange cases, and then run them through the model the second time, along with the information about the correct diagnosis and the diagnosis produced by the model. These known diagnoses can then be used to drive the debugging printouts during the second computation.<br /><br />The different algorithms will give different results. Running multiple algorithms on the same set of training data (with the same data used for training and testing) will let you see the upsides and downsides of each algorithm. That's how I've been coming up with my refinements. Look at what can be adjusted to make the algorithm work better. If one algorithm handles some cases well, and another one handles the other cases well, could we perhaps differentiate these cases somehow and then run them through the different algorithms? Some refinements will work well, some will crash and burn, try the different ones and pick the ones that work.<br /><br />And only after you've got a system that can do reasonably well on processing the data that were used for training, it's time to test it on the other data. Again, identify the cases that go wrong. In some cases the results will be better or worse than before simply because of a different mix on cases. If you see the same kinds of mistakes as when you tested with the training data but in different proportions, it's probably the mix issue. If you see the different mistakes, well, it means that there is something in this set of data that you haven't seen before and that throws your model off.<br /><br />A useful experiment is to split your data (nicely and randomly) in two, then use each half to train and test in turn. If we call these parts A and B, then you can do:<br /><br /><ul><li>train with part A, test with part A</li><li>train with part B, test with part B</li><li>train with part A, test with part B</li><li>train with part B, test with part A</li></ul><br />If you test each algorithm change on both halves of the data, you'll be able to see if it affects them in the same way or differently. Then test the training table you produced from one half of data on the other half of data. Did it do substantially differently than the same algorithm did when trained by the same data as used in the test? If yes then perhaps the mix in two halves of data is different.<br /><br />After your system is in production, you should still keep collecting the training data, and keep re-training the model. As the devices age, some of their parts will be coming to the end of life (by design or by mis-design), and this will be showing as the increasing frequency of their failures. This is something that you'd want to capture for the future diagnosis.<br /><br />And the diagnostic data that was useful for testing is useful in production too. It's useful in two ways: First, when the diagnosis turns out to be wrong, and the case goes to the human diagnostician, the human may benefit from knowing, why did the machine make this diagnosis. He might be able to pinpoint the error quickly and cheaply, without repeating the difficult steps. Second, you should always look back at what is going wrong in production? It would never be perfect but can we do it better? For that, it would help to take the misdiagnosed cases and analyze them further. Try to figure out, what went wrong, and the diagnostic information is what lets you find out what went wrong. You might be also able to use the feedback from the human diagnosticians.<br /><br />Oh, and a little tangent on the subject of the "ground truth". Recently I went to an internal conference on the subject of machine learning, and there they've been saying that there are the systems with the "ground truth" and systems without the "ground truth" (i.e. no known perfect solution, such as finding the topic of a text message). I disagree with that. I think that there are no systems without the "ground truth". There is always at least the "ground truth" benchmark of how would a human do at this task? Can the automated system match the human? Can the automated system beat the human? In that example of finding the topic of a test message, there obviously must be a body of training messages that the humans would read and formulate the topics. Then these messages can be used for testing of the automated models. And the good automated models must come reasonably close to what the humans did. Only after that can we use these automated models to process the real messages in the wild. Otherwise there is no way to tell if the results of these models are any good or if they are some complete garbage.<br /><br />There is always, ALWAYS a way to test your system and estimate, how good is it doing. If you didn't test it, you're not following the best practices, and you probably have a whole lot of bugs in your program. Testing the large volumes of data manually is impossible but the solution is to pick some samples and check carefully that these samples are producing the correct results. And of course there are other indications: for example, if you ever see a probability value of 15, this means that there is something very wrong with your computation.<br /><br />***<br /><br />That concludes what I had to say on the subject of the Bayesian expert systems. Unless I recall something that I forgot :-) I wrote up all the ideas I had, and that's actually more ideas than I've started with, I've had some interesting realizations as I went along. At some point I'll probably provide the code examples for the techniques discussed in the last few installments. We'll see how it will go with my time.<br /><br />Sergey Babkinhttp://www.blogger.com/profile/06686909700446011233noreply@blogger.com0