Saturday, September 30, 2017

removing the duplication in Bayesian & AdaBoost

I've been thinking some more on how to remove the duplication between the Bayesian events/AdaBoost partial hypotheses. Here are some notes. They're not some final results, just some thinking aloud.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect a on the training cases, and an event B that gives the prediction b on the training cases. These predictions have some part ab (that's a 2-letter identifier, not multiplication of a and b) in common. Then the effects can be written as:

A: a = a' + ab
B: b = b' + ab

Where a' and b' are the differences between the original a and ab, and b and ab. Here the operator "+" means not the addition but application of the events in the deduction one after another. So "a + b" means "apply a, then apply b".  The reason why I chose the sign "+" is that this combined application should act like addition. I.e. have the commutative and distributive properties like

x + y = y + x
(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part ab gets applied twice. To remove this duplication we've got rid of one copy of ab. We can do this by splitting the original two events into three events A', B' and AB (again, that's a 2-letter identifier, not a multiplication):

A': a'
B': b'
AB:ab
a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc
b'c would be the overlap between b' and c
abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c
b' = b'' + b'c
ab = ab' +abc
c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of ab from the values of a and b?

The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split a and b into a', b', and ab, we would be splitting the whole set of training cases into 3 parts: one part gets predicted well by both A and B, one only by A, and one only by B.

Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

AdaBoost conjunctions & XOR

I've been thinking recently about the better ways to extract the commonality from two Bayesian events (hope to find time to write about that soon), and that brought me back to the XOR problem: how to handle the situation when two events indicate a certain hypothesis when they are the same and another hypothesis when they are different.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state  before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          0   |   0

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          1   |   0

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against.  But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example.  They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          0   |   0

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

Monday, September 25, 2017

Boosting book reference

I've accidentally found the Freund and Schapire's book on AdaBoost online:

https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/toc.html

Great for searching.

Tuesday, September 19, 2017

Reversible computing

I've read an article in the IEEE magazine about he reversible computing: https://spectrum.ieee.org/computing/hardware/the-future-of-computing-depends-on-making-it-reversible

I've never thought about it before, that the normal switching in the circuits dissipates the energy continuously, and the way to fix it would be to make sure that the energy is always shuffled around and never dissipated (although of course there will always be the "friction losses"). I haven't found that much other information on the Internet. So I've started thinking about how such a reversible computing can be implemented, based on the example from the article. And I think I've come up with a decently simple implementation on the digital level, not sure why the article says that it's complicated. Of course, it might end up extremely complicated in the implementation, I know nothing about that. So, here is what I've come up with:

The first obvious problem was how do the values ("billiard balls" from the mechanical analogy) get returned back? That has a kind of obvious fundamental solution: in all the digital automata the information fundamentally goes in circles like this:

.
          +-----+ 
          |     |---> output   
          |     |   
          |     |    +------+
          |     |    |      |
input --->|logic|--->|memory|--+
          |     |    |      |  |
      +-->|     |    +------+  |
      |   +-----+              |              
      +------------------------+

Well, mostly goes in circles, there are inputs and outputs. So the next problem is how to make the inputs and outputs work. Note that the inputs don't come from nowhere, and outputs don't go into nowhere, instead they connect to the other circuits. So the inputs and outputs can be also made in the form of cycles, with the interaction between the output of one circuit and input of another circuit:


.
output --->-+ | +-<-- recycle
            | > |
recycle <---+ | +---> input



The output/input interaction can be represented as a function with the inputs:

x - the output value
1 - the constant 1

and outputs:

rx - recycling of the output value
i - the input sent to the circuit
ri - recycling of the input value

The truth table for this function will be:

x  1  |  rx i  ri
==================
0  1  |  0  0  1
1  1  |  1  1  0

As you can see, the number of 1s in the output always matches the number of 1s in the inputs. If x=1, the constant 1 gets shipped to the input, otherwise it goes to the recycling and back into the constant.

The next problem is the memory: it has to store a value that might be 0 or 1, which is obviously not symmetric. But it looks like the basic solution for that is easy: for each memory cell keep 2 sub-cells, one of them storing the value, and another one its complement. So there will always be one packet of energy stored, and the value would be determined by the sub-cell where it's stored. Then the contents of one sub-cell would be sent out for the next step of computations, and the other sub-cell will stay until the new value gets stored. I guess this could also be expressed equivalently as a constant 1 that gets connected to one or another output of the memory unit.

How would the recycling be done? For that we've got to connect the 1s in the results back to where they came from. And ultimately they all come from the constant 1s, either introduced in the computations or use din the logic of the inputs. For all I can tell, by the time the bits get recycled, they get really mixed up, and there is no easy way to put it back. The only universal way I see to put them back would be to order all the original "constant 1s" and all the recycled values (the specific order doesn't matter, just there has to be some order) and then do the logic like this:

If the 1st recycled value is 1, put it into the 1st "constant 1".
If the 2nd recycled value is 1 {
  If the 1st "constant 1" is still 0, put the 2nd recycled value into it,
  else put the 2nd recycled value into the 2nd "constant 1".
}
If the 3rd recycled value is 1 {
  If the 1st "constant 1" is still 0, put the 3rd recycled value into it,
  else if the 2nd "constant 1" is still 0, put the 3rd recycled value into it,
  else put the 3rd recycled value into the 3rd "constant 1".
}

and so on. As you can see, the logic can be parallelized to some extent but still will require as many steps as there are input values. So for the large circuits it becomes quite slow. Maybe this can be optimized in some way for some specific applications but that would be quite difficult. Maybe that's what they mean when they say that the design of the reversible circuits is difficult.

But I think that there is a simple solution: make each individual circuit small. Maybe limit it to 2 inputs and possibly one additional constant 1. That would limit the recycling to 3 steps, which is reasonably fast. Then use the input/output method described above to connect these small circuits into the large circuits. Then as the inputs propagate through the layers of these simple circuits, the early layers would compute their recycling in parallel with that propagation, so overall there will be very little penalty.

Note that there would be no memory between these layers of small circuits, it would be mostly layers of the small circuits, with one memory layer at the end.

The classic digital electronics has the implementation of functions like AND and OR with a large number of inputs, not just 2, that get computed in the same time as a function with 2 inputs, which comes quite handy. This won't be that easy for the reversible circuits. But there probably is no point in trying to do this for the reversible circuits in the first place: even if a function of N arguments would get computed in one step, then it would still have to do the recycling in N steps. On the other hand, such a function of N arguments can be built as a tree hierarchy of 2-argument functions, of the depth log2(N). This tree would be computed in log2(N) steps and the recycling of the last element will take 5 steps (the recycling of the previous elements could be done in parallel). The number 5 comes from a logical operation on 2 inputs producing 3 outputs to stay reversible, as given in an example in the article (and I suspect that it's an universal property of such operations), plus 2 inputs having 1 recycling output each. So the total time would be (log2(N)+5) which is better than N. Even if we include the I/O overhead, that would still be (2*log2(N)+5) which is still better than N.

Thus I think that the basic design of the reversible circuits on the digital level with the "billiard ball analogs" looks not that difficult. It might be that I've missed something and got everything wrong. And of course, the implementation on the level of the underlying electronic elements like transistors might be very difficult. I'm still trying to figure out how it could be done, and since my knowledge about the transistors is fairly limited, it's difficult for me. I have some ideas but I'd need to think some more about them before writing them down (and they might be completely wrong anyway).

A bit on a side subject, about the logic. In the article they use the the example of function [(NOT A) AND B] which looks kind of unconventional (but maybe they have some good reasons, one of them being the convenience of producing both the result and its complement). Well, there is more than one way to create  a set of operations that can represent all the boolean functions. One is the set of (AND, OR, NOT), then (AND-NOT) and (OR-NOT) do everything in one operation(and perhaps the [(NOT A) AND B] does that too, I forgot how to check that), and yet another set is (XOR, constant 1). The XOR gate looks more straightforward to me, it was the first one I've thought of. And it turns out that XOR is quite popular in the reversible computing, it's known as the Toffoli gate. But in the end it probably doesn't matter much. Or maybe the best way is to use the set of (AND, OR, NOT). The reason for that being that I think OR of 2 arguments can get by with only 2 outputs, and so does NOT, only AND requiring 3 outputs. Which means one fewer recycling step after the OR, which means a slightly higher performance and about 10% (half of the saved 20%) simpler recycling circuitry compared to the situation when all the circuitry is built of one universal function like (AND-NOT) or (OR-NOT) or (XOR).

Wednesday, September 6, 2017

AdaBoost 10: a better metric?

Here is one more consequence I've noticed from the application of AdaBoost approach to the Bayesian one: it can be applied the other way around too.

I.e. AdaBoost provides the list of partial hypotheses and weights for them. But at some point it tends to start repeating the same hypotheses over and over again, because a set of hypotheses taken together gives some gain, and applying it multiple times seemingly increases this gain. This gain is subject to the same problems as the Bayesian computations with the duplicate events: it increases the confidence only seemingly. So the same solution can be applied to it:

After the set of partial AdaBoost hypotheses has been computed, they can be re-ordered in such a way as to remove the duplication, by selecting on every step the partial hypotheses with the smallest effect. This "run order" can be kept and updated on every step, instead of the AdaBoost's "selection order". So when the sequence will start repeating, a duplicate hypothesis would add nothing to the cumulative effect of the "run order", and at this point the selection algorithm should try a different hypothesis (or simply stop).

I guess, in case if the selection algorithm doesn't stop but tries to do the next best pick (and then maybe another one if the second choice is no good either), it becomes fundamentally more complicated: on each iteration of AdaBoost it should select not simply the hypotheses that maximizes the effect of the whole sequence when appended to the end of it but one that maximizes the effect of the whole sequence when inserted at the position in the sequence where it adds the least effect. Each such insert could also cause a re-ordering of the part of the sequence that follows it.

Which sounds even more heavy-weight than the original AdaBoost selection of the hypothesis. On the other hand, since the AdaBoost selection is already so heavy-weight, maybe this would make little extra overhead. I don't know how to analyze it yet.