BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.
So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.
The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.
For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:
. ^y 0 | 1 | -------+------> | x 0 | 0
The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.
The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.
They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:
. ^y 0 | 1 | -------+------> | x 1 | 0
The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.
An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.
The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:
. ^y 0 | 1 | -------+------> | x 0 | 0
If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.
The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.
Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.