I've found out recently how the people in the data science compute the Bayesian inference for values from an arbitrary numeric range (as opposed to yes/no events). As it turns out, they do it using a computation by weights. I've been wondering for a while after discovering the computation by weight on my own, why nobody else uses it, it's so much simpler than by probabilities. So the answer is similar to what I've discovered before for the connection between the Bayes and neural networks: the people in the field do know about it and do use it, only the simplified popular explanations don't.
I wanted to write down the explanation of how they do it (at least, for myself in the future), and so I went back to read what I wrote before about the Bayesian computation by weight, and found that what I wrote before is quite complicated. I wrote it down as I discovered it when experimenting with all those extra parameters that make a Bayesian system work in reality, and so that explanation is also full of discussion of those parameters. Also, I've learned some things since then.
Before going into th enew ground, let me try a new take on the same thing: discuss the Bayesian inference by weight, only now in its most basic form.
Let's start again with an event E and hypothesis H. In the modern terms, they call this machine a Bayesian classifier, and call the mutually-exclusive hypotheses classes. The events are the inputs, and based on the values of the inputs the machine is trying to decide, which hypothesis is most likely to be true. The classic Bayesian formula for a binary event E is:
P(H|E) = P(H) * P(E|H) / P(E)
Here P(H) is the prior (i.e. previous) probability that the hypothesis is true, P(E) is the prior probability that the event E will be found true (after we learn the actual value of the event this probability collapses), P(E|H) is the conditional probability that the event would be true for the cases where hypothesis H is true, and finally P(H|E) is the new (posterior) probability of H after we learn that E is true. If we learn that E is false, we would compute P(H|~E) instead, for the complementary event ~E.
Where did that formula come from? Well, when talking about probabilities, it really helps to draw a diagram that starts with the range of all the possible cases and then splits it into parts with the appropriate weights. Let's draw this field as a square, in beautiful ASCII art. And then split it with a horizontal line so that the area above the line matches the number of the cases resulting in the hypothesis H1 being true, and below the line with H1 being false (i.e. the complementary hypothesis ~H1, which we'll also name H2, being true). This is for a very simple classification with two complementary hypotheses: H1 = ~H2 and H2 = ~H1.
+---------------+ | | | | H1 | | +---------------+ | | | | H2 = ~H1 | | +---------------+
Now let's do the same, but split the same square vertically. The left side will match the event E1, and the right side will be its complementary ~E1:
+---------+-----+ | | | | | | | | | | | | | | | | | | | | | +---------+-----+ E1 ~E1
Now let's take the second picture and split each vertical part horizontally, into two parts, that again match H1 on top and H2 on the bottom. I've marked these parts with the letters a, b, c, d.
+---------+-----+ | a | b | | | | |---------| | H1 | c | | | | | | |-----| | | d | H2 +---------+-----+ E1 ~E1
Here the parts a and b correspond to H1, and their total area is equal to the original area (within the ASCII-art limitations) of the upper part of the split in the first picture. The parts c and d correspond to H2, and again the sum of their areas is equal to the original area of the lower part. But obviously they're split differently on the left and right sides, meaning that if we know that E is true, H1 has a lower probability than H2, but if E is false, H1 has a higher probability. And if we don't differntiate based on E1, the left and right parts average out.
The areas of these parts a...d are the weights of four sub-divisions:
a: E & H1
b: ~E & H1
c: E & H2 (or equivalently E & ~H1)
d: ~E & H2 (or equivalently E & ~H1)
The reason why I prefer to use H2 instead of ~H1 is that this notation allows to generalize more obviously to more than two hypotheses: H3, H4, and so on. Each additional hypothesis would add two areas to the picture, one on the left and one on the right.
Now we can express the probabilities through relations of these areas (and areas can also be called weights):
P(H1) = (a + b) / (a + b + c + d) P(H2) = (c + d) / (a + b + c + d) P(E) = (a + c) / (a + b + c + d) P(~E) = 1 - P(E) = (b + d) / (a + b + c + d) P(E|H1) = a / (a + b) P(E|H2) = c / (c + d) P(H1|E) = a / (a + c) P(H2|E) = c / (a + c) P(H1|~E) = b / (b + d) P(H2|~E) = d / (b + d)
The general principle for P(x|y) is that the area that satisfies both conditions x and y becomes the numerator, and the area that satisfies y becomes the denominator.
Alright, let's substitute these formulas into the Bayesian formula:
P(H1) * P(E|H1) / P(E) = P(H1) * (1/P(E)) * P(E|H1) = ((a + b) / (a + b + c + d)) * ((a + b + c + d) / (a + c)) * (a / (a + b)) = a / (a + c) = P(H1|E) P(H2) * P(E|H2) / P(E) = P(H2) * (1/P(E)) * P(E|H2) = ((c + d) / (a + b + c + d)) * ((a + b + c + d) / (a + c)) * (c / (c + d)) = c / (a + c) = P(H2|E)
So that's where that formula comes from and how it gets proven. Note that the computation of both probabilities involves the final division by the same value (a + c). If we just want to compare them, this division by the same value doesn't matter and we can skip it. Instead we'll just get a and c, which are also the weights W(H1|E) and W(H2|E), and if we want to get the probabilities, we can normalize by dividing by their sum. Or I guess rather than W(H1|E) we could say W(H1 & E) but that's the beauty of workiing with the weights instead of probabilities: the probabilities require normalization at each step, dividing by the sum of all possible weights, while with weights the sum is kept implicit, and W(H1|E) = W(H1 & E) = W(E|H1). When expressed in weights, the formulas become simpler:
W(H1) = a + b W(H1|E) = W(H) * P(E|H1) = (a + b) * (a / (a + b)) = a
That's pretty much it. But there is one more question to consider: what if we have more than one event? We usually do have multiple events. After we compute P(Hi|E1) (or W(Hi|E1)), now we have a smaller rectangle left, with the level of the horizontal divider shifted from where it was in the previous square (or rectangle). What do we do with the next event? There are multiple ways to look at it.
One way is to hope that the events are completely independent from each other. This basically means that as we look at each part produced on splitting by E1 (E1 and ~E1), and further split each of these parts by E2, the horizontal lines in each vertical quarter shift according to the current level of H in the part being split, with the result that P(E2|E1,Hi) = P(E2|Hi). It would look something like this:
+----+----+--+--+ | e | f |g |h | |----| | | | | | | | | H1 | |----| | | | | |--| | | | | | | | | | |--| H2 +----+----+--+--+ E1 E1 ~E1 ~E1 E2 ~E2 E2 ~E2
The equality P(E2|E1,H1) = P(E2|H1) = P(E2|~E1,H1) would imply (even though it doesn't quite look like it in ASCII art):
e / (e + f) = a / (a + b) = g / (g+h)
That's a simple-minded implementation (in the scienific circles they call it naive, sometimes even spelled with two "i"s). The problem there is that when the assumption is not true and the events are strongly dependent, this can drive probabilities in weird ways.
Another way would be to build the tree of exact splits: split the slices produced by E1 into slices for E2, then for E3, and so on, and for each vertical slice after each split find the exact proportion of cases. This is obviously much more complex, and complexity grows exponentially with the number of events.
The third way (I think I've just realized it from reading the materials about data science) would be to track the splits by events pair-wise, do the vertical splits by each pair: (E1, E2), (E1, E3), ..., (E1, En), (E2, E3), (E2, E4), ..., (E2, En), and so on. I haven't quite thought through yet the adjustments that would need to be computed for each split. I guess, fundamentally it would be a representation of the covariance matrix (another word I've learned). But I guess there are two potential ways to do it: one would be to adjust the positions of the horizontal lines after the second split, another would be to adjust the positions of the vertical lines for the second split. I'm not sure which one is more correct. Maybe either way would adjust the ratio of the areas on the left and right to the same value. But either way, after you apply E1, adjust the probabilities of E2, E3, and the rest of the events according to this matrix. Then after applying E2, adjust E3, E4, and so on. It won't be a perfect match that can be produced with the exact splits but it would easily catch the events that are duplicates and near-duplicates of each other.
No comments:
Post a Comment