Saturday, April 15, 2023

Bayes 29: computation for numeric values (and Python)

As promised in the previous installment, here is the how the data science handles the Bayesian inference directly on the numeric values. Remember from that last installment, the formula for inference by weight is:

W(H[i]|E) = W(H[i]) * P(E|H[i])

So what they do is instead of using the probability P(E|H[i]), they use the probability density p(E|H[i]) (note the lowercase p) from a previously computed distribution, this time E meaning not just a specific event variable but a specific numeric value of a specific variable. Which really makes sense, since that's what the probability density means: the probability that this specific value would come out in the distribution for this variable. Since it's a conditional distribution, only the training cases for the hypothesis H[i] are used to compute the distribution.

The typical distribution used here is the normal distribution. For what I understand, it's not only because it's the typical one occurring in reality, but also because for any original distribution, once you start repeatedly pulling the examples from it, the values produced by these examples become described by the normal distribution (if I remember right, this comes out of the Central Limit Theorem). The normal distribution has two parameters: mu (the mean value) and sigma (standard deviation, equal to the square root of the variance). The probability density function then is:

p(x) = (1/sqrt(2*pi*sigma^2)) * exp( -(x-mu)^2 / (2*sigma^2) )

Or, since sigma here is always squared, we can also replace the squared sigma with the variance var:

p(x) = (1/sqrt(2*pi*var)) * exp( -(x-mu)^2 / (2*var) )

The values of mu and var would be different for each hypothesis H[i]. So we'd split all the training cases into subsets by mutually exclusive hypotheses, and then compute:

mu[E][i] = sum(cases[i][E]) / N[i]
var[E][i] = sum((cases[i][E] - mu[E][i])^2) / (N[i] - 1)

Here N[i] is the count of the training cases with the outcome H[i], and cases[i] are all the cases with that outcome, effectively N[i] = count(cases[i]). The reason why we use (N[i] - 1) instead of plain N[i] is that it computes not a variance but a sample variance. This comes out of a rather interesting distinction between the probability theory and statistics: the probability theory describes the random distributions where the parameters of these distributions are known in advance, while statistics describes how to deduce these parameters from looking at the samples of the data. Here we obviously don't somehow know the distribution in advance but have to deduce it from looking at the training cases, i.e. the data samples, so we have to use the statistics and its sample variance. Technically speaking, mu here is also not the true mean but the sample mean, but the formula for it is the same anyway. However the need to divide by (N[i] - 1) comes from the sample mean producing a different value that is offset from the true mean, sample variance counter-adjusting for this offset. And the reason for this offset is that we do the sampling without replacement (i.e. each training case is used at most once in the computation). If we did the sampling with replacement, for example select 1000 values at random from the training cases (each time from the whole set of the training cases, not excluding the cases that have already been selected but allowing them to be selected again), then when computing the sample variance we'd use the plain N[i], i.e. the same 1000 rather than 999. Or if we magically knew the true mean, we'd use N[i] for the sample variance too.

And here is a good spot to talk about Python. As much as I detest the language, I see why it's popular for the data science: because it has some very good libraries in it for this purpose: numpy, sklearn, matplotlib. They've even added an operator of matrix multiplication @ into the language for the convenience of these libraries. And the point of the matrix operations is not only that they allow to write down things in a form that is short and comfortable for the data scientists (although confusing as hell for the rest of us), but also that they allow to move all the tight loops into the C++ code that is way faster than the Python code, and also allows the library to internally parallelize the computations as needed. So the computations in a slow interpreted language with questionable parallelism become quite fast and quite parallel. Aside from that, the libraries provide other conveniences. Sklearn even has a bunch of popular data sets built right into it. So in Python the code to compute mu and var looks like this:

# X is a numpy matrix, where each row corresponds to a training case
# and each column to an input variable (i.e. a numeric event);
# Y is a numpy vector containing the outcomes for all the training cases

# Find the subset of the training case that have the outcome i,
# it produces a vector of the same size with True for
# each included case and False for each excluded case.
selector = (Y == i)

# Select the subset of training cases for i, keeping all the columns
subset = X[selector, :]

# Count the included cases, which is the number of rows in the subset
n = subset.shape[0]

# Compute the sample mean, by averaging across all rows (axis=0). 
# The result is a row with a column per input.
mu[i] = subset.mean(axis = 0)

# Compute the sample variance, again producing a row with a
# column per input. The tricky part is that mu[i] is a single row,
# so the subtraction operator automatically considers it as a matrix
# that has as many rows as subset, with all the rows being the same.
# Sum adds up the columns across all the rows (axis=0) into one row.
# And division by a scalar divides each column by the same scalar.
# The result is a row with a column per input.
var[i] = np.square(subset - mu[i]).sum(axis=0) / (n - 1)

But wait, there is more. Note how the probability density function has an exponent function in it. Instead of computing the weights by taking the exponents and multiplying, we could compute the logarithms of the weights, by adding up the logarithms (and the logarithm of the exponent is the argument of the exponent, saving this function computation). So the formula becomes:

logW(H[i]|E) = logW(H[i]) + log(p(E|H[i])) =
  = logW(H[i]) + log(1/sqrt(2*pi*var[i, E])) 
             - (x[E] - mu[i, E])^2 / (2*var[i, E])

The part log(1/sqrt(2*pi*var[i, E])) is a constant that can be pre-computed in advance, so the computation with a few multiplications and additions is quite efficient.

Monday, April 10, 2023

Bayes 28: computation by weight revisited

I've found out recently how the people in the data science compute the Bayesian inference for values from an arbitrary numeric range (as opposed to yes/no events). As it turns out, they do it using a computation by weights. I've been wondering for a while after discovering the computation by weight on my own, why nobody else uses it, it's so much simpler than by probabilities. So the answer is similar to what I've discovered before for the connection between the Bayes and neural networks: the people in the field do know about it and do use it, only the simplified popular explanations don't.

I wanted to write down the explanation of how they do it (at least, for myself in the future), and so I went back to read what I wrote before about the Bayesian computation by weight, and found that what I wrote before is quite complicated. I wrote it down as I discovered it when experimenting with all those extra parameters that make a Bayesian system work in reality, and so that explanation is also full of discussion of those parameters. Also, I've learned some things since then.

Before going into th enew ground, let me try a new take on the same thing: discuss the Bayesian inference by weight, only now in its most basic form.

Let's start again with an event E and hypothesis H. In the modern terms, they call this machine a Bayesian classifier, and call the mutually-exclusive hypotheses classes. The events are the inputs, and based on the values of the inputs the machine is trying to decide, which hypothesis is most likely to be true. The classic Bayesian formula for a binary event E is:

P(H|E) = P(H) * P(E|H) / P(E)

Here P(H) is the prior (i.e. previous) probability that the hypothesis is true, P(E) is the prior probability that the event E will be found true (after we learn the actual value of the event this probability collapses), P(E|H) is the conditional probability that the event would be true for the cases where hypothesis H is true, and finally P(H|E) is the new (posterior) probability of H after we learn that E is true. If we learn that E is false, we would compute P(H|~E) instead, for the complementary event ~E.

Where did that formula come from? Well, when talking about probabilities, it really helps to draw a diagram that starts with the range of all the possible cases and then splits it into parts with the appropriate weights. Let's draw this field as a square, in beautiful ASCII art. And then split it with a horizontal line so that the area above the line matches the number of the cases resulting in the hypothesis H1 being true, and below the line with H1 being false (i.e. the complementary hypothesis ~H1, which we'll also name H2, being true). This is for a very simple classification with two complementary hypotheses: H1 = ~H2 and H2 = ~H1.

+---------------+
|               |
|               | H1
|               |
+---------------+
|               |
|               | H2 = ~H1
|               |
+---------------+

Now let's do the same, but split the same square vertically. The left side will match the event E1, and the right side will be its complementary ~E1:

+---------+-----+
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
+---------+-----+
    E1      ~E1

Now let's take the second picture and split each vertical part horizontally, into two parts, that again match H1 on top and H2 on the bottom. I've marked these parts with the letters a, b, c, d.

+---------+-----+
| a       | b   |
|         |     |
|---------|     | H1
| c       |     |
|         |     |
|         |-----|
|         | d   | H2
+---------+-----+
    E1      ~E1

Here the parts a and b correspond to H1, and their total area is equal to the original area (within the ASCII-art limitations) of the upper part of the split in the first picture. The parts c and d correspond to H2, and again the sum of their areas is equal to the original area of the lower part. But obviously they're split differently on the left and right sides, meaning that if we know that E is true, H1 has a lower probability than H2, but if E is false, H1 has a higher probability. And if we don't differntiate based on E1, the left and right parts average out.

The areas of these parts a...d are the weights of four sub-divisions:

a: E & H1
b: ~E & H1
c: E & H2 (or equivalently E & ~H1)
d: ~E & H2 (or equivalently E & ~H1)

The reason why I prefer to use H2 instead of ~H1 is that this notation allows to generalize more obviously to more than two hypotheses: H3, H4, and so on. Each additional hypothesis would add two areas to the picture, one on the left and one on the right.

Now we can express the probabilities through relations of these areas (and areas can also be called weights):

P(H1) = (a + b) / (a + b + c + d)
P(H2) = (c + d) / (a + b + c + d)
P(E) = (a + c) / (a + b + c + d)
P(~E) = 1 - P(E) = (b + d) / (a + b + c + d)
P(E|H1) = a / (a + b)
P(E|H2) = c / (c + d)
P(H1|E) = a / (a + c)
P(H2|E) = c / (a + c)
P(H1|~E) = b / (b + d)
P(H2|~E) = d / (b + d)

The general principle for P(x|y) is that the area that satisfies both conditions x and y becomes the numerator, and the area that satisfies y becomes the denominator.

Alright, let's substitute these formulas into the Bayesian formula:

P(H1) * P(E|H1) / P(E) = P(H1) * (1/P(E)) * P(E|H1)
  = ((a + b) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (a / (a + b))
  = a / (a + c)
  = P(H1|E)
  
P(H2) * P(E|H2) / P(E) = P(H2) * (1/P(E)) * P(E|H2)
  = ((c + d) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (c / (c + d))
  = c / (a + c)
  = P(H2|E)

So that's where that formula comes from and how it gets proven. Note that the computation of both probabilities involves the final division by the same value (a + c). If we just want to compare them, this division by the same value doesn't matter and we can skip it. Instead we'll just get a and c, which are also the weights W(H1|E) and W(H2|E), and if we want to get the probabilities, we can normalize by dividing by their sum. Or I guess rather than W(H1|E) we could say W(H1 & E) but that's the beauty of workiing with the weights instead of probabilities: the probabilities require normalization at each step, dividing by the sum of all possible weights, while with weights the sum is kept implicit, and W(H1|E) = W(H1 & E) = W(E|H1). When expressed in weights, the formulas become simpler:

W(H1) = a + b
W(H1|E) = W(H) * P(E|H1) = (a + b) * (a / (a + b)) = a

That's pretty much it. But there is one more question to consider: what if we have more than one event? We usually do have multiple events. After we compute P(Hi|E1) (or W(Hi|E1)), now we have a smaller rectangle left, with the level of the horizontal divider shifted from where it was in the previous square (or rectangle). What do we do with the next event? There are multiple ways to look at it.

One way is to hope that the events are completely independent from each other. This basically means that as we look at each part produced on splitting by E1 (E1 and ~E1), and further split each of these parts by E2, the horizontal lines in each vertical quarter shift according to the current level of H in the part being split, with the result that P(E2|E1,Hi) = P(E2|Hi). It would look something like this:

+----+----+--+--+
| e  | f  |g |h |
|----|    |  |  |
|    |    |  |  | H1
|    |----|  |  |
|    |    |--|  |
|    |    |  |  |
|    |    |  |--| H2
+----+----+--+--+
 E1   E1   ~E1 ~E1
 E2   ~E2  E2  ~E2

The equality P(E2|E1,H1) = P(E2|H1) = P(E2|~E1,H1) would imply (even though it doesn't quite look like it in ASCII art):

e / (e + f) = a / (a + b) = g / (g+h)

That's a simple-minded implementation (in the scienific circles they call it naive, sometimes even spelled with two "i"s). The problem there is that when the assumption is not true and the events are strongly dependent, this can drive probabilities in weird ways.

Another way would be to build the tree of exact splits: split the slices produced by E1 into slices for E2, then for E3, and so on, and for each vertical slice after each split find the exact proportion of cases. This is obviously much more complex, and complexity grows exponentially with the number of events.

The third way (I think I've just realized it from reading the materials about data science) would be to track the splits by events pair-wise, do the vertical splits by each pair: (E1, E2), (E1, E3), ..., (E1, En), (E2, E3), (E2, E4), ..., (E2, En), and so on. I haven't quite thought through yet the adjustments that would need to be computed for each split. I guess, fundamentally it would be a representation of the covariance matrix (another word I've learned). But I guess there are two potential ways to do it: one would be to adjust the positions of the horizontal lines after the second split, another would be to adjust the positions of the vertical lines for the second split. I'm not sure which one is more correct. Maybe either way would adjust the ratio of the areas on the left and right to the same value. But either way, after you apply E1, adjust the probabilities of E2, E3, and the rest of the events according to this matrix. Then after applying E2, adjust E3, E4, and so on. It won't be a perfect match that can be produced with the exact splits but it would easily catch the events that are duplicates and near-duplicates of each other.