As promised in the previous installment, here is the how the data science handles the Bayesian inference directly on the numeric values. Remember from that last installment, the formula for inference by weight is:
W(H[i]|E) = W(H[i]) * P(E|H[i])
So what they do is instead of using the probability P(E|H[i]), they use the probability density p(E|H[i]) (note the lowercase p) from a previously computed distribution, this time E meaning not just a specific event variable but a specific numeric value of a specific variable. Which really makes sense, since that's what the probability density means: the probability that this specific value would come out in the distribution for this variable. Since it's a conditional distribution, only the training cases for the hypothesis H[i] are used to compute the distribution.
The typical distribution used here is the normal distribution. For what I understand, it's not only because it's the typical one occurring in reality, but also because for any original distribution, once you start repeatedly pulling the examples from it, the values produced by these examples become described by the normal distribution (if I remember right, this comes out of the Central Limit Theorem). The normal distribution has two parameters: mu (the mean value) and sigma (standard deviation, equal to the square root of the variance). The probability density function then is:
p(x) = (1/sqrt(2*pi*sigma^2)) * exp( -(x-mu)^2 / (2*sigma^2) )
Or, since sigma here is always squared, we can also replace the squared sigma with the variance var:
p(x) = (1/sqrt(2*pi*var)) * exp( -(x-mu)^2 / (2*var) )
The values of mu and var would be different for each hypothesis H[i]. So we'd split all the training cases into subsets by mutually exclusive hypotheses, and then compute:
mu[E][i] = sum(cases[i][E]) / N[i] var[E][i] = sum((cases[i][E] - mu[E][i])^2) / (N[i] - 1)
Here N[i] is the count of the training cases with the outcome H[i], and cases[i] are all the cases with that outcome, effectively N[i] = count(cases[i]). The reason why we use (N[i] - 1) instead of plain N[i] is that it computes not a variance but a sample variance. This comes out of a rather interesting distinction between the probability theory and statistics: the probability theory describes the random distributions where the parameters of these distributions are known in advance, while statistics describes how to deduce these parameters from looking at the samples of the data. Here we obviously don't somehow know the distribution in advance but have to deduce it from looking at the training cases, i.e. the data samples, so we have to use the statistics and its sample variance. Technically speaking, mu here is also not the true mean but the sample mean, but the formula for it is the same anyway. However the need to divide by (N[i] - 1) comes from the sample mean producing a different value that is offset from the true mean, sample variance counter-adjusting for this offset. And the reason for this offset is that we do the sampling without replacement (i.e. each training case is used at most once in the computation). If we did the sampling with replacement, for example select 1000 values at random from the training cases (each time from the whole set of the training cases, not excluding the cases that have already been selected but allowing them to be selected again), then when computing the sample variance we'd use the plain N[i], i.e. the same 1000 rather than 999. Or if we magically knew the true mean, we'd use N[i] for the sample variance too.
And here is a good spot to talk about Python. As much as I detest the language, I see why it's popular for the data science: because it has some very good libraries in it for this purpose: numpy, sklearn, matplotlib. They've even added an operator of matrix multiplication @ into the language for the convenience of these libraries. And the point of the matrix operations is not only that they allow to write down things in a form that is short and comfortable for the data scientists (although confusing as hell for the rest of us), but also that they allow to move all the tight loops into the C++ code that is way faster than the Python code, and also allows the library to internally parallelize the computations as needed. So the computations in a slow interpreted language with questionable parallelism become quite fast and quite parallel. Aside from that, the libraries provide other conveniences. Sklearn even has a bunch of popular data sets built right into it. So in Python the code to compute mu and var looks like this:
# X is a numpy matrix, where each row corresponds to a training case # and each column to an input variable (i.e. a numeric event); # Y is a numpy vector containing the outcomes for all the training cases # Find the subset of the training case that have the outcome i, # it produces a vector of the same size with True for # each included case and False for each excluded case. selector = (Y == i) # Select the subset of training cases for i, keeping all the columns subset = X[selector, :] # Count the included cases, which is the number of rows in the subset n = subset.shape[0] # Compute the sample mean, by averaging across all rows (axis=0). # The result is a row with a column per input. mu[i] = subset.mean(axis = 0) # Compute the sample variance, again producing a row with a # column per input. The tricky part is that mu[i] is a single row, # so the subtraction operator automatically considers it as a matrix # that has as many rows as subset, with all the rows being the same. # Sum adds up the columns across all the rows (axis=0) into one row. # And division by a scalar divides each column by the same scalar. # The result is a row with a column per input. var[i] = np.square(subset - mu[i]).sum(axis=0) / (n - 1)
But wait, there is more. Note how the probability density function has an exponent function in it. Instead of computing the weights by taking the exponents and multiplying, we could compute the logarithms of the weights, by adding up the logarithms (and the logarithm of the exponent is the argument of the exponent, saving this function computation). So the formula becomes:
logW(H[i]|E) = logW(H[i]) + log(p(E|H[i])) = = logW(H[i]) + log(1/sqrt(2*pi*var[i, E])) - (x[E] - mu[i, E])^2 / (2*var[i, E])
The part log(1/sqrt(2*pi*var[i, E])) is a constant that can be pre-computed in advance, so the computation with a few multiplications and additions is quite efficient.
No comments:
Post a Comment