Sergey Babkin on CEP and stuff: bayes

Showing posts with label bayes. Show all posts

Saturday, April 15, 2023

Bayes 29: computation for numeric values (and Python)

As promised in the previous installment, here is the how the data science handles the Bayesian inference directly on the numeric values. Remember from that last installment, the formula for inference by weight is:

W(H[i]|E) = W(H[i]) * P(E|H[i])

So what they do is instead of using the probability P(E|H[i]), they use the probability density p(E|H[i]) (note the lowercase p) from a previously computed distribution, this time E meaning not just a specific event variable but a specific numeric value of a specific variable. Which really makes sense, since that's what the probability density means: the probability that this specific value would come out in the distribution for this variable. Since it's a conditional distribution, only the training cases for the hypothesis H[i] are used to compute the distribution.

The typical distribution used here is the normal distribution. For what I understand, it's not only because it's the typical one occurring in reality, but also because for any original distribution, once you start repeatedly pulling the examples from it, the values produced by these examples become described by the normal distribution (if I remember right, this comes out of the Central Limit Theorem). The normal distribution has two parameters: mu (the mean value) and sigma (standard deviation, equal to the square root of the variance). The probability density function then is:

p(x) = (1/sqrt(2*pi*sigma^2)) * exp( -(x-mu)^2 / (2*sigma^2) )

Or, since sigma here is always squared, we can also replace the squared sigma with the variance var:

p(x) = (1/sqrt(2*pi*var)) * exp( -(x-mu)^2 / (2*var) )

The values of mu and var would be different for each hypothesis H[i]. So we'd split all the training cases into subsets by mutually exclusive hypotheses, and then compute:

mu[E][i] = sum(cases[i][E]) / N[i]
var[E][i] = sum((cases[i][E] - mu[E][i])^2) / (N[i] - 1)

Here N[i] is the count of the training cases with the outcome H[i], and cases[i] are all the cases with that outcome, effectively N[i] = count(cases[i]). The reason why we use (N[i] - 1) instead of plain N[i] is that it computes not a variance but a sample variance. This comes out of a rather interesting distinction between the probability theory and statistics: the probability theory describes the random distributions where the parameters of these distributions are known in advance, while statistics describes how to deduce these parameters from looking at the samples of the data. Here we obviously don't somehow know the distribution in advance but have to deduce it from looking at the training cases, i.e. the data samples, so we have to use the statistics and its sample variance. Technically speaking, mu here is also not the true mean but the sample mean, but the formula for it is the same anyway. However the need to divide by (N[i] - 1) comes from the sample mean producing a different value that is offset from the true mean, sample variance counter-adjusting for this offset. And the reason for this offset is that we do the sampling without replacement (i.e. each training case is used at most once in the computation). If we did the sampling with replacement, for example select 1000 values at random from the training cases (each time from the whole set of the training cases, not excluding the cases that have already been selected but allowing them to be selected again), then when computing the sample variance we'd use the plain N[i], i.e. the same 1000 rather than 999. Or if we magically knew the true mean, we'd use N[i] for the sample variance too.

And here is a good spot to talk about Python. As much as I detest the language, I see why it's popular for the data science: because it has some very good libraries in it for this purpose: numpy, sklearn, matplotlib. They've even added an operator of matrix multiplication @ into the language for the convenience of these libraries. And the point of the matrix operations is not only that they allow to write down things in a form that is short and comfortable for the data scientists (although confusing as hell for the rest of us), but also that they allow to move all the tight loops into the C++ code that is way faster than the Python code, and also allows the library to internally parallelize the computations as needed. So the computations in a slow interpreted language with questionable parallelism become quite fast and quite parallel. Aside from that, the libraries provide other conveniences. Sklearn even has a bunch of popular data sets built right into it. So in Python the code to compute mu and var looks like this:

# X is a numpy matrix, where each row corresponds to a training case
# and each column to an input variable (i.e. a numeric event);
# Y is a numpy vector containing the outcomes for all the training cases

# Find the subset of the training case that have the outcome i,
# it produces a vector of the same size with True for
# each included case and False for each excluded case.
selector = (Y == i)

# Select the subset of training cases for i, keeping all the columns
subset = X[selector, :]

# Count the included cases, which is the number of rows in the subset
n = subset.shape[0]

# Compute the sample mean, by averaging across all rows (axis=0). 
# The result is a row with a column per input.
mu[i] = subset.mean(axis = 0)

# Compute the sample variance, again producing a row with a
# column per input. The tricky part is that mu[i] is a single row,
# so the subtraction operator automatically considers it as a matrix
# that has as many rows as subset, with all the rows being the same.
# Sum adds up the columns across all the rows (axis=0) into one row.
# And division by a scalar divides each column by the same scalar.
# The result is a row with a column per input.
var[i] = np.square(subset - mu[i]).sum(axis=0) / (n - 1)

But wait, there is more. Note how the probability density function has an exponent function in it. Instead of computing the weights by taking the exponents and multiplying, we could compute the logarithms of the weights, by adding up the logarithms (and the logarithm of the exponent is the argument of the exponent, saving this function computation). So the formula becomes:

logW(H[i]|E) = logW(H[i]) + log(p(E|H[i])) =
  = logW(H[i]) + log(1/sqrt(2*pi*var[i, E])) 
             - (x[E] - mu[i, E])^2 / (2*var[i, E])

The part log(1/sqrt(2*pi*var[i, E])) is a constant that can be pre-computed in advance, so the computation with a few multiplications and additions is quite efficient.

Monday, April 10, 2023

Bayes 28: computation by weight revisited

I've found out recently how the people in the data science compute the Bayesian inference for values from an arbitrary numeric range (as opposed to yes/no events). As it turns out, they do it using a computation by weights. I've been wondering for a while after discovering the computation by weight on my own, why nobody else uses it, it's so much simpler than by probabilities. So the answer is similar to what I've discovered before for the connection between the Bayes and neural networks: the people in the field do know about it and do use it, only the simplified popular explanations don't.

I wanted to write down the explanation of how they do it (at least, for myself in the future), and so I went back to read what I wrote before about the Bayesian computation by weight, and found that what I wrote before is quite complicated. I wrote it down as I discovered it when experimenting with all those extra parameters that make a Bayesian system work in reality, and so that explanation is also full of discussion of those parameters. Also, I've learned some things since then.

Before going into th enew ground, let me try a new take on the same thing: discuss the Bayesian inference by weight, only now in its most basic form.

Let's start again with an event E and hypothesis H. In the modern terms, they call this machine a Bayesian classifier, and call the mutually-exclusive hypotheses classes. The events are the inputs, and based on the values of the inputs the machine is trying to decide, which hypothesis is most likely to be true. The classic Bayesian formula for a binary event E is:

P(H|E) = P(H) * P(E|H) / P(E)

Here P(H) is the prior (i.e. previous) probability that the hypothesis is true, P(E) is the prior probability that the event E will be found true (after we learn the actual value of the event this probability collapses), P(E|H) is the conditional probability that the event would be true for the cases where hypothesis H is true, and finally P(H|E) is the new (posterior) probability of H after we learn that E is true. If we learn that E is false, we would compute P(H|~E) instead, for the complementary event ~E.

Where did that formula come from? Well, when talking about probabilities, it really helps to draw a diagram that starts with the range of all the possible cases and then splits it into parts with the appropriate weights. Let's draw this field as a square, in beautiful ASCII art. And then split it with a horizontal line so that the area above the line matches the number of the cases resulting in the hypothesis H1 being true, and below the line with H1 being false (i.e. the complementary hypothesis ~H1, which we'll also name H2, being true). This is for a very simple classification with two complementary hypotheses: H1 = ~H2 and H2 = ~H1.

+---------------+
|               |
|               | H1
|               |
+---------------+
|               |
|               | H2 = ~H1
|               |
+---------------+

Now let's do the same, but split the same square vertically. The left side will match the event E1, and the right side will be its complementary ~E1:

+---------+-----+
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
+---------+-----+
    E1      ~E1

Now let's take the second picture and split each vertical part horizontally, into two parts, that again match H1 on top and H2 on the bottom. I've marked these parts with the letters a, b, c, d.

+---------+-----+
| a       | b   |
|         |     |
|---------|     | H1
| c       |     |
|         |     |
|         |-----|
|         | d   | H2
+---------+-----+
    E1      ~E1

Here the parts a and b correspond to H1, and their total area is equal to the original area (within the ASCII-art limitations) of the upper part of the split in the first picture. The parts c and d correspond to H2, and again the sum of their areas is equal to the original area of the lower part. But obviously they're split differently on the left and right sides, meaning that if we know that E is true, H1 has a lower probability than H2, but if E is false, H1 has a higher probability. And if we don't differntiate based on E1, the left and right parts average out.

The areas of these parts a...d are the weights of four sub-divisions:

a: E & H1
b: ~E & H1
c: E & H2 (or equivalently E & ~H1)
d: ~E & H2 (or equivalently E & ~H1)

The reason why I prefer to use H2 instead of ~H1 is that this notation allows to generalize more obviously to more than two hypotheses: H3, H4, and so on. Each additional hypothesis would add two areas to the picture, one on the left and one on the right.

Now we can express the probabilities through relations of these areas (and areas can also be called weights):

P(H1) = (a + b) / (a + b + c + d)
P(H2) = (c + d) / (a + b + c + d)
P(E) = (a + c) / (a + b + c + d)
P(~E) = 1 - P(E) = (b + d) / (a + b + c + d)
P(E|H1) = a / (a + b)
P(E|H2) = c / (c + d)
P(H1|E) = a / (a + c)
P(H2|E) = c / (a + c)
P(H1|~E) = b / (b + d)
P(H2|~E) = d / (b + d)

The general principle for P(x|y) is that the area that satisfies both conditions x and y becomes the numerator, and the area that satisfies y becomes the denominator.

Alright, let's substitute these formulas into the Bayesian formula:

P(H1) * P(E|H1) / P(E) = P(H1) * (1/P(E)) * P(E|H1)
  = ((a + b) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (a / (a + b))
  = a / (a + c)
  = P(H1|E)
  
P(H2) * P(E|H2) / P(E) = P(H2) * (1/P(E)) * P(E|H2)
  = ((c + d) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (c / (c + d))
  = c / (a + c)
  = P(H2|E)

So that's where that formula comes from and how it gets proven. Note that the computation of both probabilities involves the final division by the same value (a + c). If we just want to compare them, this division by the same value doesn't matter and we can skip it. Instead we'll just get a and c, which are also the weights W(H1|E) and W(H2|E), and if we want to get the probabilities, we can normalize by dividing by their sum. Or I guess rather than W(H1|E) we could say W(H1 & E) but that's the beauty of workiing with the weights instead of probabilities: the probabilities require normalization at each step, dividing by the sum of all possible weights, while with weights the sum is kept implicit, and W(H1|E) = W(H1 & E) = W(E|H1). When expressed in weights, the formulas become simpler:

W(H1) = a + b
W(H1|E) = W(H) * P(E|H1) = (a + b) * (a / (a + b)) = a

That's pretty much it. But there is one more question to consider: what if we have more than one event? We usually do have multiple events. After we compute P(Hi|E1) (or W(Hi|E1)), now we have a smaller rectangle left, with the level of the horizontal divider shifted from where it was in the previous square (or rectangle). What do we do with the next event? There are multiple ways to look at it.

One way is to hope that the events are completely independent from each other. This basically means that as we look at each part produced on splitting by E1 (E1 and ~E1), and further split each of these parts by E2, the horizontal lines in each vertical quarter shift according to the current level of H in the part being split, with the result that P(E2|E1,Hi) = P(E2|Hi). It would look something like this:

+----+----+--+--+
| e  | f  |g |h |
|----|    |  |  |
|    |    |  |  | H1
|    |----|  |  |
|    |    |--|  |
|    |    |  |  |
|    |    |  |--| H2
+----+----+--+--+
 E1   E1   ~E1 ~E1
 E2   ~E2  E2  ~E2

The equality P(E2|E1,H1) = P(E2|H1) = P(E2|~E1,H1) would imply (even though it doesn't quite look like it in ASCII art):

e / (e + f) = a / (a + b) = g / (g+h)

That's a simple-minded implementation (in the scienific circles they call it naive, sometimes even spelled with two "i"s). The problem there is that when the assumption is not true and the events are strongly dependent, this can drive probabilities in weird ways.

Another way would be to build the tree of exact splits: split the slices produced by E1 into slices for E2, then for E3, and so on, and for each vertical slice after each split find the exact proportion of cases. This is obviously much more complex, and complexity grows exponentially with the number of events.

The third way (I think I've just realized it from reading the materials about data science) would be to track the splits by events pair-wise, do the vertical splits by each pair: (E1, E2), (E1, E3), ..., (E1, En), (E2, E3), (E2, E4), ..., (E2, En), and so on. I haven't quite thought through yet the adjustments that would need to be computed for each split. I guess, fundamentally it would be a representation of the covariance matrix (another word I've learned). But I guess there are two potential ways to do it: one would be to adjust the positions of the horizontal lines after the second split, another would be to adjust the positions of the vertical lines for the second split. I'm not sure which one is more correct. Maybe either way would adjust the ratio of the areas on the left and right to the same value. But either way, after you apply E1, adjust the probabilities of E2, E3, and the rest of the events according to this matrix. Then after applying E2, adjust E3, E4, and so on. It won't be a perfect match that can be produced with the exact splits but it would easily catch the events that are duplicates and near-duplicates of each other.

Saturday, December 18, 2021

sentiments

It's pretty amazing how the development of AI research is now intertwining with psychology. Not really surprising, since if the current ML models emulate the human thinking, this is what should be happening, but the amazing part is that the development of the AI research has finally reached this point.

The IEEE Computing Edge magazine tends to be rather boring but usually contains one gem per issue. In the November 2021 issue that gem was the history of Ikonas graphics processing (I didn't realize that the raster video started to appear only in mid-1970s, before then the memory costs were prohibitively expensive, and the displays did vector graphics in hardware). In the December 2021 issue that gem is an article about the human emotions "The hourglass model revisited" (https://www.sentic.net/hourglass-model-revisited.pdf, 10.1109/MIS.2020.2992799).

They define the human emotions as a combination of 6 dimensions, 4 of them that can be positive or negative (hence forming a "hourglass", thin in the middle, growing towards extremes): sensitivity, attitude, temper, introspection, and 2 neutral that can be present or not present: expectation and surprise. Additionally they split the attitude into "towards self" and towards others".

The gradations of 4 "main" emotions from positive to negative are defined there as:

Introspection: ecstasy-joy-contentment-melancholy-sadness-grief

Temper: bliss-calmness-serenity-annoyance-anger-rage

Attitude: delight-pleasantness-acceptance-dislike-disgust-loathing

Sensitivity: enthusiasm-eagerness-responsiveess-anxiety-fear-terror

And then they define the compounds, for example "hate = anger + fear", or "pessimism = expectation + sadness". Though strangely they define despair in the same way as pessimism and not as "expectation + grief", I guess this model has quite a bit of subjectivity.

They also define the absolute strength of an emotion (in range of -1 to +1) as a sum of strengths of its components (or we could say "of its present components") divided by the count of its present (i.e. non-0) components.

This is an improvement on always dividing by 4, but I wonder if it could be improved by just taking a sum of highest negative and highest positive components. Their formula already shifts the weights in the same direction compared to the previous one, and arguably presence of a little bit of a side emotion would not reduce the effect of the strongest one, and maybe even add to it if both emotions are in the same direction. So maybe it should be for each direction not even

max(abs(E_i))

but

1-Product(1-abs(E_i))

Then I've remembered where I've seen the same problem: in the Bayesian deduction! Just rescale every sentiment from [-1, 1] to [0, 1] and consider it to be a Bayesian probability of an evidence that the sentiment is positive. And then just compose them in the Bayesian deduction, in a classic formula or with formula by odds/chances as described in https://babkin-cep.blogspot.com/2017/02/a-better-explanation-of-bayes-adaboost.html. Then the sentiments that are at 0 would naturally translate to 0.5 and won't affect the computation.

Sunday, September 27, 2020

a book on probability

I've stumbled upon the book "Introduction to probability" by Anderson, Seppalainen, and Valko. It's a college textbook.

Right in the second chapter it does a very good introduction into the Bayesian formula, deriving it from the descriptions of weights of events, in the same way as I struggled to reinvent on my own. And right there it also provides the "General version of Bayes' formula" for the enumerated events, that took me so much time to reinvent on my own. Well, another case of "everything is invented before us" :-) If only we knew, what books to read at the right time :-) On the other hand, I'm not sure if I would have recognized the importance of that formula if I didn't have a use for it in mind first.

In case if you wonder, here is the formula:

If B1...Bn are events that partition the sample space, then for any event A we have:

P(A) = sum from i=1 to n of (P(A & Bi)) = sum from i=1 to n of (P(A|Bi)*P(Bi))

Then

P(Bk|A) = P(A & Bk) / P(A) = P(A|Bk)*P(Bk) / (sum from i=1 to n of (P(A|Bi)*P(Bi)))

Friday, December 7, 2018

Principal component analysis

Today I've learned about the existence of the Principal component analysis. It's supposed to find the orthogonal dimensions that best describe the variance of the data set. Perhaps this is the real answer for removing the duplication in the dimensions that I've been thinking about some time ago. And it had been invented in 1901! I haven't understood it yet, will need to read up on it. For now just making a note so that I won't forget about it.

Saturday, September 30, 2017

removing the duplication in Bayesian & AdaBoost

I've been thinking some more on how to remove the duplication between the Bayesian events/AdaBoost partial hypotheses. Here are some notes. They're not some final results, just some thinking aloud.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect a on the training cases, and an event B that gives the prediction b on the training cases. These predictions have some part ab (that's a 2-letter identifier, not multiplication of a and b) in common. Then the effects can be written as:

A: a = a' + ab
B: b = b' + ab

Where a' and b' are the differences between the original a and ab, and b and ab. Here the operator "+" means not the addition but application of the events in the deduction one after another. So "a + b" means "apply a, then apply b". The reason why I chose the sign "+" is that this combined application should act like addition. I.e. have the commutative and distributive properties like

x + y = y + x
(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part ab gets applied twice. To remove this duplication we've got rid of one copy of ab. We can do this by splitting the original two events into three events A', B' and AB (again, that's a 2-letter identifier, not a multiplication):

A': a'
B': b'
AB:ab
a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc
b'c would be the overlap between b' and c
abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c
b' = b'' + b'c
ab = ab' +abc
c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of ab from the values of a and b?

The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split a and b into a', b', and ab, we would be splitting the whole set of training cases into 3 parts: one part gets predicted well by both A and B, one only by A, and one only by B.

Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

AdaBoost conjunctions & XOR

I've been thinking recently about the better ways to extract the commonality from two Bayesian events (hope to find time to write about that soon), and that brought me back to the XOR problem: how to handle the situation when two events indicate a certain hypothesis when they are the same and another hypothesis when they are different.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          0   |   0

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          1   |   0

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

.
              ^y
          0   |   1
              |
       -------+------>
              |      x
          0   |   0

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

Wednesday, September 6, 2017

AdaBoost 10: a better metric?

Here is one more consequence I've noticed from the application of AdaBoost approach to the Bayesian one: it can be applied the other way around too.

I.e. AdaBoost provides the list of partial hypotheses and weights for them. But at some point it tends to start repeating the same hypotheses over and over again, because a set of hypotheses taken together gives some gain, and applying it multiple times seemingly increases this gain. This gain is subject to the same problems as the Bayesian computations with the duplicate events: it increases the confidence only seemingly. So the same solution can be applied to it:

After the set of partial AdaBoost hypotheses has been computed, they can be re-ordered in such a way as to remove the duplication, by selecting on every step the partial hypotheses with the smallest effect. This "run order" can be kept and updated on every step, instead of the AdaBoost's "selection order". So when the sequence will start repeating, a duplicate hypothesis would add nothing to the cumulative effect of the "run order", and at this point the selection algorithm should try a different hypothesis (or simply stop).

I guess, in case if the selection algorithm doesn't stop but tries to do the next best pick (and then maybe another one if the second choice is no good either), it becomes fundamentally more complicated: on each iteration of AdaBoost it should select not simply the hypotheses that maximizes the effect of the whole sequence when appended to the end of it but one that maximizes the effect of the whole sequence when inserted at the position in the sequence where it adds the least effect. Each such insert could also cause a re-ordering of the part of the sequence that follows it.

Which sounds even more heavy-weight than the original AdaBoost selection of the hypothesis. On the other hand, since the AdaBoost selection is already so heavy-weight, maybe this would make little extra overhead. I don't know how to analyze it yet.

Tuesday, August 1, 2017

Bayes 27 & AdaBoost: another problem with partial confidence

The solution from the previous installment can handle the partial confidence but it has its own issue. To demonstrate it, let's look at a run with the table from the part 26 but where both of the fully-correlated events E1 and E2 get a confidence of 0.7:

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The probabilities move after applying E1, after E2 they move again. The interesting question is, should have they moved after E2? It really depends on how E1 and E2 have been measured: are their confidences derived independently or tracking back to the same measurement?

The code assumes that they are independent, so if both are independently measured at the confidence of 0.7, together they provide a higher confidence. But if they both come from the same test, this assumption is wrong.

If they came from the same test then E2 carries no additional information after E1 has been processed. In this case we should have treated the value C(E2)=0.7 (i.e. C(E2)=C(E1)) as having the meaning "I don't know", and then the differences from it would carry the new information.

But this approach has its own problems. First of all, it's fine if the confidence values C(E) represent the measured probabilities. But if they represent some expert estimation then an expert saying "I don't know" and giving the confidence of 0.5 would shift the probabilities instead of leaving them unchanged. Well, it it could be re-scaled automatically but then how do we know that the expert means "I don't know" and not "I've carefully considered everything and I think that it's a 50-50 chance"? So it sounds like getting it right would require using two separate numbers: the absolute confidence that tells how sure we are of the estimation, and separately the estimation. Which would create all kinds of messes.

The second problem, if we want to do such a re-scaling, to which value do we re-scale? It's reasonably easy if two events are copies of each other but what if the correlation is only partial? There actually is an a rather easy answer to the problem: the needed value would be P(E), the probability that the event is true at that point (after the previous events have been considered). I've actually went and modified the code to compute the P(E) based on the weights only to find out that I've forgotten an important property of the AdaBoost logic. Let me show its output:

$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1                     ,E2                     ,E3                     ,
hyp1   ,3.00000,0.66667^0.66667@0.66667,0.66667^0.66667@0.66667,0.40000^0.33333@0.50000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.40000^0.33333@0.50000,
hyp2   ,3.00000,0.33333^0.33333@0.33333,0.33333^0.33333@0.33333,0.60000^0.66667@0.75000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.60000^0.66667@0.75000,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
Expected P(E)=0.500000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
Expected P(E)=0.513333
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000
Expected P(E)=0.598615
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The computation table now has one more element printed after "@", the estimation of event's probablity for the particular branch based on the previous events. During the computation these values get mixed together based on the weight of the branches. But notice that when applying E2, the message says:

Expected P(E)=0.513333

It's 0.513333, not 0.7! The reason is two-fold. First, AdaBoost works by adjusting the weights of the training cases such as to bring the probability of the last seen event to 0.5. It just does the correction opposite to what I was trying to achieve. Second, these are the wrong probabilities in the table. These probabilities are computed per-branch based on the absolute confidence. They don't care if the actual confidence was 0.7 or 0.3, these values are the same from the standpoint of the absolute confidence. But this difference matters a lot when trying to predict P(E)!

Instead we need to honestly compute the P(E) based on the previous events, not per the AdaBoostian table but per the original training cases. But if we do that, we end up again with a table of exponential size. The trick with ignoring all the events but the last few might still work though. The idea would be similar: since we try to order the closely-correlated events close to each other, the previous few events would be the ones with the strongest effects. And if the farther events aren't closely correlated, they won't have a big effect anyway, so they can be ignored. Maybe I'll try this next. Or maybe it's simply not worth the trouble.

Monday, July 10, 2017

Bayes 26 & AdaBoost: more on the partial confidence

I think I've found at least a partial solution to the problem of partial confidence that avoids going exponential.

It grows right out of the problem definition: the problem really exists only for the closely related events. If two events are unrelated, then the confidence of one wouldn't affect the other in any noticeable way. But for the related events, the first one takes up the bulk of the effect, and then the effect of the following events becomes greatly diminished. If two events are actually the same (like E1 and E2 in tab24_01.txt in the part 24), the second event gets stripped of any effect whatsoever. But if at the execution time we find that we have no information about the first event and good information about the second event, it's too late to make use of the second event because its probabilities are already hardcoded in the table and condemn it to having no effect. But if we could change the probability at run time then we could use the second event instead. Better yet, the probabilities of the following events would be unaffected because they don't care which one of the duplicate events provided the information. For a sequence of not equal but closely correlated events, if the first one is excluded from the table preparation, the others still seem to converge in their effects to something close to the original sequence, and the following events don't get affected that much.

Since the logic in the part 24 assumes that the closely-related events would be placed in the sequence next to each other, the solution might be to compute an event's table entry in multiple versions, depending on whether the values of a limited number of the previous events are known or not. It would still be exponential for this number of previous events. But since we expect a fast convergence and hope that not too many event values will be unknown, this number of previous events can be small, and the exponent of it would be reasonably small too.

By the way, I've come up with a name for the K(E) described in the previous part: the "absolute confidence". The confidence C(E) has the range from 0 showing that we're completely sure that the event is false, though 0.5 showing that we have no idea, to 1 showing that the event is true. The absolute confidence

K(E) = abs(2*C(E) - 1)

has the range from 0 showing that we have no idea to 1 showing that we're completely sure that the event is true or false.

I had to draw a picture to figure it out how it would work. Suppose that we want to take 3 previous events into the consideration. The datapoints with the probability values for these events can then be seen as nodes of a tree, with the nodes for each event aligned vertically under the event name:

E1         E2       E3       E4        E5
                      +------o
                    0/
              +-----o
            0/      1\
            /         +------o
      +----o
     /      \         +------o
   0/       1\      0/
   /          +-----o
  /                 1\
 /                    +------o 
o                              
 \                            0+-------o
  \                   +------o<
   \                0/        1+-------o
   1\          +----o
     \        /     1\        0+-------o
      \     0/        +------o<
       \    /                 1+-------o
        +--o                  
            \                 0+-------o
            1\        +------o<
              \     0/        1+-------o
               +----o
                    1\        0+-------o
                      +------o<
                              1+-------o

E1 still has only one probability value because it has no preceding events. But E2 has two possible probabilities, depending on whether E1 was known (K(E1)=0) or unknown (K(E1)=1). E3 has four probabilities, depending on the possible combinations of E1 and E2. And E4 has eight probabilities.

When we come to E5, it can't just continue branching exponentially.It should depend on only three previous events, so we drop the dependency on E1. This is done by abandoning the subtree with K(E1)=0 and continuing branching the subtree with K(E1)=1. This is a reasonable thing to do because we assume that the combination {0, 0, 0} "should never happen" (i.e no more than 3 events in a row would be unknown, otherwise use a higher value of depth), and the other mirroring pairs of branches of {0, b, c} and {1, b, c} would converge to about the same probability value. Thus the events starting from E5 would also have 8 probability values each.

Then when we do the computation on the built table, we can have any K(E) in range [0, 1], so we do an interpolation between the branches. Suppose we're computing E4, and have 8 branches {E1, E2, E3} to interpolate. We do it by adding them up with the following weights:

W{0, 0, 0} = (1-K(E1))*(1-K(E2))*(1-K(E3))
W{0, 0, 1} = (1-K(E1))*(1-K(E2))*K(E3)
W{0, 1, 0} = (1-K(E1))*K(E2)*(1-K(E3))
W{0, 1, 1} = (1-K(E1))*K(E2)*K(E3)
W{1, 0, 0} = K(E1)*(1-K(E2))*(1-K(E3))
W{1, 0, 1} = K(E1)*(1-K(E2))*K(E3)
W{1, 1, 0} = K(E1)*K(E2)*(1-K(E3))
W{1, 1, 1} = K(E1)*K(E2)*K(E3)

Or the other way to look at it is that we start with the weight of 1 for every branch, then for E1 multiply the weight of every branch where K(E1)=0 by the actual 1-K(E1), and where K(E1)=1 by the actual K(E1). Then similarly for E2, multiply the weight of every branch where K(E2)=0 by the actual 1-K(E2), and where K(E2)=1 by the actual K(E2). And so on. The sum of all the weights will always be 1.

The physical meaning of this interpolation is that we split the computation into multiple branches and finally add them up together. Since the computation is additive, we can put them back together right away rather than waiting for the end.

And since I was implementing the partial confidences in the final computation, I've also implemented the support for the partial confidences in training: it's done by logically splitting the training case into two weighted sub-cases with the event values of 0 and 1.

All this is implemented in the script ex26_01run.pl, as a further modification of ex24_01run.pl. Without the extra parameters it works just as ex24_01 does. The new parameter -p sets the number of previous events whose absolute confidence affect the current event. Here is an example of a run with the same training table as before and the same input:

$ perl ex26_01run.pl -n tab24_01.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000in a row 
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

The option -n disables the normalization of the weight of the hypotheses, just as before. The output is the same as ex24_01 produced. In the training data E1 and E2 are duplicates, so the probabilities computed for E2 in the adaboostian table (P(H|E) and P(~H|~E), they are separated by "^") are at 0.5, meaning that it has no effect.

And here is the same training data but with support for one previous unknown event, and the input data having E1 unknown (i.e. C(E1) = 0.5):

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in26_01_01.txt

Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.500000
hyp1    w=1.50000 p=0.50000
hyp2    w=1.50000 p=0.50000
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here E2 and E3 have not one pair of probabilities in the adaboostian table but two pairs, as shown in the branching picture above. The extra branches are shown on the continuation lines that start with a "+". The branches for E2 show that if E1 is known (the second branch), E2 will have no effect as in the previous example, but if E1 is unknown (the first branch), E2's probabilities would be the same as for E1, so E2 would replace it and converge to the same values.

As you can see in the run log, after E1 with confidence 0.5 (i.e. absolute confidence 0) is applied, both hypotheses still have the equal weight. But then after E2 gets applied, the weights and probabilities of hypotheses become the same as in the previous example, thus the knowledge of E2 compensating for the absent knowledge of E1.

An interesting thing to notice is that the probabilities for E3 are the same in either case. This is because this table can handle only one unknown event in a row, which means that at least one of E1 or E2 should be known, and E2 is able to fully compensate for the unknown E1, so both ways converge to the same probabilities of E3.

Now some information about how it all is implemented:

Some of the arrays have gained an extra dimensions, to be able to keep track of the multiple branches.The weight of the training case {wt} has become an array instead of a scalar, and the probabilities {ppos} and {pneg} of the hypotheses became 2-dimensional arrays.

The function compute_abtable() that computes the adaboostean table had gained an extra nested loop that goes through all the branches. As the final step of processing every event it prunes the branches coming from the previous event and then splits each of these branches in two. The support of the partial confidence in the training data is done by replacing the statement

if ($c->{tc}[$i] >= 0.5) {
    $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
} else {
    $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
}

with

push @{$c->{wt}}, 
    $c->{tc}[$i] * $oldwt->[$br] / $abhyp{$hyp}->{ppos}[$i][$br]
    + (1. - $c->{tc}[$i]) * $oldwt->[$br] / (1. - $abhyp{$hyp}->{pneg}[$i][$br]);

The function apply_abevent() that does the computation based on the table has been updated to manage the branches during this step. It had gained the extra argument of an array that keeps the history of the absolute confidence of the few previous events. This history is used to compute the weights of the branches:

for (my $br = 0; $br <= $maxbr; $br++) {
    my $w = 1.; 
    for (my $bit = 0; $bit <= $#$evhist; $bit++) {
        if ($br & (1 << $bit)) {
            $w *= $evhist->[$bit];
        } else {
            $w *= 1. - $evhist->[$bit];
        }   
    }   
    push @brwt, $w; 
}

Then an update based on the current event is computed for each branch, and the weight of the hypothesis is updated per these weighted bracnhes:

my $p = 0.; 

# each branch is weighed per the observed history
for (my $br = 0; $br <= $maxbr; $br++) {
    $p += $brwt[$br] * ( 
        $conf * $hyp->{ppos}[$evi][$br]
        + (1. - $conf) * (1. - $hyp->{pneg}[$evi][$br])
    );  
}   

$hyp->{cwt} *= $p;

And of course the functions that print the tables and normalize the data have been updated too.

All this stuff works on the small examples. I'm not sure yet if it would work well in the real world, I'd need a bigger example for that, and I don't have one yet. Also, I've noticed one potential issue even on the small examples but more about it in the next installment.

Sunday, June 4, 2017

Neuron in the Bayesian terms, part 3

I want to elaborate on why the second trick is so important. Basically, because in general the necessary condition of the chances changing on an event in the opposite way (by dividing or multiplying by the same number) is not true. It takes a specifically defined event to make it true.

In general if an event is true, the chance of a hypothesis after applying this event is:

Chance(H|E) = P(H/E) / P(~H|E)

The probabilities change on an event as follows:

P(H|E) = P(H) * P(E|H) / P(E)
P(~H|E) = P(~H) * P(E|~H) / P(E)

So then the chance changes (after substituting and reducing the formula):

Chance(H|E) = Chance(H) * P(E|H) / P(E|~H)

If an event is false, the chance changes in a similar but different way:

Chance(H|~E) = P(H/~E) / P(~H|~E)
= Chance(H) * P(~E|H) / P(~E|~H)

Being able to do the "opposites" requires that

Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )

Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)

Which then means after doing the substitution:

P(E|H) / P(E|~H) = P(~E|~H) / P(~E|H)

Which is not always true. But if we take a page from the AdaBoost's book and define an event such that

P(E|H) = P(~E|~H)
and
P(E|~H) = P(~E|H)

Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.

There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that

l[i]^x[i]

gets replaced with

if (x[i] > 0) {
lpos[i]^x[i]
} else {
lneg[i]^x[i]
}

AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic, but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.

This opens two possibilities for variations of the neurons:

(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.
(2) One with the definition E=(x > 0)

A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.

Saturday, June 3, 2017

Neuron in the Bayesian terms, part 2

I've been trying to describe in proper formulas my insight that a neuron in neural networks is a kind of a Bayesian machine(OK, this is apparently known to the people who do the neural networks professionally but it was an insight for me), and I think that I've finally succeeded. Let's start from the beginning:

A neuron implements the function

y = nonlinear(sum(w[i] * x[i]))

Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1] by adjusting the weights w[i]).

The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".

The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:

l[i] = exp(w[i])
w[i] = ln(l[i])

This can be substituted into the expression for the sum:

sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )

Using the rule ln(a)*b = ln(a^b), the sum can be converted further:

sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )

Then using the rule ln(a) +ln(b) = ln(a * b) the sum can be converted into a product:

sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )

Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:

P(H|E) = P(H) * P(E|H) / P(E)

So as each Bayesian event E[i] gets applied, we can say that the final probability will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.

Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:

Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))

So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )

And the chances will be:

Chance(H) = P(H) / P(~H)
= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )
= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )

If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:

Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )

Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:

Chance(H) = product(
if (E is true) {
    P(E[i]|H) / P(E[i]|~H)
} else {
    P(~E[i]|H) / P(~E[i]|~H)
}
)

Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:

P(E[i]|H) = P(y == x[i])
P(~E[i]|~H) = P(y == x[i])

An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that

P(E|H) = P(~E|~H)
P(E|~H) = P(~E|H)

We then use the general property that

P(E|H) = 1 - P(~E|H)
P(E|H~) = 1 - P(~E|~H)

and get

P(E|~H) = P(~E|H) = 1 - P(E|H)

So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:

Chance(H) = product(
if (E is true) {
    P(E[i]|H) / (1 - P(E[i]|H))
} else {
    (1 - P(E[i]|H)) / P(E[i]|H)
}
)

When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":

Chance(H) = product( ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )

We can now interpret the original formula in terms of the chances. That formula was:

y = nonlinear(sum(w[i] * x[i]))
= nonlinear( ln( product( l[i]^x[i] ) ) )

Here we can substitute the product:

y = nonlinear( ln( Chance(y > 0) ) )

A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].

This substitution of the product will mean:

l[i] = P(E[i]|H) / (1 - P(E[i]|H))
w[i] = ln( P(E[i]|H) / (1 - P(E[i]|H)) )

So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.

Sunday, March 5, 2017

Bayes 25 & AdaBoost: the partial confidence problem

The code in the previous example works only with the full-confidence events, those that have the confidence of 0 or 1. Unfortunately I haven't found a good way to handle the partial confidence yet. I've found a way but it has problems. Let me show you.

Previously I was treating the partial confidence as a superposition of 0 and 1. This approach cannot be properly used here because it doesn't cover the connections between the events. To give an example, in the previous post the code had correctly found that in tab24_01.txt the events E1and E2 are the same, so after E1 is applied, E2 becomes ignored (it's AdaBoost probabilities become 0.5). So now if we then do a computation and find that E1 is unknown (i.e. its confidence is 0.5) but E2 is known, E2 will still be ignored because its table entry says to always ignore it.

The correct way should handle this situation properly: if E1 is unknown, this should bring up the table values for E2 to compensate for it and treat E2 as E1 was formerly.

The obvious way to do it is to treat the partial confidence as a superposition of "the event is unknown" and "the event is known". So if the event is known, we process it as before. If the event is unknown, we would follow the unknown branch and then apply all the following events as if this events wasn't there. And if the event is partially known, we would split the weights of all the outcomes proportionally and handle one part with the known branch and another one with the unknown branch.

So the confidence of 0.7 would mean a superposition of 0.4 of unknown and 0.6 of the known value "1". The confidence 0.1 would mean a superposition of 0.8 of unknown and 0.2 of the known value "0". So if we denote C(E) as the event confidence, K(E) as the known weight in the superposition, and U(E) as the unknown weight in the superposition, we can write them as:

K(E) = abs(C(E) - 0.5) * 2
U(E) = 1 - K(E)

The downside of this obvious approach is that the table and computation becomes more complex. As we go through more events, two branches become four, then eight, then sixteen, and so on. The formerly quadratic complexity O(Noutcomes * Nevents) becomes exponential O(Noutcomes*Nevents*2^Nevents). I don't know yet if there is a better way.

Maybe it would be possible to build a table of "gradients" between the events: i.e. if the confidence of the event A changes, how are the probabilities of another event B affected. The problem though is that this gradient depends on all the events in the sequence between A and B, so tis idea doesn't seem to really fly.

Maybe another idea is workable, that would reduce the exponential cost to a cubic one: after splitting the superposition, instead of following two branches, just take the probabilities for each following event from both branches and mix them by the weight of the branches:

P(H|Ei) = K(Ej) * P(H|KjEi) + U(Ej) * P(H|UjEi)

It's better than exponential but still not great, with the run time O(Noutcomes*Nevents*Nevents). And the table of these pre-computed values would still seem to be exponential, each event depending on all the possible combinations of the previous events.

Wednesday, March 1, 2017

Bayes 24: AdaBoost as a solution to the correlated events

When writing the whitepaper, I've realized that AdaBoost is very good at finding the cross-correlation of its partial hypotheses (that map to the Bayesian events), and it always picks the next step with the minimal cross-correlation. The Bayesian computations do have an issue with cross-correlation, as described in the whitepaper and near the end of Part 8 of my notes: if there are multiple events that duplicate or mirror each other, all of them are treated as independent and drive the probabilities off. This looked like AdaBoost could give a hint of how to solve this problem, so I've started thinking about it, and I think I've found a solution.

In retrospect the solution is pretty obvious: instead of computing the Bayesian model directly, treat it as an AdaBoost model whose set of the partial hypotheses equals to the set of the Bayesian events, with an extra limitation that each one of them must be used exactly once. If there are multiple equivalent events and they happen to go one after another, AdaBoost will treat the first one as real, and the following ones will get their error rate of 0.5 and will be effectively no-ops.

Well, if there are other events interspersed in between these equivalent copies, they won't be no-ops. These other events will shift the weights of the training cases, so that the second an further copies will find themselves drifted away from 0.5. The good news here is that AdaBoost is pretty good at removing the partial correlation too but the bad news is that the order matters, to fully remove the duplication the equivalent events should go one after another. And in general, it looks like the best removal of the cross-correlation happens when the closely related events follow closely one after another. This kind of order can be implemented if we reverse the AdaBoost preference of the partial hypotheses and make it always pick the one that has its training error rate closest to 0.5. Since (unlike the classic AdaBoost) the set of partial hypotheses is limited and each element from it must be used only once, the better ones will eventually be used as well.

To experiment with these ideas, I wrote the code example ex24_01run.pl. It doesn't reorder the events, just computes the AdaBoost values for whatever order it's given, allowing to experiment with the various orders. But it has a couple of advanced features from the later chapters of the book on AdaBoost: it does the multi-class (but single-label) version that also partitions the error rate computation for the sign of the value of the partial hypothesis.

The multi-class means that it can work not only with two outcomes (a yes/no decision) but with multiple possible outcomes (i.e. Bayesian hypotheses), just like the normal Bayesian table. And just like the normal Bayesian table, these outcomes must be mutually exclusive, so each case may have only one outcome (single-label). The multi-class logic works in the way that is similar to the one described in Part 10: the resulting table is just a combination of the individual per-outcome tables. From the standpoint of each outcome, the single-label property means that the rest of the rows of the table taken together represent the opposite of this outcome, so there is no need to compute that opposite explicitly.

The partitioning of the error rate computation not only makes the result better but also comes very handy to prove that the opposite of one outcome is equivalent to the rest of the rows of the table. The AdaBoost training error e is the ratio of the total weights of the training cases where the value of the event E (that is, the Bayesian event, in AdaBoost terms it's a partial hypothesis) is not equal to the value of the outcome (Bayesian hypothesis) H. And we'll compute it separately for E (i.e. E=1) and ~E (i.e. E=0). It ends up expressable as probabilities that can be computed by summing and dividing weights W of the training cases:

e(H,E) = P(~H|E) = (sum for cases i where H(i)=0 and E=1 (Wi))/(sum for cases j where E=1 (Wj))
e(H,~E) = P(H|~E) = (sum for cases i where H(i)=1 and E=0 (Wi))/(sum for cases j where E=0 (Wj))

So if we pick a particular outcome and denote it H1, we can show that its opposite ~H1 is equivalent to the sum of the rest of hypotheses H2...HN by showing that

P(~H1|E) = sum for i in [2, N] (P(Hi|E))

which after removing the common denominator means

(sum for cases j where H1(E)=0 and E=1 (Wj)) = (sum for i in [2, N] (sum for cases k where Hi(E)=1 and E=1 (Wk))

The single-label property means that each case is marked with exactly one outcome, and all the cases where the outcome H1 is absent must have some other outcome Hi present, so these sums are indeed equal.

So it is all self-consistent. The key part of the computation is:

for my $hyp (values %abhyp) {
        if ($conf) {
            $hyp->{wt} *= $hyp->{ppos}[$evi];
        } else {
            $hyp->{wt} *= (1. - $hyp->{pneg}[$evi]);
        }
    }

And the matching part of the table building is:

foreach my $c (@case) {
        my $hyp = $c->{hyp}[0];
        # this code really handles only the TC values of 0 and 1,
        # for the arbitrary values it would have to do an interpolation
        if ($c->{tc}[$i] >= 0.5) {
            $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
        } else {
            $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
        }   
    }

The training undoes the weight change that will be done in the computation.

Now let's try some examples. Here is the example that I used in the whitepaper:

$ perl ex24_01run.pl -n tab24_01.txt in24_01_01.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here the option "-n" means "don't normalize the weights" (or when normalized to the sum of 1 they will be the same as probabilities). The two values printed in the AdaBoostian table separated by a "^" are the P(H|E) and P(~H|~E).

As you can see, the result is not correct either (because of the XOR problem) but it's better than with the plain Bayes version that gave hyp1 the probability 0.667. And since E1 and E2 are the same, E2 ended up with the values of 0.5 in the table, and applying it didn't change the weight.

If we change the order of events, E2 gains some use back:

$ perl ex24_01run.pl -n tab24_02.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E3          ,E2          ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E3             ,E2             ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E2, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455

Rather surprisingly, the result ended up more correct. Or maybe it's not that surprising: after all, AdaBoost is designed to gain more information from available data.

Yet another interesting result get produced from repeating the same pair of events multiple times:

$ perl ex24_01run.pl -n tab24_03.txt in24_03_01.txt 
Original training cases:
!      ,       ,E01         ,E02         ,E11         ,E12         ,E21         ,E22         ,E31         ,E32         ,E41         ,E42         ,E51         ,E52         ,E61         ,E62         ,E71         ,E72         ,E81         ,E82         ,E91         ,E92         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E01            ,E02            ,E11            ,E12            ,E21            ,E22            ,E31            ,E32            ,E41            ,E42            ,E51            ,E52            ,E61            ,E62            ,E71            ,E72            ,E81            ,E82            ,E91            ,E92            ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,0.49694^0.49526,0.49916^0.49946,0.49990^0.49985,0.49997^0.49998,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,0.50306^0.50474,0.50084^0.50054,0.50010^0.50015,0.50003^0.50002,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
--- Applying event E01, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E02, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E11, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455
--- Applying event E12, c=1.000000
hyp1    w=0.18831 p=0.54242
hyp2    w=0.15886 p=0.45758
--- Applying event E21, c=1.000000
hyp1    w=0.09400 p=0.54159
hyp2    w=0.07956 p=0.45841
--- Applying event E22, c=1.000000
hyp1    w=0.04699 p=0.54149
hyp2    w=0.03979 p=0.45851
--- Applying event E31, c=1.000000
hyp1    w=0.02349 p=0.54147
hyp2    w=0.01990 p=0.45853
--- Applying event E32, c=1.000000
hyp1    w=0.01175 p=0.54146
hyp2    w=0.00995 p=0.45854
...

--- Applying event E92, c=1.000000
hyp1    w=0.00000 p=0.54146
hyp2    w=0.00000 p=0.45854

AdaBoost converges pretty quickly to the weights that make both events get the probability 0.5 and stop the further changes.

This seems to work, although it needs more testing. One more limitation of this implementation is that it works only with the events that have the value of either 0 or 1, not anything in between. I'll talk more about this in the next post.

Wednesday, February 22, 2017

a better explanation of Bayes-AdaBoost connection

After some re-reading, my previous posts on the Bayesian interpretation of AdaBoost have been kind of cryptic. It's no wonder because I've been writing them as I was understanding things, so they came out as more of a personal reminder rather than the general explanation. But I've got around and wrote another version as a whitepaper, that goes all the way from scratch and hopefully is much more understandable:

https://sourceforge.net/p/exbayes/code/HEAD/tree/whitepapers/AdaboostBayes.pdf?format=raw

Monday, February 20, 2017

Bayesian examples now on SourceForge

I've been working on one more post on the Bayes-AdaBoost cross-over, and I've finally got tired of pasting the substantial examples right in the post. So I've created a SourceForge project for them and uploaded all the examples there:

https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/

The proper SVN checkout recipe from there is:

svn checkout svn://svn.code.sf.net/p/exbayes/code/ exbayes

Friday, January 6, 2017

Bayesian networks & neural networks

I've been recently reading about the neural networks. By the way, here are some introductory links:

I can most highly recommend this document that explains in a very clear and simple way how the training of the neural networks through the backpropagation works:
http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf

A simple introductory series of 6 (and maybe growing to more) articles starting with this one:
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2

Some other links:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-01-Visualizing-Representations/
http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/

Also, the apparently the father of the deep neural networks is G.E. Hinton, and you may also want to search for the articles by Harry Shum. Hinton's home page is:
https://www.cs.toronto.edu/~hinton/
that seems to have a bunch of the links to his courses but I haven't looked at them yet.

As you can see from the introductory reading, each neuron in a neural network is a pretty simple machine: it takes some input values, multiples them by some coefficients, adds the results up, and then passes the result through a nonlinear (usually some kind of a sigmoid) function. The whole thing can be written as an expression:

result = nonlinear( sum [for inputs i] (input_i * C_i) )

The nonlinear part is pretty much what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value.

The sum part is pretty much the same as the sum done in AdaBoost. AdaBoost does the sum of logarithms. And I've shown in the previous posts that this sum can be converted to a logarithm of a product, and then the product can be seen as a Bayesian computation expressed as chances, and the logarithm being a part of the decision-making process that converts the resulting chance value to a positive-or-negative value. So we can apply the same approach to the sum in the neuron, say that the values it sums up are logarithms, convert it to the logarithm of a product, make the logarithm a part of the final nonlinear function, and then the remaining product can be seen as a Bayesian computation on chances.

This pretty much means that a neuron can be seen as a Bayesian machine.

And guess what, apparently there is also such a thing as a Bayesian network. There people take multiple Bayesian machines, and connect the results of one set of machines as the input events to the Bayesian machines of the next level. For all I can tell, the major benefit of the handling of the problem when some hypotheses are indicated by a XOR of the input events, similarly to the splitting of the hypotheses into two and then merging them afterwards like I've shown before but instead of the external arithmetics the logic being folded into the Bayesian computation of the second level.

But if a neuron can be seen as a Bayesian machine then the neural networks can also be seen as the Bayesian networks! The only difference being that in the Bayesian networks the first-level (and in general intermediate-level) hypotheses are hand-picked while the neural networks find these intermediate hypotheses on their own during the training.

I wonder if this is something widely known, or not known, or known to be wrong?

Sergey Babkin on CEP and stuff