I can most highly recommend this document that explains in a very clear and simple way how the training of the neural networks through the backpropagation works:

http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf

A simple introductory series of 6 (and maybe growing to more) articles starting with this one:

https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2

Some other links:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-01-Visualizing-Representations/

http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/

Also, the apparently the father of the deep neural networks is G.E. Hinton, and you may also want to search for the articles by Harry Shum. Hinton's home page is:

https://www.cs.toronto.edu/~hinton/

that seems to have a bunch of the links to his courses but I haven't looked at them yet.

As you can see from the introductory reading, each neuron in a neural network is a pretty simple machine: it takes some input values, multiples them by some coefficients, adds the results up, and then passes the result through a nonlinear (usually some kind of a sigmoid) function. The whole thing can be written as an expression:

result = nonlinear( sum [for inputs i] (input

_{i}* C

_{i}) )

The nonlinear part is pretty much what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value.

The sum part is pretty much the same as the sum done in AdaBoost. AdaBoost does the sum of logarithms. And I've shown in the previous posts that this sum can be converted to a logarithm of a product, and then the product can be seen as a Bayesian computation expressed as chances, and the logarithm being a part of the decision-making process that converts the resulting chance value to a positive-or-negative value. So we can apply the same approach to the sum in the neuron, say that the values it sums up are logarithms, convert it to the logarithm of a product, make the logarithm a part of the final nonlinear function, and then the remaining product can be seen as a Bayesian computation on chances.

This pretty much means that a neuron can be seen as a Bayesian machine.

And guess what, apparently there is also such a thing as a Bayesian network. There people take multiple Bayesian machines, and connect the results of one set of machines as the input events to the Bayesian machines of the next level. For all I can tell, the major benefit of the handling of the problem when some hypotheses are indicated by a XOR of the input events, similarly to the splitting of the hypotheses into two and then merging them afterwards like I've shown before but instead of the external arithmetics the logic being folded into the Bayesian computation of the second level.

But if a neuron can be seen as a Bayesian machine then the neural networks can also be seen as the Bayesian networks! The only difference being that in the Bayesian networks the first-level (and in general intermediate-level) hypotheses are hand-picked while the neural networks find these intermediate hypotheses on their own during the training.

I wonder if this is something widely known, or not known, or known to be wrong?

Sergey, long time no see! The problem with the above is that neurons do not have meanings in themselves. If you look at any given weights of a particular neuron it doesn't tell you anything - the same is not true about a Bayesian network. Bayesian networks impart meaning in their structure and node-to-node relationships (dependence). Neural networks do not (hidden layer X weight matrix has no real obvious relationship between hidden layer Y).

ReplyDeleteI suppose you could think of the output layer of a fully connected NN using a softmax activation function as kinda Bayesian since the final values are indeed probabilities (beliefs) of classifications. But there is no way to walk those values back which is why it is sometimes hard to explain why a NN behaves the way it does. Again, the same is not true for Bayesian networks.

Alex (remember SmallFoot?)

Hi Alex! What have you been up to?

DeleteThe weights in a particular neuron do represent the Bayesian probabilities of its inputs, only slightly indirectly, as logarithms of the chances. But they can be converted to the probability values, as described in http://babkin-cep.blogspot.com/2017/02/a-better-explanation-of-bayes-adaboost.html and http://babkin-cep.blogspot.com/2017/06/neuron-in-bayesian-terms-part-2.html . That's kind of the point I'm making, that you can just convert them back and forth, that they are equivalent. It's kind of funny that whatever AI approach you take: decision trees, Bayesian, boosting, neural networks, they all seem to be pretty much variations of each other.

In the simple NNs (like those in the Tensorflow examples) you can actually get interesting insights by looking at the heatmap graphs of the functions of the individual neurons. You can see that this neuron in the first layer matches this feature, and that neuron matches that feature, and then they get combined in a second-layer neuron. But in a large NN I think it's sheer size would make it difficult to look at each single neuron.

Well a couple of things:

Delete1) FYI: Sigmoid isn't really used anymore for modern day NN (relus or leaky relus seem more popular).

2) I don't think you can convert them back and forth as you describe. That's my point. With NN you can arrive at different weights depending on your objective function and type of gradient descent (you can arrive at different local minima depending on how you initialize your weights and your learning rate/step size, etc.). Put simply, you're climbing down a slope using partial derivatives. What is the Bayesian equivalent of that? That's what I don't quite understand in your analysis. You are looking at one node out of context and assigning chances. Fine, but once you add multiple layers, I think this all falls apart very fast.

I think this paper would be of interest to you (using Bayesian Backprop):

https://arxiv.org/pdf/1505.05424.pdf

3) Yeah, exactly. Simple NNs are easier to follow. But once your parameters expand ("curse of dimensionality" and all that), as I said, it falls apart.

As you can tell, ML/DL has been a hobby of mine for the past few months. I took the graduate classes at Columbia via edX last year and just finished the Udacity courses on AI and DL. I love this field and am looking to break into it (even though my background is as you know in kernel and system programming). Have any advice? (you seem to be deeply involved with it).

Sergey, here is a post that talks about it:

Deletehttps://towardsdatascience.com/understanding-objective-functions-in-neural-networks-d217cb068138

Maybe that is what you are saying?

Oops, I wanted to answer and forgot.

Delete1) From my experiments with Tensorflow, Relu is actually not a very good function, the arctangent or sigmoid work better. Relu is good for the linear regressions but not as much for the decisions. But there are a couple of reasons for why the Relu can be used to the same effect:

(a) NNs have the offset in each node, so with the cut-off of the negative arguments in Relu, you can add together multiple Relu functions to get a close approximation of a sigmoid. I.e. if you have two Relu functions, one with the offset of 0 and slope of 2, and another one with offset of 0.4 and slope of -1.67, you'd get a broken line that goes from (0, 0) to (0.4, 0.8) to (1, 1), because the second function would kick in in the range [0.4, 1] and the slope in this range will become the sum 2-1.67=0.33. Such addition would come naturally because that's what the second layer does: adds up the outputs of the first layer with coefficients.

(b) The other reason why the cutting-off of the negative side in Relu is useful: basically, it naturally supports the multiple-choice values as described in http://babkin-cep.blogspot.com/2016/08/bayes-23-enumerated-events-revisited.html and its references (although I'm not sure if I wrote it clearly).

2) I think the Bayesian equivalent is computing the conditional probabilities. If you take an one-level NN, the result of the gradient climbing would be the same as computing the averages that give you probabilities. For example, if you have 70 training cases that pull the result up and 30 cases that pull it down, you'd end up with the coefficient of 0.7, which also happens to be the probability. AdaBoost does a variation of the same thing, computes the chances from the gradient descent. But when you have 2 or more layers, the gradient-climbing does two things at once: decides what should the intermediate level-event be and computes the conditional probabilities.

Sorry, I don't have any advice, I'm in kind of the same position only worse: I haven't taken any official grad classes :-) I've been thinking that taking such classes might provide some connections but looks like it didn't? I've had some links to Reddits on this subject that might be a good place to ask around but I haven't used Reddit much yet at all :-)

Thanks for the links, I'll read them!

I'd like to make a correction:

DeleteTechnically speaking, provided you assume your error is Gaussian with zero mean and a common variance (iid etc.), Bayesian learning will show that the sum of the squared errors is equivalent to finding the maximum likelihood - and that's what gradient descent is doing. So I suppose your right that there is a connection between NN and Bayesian models.

As far as I can tell, sigmoid is just not used that much anymore in current NN models. Maybe they should? LOL.