I can most highly recommend this document that explains in a very clear and simple way how the training of the neural networks through the backpropagation works:

http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf

A simple introductory series of 6 (and maybe growing to more) articles starting with this one:

https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2

Some other links:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-01-Visualizing-Representations/

http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/

Also, the apparently the father of the deep neural networks is G.E. Hinton, and you may also want to search for the articles by Harry Shum. Hinton's home page is:

https://www.cs.toronto.edu/~hinton/

that seems to have a bunch of the links to his courses but I haven't looked at them yet.

As you can see from the introductory reading, each neuron in a neural network is a pretty simple machine: it takes some input values, multiples them by some coefficients, adds the results up, and then passes the result through a nonlinear (usually some kind of a sigmoid) function. The whole thing can be written as an expression:

result = nonlinear( sum [for inputs i] (input

_{i}* C

_{i}) )

The nonlinear part is pretty much what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value.

The sum part is pretty much the same as the sum done in AdaBoost. AdaBoost does the sum of logarithms. And I've shown in the previous posts that this sum can be converted to a logarithm of a product, and then the product can be seen as a Bayesian computation expressed as chances, and the logarithm being a part of the decision-making process that converts the resulting chance value to a positive-or-negative value. So we can apply the same approach to the sum in the neuron, say that the values it sums up are logarithms, convert it to the logarithm of a product, make the logarithm a part of the final nonlinear function, and then the remaining product can be seen as a Bayesian computation on chances.

This pretty much means that a neuron can be seen as a Bayesian machine.

And guess what, apparently there is also such a thing as a Bayesian network. There people take multiple Bayesian machines, and connect the results of one set of machines as the input events to the Bayesian machines of the next level. For all I can tell, the major benefit of the handling of the problem when some hypotheses are indicated by a XOR of the input events, similarly to the splitting of the hypotheses into two and then merging them afterwards like I've shown before but instead of the external arithmetics the logic being folded into the Bayesian computation of the second level.

But if a neuron can be seen as a Bayesian machine then the neural networks can also be seen as the Bayesian networks! The only difference being that in the Bayesian networks the first-level (and in general intermediate-level) hypotheses are hand-picked while the neural networks find these intermediate hypotheses on their own during the training.

I wonder if this is something widely known, or not known, or known to be wrong?