After experimenting with the "intelligent design", I've tried out training the same network from a random state to do the same thing, computing the square function. Yes, it's not a typical usage, but it's a surprising example, I think automatically approximating an analog function is quite impressive.
And it worked. Not every time, depending on the initial random state, but it when it worked, it worked quite decently. Usually worked better with more neurons. My guess is that with more neurons there is a higher chance that some of them would form a good state. Some of them end up kind of dead but if there were more to start with, it's more likely that some useful number of them will still be good. The ReLU function I've used has this nasty habit of having a zero derivative in the lower half, and if a neuron gets there on all inputs, it becomes dead.
Which brought the question: how do we do a better initialization? Yes, I know that the bad training from a bad random initialization is normal, but I wonder, could it be done better? I've found the article: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/. Initialization-wise it boils down to initializing the weights between the neurons in range +-1/sqrt(num_of_inputs), and initializing the bias to 0.1. There is also an idea to consider the ReLU derivative in the negative side as a small negative value but I don't want to touch that yet. The square root probably comes from the variance of the standard distribution. The bias of 0.1 "pulls" the value up, so that even if the weights are mostly or all negative, there is still a chance of the neuron producing a positive output that would pass through ReLU. This did improve things, but only slightly, and still pretty regularly I would get a bad training.
The next thing I've noticed was that things tend to go better if I do the first few rounds (where presenting the whole set of input data is a round) of training at a higher rate. The recommendation from the article on backpropagation was to use the training rate of 1%. I've found that if I do the first few rounds with the rate of 10%, I get fewer bad trainings. I've eventually settled down on the first 50 rounds out of 1000 using the high rate.
BTW, I've been worrying that when the weight gets initialized with a certain sign, it would always stay with the same sign, because of the ReLU having a zero derivative in the lower part. The experiment has shown that this is not the case, the weights do change signs, and even not that rarely. I still need to look at the formulas to understand, what did I miss in them.
What next? It would be good to reclaim the dead neurons and put them to a good use. Suppose we do a round of training with the whole training set. If a neuron never produces a positive value on the whole training set, it's effectively dead, and can be reinitialized by randomization. So I've done this. It does add some overhead but most of it happens infrequently, at most once per round. The only overhead to the normal training is increasing the usage counters. Observing the effects, he weird part is that it's not like "once you get all the neurons into a live state, they stay alive". Yes, the dead neurons tend to be found near the start of training but not necessarily on the consecutive rounds, and once in a while they pop up out of nowhere on a round like 300 or 500. I've added the reclamation of dead neurons after each round, and this made a big improvement. I was even able to shrink the network to the same size as I used in the "intelligent design" version and get the good results most of the time, normally better than from my handmade weights (although not as good as handmade + training). But not all the time.
Looking at the weights in the unsuccessful cases, the problem seems to be that it happens when all the neurons in this small model get initialized too similarly to each other. And then they get pushed by the training in the same direction. This seems to be the reason why the larger training rate on the early rounds made an improvement: it helps the neurons diverge. What could I do to make the neurons more different? Well, they got the fixed bias weights of 0.1. How about changing that, since I've got another way of reclaiming the dead neurons now? I've made the bias weights random, and this reduced the frequency of bad cases but didn't eliminate them.
Another thing commonly seen in the bad trainings seems to be that the weights in level 1 tend to be pushed to a low absolute value. Essentially, the NN tries to approximate the result with almost a constant, and this approximation is not a very good one. This drives the weights in the second layer up to absolute values greater than 1, to compensate for that constness. How about if I won't let the weights get out of [-1, 1] range, will this push the lower layer to compensate there? It did, the weights there became not so small, but overall the problem of the weights driven towards 0 is not solved yet in my example. In an article I've read in IEEE Spectrum, they re-train the networks to new input data by reinitializing the neurons with low absolute weights. So maybe the same thing can be done in the normal training.
P.S. Forgot to mention, I've also changed the initialization of weights to be not too close to 0. With the upper limit on the absolute value limited by the formula with square root, I've also added the lower absolute limit at 0.1 of that.