Mishka -- Regularization in intrinsically sparse networks -- March 10, 2019


In 2017, Mocanu et al. pubslihed a remarkable neuroevolutionary scheme for training sparse neural nets (arxiv: 1707.04780, then appearing in Nature Communications 9 (19 June 2018), open source repository: https://github.com/dcmocanu/sparse-evolutionary-artificial-neural-networks )

One starts with initializing sparse layers with random connectivity and then trains by repeating the following 2-step cycle a number of times:

The work was done for feedforward neural nets and for restricted Boltzmann machines. For the case of restricted Boltzmann machines the authors also demonstrated the ability of the system to learn advantageous network topology, forming higher density of connections in the active zone and lower density of connections in the non-meaningful margins.


In November 2018, Michael Klear implemented this scheme in PyTorch for feedforward neural nets and wrote a related blog post at https://towardsdatascience.com/the-sparse-future-of-deep-learning-bce05e8e094a

He demonstrated the ability of the system to learn the network topology contrasting the active zone and the margin for the case of feedforward neural nets.


I became interested in this implementation in February 2019, because I wanted to experiment with this neuroevolutionary scheme, and because PyTorch is my favorite machine learning platform at the moment.

When I looked more closely, I noticed the following strange effect: the results reported by Michael Klear for neural nets demonstrated inverse pattern of network topology learning compared to the results reported by Mocanu et al. for restricted Boltzmann machines. Namely, the density of connections in this case was lower in the active zone and higher in the margin.

We will call the network topology learning demonstrated by Mocanu et al positive learning, and we will call the inverse pattern of the network topology learning emerging in the runs performed by Michael Klear negative learning. ("Negative" here does not a priori imply "bad", although as we shall see below, in this series of experiments negative learning is usually associated with some overfitting/failure to generalize.)


The main conjecture I made was that the negative learning effect was related to the absence of regularization in the original code. The logic I followed was that in the absence of regularization, when the weights pointing from the outlying areas are created, they tend to remain unchanged by training. At the same time, meaningful connections are changed by training, and occasionally become small and therefore get eliminated more frequently.

At the same time, if one were to add a sufficiently strong regularization encouraging smaller weights, then one would expect the connections which are not informative to the result to decrease on average more rapidly, than the connections which are informative.

The experiments I have performed seem to confirm this conjecture. Namely, when one adds sufficiently strong reglarization, negative learning is replaced by positive learning, and the stronger regularization is, the more pronounced is this effect.

For further details see https://github.com/anhinga/synapses/blob/master/regularization.md.


Mishka --- March 10, 2019

Back to Mishka's home page