During use of a large NN, overfitting is common and we can use other regularization techniques like L2. It is proved that the best prediction ever made will be the average of all predictions made by different NN architectures using different set of parameters. This is not feasible as with a NN with ‘n’ neurons, the possible no.of architectures will be 2^n. Dropouts get closer to this ideal approach.

In this technique, each neuron during the forward propagation is dropped with some “keep_probability”. So a single training example will now be passed to different samples of NN architectures each time it goes through forward propagation. So as we know that in batch gradient descent, a neuron usually learns something specific about the example and it also depends a lot upon the previous activation values. Now as the layers are unstable, each neuron tries to holistically learn the training example and not one specific feature. Due to this the weights are sparse with low values. So for each example a sampled network is almost trained but not completely depending upon no.of epochs.

Suppose the layer 3 in NN has 3 neurons and “keep_probability” is 0.8. Then there is 0.8 chance of keeping a unit in 3rd layer and 0.2 chance of eliminating it. So the code will look like,

#This will create array same size as layer 3 with values between 0 and 1

d3 = np.random.rand( a3.shape[0], a3.shape[1] ) < keep_probability

a3 = a3 * d3 #This will eliminate any unit with value <= 0.2

a3 = a3 / keep_probability

The intuition behind last line is difficult to grasp. We know that each example during training is put into different models due to dropout. Therefore, each model will now have different predictions for one example. So best is to take average of all those predictions. Doing this manually will be expensive during test time. But we can use a clever technique using “Expected Value”(Average). Consider a NN as follows:

First consider dropout without inverting. So if we consider layer 1, the value of A1 is,

A1 = (W1*X1 + W2*X2 + W3*X3) #This is the linear combination without any activation.Let this be val1. Similarly A2 will be val2 and A3 will be val3.

Now as each unit is associated with a keep probability, it’s value can be the actual linear combination as shown above OR zero. So the expected value of A1 will be (keep_probability * actual unit value) + ([ 1 – keep_probability] * 0). Suppose keep_probability = 0.8,

So, E(A1) = 0.8(actual value) + 0.2(0) i.e. E(A1) = 0.8(actual value)

Also E(A2) = 0.8(actual value of A2) and E(A3) = 0.8(actual value of A3).

Now 2nd layer, E(A1) = E( output that we get from 1st layer)

E(A1) = E( W1*A1 + W2*A2 + W3*A3 )

E(W1*A1) + E(W2*A2) + E(W3*A3)

W1*E(A1) + W2*E(A2) + W3*E(A3) #because E(a*b) = E(a)*E(b) and E(W1) = W1

W1*0.8*( val1 ) + W2*0.8( val2 ) + W3*0.8( val3 )

Do same thing for A2 and A3 also. Now for final layer, E(F) = E(ouput from 2nd layer)

E(F) = E( W1*A1 + W2*A2 + W3*A3)

W1*E(A1) + W2*E(A2) + W3*E(A3)

W1[ W1*0.8( val1 ) ] + W2[ W2*0.8( val 2 ) ] + W3[ W3*0.8( val3 ) ] # note the inner Ws are weights of 2nd layer.

So this is the expected value or average of final output. So during test time we multiply the output by the keep_probabilities to match the final equation shown above. This is dropout without inverting. But with inverted dropout we can avoid this extra thing multiplication during test time. We do this by simply dividing each time during training with keep_probabiity. So instead of E(A1) = 0.8(val1) in first layer, we now get E(A1) = val1 because we divided by 0.8. Similarly even in the second layer 0.8 disappears and the final layer expected value will be

E(F) = W1[ W1 * ( val1 ) ] + W2[ W2 * ( val2 ) ] + W3[ W3 * ( val3 ) ]

So this is still the average of all models, even though there are no extra numbers in the equation. And even during test time we actually get this same equation without any extra numbers. That means we don’t need to do anything extra during test time. This is called “Inverted Dropout”.