Intuition behind dropout and inverted dropout as a regularization technique

During use of a large NN, overfitting is common and we can use other regularization techniques like L2. It is proved that the best prediction ever made will be the average of all predictions made by different NN architectures using different set of parameters. This is not feasible as with a NN with ‘n’ neurons, the possible no.of architectures will be 2^n. Dropouts get closer to this ideal approach.

In this technique, each neuron during the forward propagation is dropped with some “keep_probability”. So a single training example will now be passed to different samples of NN architectures each time it goes through forward propagation. So as we know that in batch gradient descent, a neuron usually learns something specific about the example and it also depends a lot upon the previous activation values. Now as the layers are unstable, each neuron tries to holistically learn the training example and not one specific feature. Due to this the weights are sparse with low values. So for each example a sampled network is almost trained but not completely depending upon no.of epochs.

Suppose the layer 3 in NN has 3 neurons and “keep_probability” is 0.8. Then there is 0.8 chance of keeping a unit in 3rd layer and 0.2 chance of eliminating it. So the code will look like,

#This will create array same size as layer 3 with values between 0 and 1

d3 = np.random.rand( a3.shape[0], a3.shape[1] ) < keep_probability

a3 = a3 * d3 #This will eliminate any unit with value <= 0.2

a3 = a3 / keep_probability

The intuition behind last line is difficult to grasp. We know that each example during training is put into different models due to dropout. Therefore, each model will now have different predictions for one example. So best is to take average of all those predictions. Doing this manually will be expensive during test time. But we can use a clever technique using “Expected Value”(Average). Consider a NN as follows:

First consider dropout without inverting. So if we consider layer 1, the value of A1 is,

A1 = (W1*X1 + W2*X2 + W3*X3) #This is the linear combination without any activation.Let this be val1. Similarly A2 will be val2 and A3 will be val3.

Now as each unit is associated with a keep probability, it’s value can be the actual linear combination as shown above OR zero. So the expected value of A1 will be (keep_probability * actual unit value) + ([ 1 – keep_probability] * 0). Suppose keep_probability = 0.8,

So, E(A1) = 0.8(actual value) + 0.2(0) i.e. E(A1) = 0.8(actual value)

Also E(A2) = 0.8(actual value of A2) and E(A3) = 0.8(actual value of A3).

Now 2nd layer, E(A1) = E( output that we get from 1st layer)

E(A1) = E( W1*A1 + W2*A2 + W3*A3 )

E(W1*A1) + E(W2*A2) + E(W3*A3)

W1*E(A1) + W2*E(A2) + W3*E(A3) #because E(a*b) = E(a)*E(b) and E(W1) = W1

W1*0.8*( val1 ) + W2*0.8( val2 ) + W3*0.8( val3 )

Do same thing for A2 and A3 also. Now for final layer, E(F) = E(ouput from 2nd layer)

E(F) = E( W1*A1 + W2*A2 + W3*A3)

W1*E(A1) + W2*E(A2) + W3*E(A3)

W1[ W1*0.8( val1 ) ] + W2[ W2*0.8( val 2 ) ] + W3[ W3*0.8( val3 ) ] # note the inner Ws are weights of 2nd layer.

So this is the expected value or average of final output. So during test time we multiply the output by the keep_probabilities to match the final equation shown above. This is dropout without inverting. But with inverted dropout we can avoid this extra thing multiplication during test time. We do this by simply dividing each time during training with keep_probabiity. So instead of E(A1) = 0.8(val1) in first layer, we now get E(A1) = val1 because we divided by 0.8. Similarly even in the second layer 0.8 disappears and the final layer expected value will be

E(F) = W1[ W1 * ( val1 ) ] + W2[ W2 * ( val2 ) ] + W3[ W3 * ( val3 ) ]

So this is still the average of all models, even though there are no extra numbers in the equation. And even during test time we actually get this same equation without any extra numbers. That means we don’t need to do anything extra during test time. This is called “Inverted Dropout”.

Machine Learning basics

Traditional approaches of software development may not always provide solutions for today’s applications that require Artificial intelligence. Programming mainly focuses upon rules and conditions with which we first implement the algorithm for a problem. Machine learning is the kind of AI which enables systems to learn and find out patterns by feeding huge amounts of data. It is all about making predictions  and finding patterns. With more data being generated alongside more computing power, researchers have realized the potential of building applications which are not possible by traditional approaches.

There are mainly 2 types of machine learning models: Supervised and Unsupervised learning

1.) Supervised Learning:  In this we already have the data set for a give problem along with the labels for the right predictions. This data can be structured like tables in a  database or unstructured like image. But the important thing is that the right predictions are also available. So this is generally known as the training set. So once we train our ML model using this set, the new test set which will be new unseen examples will not have the labels and our model still predicts the right result. The predictions i.e. label values can be either be continuous or discrete. A graph for continuous predicted value is shown below, where we already have the prices for some of the house sizes. This is our training set. Now for our test the, the best fit line will predict the price value.