Inside the black box 2

Inside the black box 2

Following the original observations of Neural Networks in action; I decided a follow up was needed.  In the original blog, ; the smallest neural net (NN) that learnt the data set was 2-6-3-1 but the details were not saved, a second NN with the same configuration came close but its approach was different. The first and second layer nodes nodes (& outputs)  are modelled but it is the second layer nodes that really shows how the clusters of blue circles were isolated by the neural networks. 

Using a new NN of 2-6-3-1, 2 inputs, 6 nodes in the first hidden layer, 3 nodes in the second hidden layer and a single output, the cluster of blue circles were isolated by three curved overlapping lines.  The images below are from the second layer video, it does show that the network does not fully learn the dataset because the boundaries that close the cluster of blue squares were not learnt before convergence.  The output of the trained network is always in red

2-6-3-1 (2nd hidden layer and output nodes)

The data set plotted

Pushing the boundaries (2-5-3-1)

After trying several different networks, the decision was made to see if a smaller network could learn the data set, a network with one less neuron in the first layer and the results were surprising.  This time, I have included the first and second layer animations into a single plot, added the data set as a point of reference and included the weights of the entire network so that the changes/activities can be monitored in real time.  

The neural network weight values were not displayed because it would be pointless unless an observer can sum and calculate the logistic function on the fly.  Instead, the direction of the weights is now displayed, so up for red, yellow for unchanged and blue for decreasing weights.  We can now observe how the weights work during the learning process as well as how the nodes are shaped in the pattern space.

Observations

At about five minutes into the animation, a node regressed to zero as was observed in the first blog, where several nodes in the 2-15-1 neural network regressed to zero and contributed nothing to the final solution.

             2-5-3-1 (@ 5min), 2nd hidden layer                                             2-8-6-1, 2nd hidden layer

On closer inspection, the network 2-8-6-1 also had nodes that reduced their contribution; it also had a nodes that duplicated features, the orange and blue neuron have clearly learnt the same thing, the yellow neuron is also fairly flat but suspended.  Clearly the 2-8-6-1 has too much power for the data set, see above right.

 

Although the 2-5-3-1 neural network (below left) was chosen to get a more improved performance but failed; I wanted to see if the weights of the dormant neuron still contributed to the final solution and so I decided to remove the weights of the dormant node from the converged solution and re-run the visualisation tool.  The solution remained the same (below right); this data set can thus be learnt from a neural net of 2-5-2-1. 

                     2-5-3-1                                                                        2-5-2-1(dormant node removed)

Solution 1 (2-5-2-1)

One of the best solution using a 2-5-2-1 neural network is shown below, the network has a similar structure to the one at the start of this blog except some regions were inverted.  The network clearly struggled to hold its shape before it converged in a sub optimal solution and clearly has distortions.  To be honest, it looks like a mess.

 

2-5-2-1 (2nd hidden layer + Output) 

This could be due to the lime green node lying flat on the underside of the first hidden network, below.

 

2-5-2-1 (1st hidden layer + Output) 

Solution 2

2-5-2-1 (Training in Progress)

 

A new network is trained with the learning rate reduced and kept low for the duration of the training and the network (no momentum added)  converged perfectly.


Inside the black box 2

Inside the black box

Inside the black box

Introduction

This post is designed show the internal changes of an Artificial Neural Network (ANN/NN), it shows the outputs of the neurons from the beginning of a Backpropagation algorithm to convergence.

The hope is for a better understanding of why we use a second hidden layer, local minimums, to how many internal nodes are required and their impact on the final solution.

The Dataset

1

0.1

0.1

1

 

13

0.9

0.5

1

 

25

0.7

0.4

0

2

0.3

0.1

1

 

14

0.8

0.7

1

 

26

0.3

0.5

0

3

0.5

0.1

1

 

15

0.9

0.7

1

 

27

0.2

0.55

0

4

0.7

0.1

1

 

16

0.1

0.9

1

 

28

0.4

0.55

0

5

0.9

0.1

1

 

17

0.3

0.9

1

 

29

0.1

0.65

0

6

0.1

0.3

1

 

18

0.5

0.9

1

 

30

0.2

0.65

0

7

0.3

0.3

1

 

19

0.7

0.9

1

 

31

0.3

0.65

0

8

0.5

0.3

1

 

20

0.9

0.9

1

 

32

0.4

0.65

0

9

0.9

0.3

1

 

21

0.7

0.2

0

 

33

0.5

0.65

0

10

0.1

0.5

1

 

22

0.6

0.3

0

 

34

0.2

0.7

0

11

0.5

0.5

1

 

23

0.7

0.3

0

 

35

0.3

0.7

0

12

0.7

0.5

1

 

24

0.8

0.3

0

 

36

0.4

0.7

0

 

 

 

 

 

 

 

 

 

 

37

0.3

0.8

0

The dataset chosen is designed to look like holes in the ground and chosen particularly to challenge the network to surround the blue circles and separate them.

Ideal solution

The ideal solution is for the network to perfectly surround the blue circles in a class of its own and try to avoid a connected solution like the purple enclosure

After surrounding the holes, the main challenges are

  • Clean separation between the holes
  • Commitment of resources to seal Region 1 cleanly
  • Will grid position in Filler 1 be a 1 or 0
  • How will the network represent box 1, will it be closed or open, it is not designed to be a boundary and was left deliberately ambiguous

Animations

The following animations of neural networks will show the output in a Red mesh and the hidden layers in various colours, some colours may be used more than once.  The animations will show the nodes in the hidden layer with the output node but not the hidden layers together. The animations shown are from the start of the algorithm, from the random weights up to convergence; the bias nodes are not shown.

 One hidden layer (2-15-1) 

A NN of 2-10-1 and two NN of 2-15-1 were tried before a solution was found, despite the network learning the data set to the desired convergence threshold of 10% total error, three hidden nodes were found to be practically unused. 

 

Unused in the sense that after all the backpropagation of errors had taken place; their Z output value did not rise above 11%, in fact if the data set convergence threshold errors was set at 1%, I am convinced that their contribution would have been even less.  These are the Yellow, Dark Orange and Blue meshes. 

 

They can be seen when the animation has an elevation of 0°, although these colours were used twice, the other instances had Z output values that were much higher.  This suggests a network with a single hidden layer of 12 nodes may be sufficient to learn the dataset

Hidden and output layer

Two hidden layers (2-6-3-1)

A network using 2 hidden layers, 6 nodes in the first and 3 in the second

2nd hidden layer + output node

1st Hidden layer + Output node

Two hidden layers (2-8-6-1)

A network using 2 hidden layers, 8 nodes in the first and 6 in the second, the output node is animated separately this time, but inverted

Output node

2nd hidden layer + output node

1st Hidden layer + Output node

Observations (so far)

The initial thinking was that a single hidden layer would not be able to learn the data set because the zero output values were isolated/not continuous.   The smallest network observed to have met the challenges and learn the dataset contained two hidden layers of six and three, 2-6-3-1; unfortunately the animation was not captured.

 

The Sigmoid curve; the “S” shape, is clearly visible in the layer closer to the input nodes but when a second hidden layer is used, it is not noticeable.  This shows a function of functions to develop in the second layer, allowing more complex models to be learned.

 

Convergence seems to happen in three observable ways

  • The Convergence threshold is met and the data set is fully learnt; this seldom happens with minimum resources.
  • Convergence on a local minimum, this can happen near the global minimum or simply quite early, such as when the network gets stuck in a local loop and oscillates between numbers, this can happen high up if there is insufficient momentum to roll out of the local minimum, simply trapped.
  • Convergence can simply happen due to the path taken by the network, unused resources are cut off and the path chosen simple does not have enough resources to complete the local data set, it simply runs out of road

 

Other animations will be uploaded with observations, such as the variations in the number and types of local minimums that occurred.  The networks with a single hidden layer, even with sufficient hidden nodes tended to get stuck in local minimums more often and took much longer to train than the NN with two hidden layers


Inside the black box