You are reading the article **A Quick History Of Neural Networks** updated in February 2024 on the website Eastwest.edu.vn. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. *Suggested March 2024 A Quick History Of Neural Networks*

This article is part of the Data Science Blogathon.

IntroductionNeural networks are ubiquitous right now. Organizations are splurging money on hardware and talent to ensure they can build the most complex neural networks and bring out the best deep learning solutions.

Although Deep Learning is a fairly old subset of machine learning, it didn’t get its due recognition until the early 2010s. Today, it has taken the world by storm and captured public attention in a way that very few algorithms have managed to accomplish.

In this article, I wanted to take a slightly different approach to neural networks and understand how they came to be. This is the story of the origin of neural networks!

The earliest reported work in the field of Neural Networks began in the 1940s, with Warren McCulloch and Walter Pitts attempting a simple neural network with electrical circuits.

The below image shows an MCP Neuron. If you studied High School physics, you’ll recognize that this looks quite similar to a simple NOR Gate.

The paper demonstrated basic thought with the help of signals, and how decisions were made by transforming the inputs provided.

McCulloch-Pitts Neuron

McCulloch and Pitts’ paper provided a way to describe brain functions in abstract terms, and showed that simple elements connected in a neural network can have immense computational power.

Despite its groundbreaking implications, the paper went virtually unnoticed till about 6 years later, when Donald Hebb (image below) published a paper that reinforced that neural pathways strengthen each time they are used.

Donald Hebb (Father of Neuropsychology) Photo Credit: researchgate.net

Keep in mind that computing was still in its nascent stage at that point, with IBM coming out with its first PC (The IBM 5150) in 1981.

Fast forward to the ’90s, a lot of research into artificial neural networks had already been published. Rosenblatt had created the first perceptron in the 1950s. The backpropagation algorithm was successfully implemented at the Bell Labs in 1989 by Yann LeCun. By the 1990s, the US Postal Service had already deployed LeCun’s model for reading ZIP Codes on envelopes.

The LSTM (Long Short Term Memory) as we know it today was coined back in 1997.

If so much groundwork had already been laid down by the 90’s, why did it take until 2012 to leverage neural network for deep learning tasks?

Hardware and the Rise of the InternetThe last two decades have seen rapid strides in the field of Hardware and the Internet. In the 1990s, the IBM PC had a RAM of 16KB. In the 2010s, the average RAM of PC’s used to be around 4GB!

Nowadays, we can train a small-sized model on our computers, which would have been unfathomable in the ’90s.

The Gaming market also played a significant role in this revolution, with companies like NVIDIA and AMD investing heavily in supercomputing to deliver a high-end virtual experience.

With the growth of the internet, creating and distributing datasets for machine learning tasks became that much easier.

It has become rather easy to collect images from Google or mine text from Wikipedia to train and build Deep Learning Models.

The 2010’s: Our Era of Deep LearningImageNet: In 2009, the beginning of the modern deep learning era, Stanford’s Fei-Fei Li created ImageNet, a large visual dataset that has been lauded as the project that spawned the AI Revolution in the world.

Back in 2006, Li was a new professor at the University of Illinois – Urbana Champaign. Her colleagues would continuously talk about coming up with new algorithms that would make better decisions. She, however, saw the flaw in their plan.

The best algorithm wouldn’t run well if it was trained on a dataset that reflected the real world. ImageNet consisted of more than 14 million images across more than 20,000 categories, and to date, remains the cornerstone in Object Recognition Technology.

Public Competitions: In 2009, Netflix held an open competition called Netflix Prize, to predict user ratings for films. On September 21, 2009, a prize of 1 million USD was awarded to BellKor’s Pragmatic Chaos team which beat Netflix’s own algorithm by 10.06%.

Started in 2010, Kaggle is a platform that hosts machine learning competitions open to everyone across the globe. It has allowed researchers, engineers, and homegrown coders to push the envelope in solving complex data tasks.

Prior to the AI Boom, the investment in artificial intelligence was around 20 million USD. By 2014, this investment had grown twenty-fold, with market leaders like Google, Facebook, and Amazon allocating funds to further research into AI products of the future. This new wave of investments led to increased hiring in deep learning from a few hundred to tens of thousands.

Despite its slow beginnings, Deep Learning has become an inescapable part of our lives. From Netflix and YouTube recommendations to language translation engines, from facial recognition and medical diagnosing to self-driving cars, there is no sphere that Deep Learning has not touched.

AI is not our future, it is our present, and it’s just getting started!

Related

You're reading __A Quick History Of Neural Networks__

## Biologically Inspired: How Neural Networks Are Finally Maturing

More than two decades ago, neural networks were widely seen as the next generation of computing, one that would finally allow computers to think for themselves.

Computers still can’t think for themselves, of course, but the latest innovations in neural networks allow computers to sift through vast realms of data and draw basic conclusions without the help of human operators.

“Neural networks allow you to solve problems you don’t know how to solve,” said Leon Reznik, a professor of computer science at the Rochester Institute of Technology.

On the software side, neural networks are slowly moving into production settings as well. Google has applied various neural network algorithms to improve its voice recognition application, Google Voice. For mobile devices, Google Voice translates human voice input to text, allowing users to dictate short messages, voice search queries and user commands even in the kind of noisy ambient conditions that would flummox traditional voice recognition software.

Neural networks could also be used to analyze vast amounts of data. In 2009, a group of researchers used neural network techniques to win the Netflix Grand Prize.

Neural networking vs. computingAs originally conceived, neural networking differs from traditional computing in that, with conventional computing, the computer is given a specific algorithm, or program, to execute. With neural networking, the job of solving a specific problem is largely left in the hands of the computer itself, Reznick said.

An artificial neural network (ANN) also uses this approach of modifying the strength of connections among different layers of neurons, or nodes in the parlance of the ANN. ANNs, however, usually deploy a training algorithm of some form, which adjusts the nodes to extract the desired features from the source data. Much like humans do, a neural network can generalize, slowly building up the ability to recognize, for instance, different types of dogs, using a single image of a dog.

Evolution of neural networkingAlthough investigated since the 1940s, research into ANNs, which can be thought of as a form of artificial intelligence (AI), hit a peak of popularity in the late 1980s.

“There was a lot of great things done as part of the neural network resurgence in the late 1980s,” said Dharmendra Modha, an IBM Research senior manager who is involved in a company project to build a neuromorphic processor. Throughout the next decade, however, other forms of closely related AI started getting more attention, such as machine learning and expert systems, thanks to a more immediate applicability to industry usage.

Nonetheless, the state-of-the-art in neural networks continued to evolve, with the introduction of powerful new learning models that could be layered to sharpen performance in pattern recognition and other capabilities,

“That means that now our artificial computer models will be much closer to the way natural neural networks process information,” Reznick said.

The continuing march of Moore’s Law has also lent a helping hand. Over the past decade, the microprocessor fabrication process has provided the density needed to run large clusters of nodes even on a single slice of silicon, a density that would not have been possible even a decade ago.

“We’re now at a point where the silicon has matured and technology nodes have gotten dense enough where it can deliver unbelievable scale at really low power,” Modha said.

Harnessing processorsToday’s intrusion detection systems work in one of two ways, Reznick explained. They either use signature detection, in which they recognize a pattern based on a pre-existing library of patterns. Or they look for anomalies in a typically static backdrop, which can be difficult to do in scenarios with lots of activity. Neural networking could combine the two approaches to strengthen the ability of the system to detect unusual deviations from the norm, Reznick said

Micron

Micron Automata

One hardware company investigating the possibilities of neural networking is Micron. The company has just released a prototype of a DDR memory module with a built-in processor, called Automata.

While not a replacement for standard CPUs, a set of Automata modules could be used to watch over a live stream of incoming data, seeking anomalies or patterns of interest. In addition to these spatial characteristics, they can also watch for changes over time, said Paul Dlugosch, director of Automata processor development in the architecture development group of Micron’s DRAM division.

Nonetheless, because they can be run in parallel, multiple Automata modules, each serving as a node, could be run together in a cluster for doing neural network-like computations. The output of one module can be piped into another module, providing the multiple layers of nodes needed for neural networking. Programming the Automata can be done through a compiler that Micron developed that uses either an extension of the regular expression language or its own Automata Network Markup Language (ANML).

Another company investigating this area is IBM. In 2013, IBM announced it had developed a programming model for some cognitive processors it built as part of the U.S. Defense Advanced Research Projects Agency (DARPA) SyNAPSE (Systems of Neuromorphic Adaptive Plastic Scalable Electronics) program.

IBM’s programming model for these processors is based on reusable and stackable building blocks, called corelets. Each corelet is in fact a tiny neural network itself and can be combined with other corelets to build functionality. “One can compose complex algorithms and applications by combining boxes hierarchically,” Modha said.

In early tests, IBM taught one chip how to play the primitive computer game Pong, to recognize digits, to do some olfactory processing, and to navigate a robot through a simple environment.

While it is doubtful that neural networks would ever replace standard CPUs, they may very well end up tackling certain types of jobs difficult for CPUs alone to handle.

“Instead of bringing sensory data to computation, we are bringing computation to sensors,” Modha said. “This is not trying to replace computers, but it is a complementary paradigm to further enhance civilization’s capability for automation.”

## Understanding And Coding Neural Networks From Scratch In Python And R

Note: This article was originally published on May 29, 2023, and updated on July 24, 2023

Overview

Neural Networks is one of the most popular machine learning algorithms

Gradient Descent forms the basis of Neural networks

Neural networks can be implemented in both R and Python using certain libraries and packages

IntroductionYou can learn and practice a concept in two ways:

Option 1: You can learn the entire theory on a particular subject and then look for ways to apply those concepts. So, you read up how an entire algorithm works, the maths behind it, its assumptions, limitations, and then you apply it. Robust but time-taking approach.

Option 2: Start with simple basics and develop an intuition on the subject. Then, pick a problem and start solving it. Learn the concepts while you are solving the problem. Then, keep tweaking and improving your understanding. So, you read up how to apply an algorithm – go out and apply it. Once you know how to apply it, try it around with different parameters, values, limits, and develop an understanding of the algorithm.

I prefer Option 2 and take that approach to learn any new topic. I might not be able to tell you the entire math behind an algorithm, but I can tell you the intuition. I can tell you the best scenarios to apply an algorithm based on my experiments and understanding.

In my interactions with people, I find that people don’t take time to develop this intuition and hence they struggle to apply things in the right manner.

In this article, I will discuss the building block of neural networks from scratch and focus more on developing this intuition to apply Neural networks. We will code in both “Python” and “R”. By the end of this article, you will understand how Neural networks work, how do we initialize weights and how do we update them using back-propagation.

Let’s start.

In case you want to learn this in a course format, check out our course Fundamentals of Deep Learning

Table of Contents:

Simple intuition behind Neural networks

Multi-Layer Perceptron and its basics

Steps involved in Neural Network methodology

Visualizing steps for Neural Network working methodology

Implementing NN using Numpy (Python)

Implementing NN using R

Understanding the implementation of Neural Networks from scratch in detail

[Optional] Mathematical Perspective of Back Propagation Algorithm

Simple intuition behind neural networksIn case you have been a developer or seen one work – you know how it is to search for bugs in code. You would fire various test cases by varying the inputs or circumstances and look for the output. Further, the change in output provides you a hint on where to look for the bug – which module to check, which lines to read. Once you find it, you make the changes and the exercise continues until you have the right code/application.

Neural networks work in a very similar manner. It takes several inputs, processes it through multiple neurons from multiple hidden layers, and returns the result using an output layer. This result estimation process is technically known as “Forward Propagation“.

Next, we compare the result with actual output. The task is to make the output to the neural network as close to the actual (desired) output. Each of these neurons is contributing some error to the final output. How do you reduce the error?

We try to minimize the value/ weight of neurons that are contributing more to the error and this happens while traveling back to the neurons of the neural network and finding where the error lies. This process is known as “Backward Propagation“.

In order to reduce this number of iterations to minimize the error, the neural networks use a common algorithm known as “Gradient Descent”, which helps to optimize the task quickly and efficiently.

That’s it – this is how Neural networks work! I know this is a very simple representation, but it would help you understand things in a simple manner.

Multi-Layer Perceptron and its basicsJust like atoms form the basics of any material on earth – the basic forming unit of a neural network is a perceptron. So, what is a perceptron?

A perceptron can be understood as anything that takes multiple inputs and produces one output. For example, look at the image below.

The above structure takes three inputs and produces one output. The next logical question is what is the relationship between input and output? Let us start with basic ways and build on to find more complex ways.

Below, I have discussed three ways of creating input-output relationships:

Next, let us add bias: Each perceptron also has a bias which can be thought of as how much flexible the perceptron is. It is somehow similar to the constant b of a linear function y = ax + b. It allows us to move the lineup and down to fit the prediction with the data better. Without b the line will always go through the origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input and one for the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.

But, all of this is still linear which is what perceptrons used to be. But that was not as much fun. So, people thought of evolving a perceptron to what is now called as an artificial neuron. A neuron applies non-linear transformations (activation function) to the inputs and biases.

What is an activation function?Activation Function takes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an argument and returns the output of the neuron. In the above equation, we have represented 1 as x0 and b as w0.

Moreover, the activation function is mostly used to make a non-linear transformation that allows us to fit nonlinear hypotheses or to estimate the complex functions. There are multiple activation functions, like “Sigmoid”, “Tanh”, ReLu and many others.

Forward Propagation, Back Propagation, and EpochsTill now, we have computed the output and this process is known as “Forward Propagation“. But what if the estimated output is far away from the actual output (high error). In the neural network what we do, we update the biases and weights based on the error. This weight and bias updating process is known as “Back Propagation“.

Back-propagation (BP) algorithms work by determining the loss (or error) at the output and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. Subsequently, the first step in minimizing the error is to determine the gradient (Derivatives) of each node w.r.t. the final output. To get a mathematical perspective of the Backward propagation, refer to the below section.

This one round of forwarding and backpropagation iteration is known as one training iteration aka “Epoch“.

Multi-layer perceptronNow, let’s move on to the next part of Multi-Layer Perceptron. So far, we have seen just a single layer consisting of 3 input nodes i.e x1, x2, and x3, and an output layer consisting of a single neuron. But, for practical purposes, the single-layer network can do only so much. An MLP consists of multiple layers called Hidden Layers stacked in between the Input Layer and the Output Layer as shown below.

The image above shows just a single hidden layer in green but in practice can contain multiple hidden layers. In addition, another point to remember in case of an MLP is that all the layers are fully connected i.e every node in a layer(except the input and the output layer) is connected to every node in the previous layer and the following layer.

Let’s move on to the next topic which is a training algorithm for neural networks (to minimize the error). Here, we will look at the most common training algorithms known as Gradient descent.

Full Batch Gradient Descent and Stochastic Gradient DescentBoth variants of Gradient Descent perform the same work of updating the weights of the MLP by using the same updating algorithm but the difference lies in the number of training samples used to update the weights and biases.

Full Batch Gradient Descent Algorithm as the name implies uses all the training data points to update each of the weights once whereas Stochastic Gradient uses 1 or more(sample) but never the entire training data to update the weights once.

Let us understand this with a simple example of a dataset of 10 data points with two weights w1 and w2.

Full Batch: You use 10 data points (entire training data) and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2.

SGD: You use 1st data point and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2. Next, when you use 2nd data point, you will work on the updated weights

For a more in-depth explanation of both the methods, you can have a look at this article.

Steps involved in Neural Network methodologyLet’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar to above-shown architecture). At the output layer, we have only one neuron as we are solving a binary classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes.

Firstly look at the broad steps:

0.) We take input and output

X as an input matrix

y as an output matrix

1.) Then we initialize weights and biases with random values (This is one-time initiation. In the next iteration, we will use updated weights, and biases). Let us define:

wh as a weight matrix to the hidden layer

bh as bias matrix to the hidden layer

wout as a weight matrix to the output layer

bout as bias matrix to the output layer

2.) Then we take matrix dot product of input and weights assigned to edges between the input and hidden layer then add biases of the hidden layer neurons to respective inputs, this is known as linear transformation:

hidden_layer_input= matrix_dot_product(X,wh) + bh

3) Perform non-linear transformation using an activation function (Sigmoid). Sigmoid will return the output as 1/(1 + exp(-x)).

hiddenlayer_activations = sigmoid(hidden_layer_input)

4.) Then perform a linear transformation on hidden layer activation (take matrix dot product with weights and add a bias of the output layer neuron) then apply an activation function (again used sigmoid, but you can use any other activation function depending upon your task) to predict the output

All the above steps are known as “Forward Propagation“

5.) Compare prediction with actual output and calculate the gradient of error (Actual – Predicted). Error is the mean square loss = ((Y-t)^2)/2

E = y – output

6.) Compute the slope/ gradient of hidden and output layer neurons ( To compute the slope, we calculate the derivatives of non-linear activations x at each layer for each neuron). The gradient of sigmoid can be returned as x * (1 – x).

slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)

7.) Then compute change factor(delta) at the output layer, dependent on the gradient of error multiplied by the slope of output layer activation

d_output = E * slope_output_layer

8.) At this step, the error will propagate back into the network which means error at the hidden layer. For this, we will take the dot product of the output layer delta with the weight parameters of edges between the hidden and output layer (wout.T).

Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)

9.) Compute change factor(delta) at hidden layer, multiply the error at hidden layer with slope of hidden layer activation

d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer

10.) Then update weights at the output and hidden layer: The weights in the network can be updated from the errors calculated for training example(s).

wh = wh + matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate

learning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate)

11.) Finally, update biases at the output and hidden layer: The biases in the network can be updated from the aggregated errors at that neuron.

bias at output_layer =bias at output_layer + sum of delta of output_layer at row-wise * learning_rate

bias at hidden_layer =bias at hidden_layer + sum of delta of output_layer at row-wise * learning_rate

bout = bout + sum(d_output, axis=0)*learning_rate

Steps from 5 to 11 are known as “Backward Propagation“

One forward and backward propagation iteration is considered as one training cycle. As I mentioned earlier, When do we train second time then update weights and biases are used for forward propagation.

Above, we have updated the weight and biases for the hidden and output layer and we have used a full batch gradient descent algorithm.

Visualization of steps for Neural Network methodologyWe will repeat the above steps and visualize the input, weights, biases, output, error matrix to understand the working methodology of Neural Network (MLP).

Note:

For good visualization images, I have rounded decimal positions at 2 or3 positions.

Yellow filled cells represent current active cell

Orange cell represents the input used to populate the values of the current cell

Step 0: Read input and output

Step 1: Initialize weights and biases with random values (There are methods to initialize weights and biases but for now initialize with random values)

hidden_layer_input= matrix_dot_product(X,wh) + bh

hiddenlayer_activations = sigmoid(hidden_layer_input)

Step 4: Perform linear and non-linear transformation of hidden layer activation at output layer

output = sigmoid(output_layer_input)

E = y-output

Slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)

Step 7: Compute delta at output layer

d_output = E * slope_output_layer*lr

Step 8: Calculate Error at the hidden layer

Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)

Step 9: Compute delta at hidden layer

d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer

Step 10: Update weight at both output and hidden layer

wh = wh+ matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate

Step 11: Update biases at both output and hidden layer

bout = bout + sum(d_output, axis=0)*learning_rate

Above, you can see that there is still a good error not close to the actual target value because we have completed only one training iteration. If we will train the model multiple times then it will be a very close actual outcome. I have completed thousands iteration and my result is close to actual target values ([[ 0.98032096] [ 0.96845624] [ 0.04532167]]).

Implementing NN using Numpy (Python)

Implementing NN in RX=matrix(c(1,0,1,0,1,0,1,1,0,1,0,1),nrow = 3, ncol=4,byrow = TRUE)

Y=matrix(c(1,1,0),byrow=FALSE)

}

}

output_neurons=1

wout=matrix( rnorm(hiddenlayer_neurons*output_neurons,mean=0,sd=1), hiddenlayer_neurons, output_neurons)

for(i in 1:epoch){

output= sigmoid(output_layer_input)

# Back Propagation

bh = bh + rowSums(d_hiddenlayer)*lr

output

Understanding the implementation of Neural Networks from scratch in detailNow that you have gone through a basic implementation of numpy from scratch in both Python and R, we will dive deep into understanding each code block and try to apply the same code on a different dataset. We will also visualize how our model is working, by “debugging” it step by step using the interactive environment of a jupyter notebook and using basic data science tools such as numpy and matplotlib. So let’s get started!

The first thing we will do is to import the libraries mentioned before, namely numpy and matplotlib. Also, as we will be working with the jupyter notebook IDE, we will set inline plotting of graphs using the magic function %matplotlib inline

View the code on Gist.

Let’s check the versions of the libraries we are using

View the code on Gist.

Version of numpy: 1.18.1and the same for matplotlib

View the code on Gist.

Version of matplotlib: 3.1.3Also, lets set the random seed parameter to a specific number (let’s say 42 (as we already know that is the answer to everything!)) so that the code we run gives us the same output every time we run (hopefully!)

View the code on Gist.

Now the next step is to create our input. Firstly, let’s take a dummy dataset, where only the first column is a useful column, whereas the rest may or may not be useful and can be a potential noise.

View the code on Gist.

This is the output we get from running the above code

Input: [[1 0 0 0] [1 0 1 1] [0 1 0 1]] Shape of Input: (3, 4)Now as you might remember, we have to take the transpose of input so that we can train our network. Let’s do that quickly

View the code on Gist.

Input in matrix form: [[1 1 0] [0 0 1] [0 1 0] [0 1 1]] Shape of Input Matrix: (4, 3)Now let’s create our output array and transpose that too

View the code on Gist.

Actual Output: [[1] [1] [0]] Output in matrix form: [[1 1 0]] Shape of Output: (1, 3)Now that our input and output data is ready, let’s define our neural network. We will define a very simple architecture, having one hidden layer with just three neurons

View the code on Gist.

Then, we will initialize the weights for each neuron in the network. The weights we create have values ranging from 0 to 1, which we initialize randomly at the start.

For simplicity, we will not include bias in the calculations, but you can check the simple implementation we did before to see how it works for the bias term

View the code on Gist.

Let’s print the shapes of these numpy arrays for clarity

View the code on Gist.

After this, we will define our activation function as sigmoid, which we will use in both the hidden layer and output layer of the network

View the code on Gist.

And then, we will implement our forward pass, first to get the hidden layer activations and then for the output layer. Our forward pass would look something like this

View the code on Gist.

View the code on Gist.

Let’s see what our untrained model gives as an output.

View the code on Gist.

We get an output for each sample of the input data. In this case, let’s calculate the error for each sample using the squared error loss

View the code on Gist.

We get an output like this

array([[0.05013458, 0.03727248, 0.25388062]])We have completed our forward propagation step and got the error. Now let’s do a backward propagation to calculate the error with respect to each weight of the neuron and then update these weights using simple gradient descent.

Firstly we will calculate the error with respect to weights between the hidden and output layers. Essentially, we will do an operation such as this

where to calculate this, the following would be our intermediate steps using the chain rule

Rate of change of error w.r.t output

Rate of change of output w.r.t Z2

Rate of change of Z2 w.r.t weights between hidden and output layer

Let’s perform the operations

View the code on Gist.

View the code on Gist.

View the code on Gist.

Now, let’s check the shapes of the intermediate operations.

View the code on Gist.

What we want is an output shape like this

View the code on Gist.

Now as we saw before, we can define this operation formally using this equation

Let’s perform the steps

View the code on chúng tôi the code on Gist.

We get the output as expected.

Further, let’s perform the same steps for calculating the error with respect to weights between input and hidden – like this

So by chain rule, we will calculate the following intermediate steps,

Rate of change of error w.r.t output

Rate of change of output w.r.t Z2

Rate of change of Z2 w.r.t hidden layer activations

Rate of change of hidden layer activations w.r.t Z1

Rate of change of Z1 w.r.t weights between input and hidden layer

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

Let’s print the shapes of these intermediate arrays

View the code on Gist.

(1, 3) (1, 3) (3, 1) (3, 3) (4, 3)But what we want is an array of shape this

View the code on Gist.

(4, 3)So we will combine them using the equation

View the code on Gist.

So that is the output we want. Lets quickly check the shape of the resultant array

View the code on Gist.

Now the next step is to update the parameters. For this, we will use vanilla gradient descent update function, which is as follows

Firstly define our alpha parameter, i.e. the learning rate as 0.01

View the code on Gist.

We also print the initial weights before the update

View the code on Gist.

View the code on Gist.

View the code on Gist.

and update the weights

View the code on Gist.

Then, we check the weights again to see if they have been updated

View the code on Gist.

View the code on Gist.

Now, this is just one iteration (or epoch) of the forward and backward pass. We have to do it multiple times to make our model perform better. Let’s perform the steps above again for 1000 epochs

View the code on Gist.

View the code on Gist.

We get an output like this, which is a debugging step we did to check error at every hundredth epoch

Error at epoch 0 is 0.11553 Error at epoch 100 is 0.11082 Error at epoch 200 is 0.10606 Error at epoch 300 is 0.09845 Error at epoch 400 is 0.08483 Error at epoch 500 is 0.06396 Error at epoch 600 is 0.04206 Error at epoch 700 is 0.02641 Error at epoch 800 is 0.01719 Error at epoch 900 is 0.01190Our model seems to be performing better and better as the training continues. Let’s check the weights after the training is done

View the code on Gist.

View the code on Gist.

And also plot a graph to visualize how the training went

View the code on Gist.

One final thing we will do is to check how close the predictions are to our actual output

View the code on Gist.

View the code on Gist.

Pretty close!

Further, the next thing we will do is to train our model on a different dataset, and visualize the performance by plotting a decision boundary after training.

Let’s get on to it!

View the code on Gist.

View the code on Gist.

We get an output like this

View the code on Gist.

We will normalize the input so that our model trains faster

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

Now we will define our network. We will update the following three hyperparameters, namely

Change hidden layer neurons to be 10

Change the learning rate to be 0.1

and train for more epochs

View the code on Gist.

This is the error we get after each thousand of the epoch

Error at epoch 0 is 0.23478 Error at epoch 1000 is 0.25000 Error at epoch 2000 is 0.25000 Error at epoch 3000 is 0.25000 Error at epoch 4000 is 0.05129 Error at epoch 5000 is 0.02163 Error at epoch 6000 is 0.01157 Error at epoch 7000 is 0.00775 Error at epoch 8000 is 0.00689 Error at epoch 9000 is 0.07556And plotting it gives an output like this

View the code on Gist.

Now, if we check the predictions and output manually, they seem pretty close

View the code on Gist.

which gives us an output like this

which lets us know how adept our neural network is at trying to find the pattern in the data and then classifying them accordingly.

Here’s an exercise for you – Try to take the same implementation we did, and implement in on a “blobs” dataset using scikit-learn The data would look similar to this

Do share your results with us!

[Optional] Mathematical Perspective of Back Propagation AlgorithmLet Wi be the weights between the input layer and the hidden layer. Wh be the weights between the hidden layer and the output layer.

Now, h=σ (u)= σ (WiX), i.e h is a function of u and u is a function of Wi and X. here we represent our function as σ

Y= σ (u’)= σ (Whh), i.e Y is a function of u’ and u’ is a function of Wh and h.

We will be constantly referencing the above equations to calculate partial derivatives.

We are primarily interested in finding two terms, ∂E/∂Wi and ∂E/∂Wh i.e change in Error on changing the weights between the input and the hidden layer and change in error on changing the weights between the hidden layer and the output layer.

But to calculate both these partial derivatives, we will need to use the chain rule of partial differentiation since E is a function of Y and Y is a function of u’ and u’ is a function of Wi.

Let’s put this property to good use and calculate the gradients.

∂E/∂Wh = (∂E/∂Y).( ∂Y/∂u’).( ∂u’/∂Wh), ……..(1)

We know E is of the form E=(Y-t)2/2.

So, (∂E/∂Y)= (Y-t)

Now, σ is a sigmoid function and has an interesting differentiation of the form σ(1- σ). I urge the readers to work this out on their side for verification.

So, (∂Y/∂u’)= ∂( σ(u’)/ ∂u’= σ(u’)(1- σ(u’)).

But, σ(u’)=Y, So,

(∂Y/∂u’)=Y(1-Y)

Now, ( ∂u’/∂Wh)= ∂( Whh)/ ∂Wh = h

Replacing the values in equation (1) we get,

∂E/∂Wh = (Y-t). Y(1-Y).h

So, now we have computed the gradient between the hidden layer and the output layer. It is time we calculate the gradient between the input layer and the hidden layer.

∂E/∂Wi =(∂ E/∂ h). (∂h/∂u).( ∂u/∂Wi)

But, (∂ E/∂ h) = (∂E/∂Y).( ∂Y/∂u’).( ∂u’/∂h). Replacing this value in the above equation we get,

∂E/∂Wi =[(∂E/∂Y).( ∂Y/∂u’).( ∂u’/∂h)]. (∂h/∂u).( ∂u/∂Wi)……………(2)

So, What was the benefit of first calculating the gradient between the hidden layer and the output layer?

As you can see in equation (2) we have already computed ∂E/∂Y and ∂Y/∂u’ saving us space and computation time. We will come to know in a while why is this algorithm called the backpropagation algorithm.

Let us compute the unknown derivatives in equation (2).

∂u’/∂h = ∂(Whh)/ ∂h = Wh

∂h/∂u = ∂( σ(u)/ ∂u= σ(u)(1- σ(u))

But, σ(u)=h, So,

(∂Y/∂u)=h(1-h)

Now, ∂u/∂Wi = ∂(WiX)/ ∂Wi = X

Replacing all these values in equation (2) we get,

∂E/∂Wi = [(Y-t). Y(1-Y).Wh].h(1-h).X

So, now since we have calculated both the gradients, the weights can be updated as

Wh = Wh + η . ∂E/∂Wh

Wi = Wi + η . ∂E/∂Wi

Where η is the learning rate.

So coming back to the question: Why is this algorithm called Back Propagation Algorithm?

The reason is: If you notice the final form of ∂E/∂Wh and ∂E/∂Wi , you will see the term (Y-t) i.e the output error, which is what we started with and then propagated this back to the input layer for weight updation.

So, where does this mathematics fit into the code?

hiddenlayer_activations=h

E= Y-t

Slope_output_layer = Y(1-Y)

lr = η

slope_hidden_layer = h(1-h)

wout = Wh

Now, you can easily relate the code to the mathematics.

End Notes:To summarize, this article is focused on building Neural Networks from scratch and understanding its basic concepts. I hope now you understand the working of neural networks. Such as how does forward and backward propagation work, optimization algorithms (Full Batch and Stochastic gradient descent), how to update weights and biases, visualization of each step in Excel, and on top of that code in python and R.

Therefore, in my upcoming article, I’ll explain the applications of using Neural Networks in Python and solving real-life challenges related to:

Computer Vision

Speech

Natural Language Processing

Related

## A Brief History Of The Apollo Hoax

When Neil Armstrong pressed the first bootprint into the Sea of Tranquility, most of humanity watched the televised low-res blob and felt pride welling up in their chests. But a few watchers felt something entirely different—an unconfirmed, squinty-eyed skepticism that something about the whole deal smelled fishy. How could the United States, which could barely put a chimp into space in 1961, get two full-grown men on the surface of the moon eight years later? How could anyone confirm that men actually made it to the moon? And, how, exactly, had that $25 billion Apollo budget been spent?

Five years, and five lunar landings later, the nebulous idea that the government faked the whole moon shot on a soundstage somewhere in the Southwest finally coalesced when, in 1974, Bill Kaysing, a former technical writer for Rocketdyne, a company that worked on the Atlas V launch vehicle, self published a book_ We Never Went to the Moon: America’s $30 Billon Swindle_. Kaysing claimed that the Apollo program was faked to allow the U.S. to secretly militarize space, and that the astronauts, who were put through sessions of “guilt therapy” to help deal with the deception, were actually at a strip club in Nevada the night of the moon landing.

Far from being the work of an exhaustive investigative journalist, its notable lack of evidence, sources, and logical reasoning kept the tome from hitting the bestseller list (or any list). But mistrust of the government—1974 was the height of frustration with Vietnam and the Watergate scandal—gave Kaysing’s semi-formed ideas enough to nudge the Apollo Hoax out of the ether and into the near fringe of pseudo-science. The seed was slow to germinate, but Capricorn 1—a popular 1978 film starring OJ Simpson (who later theorists have implicated in the Apollo coverup) in which the government fakes a manned Mars landing—kept Kaysing’s ideas alive and helped spawn a cottage industry of Moon hoaxers who gathered and presented evidence to one another throughout the 1980s and 1990s.

Despite this, the Apollo Hoax remained fringe, and was on the verge of likely evaporation when the nexus of the Internet and a February 2001 special on the Fox network called Conspiracy Theory: Did We Ever Land on the Moon? put the theory on the public display for the first time. In Fox’s shockumentary era (see When Animals Attack and Temptation Idol), the Moon Hoax documentary and a replay a month later were ratings successes, and became water cooler fodder across the country with people asking “why weren’t there stars in the photos?” And “How could the astronauts have survived the radiation of the Van Allen Belts?” Aided with a blossoming of Internet conspiracy sites, the Apollo Hoax made its first true toehold in the mainstream press.

Join PopSci as we celebrate NASA’s 50th anniversary!

Footprint on the Moon

Moondust? Or just plaster?

At the same time Fox was giving credence to Kaysing’s ideas, astronomer Phil Plait was preparing the defense. On his Web site Bad Astronomy and in a later book of the same name, the professional astronomer refuted the claims of the Fox show and Kaysing (who passed away in 2005) point by point. Plait’s refutations spawned dozens of other debunking sites, setting off a veritable Internet war between hoax believers and their critics. There have been a few notable events since then—in 2002, Bart Sibrel, who appeared in the Fox special, was punched in the face by astronaut Buzz Aldrin after poking Aldrin with a Bible, asking him to swear to the moon landings authenticity. More recently, Fear Factor host Joe Rogan has gone to bat for the Moon Hoax Theory, debating Plait on Penn Jillette’s radio show. But as for hard evidence? The stories haven’t changed much since 1974.

“In ten years I think this conspiracy theory will be gone,” says Plait, who points out that in 2009 NASA’s Lunar Reconnaissance Orbiter will give us clear photos of the moon landing sites, and says the U.S. goal of returning to the moon by 2023 will refocus us on the triumph of the Apollo mission. “These guys are not professional journalists, they have no credentials, and their arguments are tissue thin. They have a track record of 100 percent errors.”

Though the hoaxers claims usually disappear when held up to the light, there is one question that sticks in one’s craw: what happened to the official videotapes of the Apollo 11 landing? To save space in the broadcast spectrum so they could transmit telemetry and other data, cameras on the lunar lander transmitted images in a special slow-scan format. That data was received by stations in Australia and the Mojave, formatted for television broadcast and sent to Houston. The images seen on television were fuzzy and indistinct. The actual slow-scan footage before conversion was crisp and full of detail.

But those priceless historical images weren’t put in a vault at the Smithsonian like they should have been. According to NASA records, the official video images of the moon landing were stored in 2,612 boxes at a government warehouse. Between 1975 and 1979, the Goddard Space Center requested all but two boxes of tapes and never returned them to the National Archives. Now, the 13,000 reels of data are nowhere to be found. In 2006, NASA began a dedicated agency-wide hunt, but to date, the images haven’t shown up. “Despite the challenges of the search,” a NASA release states, “NASA does not consider the tapes to be lost.” But the hoaxers and moon doubters do. And it’s unlikely their questions will be put to rest till we put another footprint on the moon.

View the scant remnants here.Join PopSci as we celebrate NASA’s 50th anniversary!

Workings

The original film of the Apollo 11 moon landing is recorded on reels of specially formatted magnetic tape like this one. There is only one working machine at the Goddard Space Flight Center that can still read the magnetic tape. And check out A Brief History of the Apollo Hoax._

Have You Seen?

These images were distributed to NASA employees in 2006 to help them identify any boxes of Apollo-era tapes they might come across. Goddard believes the missing tapes may still be somewhere at Goddard or sent off to other facilities for storage or research.

Tape It

Over 2,600 boxes of magnetic tapes full of Apollo data like this one were shipped from the National Archives to the Goddard Space Center between 1970 and 1975. Only two boxes of tapes remain at the Archives. The rest are missing

Knowledge is Power

The tapes include biomedical data on astronauts, telemetry and engineering data from the Apollo missions, as well as video footage of the Apollo 11 landing. As NASA begins thinking about future missions to the moon, these 13,000 reels of missing tape would be a critical source of information.

## A Quick Glance Of Matlab Scripts With Examples

Introduction to Matlab Scripts

Matlab Script is a sequence of various commands, which are most likely used to represent some program and are executed in the same way as a program or single command in Matlab command window. The script is created using ‘edit’ command in Matlab. Variables that are created in a script can be accessed from the Matlab command window until we clear them or terminate the session. To run our script, we must save it in current directory, or in a directory saved on Matlab path. Matlab scripts must be saved as ‘.m’ extension and this is the reason they are referred as “M-files”.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Examples of Matlab ScriptsGiven below are the examples:

Example #1In this example we will create a script that will generate 5000 random numbers between 0 and 100. We will also create a histogram for all these numbers.

Below are the steps that we will follow for this example:

Use ‘edit’ command to create new script.

Write the code for generating 5000 random numbers and drawing a histogram.

Code:

edit myHisto

[Using the ‘edit’ command to create the script ‘myHisto’]

[Initializing the number of columns]

[Initializing the number of columns]

Row = 1;

[Initializing the number of rows]

[Initializing the number of rows]

Bins = Col/100;

[Defining the Bins for the histogram]

[Defining the Bins for the histogram]

rng(now);

[Using the ‘Random number generator’ to create random values]

[Using the ‘Random number generator’ to create random values]

A = 100*rand(Row, Col);

histogram(A, Bins)

[Drawing the histogram using above values] [Save this file as .m extension. Please keep in mind that the name of the file must be same as the name of the script, which is ‘myHisto’ in our example]

[Drawing the histogram using above values] [Save this file as .m extension. Please keep in mind that the name of the file must be same as the name of the script, which is ‘myHisto’ in our example]

Next, we need to call this script. This is done by typing the name of the script in the command window as below:

myHisto

[Calling the script created in the command window]

[Calling the script created in the command window]

Input:

histogram(A, Bins)

[Please notice that in the above figure, the name of the file is same as the name of the script]

[Please notice that in the above figure, the name of the file is same as the name of the script]

Output:

As we can see in the output, we have obtained a histogram of random values as expected by us.

Example #2In this example we will create a script that will be used to find the integration of a function.

Below are the steps that we will follow for this example:

Use ‘edit’ command to create the new script.

Write the code for computing the integration using ‘integral’ function.

Code:

edit myIntegral

[Using the ‘edit’ command to create the script ‘myIntegral’]

[Using the ‘edit’ command to create the script ‘myIntegral’]

syms x

[Initializing the local variable ‘x’]

[Initializing the local variable ‘x’]

Fx = @(x) 5*x.^3

[Creating the polynomial function of degree 3]

[Creating the polynomial function of degree 3]

A = integral (Fx, 0, 3)

[Passing the input function& the required limits] [Save this file as .m extension and keep the name as ‘myIntegral’] [Mathematically, the integral of 5*x ^3, between the limits 0 to 3 is 101.25]

[Passing the input function& the required limits] [Save this file as .m extension and keep the name as ‘myIntegral’] [Mathematically, the integral of 5*x ^3, between the limits 0 to 3 is 101.25]

Next, we need to call this script. This is done by typing the name of the script in the command window as below:

myIntegral

[Calling the script created in the command window]

[Calling the script created in the command window]

Input:

A = integral (Fx, 0, 3)

As we can see in the output, we have obtained integration of our function by calling the script.

Example #3In this example we will create a script that will be used to draw a sphere.

Below are the steps that we will follow for this example:

Use ‘edit’ command to create the new script.

Write the code for drawing a sphere of radius ‘Rad’.

Code:

edit drawSphere

[Using the ‘edit’ command to create the script ‘drawSphere’] [a, b, c] = sphere; [Creating unit sphere]

[Using the ‘edit’ command to create the script ‘drawSphere’] [a, b, c] = sphere; [Creating unit sphere]

Rad = 2;

[Initializing the radius]

[Initializing the radius]

surf(a * Rad, b * Rad, c * Rad)

[Adjusting the dimensions & creating the plot]

[Adjusting the dimensions & creating the plot]

axis equal

[Making the scale common for each axis] [Save this file as .m extension and keep the name as ‘drawSphere’]

[Making the scale common for each axis] [Save this file as .m extension and keep the name as ‘drawSphere’]

Next, we will call this script. This is done by typing the name of the script in the command window as below:

drawSphere

[Calling the script created in the command window]

[Calling the script created in the command window]

Input:

axis equal

Output:

Example #4In this example will plot a sine wave and a cos wave in the same plot using script.

Below are the steps that we will follow for this example:

Use ‘edit’ command to create the new script.

Write the code for creating the waves and plot them.

Code:

edit drawWaves

[Using the ‘edit’ command to create the script ‘drawWaves’]

[Using the ‘edit’ command to create the script ‘drawWaves’]

A = linspace (-pi, 2*pi);

[Initializing the interval]

[Initializing the interval]

Y = sin(A);

[Creating the sine wave]

[Creating the sine wave]

Z = cos(A);

[Creating the cos wave]

[Creating the cos wave]

T = plot(A, Y, A, Z);

[Creating the plot] [Save this file as .m extension and keep the name as ‘drawWaves’]

[Creating the plot] [Save this file as .m extension and keep the name as ‘drawWaves’]

Next, we will call this script. This is done by typing the name of the script in the command window as below:

drawWaves

[Calling the script created in the command window]

[Calling the script created in the command window]

Input:

T = plot(A, Y, A, Z);

Output:

As we can see in the output, we have obtained a plot containing sine and cos waves by calling the script.

ConclusionScripts in Matlab consist of a sequence of commands which we use as a program by calling them from a separate command window. While creating a script, we must save it as .m extension and keep the file name the same as the name of the script.

Recommended ArticlesThis is a guide to Matlab Scripts. Here we discuss the introduction to Matlab Scripts along with the examples for better understanding. You may also have a look at the following articles to learn more –

## A Quick Glance Of Postgresql Count With Examples

Introduction to Postgresql Count

There are many aggregate functions present in the PostgreSQL database. One of the aggregate function that is used to find the row count is the COUNT() aggregate function. This function counts the total number of rows according to the query statement and clauses. When it is used on a particular column, then only non-NULL values are considered. In this article, we will see how does COUNT() function works with *, a particular column for nun-NULL values, DISTINCT keyword, GROUP BY clause, and HAVING clause with the help of examples. We will begin studying and understanding the working of the COUNT() function by learning its syntax. In this topic, we are going to learn about Postgresql Count.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Syntax

FROM tableName [WHERE conditionalStatements];

The count function can accept different parameters. It can be passed with either “*” to count all the rows in the result set or with a column name preceded by the distinct or all keyword, to count distinct or all values in that specific column. By default, it is an ALL keyword when mentioned in a particular columnName. Using the DISTINCT keyword limits the result set to unique values within the specified columns. The table name specifies the table from which we want to retrieve the result and determine the row counts. ConditionalStatements are the conditions you wish to apply in the where clause and are optional.

Example of Postgresql CountLet us begin by connecting to out PostgreSQL database and open the psql terminal command- prompt using the following statements –

sudo su – postgres psqlThe above queries will result in the access to Postgres command-prompt as follows –

Now let us create one table and insert values in it.

CREATE TABLE educba (technical_id serial PRIMARY KEY,technology_name VARCHAR (255) NOT NULL,course_Duration INTEGER,starting_date DATE NOT NULL DEFAULT CURRENT_DATE,department VARCHAR(100));Firing the above query in our psql terminal command prompt will result in the following output –

Let us insert the value in the educba table without mentioning the starting_date column’s value while inserting.

INSERT INTO educba(technology_name, course_duration, starting_date, department) VALUES ('psql',35,'2024-04-07','Database');This gives the following output –

Let’s insert some more entries –

INSERT INTO educba(technology_name, course_duration, department) VALUES ('mysql',40,'Database'); INSERT INTO educba(technology_name, course_duration, department) VALUES ('javascript',30,'scripting language'); INSERT INTO educba(technology_name, course_duration, department) VALUES ('java',35,'server- side language'); INSERT INTO educba(technology_name, course_duration, department) VALUES ('Angular',35,'Client-side language');That results in the following output –

Let us now check the contents of our table educba by firing the following SELECT command –

SELECT * FROM educba;That gives the following output –

Let us retrieve the row count of the educba table using the COUNT() function. The query statement will be as follows –

SELECT COUNT(*) FROM educba;That results in the following output –

Now, let us count the rows with 35 days of course_duration using the following query statement –

SELECT COUNT(*) FROM educba WHERE course_duration=35;That results in the following output result –

As there are three rows with psql, java, and angular as technology_name that have a course duration of 35 days, we got the row count as 3.

Using DISTINCT keywordYou can use the DISTINCT keyword in the SELECT clause whenever you want to get the unique row count of the particular column field. For example, suppose that we want to retrieve

How many departments are used in the educba table then we can mention DISTINCT(department) in the SELECT clause using the following query statement –

SELECT COUNT(DISTINCT(department)) FROM educba;That results in the following output-

Using GROUP BY clauseNow, let us retrieve the count of rows grouped according to the course_duration. Following will be the query statement that will be used to get the count of records grouped based on the course_duration column –

SELECT COUNT(*),course_duration FROM educba GROUP BY course_duration;Those output will be as follows –

As three technologies are having a course duration of 35 and one technology counts with 40 and 30 days duration each, the above output is correct. But we cannot know which technologies are considered in that count. To do so, we can use GROUP_CONCAT() function.

Using string_agg functionThe above query just retrieved the count of technologies grouped on course_duration used in the educba table. If we want the list of those technologies, then we can use the string_agg() function to get the comma-separated list of those technologies in the following way –

SELECT COUNT(technology_name) as technology_count, course_duration as duration_in_days ,string_agg(technology_name,',') as list_of_technologies FROM educba GROUP BY course_duration;The output of the above query statement is as follows –

Retrieving column count alter table educba add column temp_null_col varchar default null;And for verifying the records of educba, we will fire the following command –

SELECT * from educba;Whose output is as follows –

update educba set temp_null_col='temp' where department='Database';Whose output is as follows –

SELECT * from educba;That results in the following output –

Now, let us get the count of the column temp_null_col using the following query –

select count(temp_null_col) from educba;Whose output is as follows –

Considering only non-null values, the count of rows in the column temp_null_col is 2.

ConclusionWe can use the COUNT() aggregate function in PostgreSQL to get the count of the number of rows of the particular query statement. Internally, the query fires to obtain the result set containing all the rows that meet the condition. To determine the count value, the system performs calculations on the retrieved result set. Additionally, you can apply the COUNT() function to specific columns to retrieve the count of non-null values within those columns.

It can also be used with the GROUP BY clause to get the count of grouped results. To fetch the count of unique values, the DISTINCT() function can be used in the SELECT clause. Additionally, the string_agg() function can be employed to obtain a list of column values from other columns, excluding the column used for counting, providing a list of values considered in that count.

Recommended ArticlesWe hope that this EDUCBA information on “Postgresql Count” was beneficial to you. You can view EDUCBA’s recommended articles for more information.

Update the detailed information about **A Quick History Of Neural Networks** on the Eastwest.edu.vn website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!