All right, the moment we've all been waiting for is finally here. In this section, we'll begin discussion about neural networks and highlight that which makes neural networks and by extension, deep learning, different from the more traditional machine learning methods we've already discussed. The neural paradigm underpins the subfield of machine learning known as deep learning. We've analyzed machine learning models whose parameters interact with the features directly in order to produce predictions. But deep neural networks often comprise millions and sometimes billions of parameters organized into hierarchical layers. The features are multiplied and added together repeatedly, with the outputs from one layer of parameters being fed into the next layer before a prediction can finally be made. This increased interaction between features and model parameters increases the complexity of the functions that the model can learn. Compare such a model architecture as we've just described with the linear regression model we discussed earlier, which can only sum together features either multiplied by a single parameter. >> Deep learning relies on only a few additional concepts beyond what we've covered so far. And most of these have to do with the extra algorithmic complexity that comes with layered model architectures. In other words, layered function structures. The rest of this section is going to cover the following. First, we're going to dive a little bit deeper into the differences between neural network models and the other machine learning models we've covered. Then we're going to review the machine learning training loop and revisit the concept of loss. And we'll contextualize this with respect to neural networks. Then we'll discuss the strategy that the model will apply to minimize the loss. The overarching strategy towards loss minimization is an iterative optimization technique known as gradient descent. We'll follow by discussing the actual process through which gradient descent is achieved and model parameters are updated, which is known as backpropagation. One reason that neural networks are so powerful is the increased number of parameters that are used to interpret the data. Remember that parameters are the set of numbers within a model that are used to define the model function. You can think of it as the coefficients of a function that are adjusted or trained during the machine learning process to accurately transform inputs into desired outputs. They are just numbers that are multiplied and added in various ways. In most traditional machine learning methods such as SVMs and linear regression, the number of parameters is limited to the number of input features. In linear regression, for example, features are simply multiplied with a set of parameters and then summed together. This is known as a linear combination, and this comes from the fact that in cases where we have one feature in one parameter, the resulting function is a line. And this can be naturally extended to higher dimensions. The machine learning methods that we have discussed adjust the parameters or the weights to fit the function to a training data set. Earlier, we discussed an example of a model for classifying whether a nodule in the lung on a CT is malignant or benign, based on size, which was implemented using logistic regression. Remember that we use logistic regression when the output label is categorical. In this case, it's a 0 if the nodule is benign or a 1 if the nodule is malignant. In this case, we use the sigmoid function to transform the function from a line to an S shape in order to fit our categorically labeled data. Here again, the parameters can be adjusted to fit the function to our training data set. The sigmoid function, as we've mentioned earlier, is known as a nonlinear transformation. In comes a line and out comes something else. We'll see that nonlinear transformations are a key part of deep learning. Since we're about to start talking about deep learning, also throw in a sneak peek of deep learning lingo here. In deep learning terminology, we often use the term activation function to refer to the nonlinear transformations that we use, and we call the result of a linear combination followed by an activation function as the activations associated with the parameters involved. By the way, there are other activation functions besides the sigmoid that are used in deep learning models. Many recent models use what is called the ReLU activation function, which stands for rectified linear unit. This function passes through input values as is if they're positive, but it turns all negative inputs into zero. It's called a rectified linear unit because it rectifies or passes through only the positive side of the input range, and what it does pass through is linear. And the rectification makes the function not aligned such that it's a nonlinear transformation. While the ReLu doesn't have the same nice probabilistic interpretation as the sigmoid, it's been shown to work really well when used with deep learning models in practice. In particular, allowing deep neural networks to train to better performance levels and more quickly. So you'll commonly see them being used. >> So as a quick recap, traditional machine learning methods often involve a linear combination of the input features, optionally followed by a nonlinear transformation. The result of these two steps produce activations. These activations are synonymous with the predictions of a traditional machine learning model. Now this is where it gets exciting. Let us envision the above model as a single neuron. Just as the name suggests, neural networks are inspired by the brain's computation mechanism which consists of units of neurons. In a physiologic metaphor, a neuron is a unit that takes an input signal and produces an output signal. >> So in the earlier example, the features are x1 and x2. The parameters of the model are w1, w2, and b. The nonlinear transformation, we can call it function f. The activation in this case is a single number, which is y. The neurons of a neural network are, for all intents and purposes, just miniature logistic regression models. And a layer of a neural network consists of a set of these neurons that each take the same input. Each of these neurons has its own set of parameters that it learns, and each neuron produces a different output or activation. So you might be asking yourself, why produce multiple outputs for the same input? Well, each neuron and its corresponding function can be thought of as looking for a different pattern in the input data to the layer. The different outputs from the multiple neurons then essentially represent higher level features that are extracted by each neuron's function. We concatenate the multiple outputs from a layer of a neural network to be the input to the next layer. As each layer in a neural network performs its function computation, the total computation that it is incurring in the model represents an increasingly complex function. This is why what's known as deep neural networks are so powerful. The deep means that there are many layers of neurons in the neural network, which can represent a very complex function. Which hopefully, can also be very accurate mapping from input to output. In the basic idea of a neural network that we've just described where all the neurons in the layer take as input the full input to the layer, we call these layers fully connected layers. Other names also exist, and you might hear it called a dense layer or a linear layer. But these all mean the same thing. If we stack two fully connected layers in a neural network, where the second layer corresponds to the model output, or for example, it contains a single neuron that represents the binary classification probability that we want this output. Then we have what's known as a two-layer fully connected neural network. In other words, this neural network consists of a first layer that transforms any given data example via linear combination, followed by a nonlinear transformation. The outputs of this first layer, or as we call them, the first layer activations, are then passed into the second layer as input. The second layer performs another linear combination and nonlinear transformation on these inputs. And finally, outcomes a number that will act as our prediction for the given data sample. >> To recap, the neural network we have just described is essentially layers of logistic regression models known as neurons. There can be dozens, if not hundreds of layers, each consisting of thousands of parameters in a single neural network. The way these neurons are organized is known as the architecture of the neural network. The repeated combining and recombining of features is what gives neural networks their ability to find complex relationships within data in a way that many other traditional machine learning methods can't. In the next section, we'll review the training loop, adding in concepts that are particular to the neural network model type.