
The parameter estimation
The output of each layer on the network is dependent on the parameters of the model estimated by training the neural network to minimize the loss with respect to the weights, , as we described earlier. This is a general principle in machine learning, in which a learning procedure, for example backpropagation, uses the gradients of the error of a model to update its parameters to minimize the error. Consider estimating the parameters of a linear regression model such that the output of the model minimizes the mean squared error (MSE). Mathematically speaking, the point-wise error between the
predictions and the
target value is computed as follows:

The MSE is computed as follows:
,
where presents the sum of the squared errors (SSE) and
normalizes the SSE with the number of samples to reach the mean squared error (MSE).
In the case of linear regression, the problem is convex and the MSE has the simple closed solution given in the following equation:

Here, is the coefficient of the linear model,
refers to matrices with the observations and their respective features, and
is the response value associated with each observation. Note that this closed form solution requires the
matrix to be invertible and, hence, to have a determinant larger than 0.
In the case of models where a closed solution to the loss does not exist, we can estimate the parameters that minimize the MSE by computing the partial derivative of each weight with respect to the MSE loss, , and using the negative of that value, scaled by a learning rate,
, to update the
parameters of the model being evaluated:

A model in which many of the coefficients in are 0 is said to be sparse. Given the large number of parameters or coefficients in deep learning models, producing models that are sparse is valuable because it can reduce computation requirements and produce models with faster inference.