1 What is deep learning?

1.1 Artificial intelligence, machine learning, and deep learning

The layer’s function in processing input data is determined by its weights, which are numerical values. Technically, the layer’s operation is parameterized by these weights, which can also be referred to as the layer’s parameters.

Learning in this context involves identifying the proper values for the weights across all layers in a network so that the network accurately associates input examples with their corresponding targets.

To manage the output of a neural network, it’s essential to monitor it, which is done through a loss function. The loss function evaluates the network’s predictions against the desired targets by calculating a distance score, indicating the network’s performance on a given example. This function is also known as the objective function or cost function.

1.2 Before deep learning - A brief history of machine learning

Probabilistic modeling uses statistical principles for data analysis and is foundational in machine learning. The Naive Bayes algorithm is a well-known example that applies Bayes' theorem with the assumption that all input features are independent. This technique has been in use since before the advent of computers, with its principles dating back to the 18th century. Logistic regression is another related model which, despite its name, is used for classification tasks. It is also an old but still relevant technique, commonly used as an initial approach by data scientists for its simplicity and versatility.

Deep learning has gained popularity not only because of its superior performance on many tasks but also because it simplifies problem-solving by eliminating the need for feature engineering. In traditional machine learning methods, called shallow learning, humans had to manually create and fine-tune features from the data for the algorithms to work effectively. Deep learning, however, automates this process by learning all necessary features in a single pass through a multi-layered network structure.

Shallow learning methods like SVMs or decision trees only transform data into one or two layers of representation, which often isn’t sufficient for complex problems. Attempting to apply shallow methods in succession to mimic deep learning is inefficient because each layer’s optimal representation isn’t the same across different model complexities. Deep learning’s transformative approach is its ability to learn multiple layers of representation jointly. This means that all layers adjust together in response to a single feedback signal, allowing the model to develop increasingly complex and abstract representations through a series of simple transformations across layers.

The two key aspects of how deep learning learns from data are the incremental development of more complex representations through multiple layers and the joint learning of these representations, which enables each layer to be refined in coordination with the others. These properties have made deep learning much more effective than previous machine learning methods.

From 2016 to 2020, the machine learning and data science field has been primarily focused on two methodologies: deep learning for perceptual tasks like image classification, and gradient boosted trees for structured data problems. Practitioners use Python libraries such as Scikit-learn, XGBoost, and LightGBM for gradient boosted trees and Keras, often with TensorFlow, for deep learning. Proficiency in these techniques and tools is considered crucial for success in applied machine learning, and they are frequently used in Kaggle competitions.

1.3 Why deep learning - Why now

Deep learning is considered a significant advancement in AI with several properties that make it a robust and long-lasting approach, likely to influence future technologies. These properties include:

  1. Simplicity: Deep learning simplifies model building by eliminating the need for feature engineering, allowing for end-to-end training with a limited set of operations.

  2. Scalability: It benefits from GPU or TPU parallelization and can handle vast datasets thanks to batch-based training, making it compatible with the ongoing improvements in computational power.

  3. Versatility and Reusability: Deep learning models can be continuously updated with new data and repurposed for different tasks, making them adaptable and efficient for various applications, even with smaller datasets.

While deep learning has been prominent for only a few years, its potential is still unfolding with new use cases and improvements. The field saw rapid progress initially, which is now stabilizing into more incremental advancements. Innovations like Transformer-based models in natural language processing have marked significant milestones. As of 2021, deep learning may be entering a phase of more steady and less explosive growth, but significant developments are still anticipated in the coming years. The deployment of deep learning across numerous applications is expected to continue, as its full potential has yet to be fully realized.

2 The mathematical building blocks of neural networks

2.1 A first look at a neural network

An overview of a neural network example using Keras, a Python library, to classify handwritten digits from the MNIST dataset. The dataset includes 60,000 training images and 10,000 test images of grayscale digits, each 28x28 pixels. The classification task involves assigning these images to one of ten classes (0-9).

The neural network architecture consists of two densely connected layers. The first layer has 512 units with ReLU activation, and the second is a softmax layer with 10 units, corresponding to the 10 digit classes. The softmax layer outputs a probability distribution across the classes.

To prepare the model for training, the following steps are taken:

  1. Compiling the model with an optimizer (RMSprop), a loss function (sparse_categorical_crossentropy), and a metric (accuracy).

  2. Preprocessing the image data by reshaping it into a flat array and normalizing pixel values to the range [0, 1].

The model is trained using the fit method on the preprocessed training data for 5 epochs with a batch size of 128. The performance is measured in terms of loss and accuracy on the training data, achieving a high accuracy of 98.9%.

After training, the model can predict class probabilities for new images. For example, when the model predicts the class for a test digit, it outputs a probability distribution, with the highest probability indicating the predicted class.

Evaluating the model on the test set shows an accuracy of 97.8%, which is slightly lower than the training accuracy due to overfitting—a common issue in machine learning where models perform better on training data than on new, unseen data.

The example serves as an introduction to building and training a neural network for digit classification and sets the stage for deeper explanations in subsequent chapters about tensors, tensor operations, and gradient descent. The reader is not expected to understand all details at this point, as the forthcoming chapters will provide a more thorough explanation.

2.2 Data representations for neural networks

A tensor is a container for numerical data, which generalizes the concept of matrices to higher dimensions, known as axes. The following points summarize the key ideas and examples provided about tensors:

  • Scalars (0D tensors): A single number, having zero axes (ndim = 0).

  • Vectors (1D tensors): An array of numbers, having one axis.

  • Matrices (2D tensors): An array of vectors, having two axes (rows and columns).

  • Higher-rank tensors: Arrays of 2D tensors are 3D tensors, and so on. Deep learning typically involves tensors with 0 to 4 axes, occasionally 5 for video data.

Tensors are characterized by three main attributes:

  • Number of axes (rank): The dimensionality or number of axes in a tensor.

  • Shape: The dimensions the tensor has along each axis.

  • Data type (dtype): The type of data stored in the tensor (e.g., float32, uint8).

In the context of machine learning, you often deal with tensors that have specific shapes and meanings, such as:

  • Data batches: The first axis (axis 0) is usually the samples axis, used for mini-batches in training.

  • Real-world data tensors: These can include vector data (2D tensors), time series or sequence data (3D), images (4D), and videos (5D).

Specific examples include:

  • Vector data: Rank-2 tensors with shape (samples, features).

  • Time series data: Rank-3 tensors with shape (samples, timesteps, features).

  • Image data: Rank-4 tensors with shape (samples, height, width, channels) or (samples, channels, height, width).

  • Video data: Rank-5 tensors with shape (samples, frames, height, width, channels).

In practice, slicing tensors allows for selecting specific elements, sequences, or regions from these arrays. For instance, to access a portion of an image or a sequence of data points within a time series.

Understanding these tensor operations is crucial for preprocessing and manipulating the data used to train machine learning models, like the example shown where a digit from the MNIST dataset is displayed using Matplotlib, or when taking a specific batch of images from a larger tensor for processing.

2.3 The gears of neural networks - Tensor operations

relu (rectified linear unit) and addition are element-wise operations that can be applied to each element of tensors independently.

Two naive Python implementations of the relu and addition operations are included, which loop over the elements of rank-2 NumPy tensors (2D arrays) and apply the operations in a non-vectorized manner. The relu operation sets each element to the maximum of its current value and zero, while the addition operation sums corresponding elements from two tensors.

It is mentioned that, in practice, one would use optimized NumPy functions for these operations, which are significantly faster because they utilize Basic Linear Algebra Subprograms (BLAS) that are low-level, highly optimized, and parallel.

Broadcasting in the context of tensor addition when the tensors have different shapes allows for element-wise operations between tensors of different shapes by virtually extending the smaller tensor to match the shape of the larger one without actually copying data.

Here’s a summary of the steps and principles involved in broadcasting:

  1. If one tensor is smaller, it will be virtually "broadcast" to match the shape of the larger tensor for addition.

  2. The process involves adding broadcast axes to the smaller tensor so that its number of dimensions (ndim) matches that of the larger tensor.

  3. The smaller tensor is then virtually replicated along these new axes to have the same shape as the larger tensor.

For example, to add a matrix X with shape (32, 10) and a vector y with shape (10,), y is first reshaped to (1, 10). It is then virtually repeated 32 times to form a matrix Y with shape (32, 10), allowing for the addition with X.

Naive Python implementation of matrix-vector addition included, that manually implements broadcasting. Additionally, it mentions that broadcasting can be used with tensors where the smaller tensor’s shape matches the trailing dimensions of the larger tensor’s shape. An example of using broadcasting with the np.maximum function is given, where a tensor x of shape (64, 3, 32, 10) is element-wise compared with a tensor y of shape (32, 10) to produce a tensor z of shape (64, 3, 32, 10).

The tensor product, commonly known as the dot product, is a crucial operation in tensor algebra. In NumPy, this operation is performed using the np.dot function. When two vectors are involved, the dot product is a scalar that sums the products of the corresponding elements of equal-length vectors. For a matrix and a vector, the result is a vector whose elements are the dot products of the vector with each row of the matrix. The dot product can also be generalized to two matrices, where it results in a new matrix with elements formed from the dot products of rows of the first matrix with columns of the second matrix. The shapes of the tensors must be compatible, meaning that the length of the row vector in the first matrix must match the length of the column vector in the second matrix. The concept extends to higher-dimensional tensors, where the rule of matching the last dimension of the first tensor with the second-to-last dimension of the second tensor still applies.

Tensor reshaping is a fundamental operation when working with neural networks. This operation changes the shape of a tensor without altering the total number of elements it contains.

Tensor operations can be understood as geometric transformations in space. The addition of two vectors, such as A = [0.5, 1] and B = [1, 0.25], results in a translation, where vector B moves point A to a new location. Other basic tensor operations that have geometric interpretations include:

  • Translation: Moving an object in space without distortion, represented by vector addition.

  • Rotation: A counterclockwise rotation around the origin in a 2D space can be achieved by multiplying a vector by a rotation matrix R consisting of sine and cosine values of the rotation angle.

  • Scaling: Changing the size of an object in space by a certain factor along various axes is achieved by multiplying the object’s vector representation by a scaling matrix S, which is a diagonal matrix with scaling factors.

  • Linear Transform: A dot product with any matrix represents a linear transformation, which includes scaling and rotation.

  • Affine Transform: Combining a linear transform with a translation, similar to the operation done by a Dense layer in neural networks without an activation function.

In the context of neural networks, a Dense layer with a relu activation function enables the network to learn non-linear transformations, which is crucial since a sequence of affine transformations without non-linearity would collapse into a single affine transformation. Activation functions like relu introduce non-linearity, allowing deep neural networks to represent complex transformations and hypotheses.

Neural networks operate through a series of tensor operations that can be seen as geometric transformations in high-dimensional space, simplifying these transformations into manageable steps. A useful analogy for understanding this concept is to visualize crumpled paper representing mixed classes of data in a 3D space. The goal of a neural network is to 'uncrumple' this ball of paper in order to separate the classes again, making them distinct and easily identifiable. This process is akin to finding simpler representations for complex data structures, known as manifolds, within these high-dimensional spaces. Deep learning is particularly adept at this task because it breaks down the complex disentanglement into a succession of simpler transformations, much like how a person would methodically unfold a crumpled ball of paper. Each layer of the neural network contributes to gradually unraveling the data until the classes are clearly separated.

2.4 The engine of neural networks - Gradient-based optimization

Model transforms its input using a layer that applies a dot product with weights (W) and a bias (b), followed by a ReLU activation function. The weights are initially set to small random values and are adjusted during training to minimize the loss, which is the difference between the model’s predictions and the actual target values.

The training process follows these steps:

  1. Select a batch of training data and corresponding targets.

  2. Perform a forward pass to generate predictions.

  3. Calculate the loss to measure how well the model’s predictions match the targets.

  4. Update the weights to reduce the loss on this batch.

Updating the weights is the key challenge in training, and doing this efficiently is crucial. A naive approach of individually tweaking each weight is impractical due to computational expense. Instead, gradient descent is used as an optimization technique. Functions in the model are differentiable, meaning small changes in weights lead to small and predictable changes in the loss. By computing the gradient of the loss with respect to the weights, it’s possible to adjust all the weights in a direction that reduces the loss.

The concept of derivatives extends to functions that take tensors as inputs, creating gradients. A gradient represents how the output of a tensor function changes in response to variations in input.

The example provided is in the context of machine learning, where an input vector x, a weight matrix W, and a target y_true are used to compute a loss function, which measures the discrepancy between the predicted output y_pred and the actual target y_true. The goal is to adjust W to minimize the loss.

The loss function can be seen as mapping the weights W to loss values, and the gradient of the loss function with respect to W at a point W0 indicates how changes in W’s coefficients affect the loss. This gradient is a tensor with the same shape as `W, and each coefficient shows the direction and magnitude of the impact on the loss value when that specific coefficient in W is tweaked.

In practice, the gradient is used to update the weights in the opposite direction of the steepest ascent to reduce the loss value. This is done by subtracting a fraction (determined by a scaling factor step) of the gradient from the current weight values. This process is based on the idea that moving against the direction of steepest ascent should lead to a lower point on the loss curve. The scaling factor is necessary to ensure that the approximation of curvature provided by the gradient remains accurate by not straying too far from the current point W0.

An approach to finding the minimum of a differentiable function analytically by setting its derivative to zero and solving for the function’s variables. However, this approach is not feasible for neural networks due to the large number of parameters involved.

Instead, the text outlines a practical four-step algorithm known as mini-batch stochastic gradient descent (SGD) for optimizing neural networks:

  1. Select a random batch of training samples and corresponding targets.

  2. Perform a forward pass to generate predictions from the input samples.

  3. Calculate the loss, which measures how well the predictions match the targets.

  4. Compute the gradient of the loss with respect to the model parameters and adjust the parameters in the opposite direction of the gradient to reduce the loss.

The learning rate is a factor that determines the step size taken when adjusting the parameters. Choosing an appropriate learning rate is important to ensure convergence and avoid being trapped in local minima or making erratic updates.

SGD can be performed with individual samples (true SGD) or the entire dataset (batch gradient descent), but mini-batch SGD is a compromise between these methods.

Variants of SGD that incorporate previous updates into the current update, such as momentum, which helps to accelerate convergence and avoid local minima. Momentum is likened to a ball rolling down a loss curve, where it uses both the current slope and its existing velocity to move through parameter space.

Backpropagation is a method for computing the gradient of the loss function of a neural network with respect to its parameters, which are typically the weights and biases. It utilizes the derivatives of simple operations, such as addition, ReLU, or tensor product, which are the building blocks of neural networks. These operations are easily differentiable, allowing for the calculation of gradients in complex networks.

A neural network can be represented as a function with parameters like W1, b1, W2, and b2. These parameters are used in operations like dot product, relu, softmax, and addition, combined with a loss function. The chain of operations can be expressed as follows:

loss_value = loss(y_true, softmax(dot(relu(dot(inputs, W1) + b1), W2) + b2))

The chain rule from calculus is employed to derive this function. It states that the gradient of a composed function fg(x) = f(g(x)) can be calculated by multiplying the derivatives of the individual functions involved:

grad(y, x) == grad(y, x1) * grad(x1, x)

When more functions are composed together, the chain rule extends like a chain:

grad(y, x) == grad(y, x3) * grad(x3, x2) * grad(x2, x1) * grad(x1, x)

By applying the chain rule to each layer of the network in reverse order (from the output to the input), backpropagation efficiently computes the gradients needed to update the network’s parameters during training.

The GradientTape API in TensorFlow is utilized for automatic differentiation by recording tensor operations within its scope to create a computation graph. Variables, which are mutable tensors, are often used for model parameters. To use the GradientTape:

  1. Create a tf.Variable.

  2. Open a tf.GradientTape context.

  3. Perform tensor computations inside this context.

  4. Calculate gradients of an output with respect to variables using tape.gradient().

The GradientTape can handle various tensor operations and can work with both single variables and lists of variables. For example, it can be used to calculate the gradient of a simple linear equation, or more complex operations like a matrix multiplication followed by an addition. The resulting gradients will have the same shape as the variables they are derived from.

2.5 Looking back at our first example

A model is described as a series of layers that transform input data into predictions. These predictions are then evaluated by a loss function against the expected outcomes, producing a loss value that indicates the model’s performance. An optimizer uses this loss value to adjust the model’s weights.

The author revisits the first example from the chapter, explaining the process step by step. First, the input data is preprocessed and formatted into a specific shape and data type suitable for the model. The model itself is a sequential composition of two dense layers, each performing operations on the input data with weight tensors, which are essentially the learned parameters of the model.

The model compilation step involves defining the loss function and optimizer. The loss function used here is sparse_categorical_crossentropy, providing a signal to guide the learning process by minimizing loss. The rmsprop optimizer sets the specific rules for the gradient descent process during training.

Finally, the training loop is discussed, illustrating how the model iterates over the training data in mini-batches, updating the model weights in each epoch through backpropagation. After several epochs, the model’s loss decreases, and its accuracy in classifying handwritten digits improves.

The implementation of a simple Python class NaiveDense which simulates a dense (fully connected) layer in a neural network using TensorFlow. The class is initialized with an input size, an output size, and an activation function. It creates two TensorFlow variables: W (weights matrix) and b (biases vector). The weights are initialized with random values within a specified range, while the biases are initialized with zeros.

The class defines a call method that performs the forward pass by computing the dot product of the inputs and weights, adding the biases, and then applying the activation function to the result.

Additionally, the class includes a weights property that provides convenient access to the layer’s parameters (weights and biases).

NaiveSequential class is used to create a neural network by chaining together multiple layers. The class takes a list of layers as an argument and sequentially applies each layer to the input data. The call() method is used to process the inputs through the layers in the order they were added. The class also includes a weights property that aggregates all the parameters from the individual layers.

Following the description, an example usage of the NaiveSequential class is given, where a simple neural network model is created with two layers: a NaiveDense layer with ReLU activation followed by another NaiveDense layer with softmax activation. The model is expected to have four sets of weights, indicating that each NaiveDense layer has both weights and biases.

The steps involved in the training process of a machine learning model and code snippets for implementing these steps using TensorFlow. The main steps in the training process are:

  1. Run the model on a batch of images to generate predictions.

  2. Calculate the loss using the model’s predictions and the actual labels.

  3. Compute the gradient of the loss with respect to the model’s weights.

  4. Adjust the weights slightly in the opposite direction of the gradient to minimize loss.

The code uses TensorFlow’s GradientTape to record operations for automatic differentiation. Within the one_training_step function, a forward pass is performed to compute predictions and loss, and then the gradient is calculated. This gradient is used by the update_weights function to adjust the model weights, with a defined learning rate as a scaling factor.

Two approaches for the weight update step are provided: a manual approach that directly modifies the weights using assign_sub, and a more practical approach that employs Keras' Optimizer class, specifically the SGD (Stochastic Gradient Descent) optimizer, to handle the weight adjustment.

These steps are part of the training loop, which iterates over the training data for multiple epochs to improve the model’s performance.

Summary

  • Tensors form the foundation of modern machine learning systems. They come in various flavors of dtype, rank, and shape.

  • You can manipulate numerical tensors via tensor operations (such as addition, tensor product, or element-wise multiplication), which can be interpreted as encoding geometric transformations. In general, everything in deep learning is amenable to a geometric interpretation.

  • Deep learning models consist of chains of simple tensor operations, parameterized by weights, which are themselves tensors. The weights of a model are where its “knowledge” is stored.

  • Learning means finding a set of values for the model’s weights that minimizes a loss function for a given set of training data samples and their corresponding targets.

  • Learning happens by drawing random batches of data samples and their targets, and computing the gradient of the model parameters with respect to the loss on the batch. The model parameters are then moved a bit (the magnitude of the move is defined by the learning rate) in the opposite direction from the gradient. This is called mini-batch stochastic gradient descent.

  • The entire learning process is made possible by the fact that all tensor operations in neural networks are differentiable, and thus it’s possible to apply the chain rule of derivation to find the gradient function mapping the current parameters and current batch of data to a gradient value. This is called backpropagation.

  • Two key concepts you’ll see frequently in future chapters are loss and optimizers. These are the two things you need to define before you begin feeding data into a model.

  • The loss is the quantity you’ll attempt to minimize during training, so it should represent a measure of success for the task you’re trying to solve.

  • The optimizer specifies the exact way in which the gradient of the loss will be used to update parameters: for instance, it could be the RMSProp optimizer, SGD with momentum, and so on.