Index
A
Adam optimizer algorithm 328–329
AlexNet 391
arrays 53
asymptotically approaching zero 174
autoencoder decoder, code listing 481
autoencoder encoder, code listing 480
autoencoder training, code listing 481
decoder, definition of 478
embedding an image 305
encoder and decoder neural networks, discussion of 479–480
encoder, definition of 478
hyperparameter, definition of 478
PCA and 481
reconstruction loss, definition of 478
representation learning, definition of 478
schematic representation of an autoencoder 479
as unsupervised 479
See also variational autoencoders (VAEs)
average pooling, definition of 381
B
backpropagation algorithm 286–294
algorithm for training a neural network 294–295
backpropagation algorithm on an arbitrary network of linear layers 290–294
definition of 286
evaluating on a simple MLP with a single neuron per layer 286
forward and backward propagation, code listing 289–290
forward pass, definition of 289
Hadamard product, definition of 292
performing min-max normalization in PyTorch, code listing 296
training a neural network in PyTorch 295–298
using an optimizer to update weights 298
Bayes’ theorem 194, 196–198, 448
Bernoulli distribution 188–189
bounding box, definition of 385
C
categorical variables, definition of 242
centroid, definition of 163
classification loss, definition of 423
classifiers
binary classifiers 88
charted examples of good and bad decision boundaries 250
charts of cat-brain threat-model decision boundaries 247–249
as decision boundaries 84–85, 246–247
decision boundary as a hypersurface 249
estimating a decision boundary 251
feature space 246
forming mental pictures of hyperspaces with 3D analogs 249
geometric depiction of a classification problem 85
as a hypersurface 85
input space 246
modeling the classifier, definition of 86
continuous random variable 152
continuous variables, definition of 242
convex and nonconvex functions
convex curves and surfaces, three definitions of 110–112
convexity and the Taylor series 112–113
examples of convex functions 113
convolution
convolution layers 344, 380–381
convolution output size 356
expressing convolution layers as matrix-vector multiplications 344
convolution, one-dimensional 345–356
1D edge detection, code listing 355
1D local averaging convolution, code listing 354–355
convolution output size 356
curve edge detection via 1D convolution 350–351
curve smoothing via 1D convolution 350
detecting edges as a way to understand images 351
directly invoking the convolution function, code listing 356
edge, definition of 351
formula for generating a single output value in 1D convolution 349
graphical and algebraical view 345
how to visualize a 1D convolution 345
input, definition of 345
kernel, definition of 345
as matrix multiplication 351–354
output, definition of 345
padding, definition of 346–347
setting the weights of a 1D kernel 354
convolution, three-dimensional 368–374
3D convolution with custom weights, code listing 373–374
diagrams illustrating the spatio-temporal view of 3D convolution 370
generating a single output value in 3D convolution 370
how a kernel extracts motion information from video frames 371
how to visualize a 3D convolution 369
illustration of a 3D convolution motion detector 372
video as a 3D entity extending over a spatio-temporal volume 368
video motion detection via 3D convolution 370–371
convolution, transposed 374–379
2D convolution and its transpose 377
autoencoder, definition of 377
decoder, definition of 376
descriptor vector, definition of 375
embedding as an effective compression technique 377
embedding, definition of 375
encoder, definition of 375
end-to-end learning, definition of 377
fractionally strided convolution–375
illustration of a 1D convolution and its transpose 376
illustration of a 2D convolution and its transpose 377
output size 377
upsampling using transpose convolutions, code listing 378–379
why autoencoders need transposed convolution 375
convolution, two-dimensional 356–368
2D convolution as matrix multiplication 366–368
2D convolution with custom weights 363–365
2D edge detection, code listing 366
2D local averaging convolution, code listing 365–366
comparing Euclidean distance to Manhattan distance 358
generating a single output value in 2D convolution 361–362
graphical and algebraic view 356–358
how to visualize a 2D convolution 358
image edge detection via 2D convolution 362–363
image smoothing via 2D convolution 362
input, definition of 359
kernel, definition of 359
output, definition of 359
padding, definition of 361
same (zero) padding 361
two-dimensional neighborhoods not preserved by rasterization 358
valid padding 361
convolutional neural networks (CNNs)
AlexNet 391
benefits of neural networks with multiple convolutional layers 388–391
bounding box, definition of 385
feature map, definition of 388
GoogLeNet 398
image classification, definition of 386
Inceptionv1 architecture 397–401
LeNet architecture, components of 387–389
MNIST data set, sample images from 387
object detection, definition of 386
VGG (Visual Geometry Group) Net 391–397
covariance
covariance as the multivariate analog of variance 165–167
covariance of a multivariate Gaussian distribution 178–180
variance, covariance, and standard deviation 164–165
zero-mean, unit-covariance Gaussian for the known prior 490–492
See also variance
cross-entropy loss
binary cross-entropy loss, code listing 305
definition of 303
Cybenko’s universal approximation theorem 261–262
D
data imbalance, definition of 310–311
decoder, definition of 376, 478
deep learning, overview of 1–17
deep neural networks, definition of 260
descriptor vector, definition of 375
determinants 499
differentiable step-like functions 273–276
graph of the derivatives of 1D sigmoid and tanh functions 276
Heaviside step function as not differentiable 273
sigmoid function and its properties 273–275
dimensionality reduction, definition of 469
discriminative functions 252
document descriptor space 116–118
document retrieval problem 141–147
dot product and cosine of the angle between two vectors 497–498
downsampling, definition of 381
E
eigenvalues and eigenvectors 62–65, 67–69, 72–73
encoder, definition of 375, 478
applying to continuous and multidimensional random variables 201
chain rule of conditional entropy 212
charts of entropies of peaked and flat distributions 202
charts of KLD between example distributions 211
computing the cross-entropy of a Gaussian, code listing 202
computing the entropy of a Gaussian distribution, code listing 204
computing the KLD, code listing 210
examples of 199
geometrical intuition for entropy 201–203
Huffman encoding 200
Kullback Leibler divergence (KLD) 207–210
prefix coding 200
quantifying the uncertainty associated with a chancy event 199
variable bit-rate coding 200
epoch, definition of 302
error. See loss function (error)
evidence lower bound (ELBO) 488–490
F
Fast R-CNN architecture 429
Faster R-CNN, high level architecture 414
feature space, definition of 10
first-order approximation 101
fixed point, definition of 231
frequentist paradigm 151
Frobenius norms, definition of 122–123
Fully Bayes estimation 448–453
Bayesian estimation with unknown mean, known variance, code listing 452–453
Bayesian estimation with unknown mean, unknown variance, code listing 459
Bayesian estimation with unknown variance, known mean, code listing 456
computing posterior probability using Bayesian inference, code listing 461
conjugate priors 454
estimating the mean and precision parameters 457–458
estimating the precision parameter when the mean is known 455–456
Fully Bayesian inferencing 459–461
Gaussian, unknown mean, known precision 450–453
Gaussian, unknown precision, known mean 454
maximum a posteriori (MAP) estimation 448
maximum likelihood estimation 460
maximum likelihood parameter estimation (MLE) 448
MLE for Gaussian parameter values, recap of 449–450
multivariate Bayesian inferencing, unknown mean 463
multivariate Gaussian, unknown mean, known precision 461–463
multivariate, unknown precision, known mean 463–466
normal-gamma distribution 457–459
parameter estimation and belief injection 448–449
prior probability density 448
Wishart distribution 454, 463–464
fully connected layer 277
function family 86
G
Gamma distribution 454–455, 503–505
Gamma function, overview of 502–503
Gaussian (normal) distribution 173–180
asymptotically approaching zero 174
bell-shaped curve 174
Bernoulli distribution 188–189
categorical distribution and one-hot vectors 189–190
chart of a univariate Gaussian random probability density function 177
computing the variance of 499–501
covariance of a multivariate Gaussian distribution 178–180
expected value of a Bernoulli distribution 188–189
expected value of a binomial distribution 184–185
expected value of a categorical distribution 190–191
expected value of a Gaussian distribution 176–177
Gaussian probability density function
geometry of sampled point clouds 180
log probability of a Bernoulli distribution, code listing 188
log probability of a binomial distribution, code listing 183–184
log probability of a multinomial distribution, code listing 186
log probability of a univariate normal distribution, code listing 175
mean and variance of a Bernoulli distribution, code listing 189
mean and variance of a multinomial distribution, code listing 187–188
mean and variance of a multivariate normal distribution, code listing 179
mean and variance of a univariate Gaussian, code listing 178
multinomial distribution 185–187
multivariate Gaussian point clouds and hyper-ellipses 180
outlier values 174
probability of a categorical distribution 190
variance of a Bernoulli distribution 189
variance of a binomial distribution 185
Gaussian mixture models (GMM) 215–237
algorithm of GMM fit (MLE of GMM parameters) 236
charts of two-dimensional GMMs with circular and elliptical bases 226
classification via GMM 230
fixed point, definition of 231
Gaussian mixture model distribution 229
latent variables for class selection 227–229
maximum likelihood estimation of GMM parameters (GMM fit) 230–237
probability density function of the GMM 223–227
progression of maximum likelihood estimation for GMM parameters 232
generative functions 252
generative modeling, definition of 447–448
global minimum 303
GoogLeNet 398
becoming zero at the optimum 96
gradient vectors and minimizing loss functions 89–90
as the vector of all the partial derivatives 95
H
Hadamard product, definition of 292
Hausdorff property, definition of 441–442
Heaviside step function 252–253, 273
hidden layers 260
Huffman encoding 200
human labeling (human curation), definitionof 6
hyperparameter, definition of 478
I
Inceptionv1 architecture 397–401
description of its network-in-network paradigm 397–399
diagram of 398
GoogLeNet 398
implementing a dimensionality reduced Inception block 400–401
implementing a naive Inception block, code listing 399
input variables 242
inputs, normalizing 7
J
Jensen’s inequality theorem 501–502
joint probability, definition of 155
K
Kullback–Leibler divergence (KLD) 207–210
L
L’Hospital’s rule 500
latent or hidden variables/parameters 216, 228
latent semantic analysis (LSA) 118, 142–147
comparing discriminative and generative models 471–472
considering the space of natural and digital images 469
dimensionality reduction 469, 475
dimensionality reduction using PCA, code listing 477–478
discriminative classifiers, definitionof 471
generative classifiers 471–472
generative models, properties of 471–472
illustration of good and bad discriminative classifiers 472
latent space modeling briefly explained 470–471
latent vector, definition of 468
latent-space modeling, benefits and applications 472–473
linear latent space manifolds and PCA 474–478
manifold as capturing the essence of a common property 469
mapping from a 2D input space to a 1D latent space 482
observed vector, definition of 468
PCA as a special case of latent space representation 471
regularization as creating a more compact latent space 483
smoothness, continuity, and regularization of 481–483
steps involved in a PCA-based dimensionality reduction 475–477
two examples of latent subspaces, with planar and curved manifolds 470
implementing LeNet for image classification on MNIST, code listing 388–389
output feature map passed through two fully connected (FC) layers 388
subsampling (pooling) layers 388
tanh activation layer 388
three convolutional layers of 5×5 kernels 387–388
level contours 97
algorithm for training a neural network 294–295
backpropagation algorithm 286–294
diagram of a complete multilayered neural network 278
forward and backward propagation, code listing 289–290
forward propagation of a single linear layer 280–281
forward propagation, code listing 281
fully connected layer 277
gradient descent and local minima 285–286
Hadamard product, definition of 292
Hessian matrix 284
learning rate 285
loss and its minimization 282–283
loss surface and gradient descent 283–286
as matrix-vector multiplication 277–280
mean squared error (MSE) function 282
MSE loss, code listing 283
never using test data for training 281
performing min-max normalization in PyTorch, code listing 296
training a neural network in PyTorch 298
training and backpropagation 281–282
tunable hyperparameter 285
using an optimizer to update weights 298
linear vs. nonlinear models 12–14
local minimum 303
local response normalization (LRN) layers 391
local translation invariance, definitionof 381
definition of 242
m-out-of-n trigger 244
multi-input logical AND 244
multi-input logical OR 244
log-sum inequality theorem 502
loss function (error)
autoencoders, definition of 305
binary cross-entropy loss for image and vector mismatches 305–306
binary cross-entropy loss, code listing 305
computing the gradient of the loss function 316
creating a custom neural network model, code listing 318
creating a custom PyTorch data set, code listing 317
cross-entropy loss, code listing 304
cross-entropy loss, definition of 303
data imbalance, definition of 310–311
epoch, definition of 302
equation for describing a full neural network 301
generating one training loop, code listing 319
global minimum 303
gradient vectors and minimizing loss functions 89–90
GT vector, definition of 301–302
local approximation for the loss function 99–100
local minimum 303
loss function and SGD optimizer, code listing 318
loss surfaces and their minimization 303
loss surfaces, description of 302–303
multi-dimensional loss functions 93–94
one-dimensional loss functions 91–93
output vector 303
prediction vector 303
regression loss, code listing 303
running the training loop num epochs times, code listing 319
total error, definition of 89
total training loss, definition of 301
using a squared error function 89
visualizing loss surfaces 97
M
machine learning
analogy to the human brain 5
cat brain model 7
chart of the 2D input point space for the cat brain model 11
classifier, definition of 12
collinearity as implying linear dependence 47
computing eigenvectors and eigenvalues 67
computing the simplified cat brain threat score model 11
defining the span of a set of vectors 47–48
dot product and the difference between two unit vectors 38–39
dot product of two vectors 29–30
eigenvalues and eigenvectors 62–65
eigenvectors and linear independence 65–66
equations for describing a multilayered neural network 16
estimating a threat score 7
example cat-brain dataset matrix 24
example training dataset 24
feature space, definition of 10
finding the axes of a hyperellipse 78–79
formula for transforming an arbitrary input value to a normalized value 7
from arbitrary input to the desired output during inferencing 5
generating the right outcome on never-before-seen data 5
generic multidimensional definition of linear transforms 51
geometric intuitions for dot product and vector length 36–37
introduction to vectors via PyTorch 23
Kullback–Leibler divergence (KLD) 207–210
latent semantic analysis (LSA) 118
learning, definition of 5
linear systems with zero or near-zero determinants 55–57
linear vs. nonlinear models 12–14
list of problem solving stages 4
machine learning model error 34–36
matrix powers using diagonalization 76–77
matrix-matrix multiplication 32–33
matrix-vector multiplication 31–32
matrix-vector multiplications as linear transforms 52–53
measuring the component of a vector along a coordinate axis 37–38
minimizing a quadratic form in machine learning problems 121
model estimation 8
multidimensional line and plane equations 42–46
multidimensional line equation 42–43
multilayered neural network, diagram of 15
natural language processing (NLP) 118
over-determined and under-determined linear systems 57–59
as a paradigm shift in computing 3
performing basic vector and matrix operations 26–28
principal component analysis (PCA) 118
producing Python code using Jupyter Notebook 22
quadratic form, definition of 118
regressor, definition of 12
retrieving documents that match a query phrase 140–147
role of matrices in 23
sigmoid function, definition of 14
singular value decomposition (SVD) 130–140
solving linear systems without inversion via diagonalization 74–75
spectral decomposition of a symmetric matrix 77
squared error 34
sticking to any fixed coordinate system 22
supervised machine learning 194
supervised vs. unsupervised learning 4
symmetric matrices and orthogonal eigenvectors 66
target output 4
training data, definition of 4
training, definition of 5
transpose of matrix products 33
trying to model the unknown transformation function 6
unsupervised machine learning 193–194
using 3D analogues for higher dimensional spaces 22
using PyTorch code for vector manipulations 22
See also neural networks
applying calculus to a locally Euclidean property 440–441
bounded, compact, and precompact sets 443
definition of 438
d-manifold, definition of 440
example manifolds and non-manifolds in 1D and 2D 440
Hausdorff property, definition of 441–442
manifolds as locally Euclidean 440
mapping points from one manifold to another 439
neural networks and 438
open sets, closed sets, and boundaries 442
second countable property of manifolds 442–443
mathematical notations used throughout the text 506
matrices
applying rotation matrices 69
basic vector and matrix operations in machine learning 26–28
converting a matrix into a vector via rasterization 84
data matrix columns as dimensions in the feature space 115
data matrix rows as representing feature vectors 115
example cat-brain dataset matrix 24
Frobenius norms, definition of 122–123
full-rank matrices, definition of 137
introducing matrices via PyTorch 25–26
inverting a matrix and computing its determinant 57
linear systems and matrix inverse 53–55
matrix and vector transpose 28–29
matrix powers using diagonalization 76–77
matrix, definition of 23
matrix-matrix multiplication 32–33
matrix-vector multiplication 31–32
Moore Penrose pseudo-inverse of a matrix 59–62
orthogonal (rotation) matrices and their eigenvalues and eigenvectors 67–69
orthogonality and length-preservation 71
orthogonality of rotation matrices 71–72
rank of a matrix, definition of 137
representing digital images as matrices 25
role in machine learning 23
slicing and dicing matrices 26
solving an overdetermined system using the pseudo-inverse 62
spectral norms, definition of 122
symmetric positive semidefinite matrices 121–122
transpose of matrix products 33
using linear algebraic tools to analyze matrix structures 115–116
max pooling, definition of 381
maximum a posteriori (MAP) estimation 448
maximum likelihood parameter estimation (MLE) 448
mean squared error (MSE) function 282
model parameter estimation 213–222
estimating the model parameters from the unlabeled training data 213
examining the likelihood term 213
examining the prior probability term 213
Gaussian mixture models (GMMs) 215
Gaussian negative log-likelihood for training data, code listing 219–220
Gaussian negative log-likelihood with regularization, code listing 221–222
latent variables and evidence maximization 215–216
likelihood, evidence, and posterior and prior probabilities 213–214
maximum a posteriori (MAP) parameter estimation and regularization 215
maximum likelihood estimate for a Gaussian, code listing 219
maximum likelihood parameter estimation (MLE) 214–215
maximum likelihood parameter estimation for Gaussians 216–218
minimizing MLE loss via gradient descent, code listing 220
using the log-likelihood trick 214
modeling
inferencing 4
linear vs. nonlinear models 12–14
model architecture selection 86
overall algorithm for training a supervised model 90
training error, definition of 86
trying to model the unknown transformation function 6
Adam optimizer algorithm with bias correction 328–329
chart showing an overfitting of data points in a binary classifier 331
momentum-based gradient descent 322
Nesterov accelerated gradients 322–325
overfitting and underfitting 330
regularization 330
root mean squared propagation (RMSProp) 327–328
viewing regularization as minimizing descriptor length 332
Multibox Single-Shot Detector (SSD) 436
multidimensional functions 93–95
multidimensional integral 161–162
N
natural language processing (NLP) 118
Nesterov accelerated gradients 322–325
neural networks
adjusting its architecture and parameter values 240
algorithm for training a neural network 294–295
backpropagation algorithm 286–294
categorical variables, definition of 242
charts of cat-brain threat-model decision boundaries 247–249
charts of good and bad decision boundaries 250
choosing an architecture 240
classifying into supervised and unsupervised neural networks 241
continuous variables, definition of 242
decision boundaries, definition of 246–247
decision boundary as a hypersurface 249
determining parameter values through training 240
diagram of a complete multilayered neural network 278
diagram of a multilayered neural network 15
differentiable step-like functions 273–276
discriminative functions 252
equations for describing a multilayered neural network 16
estimating a decision boundary 251
expressing real-world problems in target functions 240–242
expressive power, definition of 240
feature space 246
forming mental pictures of hyperspaces with 3D analogs 249
forward and backward propagation, code listing 289–290
fully connected layer 277
generative functions 252
gradient descent and local minima 285–286
ground truth 241
Hadamard product, definition of 292
Heaviside step function 252–253, 273
Hessian matrix 284
inferencing 241
input space 246
input variables 242
learning rate 285
logical functions, definition of 242–245
making a probabilistic statement of output correctness 241
manual annotation 241
mean squared error (MSE) function 282
m-out-of-n trigger function 244
multi-input logical AND function 244
multi-input logical OR function 244
multilayer perceptrons (MLPs) 259–269
neuron, basic description of 240, 252
output variables 242
performing min-max normalization in PyTorch, code listing 296
sigmoid function and its properties 275
supervised neural networks 273
supervised training data 241
target output 241
training a neural network in PyTorch 295–298
training data, definition of 272
tunable hyperparameter 285
using an optimizer to update weights 298
weights 241
See also machine learning
neuron, description of 240, 252
non-maxima suppression (NMS) algorithm 425–426
O
anchors and their configurations, description of 415
assigning GT labels for each anchor box, code listing 420
assigning targets to anchor boxes 421–422
classification loss, definition of 423
classifier predicting an objectness value 417
contributions and improvements of Fast R-CNN 412–413
dealing with the imbalance between negative and positive anchors 421
Fast R-CNN architecture 429
Fast R-CNN loss function 431–432
Fast R-CNN RoI head, code listing 430–431
Faster R-CNN and its two core modules 412–413
Faster R-CNN, high level architecture 414
FCN of the RPN, code listing 418–419
Feature Pyramid Network (FPN) 436
FRCNN guidelines for assigning labels to anchor boxes 420
fully convolutional network (FCN) architecture 417–418
generating a target (GT) for an RPN 421
generating all anchors for a given image 417
generating anchors at a particular grid point, code listing 416
generating region proposals 424–425
Multibox Single-Shot Detector (SSD) 435–436
NMS of RoIs, code listing 427
non-maxima suppression (NMS) algorithm 425–426
other object-detection paradigms 435–436
Region proposal network (RPN) 413–415
regression loss, definition of 423
three stages in the R-CNN approach to object detection 411–412
training the Fast R-CNN 431
training the Faster R-CNN 434–435
You Only Look Once (YOLO) 435
observed vector, definition of 468
one-dimensional loss functions 91–93
Adam optimizer algorithm with bias correction 328–329
Bayes’ theorem and the stochastic view of optimization 334–335
creating a custom neural network model, code listing 318
creating a custom PyTorch data set, code listing 317
generating one training loop, codelisting 319
learning rate (LR) 315
loss function and SGD optimizer, code listing 318
MLE-based optimization 335
overfitting and underfitting 330
overfitting of data points in a binary classifier 331
random shuffling of training data after every epoch 315
regularization 330
root mean squared propagation (RMSProp) 327–328
running the training loop num epochs times, code listing 319
stochastic gradient descent (SGD) 315–316
viewing regularization as minimizing descriptor length 332
output variables 242
output vector 303
P
parameterized function, threat score 4
partial derivatives, definition of 94
classification and 254
code listing for 256
Cybenko’s universal approximation theorem 261–262
deep neural networks, definition of 260
definition of 254
generating 2D steps and waves with perceptrons 264–267
generating a 1D tower with perceptrons 262–264
hidden layers 260
introduction to modeling common logic gates with perceptrons 256
layering for organizing perceptrons into a neural network 260
MLP for a logical XOR function 259–260
MLPs for polygonal decision boundaries 268–269
modeling logical gates, code listing 258
multilayer perceptrons (MLPs) 259–269
multiple perceptrons 256
partitioning with a planar decision surface
perceptron for a logical AND function 257
perceptron for a logical NOT function 258
perceptron for a logical OR function 258
perceptrons and MLPs in 1D, code listing 267
perceptrons and MLPs in 2D, code listing 267–268
truth table for two-variable logical functions 261
pixel, definition of 83
prediction vector 303
prefix coding 200
principal component analysis (PCA) 118, 123–130
applying PCA on correlated and uncorrelated datasets 128
calculating the direction of maximum spread 125–127
dimensionality reduction via PCA 127–128
linear latent space manifolds and PCA 478
PCA and data compression 130
PCA computation, code listing 128–129
PCA on synthetic correlated data, code listing 129
PCA on synthetic nonlinearly correlated data, code listing 130
PCA on synthetic uncorrelated data 129
as a special case of latent space representation 471
use in JPEG 98 image compression techniques 130
probability density function (PDF) 152, 198
probability distributions
continuous random variable 152
definition of 153
discrete random variable 151
emphasizing the geometrical view of multivariate statistics 150
example graph for the weights of adults in Statsville 154
fitting probability distributions to specific groups of people 150
frequentist paradigm 151
loosely structured point distributions in high-dimensional spaces 149
probabilities as always less than or equal to 1 151
probability density 152
PyTorch distributions package 150, 162
random variable, definition of 151
semantic segmentation 150
using histograms to visualize discrete random variables 152–153
using probabilistic models in unsupervised and minimally supervised learning 150
using uppercase letters to denote random variables 152
variational auto encoders (VAEs) 150
See also probability theory
probability theory
asymptotically approaching zero 174
bell-shaped curve 174
Bernoulli distribution 188–189
Cartesian product 157
categorical distribution and one-hot vectors 189–190
centroid, definition of 163
chart of a univariate Gaussian random probability density function 177
chart of bivariate uniform random probability density function 173
conditional probability 196
continuous random variables and probability density 160–162
covariance as the multivariate analog of variance 165–167
covariance of a multivariate Gaussian distribution 178–180
dependent events and their joint probability distribution 157–159
dependent vs. independent variables 196
entropy, definition of 200–201
exhaustive and mutually exclusive events 154–155
expected value of a Bernoulli distribution 188–189
expected value of a binomial distribution 184–185
expected value of a categorical distribution 190–191
expected value of a function of a random variable 163
expected value of a Gaussian distribution 176–177
expected value of a linear combination of random variables 164
expected value of a uniform distribution 171
Gaussian (normal) distribution 173–180
Gaussian probability density function 174
geometry of sampled point clouds 180
graphical visualization of joint probability distributions 160
independent events 155
joint and marginal probability 194–196
joint probabilities and their distributions 157
Kullback Leibler divergence (KLD) 207–210
log probability of a Bernoulli distribution, code listing 188
log probability of a binomial distribution, code listing 183–184
log probability of a multinomial distribution, code listing 186
log probability of a univariate normal distribution, code listing 175
log probability of a univariate uniform random distribution, code listing 171
marginal probabilities 157
marginal probability for a variable 195
mean and variance of a Bernoulli distribution, code listing 189
mean and variance of a multinomial distribution, code listing 187–188
mean and variance of a multivariate normal distribution, code listing 179
mean and variance of a uniform random distribution, code listing 172
mean and variance of a univariate Gaussian, code listing 178
multidimensional integral 161–162
multinomial distribution 185–187
multivariate Gaussian point clouds and hyper-ellipses 180
outlier values 174
probabilities of impossible and certain events 154
probability density function (PDF) 16, 198
probability of a categorical distribution 190
product rule, definition of 155
properties of distributions 162–167
sample point distributions for dependent and independent variables 159–160
sampling from a distribution 167–169
sum rule 195
uniform distributions as multivariate 173
uniform random distributions 170–171
variance and expected value 167
variance of a Bernoulli distribution 189
variance of a binomial distribution 185
variance of a uniform distribution 172
variance, covariance, and standard deviation 164–165
See also probability distributions
product rule, definition of 155
applying PCA on correlated and uncorrelated datasets 128
computing LSA and SVD on a large dataset 146–147
computing PCA directly using SVD 139
dot product of two vectors 40
eigenvalues and eigenvectors of a rotation matrix 72
examining linear models 101–105
examining nonlinear models 105–107
finding the axes of a hyperellipse 79–80
introducing matrices via PyTorch 25–26
matrix diagonalization 74
matrix vector multiplication 40–41
orthogonality of rotation matrices 71–72
PCA computation, code listing 128–129
PCA on synthetic correlated data, code listing 129
PCA on synthetic nonlinearly correlated data, code listing 130
PCA on synthetic uncorrelated data 129
performing a matrix transpose 39–40
slicing and dicing matrices 26
solving linear systems via diagonalization 76
solving linear systems with SVD 137–139
spectral decomposition of a symmetric matrix 77–78
tensors and images in PyTorch 26
training a linear model for the cat brain 108
transpose of a matrix product 42
using for vector manipulations 22
PyTorch
creating a custom PyTorch data set, code listing 317
inferencing a model, PyTorch Trainer code listing 411
introducing matrices via PyTorch 25–26
introducing vectors via PyTorch 23
MNIST data module, PyTorch DataModule code listing 407–408
performing min-max normalization in PyTorch, code listing 296
PyTorch code for solving linear systems with SVD 137–139
PyTorch distributions package 150, 162
tensors and images in PyTorch 26
training a neural network in PyTorch 295–298
using PyTorch code for vector manipulations 22
PyTorch Autograd 103
PyTorch DataLoader 316
implementing LeNet as a PyTorch Lightning module, code listing 408–410
inferencing a model, PyTorch Trainer code listing 411
LightningModule component 410
MNIST data module, PyTorch DataModule code listing 407–408
Q
quadratic forms, minimizing 118–121
quantitative inputs 4
R
random variable 151–153, 160–162
rasterized vector, creating 84
reconstruction loss, definition of 478
rectified linear unit (ReLU) 392–394
regression loss, definition of 303, 423
regressors, definition of 2, 12
regularization 330
representation learning, definition of 478
components of the core architecture 403
examining how to solve the degradation problem 401–403
identity shortcut connection 402
implementing a basic skip connection block (BasicBlock) 403–404
ResidualConvBlock, code listing 405
ResNet-34, code listing 405–406
root mean squared propagation (RMSProp) 327–328
S
semantic segmentation 150
singular value decomposition (SVD) 130–140
applying SVD by solving arbitrary linear systems 135–136
applying SVD to find the best low-rank approximation of a matrix 139–140
applying SVD via PCA computation 135
computing PCA directly using SVD 139
full-rank matrices, definition of 137
linear system as degenerate 137
PyTorch code for solving linear systems with SVD 137–139
rank of a matrix, definition of 137
spectral norms, definition of 122
stochastic gradient descent (SGD) 315, 317
stochastic mapping, definition of 484
supervised neural networks 273
supervised training data 241
T
Taylor series 100–101, 112–113
term frequency (TF) 141
torchvision package 397
total training loss, definition of 301
training data 4–6, 86, 272, 315
transposed convolution. See convolution, transposed
tunable hyperparameter 285
U
unit vector 36
unsupervised machine learning 193–194
V
vanishing gradient problem 394
variable bit-rate coding 200
variance
Bayesian estimation with unknown mean, known variance, code listing 452–453
Bayesian estimation with unknown mean, unknown variance, code listing 459
Bayesian estimation with unknownvariance, known mean, code listing
computing the variance of a Gaussian distribution 499–501
covariance as the multivariate analog of variance 165–167
mean and variance of a Bernoulli distribution, code listing 189
mean and variance of a multinomial distribution, code listing 187–188
mean and variance of a multivariate normal distribution, code listing 179
mean and variance of a univariate Gaussian, code listing 178
variance and expected value 167
variance of a Bernoulli distribution 189
variance of a binomial distribution 185
variance of a uniform distribution 171–172
variance, covariance, and standard deviation 164–165
See also covariance
variational autoencoders (VAEs) 150, 241, 483–495
autoencoders vs. VAEs 494
comparing autoencoder-and VAE-reconstructed images on the MNIST data set 494
computing the reconstruction loss and KL divergence loss 485
differences between the learned latent spaces of the autoencoder and VAE 495
evidence lower bound (ELBO) 488–490
examples of high and low KL divergence loss 486
KLD loss as regularizing the latent space 486
minimizing reconstruction loss as leading to ELBO maximization 490
physical significance of ELBO maximization 489
reparameterization trick, code listing 492
stochastic mapping as leading to latent-space smoothness 487
stochastic mapping, definition of 484
VAE decoder, code listing 493
VAE loss, code listing 494
VAE training, code listing 494
VAE training, losses, and inferencing 485–486
VAE, code listing 493
VAEs and Bayes’ theorem 487
zero-mean, unit-covariance Gaussian for the known prior 490–492
See also autoencoders
vectors
basic vector and matrix operations in machine learning 26–28
basis vectors 48
creating feature vectors that describe a document 20
defining the span of a set of vectors 47–48
definition of 19
describing a point’s position in a coordinate system 21
document feature vectors 141
dot product of two vectors 29–30
feature vector, definition of 19
geometric intuitions for vector length 36
geometric view of 21
as inputs to a machine learning system 84
introduction to vectors via PyTorch 23
mapping input points to output points in a high-dimensional space 21
matrix and vector transpose 28–29
minimal and complete basis 48–49
orthogonality of vectors 39
representing both inputs and outputs 19
representing the parameters of the model function 19
role of in machine learning 19–21
unit vector 36
VGG (Visual Geometry Group) Net 391–397
common structural elements of the VGG family of networks 391–392
convolutional backbone, code listing 395–396
graph of a 1D sigmoid function and its derivative 394
instantiating a VGG network from a specific config 397
rectified linear unit (ReLU) 392–394
removal of the local response normalization (LRN) layers 391
single convolutional block, code listing 395
torchvision package 397
use of smaller (3x3) convolution filters 391
vanishing gradient problem 394
VGG network, code listing 396
VGG-11 architecture diagram 393
W