Index

AdaGrad algorithm 326–327

Adam optimizer algorithm 328–329

AlexNet 391

argmaxonehot function 307–308

arrays 53

asymptotically approaching zero 174

autoencoders 469, 478–481

autoencoder decoder, code listing 481

autoencoder encoder, code listing 480

autoencoder training, code listing 481

decoder, definition of 478

definition of 305, 377, 478

embedding an image 305

encoder and decoder neural networks, discussion of 479–480

encoder, definition of 478

hyperparameter, definition of 478

PCA and 481

reconstruction loss, definition of 478

representation learning, definition of 478

schematic representation of an autoencoder 479

as unsupervised 479

See also variational autoencoders (VAEs)

average pooling, definition of 381

backpropagation algorithm 286–294

algorithm for training a neural network 294–295

backpropagation algorithm on an arbitrary network of linear layers 290–294

definition of 286

evaluating on a simple MLP with a single neuron per layer 286

forward and backward propagation, code listing 289–290

forward pass, definition of 289

Hadamard product, definition of 292

performing min-max normalization in PyTorch, code listing 296

training a neural network in PyTorch 295–298

using an optimizer to update weights 298

Bayes’ theorem 194, 196–198, 448

Bernoulli distribution 188–189

binary classifiers 83, 88

binomial distribution 180–184

bounding box, definition of 385

categorical variables, definition of 242

centroid, definition of 163

classification loss, definition of 423

classifiers

binary classifiers 88

charted examples of good and bad decision boundaries 250

charts of cat-brain threat-model decision boundaries 247–249

as decision boundaries 84–85, 246–247

decision boundary as a hypersurface 249

definition of 12, 245

estimating a decision boundary 251

feature space 246

forming mental pictures of hyperspaces with 3D analogs 249

geometric depiction of a classification problem 85

as a hypersurface 85

input space 246

modeling the classifier, definition of 86

continuous random variable 152

continuous variables, definition of 242

convex and nonconvex functions

convex curves and surfaces, three definitions of 110–112

convexity and the Taylor series 112–113

examples of convex functions 113

introduction to 109–110

convolution

convolution layers 344, 380–381

convolution output size 356

description of 343–345

expressing convolution layers as matrix-vector multiplications 344

one-dimensional 345–356

three-dimensional 368–374

transposed 374–379

two-dimensional 356–368

convolution, one-dimensional 345–356

1D edge detection, code listing 355

1D local averaging convolution, code listing 354–355

convolution output size 356

curve edge detection via 1D convolution 350–351

curve smoothing via 1D convolution 350

detecting edges as a way to understand images 351

directly invoking the convolution function, code listing 356

edge, definition of 351

formula for generating a single output value in 1D convolution 349

graphical and algebraical view 345

how to visualize a 1D convolution 345

input, definition of 345

kernel, definition of 345

as matrix multiplication 351–354

output, definition of 345

padding, definition of 346–347

same (zero) padding 347, 352

setting the weights of a 1D kernel 354

stride, definition of 345–346

valid padding 347, 352

convolution, three-dimensional 368–374

3D convolution with custom weights, code listing 373–374

diagrams illustrating the spatio-temporal view of 3D convolution 370

generating a single output value in 3D convolution 370

how a kernel extracts motion information from video frames 371

how to visualize a 3D convolution 369

illustration of a 3D convolution motion detector 372

video as a 3D entity extending over a spatio-temporal volume 368

video motion detection via 3D convolution 370–371

convolution, transposed 374–379

2D convolution and its transpose 377

autoencoder, definition of 377

decoder, definition of 376

descriptor vector, definition of 375

embedding as an effective compression technique 377

embedding, definition of 375

encoder, definition of 375

end-to-end learning, definition of 377

fractionally strided convolution–375

illustration of a 1D convolution and its transpose 376

illustration of a 2D convolution and its transpose 377

output size 377

upsampling using transpose convolutions, code listing 378–379

why autoencoders need transposed convolution 375

convolution, two-dimensional 356–368

2D convolution as matrix multiplication 366–368

2D convolution with custom weights 363–365

2D edge detection, code listing 366

2D local averaging convolution, code listing 365–366

comparing Euclidean distance to Manhattan distance 358

generating a single output value in 2D convolution 361–362

graphical and algebraic view 356–358

how to visualize a 2D convolution 358

image edge detection via 2D convolution 362–363

image smoothing via 2D convolution 362

image, definition of 356–357

input, definition of 359

kernel, definition of 359

output, definition of 359

padding, definition of 361

same (zero) padding 361

stride, definition of 360–361

two-dimensional neighborhoods not preserved by rasterization 358

valid padding 361

convolutional neural networks (CNNs)

AlexNet 391

benefits of neural networks with multiple convolutional layers 388–391

bounding box, definition of 385

feature map, definition of 388

GoogLeNet 398

image classification, definition of 386

Inceptionv1 architecture 397–401

LeNet architecture, components of 387–389

MNIST data set, sample images from 387

object detection, definition of 386

ResNet architecture 401–406

VGG (Visual Geometry Group) Net 391–397

covariance

covariance as the multivariate analog of variance 165–167

covariance of a multivariate Gaussian distribution 178–180

variance, covariance, and standard deviation 164–165

zero-mean, unit-covariance Gaussian for the known prior 490–492

See also variance

cross-entropy loss

binary cross-entropy loss, code listing 305

definition of 303

Cybenko’s universal approximation theorem 261–262

data imbalance, definition of 310–311

decision making 2–4

decoder, definition of 376, 478

deep learning, overview of 1–17

deep neural networks, definition of 260

dependent events 157–159

descriptor vector, definition of 375

determinants 499

differentiable step-like functions 273–276

graph of the derivatives of 1D sigmoid and tanh functions 276

Heaviside step function as not differentiable 273

sigmoid function and its properties 273–275

tanh function 275–276

dimensionality reduction, definition of 469

discriminative functions 252

document descriptor space 116–118

document retrieval problem 141–147

dot product and cosine of the angle between two vectors 497–498

downsampling, definition of 381

dropout 336–339

eigenvalues and eigenvectors 62–65, 67–69, 72–73

encoder, definition of 375, 478

entropy 198–212

applying to continuous and multidimensional random variables 201

chain rule of conditional entropy 212

charts of entropies of peaked and flat distributions 202

charts of KLD between example distributions 211

computing the cross-entropy of a Gaussian, code listing 202

computing the entropy of a Gaussian distribution, code listing 204

computing the KLD, code listing 210

conditional entropy 210–212

cross-entropy 204–207

definition of 200–201

entropy of Gaussians 203–204

examples of 199

geometrical intuition for entropy 201–203

Huffman encoding 200

Kullback Leibler divergence (KLD) 207–210

prefix coding 200

quantifying the uncertainty associated with a chancy event 199

variable bit-rate coding 200

epoch, definition of 302

error. See loss function (error)

evidence lower bound (ELBO) 488–490

expected value (mean) 162–163

expressive power 14, 16, 240

Fast R-CNN architecture 429

Faster R-CNN, high level architecture 414

feature space, definition of 10

first-order approximation 101

fixed point, definition of 231

focal loss 310–312

frequentist paradigm 151

Frobenius norms, definition of 122–123

Fully Bayes estimation 448–453

Bayes’ theorem 448–449

Bayesian estimation with unknown mean, known variance, code listing 452–453

Bayesian estimation with unknown mean, unknown variance, code listing 459

Bayesian estimation with unknown variance, known mean, code listing 456

Bayesian inference 460–461

computing posterior probability using Bayesian inference, code listing 461

conjugate priors 454

estimating precision 464–466

estimating the mean and precision parameters 457–458

estimating the precision parameter when the mean is known 455–456

Fully Bayesian inferencing 459–461

Gaussian, unknown mean, known precision 450–453

Gaussian, unknown precision, known mean 454

maximum a posteriori (MAP) estimation 448

maximum likelihood estimation 460

maximum likelihood parameter estimation (MLE) 448

MLE for Gaussian parameter values, recap of 449–450

multivariate Bayesian inferencing, unknown mean 463

multivariate Gaussian, unknown mean, known precision 461–463

multivariate, unknown precision, known mean 463–466

normal-gamma distribution 457–459

parameter estimation and belief injection 448–449

prior probability density 448

Wishart distribution 454, 463–464

fully connected layer 277

function family 86

function-fitting problem 8–9

Gamma distribution 454–455, 503–505

Gamma function, overview of 502–503

Gaussian (normal) distribution 173–180

asymptotically approaching zero 174

bell-shaped curve 174

Bernoulli distribution 188–189

binomial distribution 180–184

categorical distribution and one-hot vectors 189–190

chart of a univariate Gaussian random probability density function 177

computing the variance of 499–501

covariance of a multivariate Gaussian distribution 178–180

expected value of a Bernoulli distribution 188–189

expected value of a binomial distribution 184–185

expected value of a categorical distribution 190–191

expected value of a Gaussian distribution 176–177

Gaussian probability density function

geometry of sampled point clouds 180

log probability of a Bernoulli distribution, code listing 188

log probability of a binomial distribution, code listing 183–184

log probability of a multinomial distribution, code listing 186

log probability of a univariate normal distribution, code listing 175

mean and variance of a Bernoulli distribution, code listing 189

mean and variance of a multinomial distribution, code listing 187–188

mean and variance of a multivariate normal distribution, code listing 179

mean and variance of a univariate Gaussian, code listing 178

multinomial distribution 185–187

multivariate Gaussian 175–176

multivariate Gaussian point clouds and hyper-ellipses 180

outlier values 174

probability of a categorical distribution 190

variance of a Bernoulli distribution 189

variance of a binomial distribution 185

Gaussian mixture models (GMM) 215–237

algorithm of GMM fit (MLE of GMM parameters) 236

charts of two-dimensional GMMs with circular and elliptical bases 226

classification via GMM 230

fixed point, definition of 231

Gaussian mixture model distribution 229

GMM fit, code listing 236–237

latent variables for class selection 227–229

maximum likelihood estimation of GMM parameters (GMM fit) 230–237

probability density function of the GMM 223–227

progression of maximum likelihood estimation for GMM parameters 232

generative functions 252

generative modeling, definition of 447–448

global minimum 303

GoogLeNet 398

gradients 95–99

becoming zero at the optimum 96

gradient example in 3D 98–99

gradient vectors and minimizing loss functions 89–90

introduction to 90–91

as the vector of all the partial derivatives 95

ground truth 6, 241

GT vector 301, 303

Hadamard product, definition of 292

Hausdorff property, definition of 441–442

Heaviside step function 252–253, 273

Hessian matrix 101, 284

hidden layers 260

hinge loss function 312–314

histograms 152–153

homeomorphism 443–445

Huffman encoding 200

human labeling (human curation), definitionof 6

hyperparameter, definition of 478

hyperplanes 85, 253–254

image 83–84, 386

Inceptionv1 architecture 397–401

description of its network-in-network paradigm 397–399

diagram of 398

GoogLeNet 398

implementing a dimensionality reduced Inception block 400–401

implementing a naive Inception block, code listing 399

inferencing 4, 6, 10, 241

input variables 242

inputs, normalizing 7

iterative training 8–9

Jensen’s inequality theorem 501–502

joint probability, definition of 155

Jupyter Notebook 18–19, 22

Kullback–Leibler divergence (KLD) 207–210

L’Hospital’s rule 500

L1 regularization 333–334

L2 regularization 332–334

labeling 241, 273

latent or hidden variables/parameters 216, 228

latent semantic analysis (LSA) 118, 142–147

latent spaces 468–496

comparing discriminative and generative models 471–472

considering the space of natural and digital images 469

dimensionality reduction 469, 475

dimensionality reduction using PCA, code listing 477–478

discriminative classifiers, definitionof 471

generative classifiers 471–472

generative models, properties of 471–472

geometric view of 469–471

illustration of good and bad discriminative classifiers 472

latent space modeling briefly explained 470–471

latent vector, definition of 468

latent-space modeling, benefits and applications 472–473

linear latent space manifolds and PCA 474–478

manifold as capturing the essence of a common property 469

mapping from a 2D input space to a 1D latent space 482

observed vector, definition of 468

PCA as a special case of latent space representation 471

regularization as creating a more compact latent space 483

smoothness, continuity, and regularization of 481–483

steps involved in a PCA-based dimensionality reduction 475–477

two examples of latent subspaces, with planar and curved manifolds 470

learning rate (LR) 285, 315

LeNet architecture 387–389

implementing LeNet for image classification on MNIST, code listing 388–389

output feature map passed through two fully connected (FC) layers 388

PyTorch Lightning 406–411

subsampling (pooling) layers 388

tanh activation layer 388

three convolutional layers of 5×5 kernels 387–388

level contours 97

linear layers 277–281

algorithm for training a neural network 294–295

backpropagation algorithm 286–294

diagram of a complete multilayered neural network 278

forward and backward propagation, code listing 289–290

forward propagation of a single linear layer 280–281

forward propagation, code listing 281

fully connected layer 277

gradient descent and local minima 285–286

Hadamard product, definition of 292

Hessian matrix 284

learning rate 285

loss and its minimization 282–283

loss surface and gradient descent 283–286

as matrix-vector multiplication 277–280

mean squared error (MSE) function 282

MSE loss, code listing 283

never using test data for training 281

performing min-max normalization in PyTorch, code listing 296

training a neural network in PyTorch 298

training and backpropagation 281–282

tunable hyperparameter 285

using an optimizer to update weights 298

linear vs. nonlinear models 12–14

local minimum 303

local response normalization (LRN) layers 391

local translation invariance, definitionof 381

logical functions 242–245

definition of 242

logical AND 243–244

logical NOT 244–245

logical OR 242–243

logical XOR 244–245

m-out-of-n trigger 244

multi-input logical AND 244

multi-input logical OR 244

log-sum inequality theorem 502

loss function (error)

autoencoders, definition of 305

binary cross-entropy loss for image and vector mismatches 305–306

binary cross-entropy loss, code listing 305

computing the gradient of the loss function 316

creating a custom neural network model, code listing 318

creating a custom PyTorch data set, code listing 317

cross-entropy loss, code listing 304

cross-entropy loss, definition of 303

data imbalance, definition of 310–311

definition of 89–91, 301

epoch, definition of 302

equation for describing a full neural network 301

focal loss 310–312

generating one training loop, code listing 319

global minimum 303

gradient vectors and minimizing loss functions 89–90

GT vector, definition of 301–302

local approximation for the loss function 99–100

local minimum 303

loss function and SGD optimizer, code listing 318

loss surfaces and their minimization 303

loss surfaces, description of 302–303

minimizing 83, 88–89

multi-dimensional loss functions 93–94

one-dimensional loss functions 91–93

output vector 303

prediction vector 303

regression loss, code listing 303

running the training loop num epochs times, code listing 319

softmax function 306–310

total error, definition of 89

total training loss, definition of 301

using a squared error function 89

visualizing loss surfaces 97

machine learning

analogy to the human brain 5

cat brain model 7

chart of the 2D input point space for the cat brain model 11

classifier, definition of 12

collinearity as implying linear dependence 47

computing eigenvectors and eigenvalues 67

computing the simplified cat brain threat score model 11

defining the span of a set of vectors 47–48

dot product and the difference between two unit vectors 38–39

dot product of two vectors 29–30

eigenvalues and eigenvectors 62–65

eigenvectors and linear independence 65–66

entropy 198–212

equations for describing a multilayered neural network 16

estimating a threat score 7

example cat-brain dataset matrix 24

example training dataset 24

expressive power 14, 16

feature space, definition of 10

finding the axes of a hyperellipse 78–79

formula for transforming an arbitrary input value to a normalized value 7

from arbitrary input to the desired output during inferencing 5

generating the right outcome on never-before-seen data 5

generic multidimensional definition of linear transforms 51

geometric intuitions for dot product and vector length 36–37

geometrical view of 10–11

introduction to vectors via PyTorch 23

Kullback–Leibler divergence (KLD) 207–210

latent semantic analysis (LSA) 118

learning, definition of 5

linear dependence 46–47

linear systems with zero or near-zero determinants 55–57

linear vs. nonlinear models 12–14

list of problem solving stages 4

machine learning model error 34–36

matrix diagonalization 73–74

matrix powers using diagonalization 76–77

matrix-matrix multiplication 32–33

matrix-vector multiplication 31–32

matrix-vector multiplications as linear transforms 52–53

measuring the component of a vector along a coordinate axis 37–38

minimizing a quadratic form in machine learning problems 121

model estimation 8

multidimensional line and plane equations 42–46

multidimensional line equation 42–43

multidimensional planes 43–46

multilayered neural network, diagram of 15

natural language processing (NLP) 118

over-determined and under-determined linear systems 57–59

as a paradigm shift in computing 3

performing basic vector and matrix operations 26–28

principal component analysis (PCA) 118

producing Python code using Jupyter Notebook 22

quadratic form, definition of 118

regressor, definition of 12

retrieving documents that match a query phrase 140–147

role of matrices in 23

role of vectors in 19–21

sigmoid function, definition of 14

singular value decomposition (SVD) 130–140

solving linear systems without inversion via diagonalization 74–75

spectral decomposition of a symmetric matrix 77

squared error 34

sticking to any fixed coordinate system 22

supervised machine learning 194

supervised vs. unsupervised learning 4

symmetric matrices and orthogonal eigenvectors 66

target output 4

training data, definition of 4

training, definition of 5

transpose of matrix products 33

trying to model the unknown transformation function 6

unsupervised machine learning 193–194

using 3D analogues for higher dimensional spaces 22

using PyTorch code for vector manipulations 22

See also neural networks

manifolds 438–443

applying calculus to a locally Euclidean property 440–441

bounded, compact, and precompact sets 443

definition of 438

d-manifold, definition of 440

example manifolds and non-manifolds in 1D and 2D 440

Hausdorff property, definition of 441–442

manifolds as locally Euclidean 440

mapping points from one manifold to another 439

neural networks and 438

open sets, closed sets, and boundaries 442

second countable property of manifolds 442–443

mathematical notations used throughout the text 506

matrices

applying rotation matrices 69

basic vector and matrix operations in machine learning 26–28

converting a matrix into a vector via rasterization 84

data matrix columns as dimensions in the feature space 115

data matrix rows as representing feature vectors 115

example cat-brain dataset matrix 24

Frobenius norms, definition of 122–123

full-rank matrices, definition of 137

introducing matrices via PyTorch 25–26

inverting a matrix and computing its determinant 57

linear systems and matrix inverse 53–55

matrix and vector transpose 28–29

matrix diagonalization 73–74

matrix powers using diagonalization 76–77

matrix, definition of 23

matrix-matrix multiplication 32–33

matrix-vector multiplication 31–32

Moore Penrose pseudo-inverse of a matrix 59–62

orthogonal (rotation) matrices and their eigenvalues and eigenvectors 67–69

orthogonality and length-preservation 71

orthogonality of rotation matrices 71–72

rank of a matrix, definition of 137

representing digital images as matrices 25

role in machine learning 23

slicing and dicing matrices 26

solving an overdetermined system using the pseudo-inverse 62

spectral norms, definition of 122

symmetric positive semidefinite matrices 121–122

transpose of matrix products 33

using linear algebraic tools to analyze matrix structures 115–116

max pooling, definition of 381

maximum a posteriori (MAP) estimation 448

maximum likelihood parameter estimation (MLE) 448

mean squared error (MSE) function 282

model architecture 4, 6, 8

model parameter estimation 213–222

estimating the model parameters from the unlabeled training data 213

examining the likelihood term 213

examining the prior probability term 213

Gaussian mixture models (GMMs) 215

Gaussian negative log-likelihood for training data, code listing 219–220

Gaussian negative log-likelihood with regularization, code listing 221–222

latent variables and evidence maximization 215–216

likelihood, evidence, and posterior and prior probabilities 213–214

maximum a posteriori (MAP) parameter estimation and regularization 215

maximum likelihood estimate for a Gaussian, code listing 219

maximum likelihood parameter estimation (MLE) 214–215

maximum likelihood parameter estimation for Gaussians 216–218

minimizing MLE loss via gradient descent, code listing 220

using the log-likelihood trick 214

modeling

inferencing 4

linear vs. nonlinear models 12–14

model architecture selection 86

model training 4, 8–10, 86

overall algorithm for training a supervised model 90

training error, definition of 86

trying to model the unknown transformation function 6

momentum 320–325

AdaGrad algorithm 326–327

Adam optimizer algorithm with bias correction 328–329

chart showing an overfitting of data points in a binary classifier 331

explanation of 320–321

L1 regularization 333–334

L2 regularization 332–334

momentum-based gradient descent 322

Nesterov accelerated gradients 322–325

overfitting and underfitting 330

regularization 330

root mean squared propagation (RMSProp) 327–328

viewing regularization as minimizing descriptor length 332

Multibox Single-Shot Detector (SSD) 436

multidimensional functions 93–95

multidimensional integral 161–162

natural language processing (NLP) 118

Nesterov accelerated gradients 322–325

neural networks

adjusting its architecture and parameter values 240

algorithm for training a neural network 294–295

backpropagation algorithm 286–294

categorical variables, definition of 242

charts of cat-brain threat-model decision boundaries 247–249

charts of good and bad decision boundaries 250

choosing an architecture 240

classifier functions 245–246

classifying into supervised and unsupervised neural networks 241

continuous variables, definition of 242

decision boundaries, definition of 246–247

decision boundary as a hypersurface 249

determining parameter values through training 240

diagram of a complete multilayered neural network 278

diagram of a multilayered neural network 15

differentiable step-like functions 273–276

discriminative functions 252

equations for describing a multilayered neural network 16

estimating a decision boundary 251

expressing real-world problems in target functions 240–242

expressive power, definition of 240

feature space 246

forming mental pictures of hyperspaces with 3D analogs 249

forward and backward propagation, code listing 289–290

fully connected layer 277

generative functions 252

gradient descent and local minima 285–286

ground truth 241

Hadamard product, definition of 292

Heaviside step function 252–253, 273

Hessian matrix 284

hyperplanes 253–254

inferencing 241

input space 246

input variables 242

labeling 241, 273

learning rate 285

linear layers 276–281

logical AND function 243–244

logical functions, definition of 242–245

logical NOT function 244–245

logical OR function 242–243

logical XOR function 244–245

making a probabilistic statement of output correctness 241

manual annotation 241

mean squared error (MSE) function 282

m-out-of-n trigger function 244

multi-input logical AND function 244

multi-input logical OR function 244

multilayer perceptrons (MLPs) 259–269

neuron, basic description of 240, 252

output variables 242

overview of 240–241

perceptrons 254–269

performing min-max normalization in PyTorch, code listing 296

sigmoid function and its properties 275

supervised neural networks 273

supervised training data 241

tanh function 275–276

target output 241

training a neural network in PyTorch 295–298

training data, definition of 272

tunable hyperparameter 285

using an optimizer to update weights 298

weights 241

See also machine learning

neuron, description of 240, 252

non-maxima suppression (NMS) algorithm 425–426

object detectors 411–436

anchors and their configurations, description of 415

assigning GT labels for each anchor box, code listing 420

assigning targets to anchor boxes 421–422

classification loss, definition of 423

classifier predicting an objectness value 417

contributions and improvements of Fast R-CNN 412–413

dealing with the imbalance between negative and positive anchors 421

Fast R-CNN and Rols 427–435

Fast R-CNN architecture 429

Fast R-CNN inference 433–434

Fast R-CNN loss function 431–432

Fast R-CNN RoI head, code listing 430–431

Faster R-CNN and its two core modules 412–413

Faster R-CNN, high level architecture 414

FCN of the RPN, code listing 418–419

Feature Pyramid Network (FPN) 436

FRCNN guidelines for assigning labels to anchor boxes 420

fully convolutional network (FCN) architecture 417–418

generating a target (GT) for an RPN 421

generating all anchors for a given image 417

generating anchors at a particular grid point, code listing 416

generating region proposals 424–425

Multibox Single-Shot Detector (SSD) 435–436

NMS of RoIs, code listing 427

non-maxima suppression (NMS) algorithm 425–426

other object-detection paradigms 435–436

R-CNN module 413–414

Region proposal network (RPN) 413–415

regression loss, definition of 423

Rol pooling 428–429

RPN loss function 423–424

three stages in the R-CNN approach to object detection 411–412

training the Fast R-CNN 431

training the Faster R-CNN 434–435

You Only Look Once (YOLO) 435

observed vector, definition of 468

one-dimensional loss functions 91–93

optimization 314–316

AdaGrad algorithm 326–327

Adam optimizer algorithm with bias correction 328–329

Bayes’ theorem and the stochastic view of optimization 334–335

creating a custom neural network model, code listing 318

creating a custom PyTorch data set, code listing 317

definition of 301–302, 315

dropout 336–339

generating one training loop, codelisting 319

L1 regularization 333–334

L2 regularization 332–334

learning rate (LR) 315

loss function and SGD optimizer, code listing 318

map optimization 335–336

MLE-based optimization 335

overfitting and underfitting 330

overfitting of data points in a binary classifier 331

random shuffling of training data after every epoch 315

regularization 330

root mean squared propagation (RMSProp) 327–328

running the training loop num epochs times, code listing 319

stochastic gradient descent (SGD) 315–316

viewing regularization as minimizing descriptor length 332

output variables 242

output vector 303

parameterized function, threat score 4

partial derivatives, definition of 94

perceptrons 254–269

classification and 254

code listing for 256

Cybenko’s universal approximation theorem 261–262

deep neural networks, definition of 260

definition of 254

generating 2D steps and waves with perceptrons 264–267

generating a 1D tower with perceptrons 262–264

hidden layers 260

introduction to modeling common logic gates with perceptrons 256

layering for organizing perceptrons into a neural network 260

MLP for a logical XOR function 259–260

MLPs for polygonal decision boundaries 268–269

modeling logical gates, code listing 258

multilayer perceptrons (MLPs) 259–269

multiple perceptrons 256

partitioning with a planar decision surface

perceptron for a logical AND function 257

perceptron for a logical NOT function 258

perceptron for a logical OR function 258

perceptrons and MLPs in 1D, code listing 267

perceptrons and MLPs in 2D, code listing 267–268

truth table for two-variable logical functions 261

pixel, definition of 83

pooling 381–383

prediction vector 303

prefix coding 200

principal component analysis (PCA) 118, 123–130

applying PCA on correlated and uncorrelated datasets 128

calculating the direction of maximum spread 125–127

dimensionality reduction via PCA 127–128

introduction to 123–125

limitations of PCA 129–130

linear latent space manifolds and PCA 478

PCA and data compression 130

PCA computation, code listing 128–129

PCA on synthetic correlated data, code listing 129

PCA on synthetic nonlinearly correlated data, code listing 130

PCA on synthetic uncorrelated data 129

as a special case of latent space representation 471

use in JPEG 98 image compression techniques 130

probability density function (PDF) 152, 198

probability distributions

continuous random variable 152

definition of 153

discrete random variable 151

emphasizing the geometrical view of multivariate statistics 150

example graph for the weights of adults in Statsville 154

fitting probability distributions to specific groups of people 150

frequentist paradigm 151

loosely structured point distributions in high-dimensional spaces 149

probabilities as always less than or equal to 1 151

probability density 152

PyTorch distributions package 150, 162

random variable, definition of 151

semantic segmentation 150

using histograms to visualize discrete random variables 152–153

using probabilistic models in unsupervised and minimally supervised learning 150

using uppercase letters to denote random variables 152

variational auto encoders (VAEs) 150

See also probability theory

probability theory

asymptotically approaching zero 174

basic concepts of 154–155

bell-shaped curve 174

Bernoulli distribution 188–189

binomial distribution 180–184

Cartesian product 157

categorical distribution and one-hot vectors 189–190

centroid, definition of 163

chart of a univariate Gaussian random probability density function 177

chart of bivariate uniform random probability density function 173

conditional probability 196

continuous random variables and probability density 160–162

covariance as the multivariate analog of variance 165–167

covariance of a multivariate Gaussian distribution 178–180

dependent events and their joint probability distribution 157–159

dependent vs. independent variables 196

entropy 198–212

entropy, definition of 200–201

exhaustive and mutually exclusive events 154–155

expected value (mean) 162–164

expected value of a Bernoulli distribution 188–189

expected value of a binomial distribution 184–185

expected value of a categorical distribution 190–191

expected value of a function of a random variable 163

expected value of a Gaussian distribution 176–177

expected value of a linear combination of random variables 164

expected value of a uniform distribution 171

Gaussian (normal) distribution 173–180

Gaussian probability density function 174

geometry of sampled point clouds 180

graphical visualization of joint probability distributions 160

independent events 155

joint and marginal probability 194–196

joint probabilities and their distributions 157

Kullback Leibler divergence (KLD) 207–210

log probability of a Bernoulli distribution, code listing 188

log probability of a binomial distribution, code listing 183–184

log probability of a multinomial distribution, code listing 186

log probability of a univariate normal distribution, code listing 175

log probability of a univariate uniform random distribution, code listing 171

marginal probabilities 157

marginal probability for a variable 195

mean and variance of a Bernoulli distribution, code listing 189

mean and variance of a multinomial distribution, code listing 187–188

mean and variance of a multivariate normal distribution, code listing 179

mean and variance of a uniform random distribution, code listing 172

mean and variance of a univariate Gaussian, code listing 178

multidimensional integral 161–162

multinomial distribution 185–187

multivariate Gaussian 175–176

multivariate Gaussian point clouds and hyper-ellipses 180

outlier values 174

probabilities of impossible and certain events 154

probability density function (PDF) 16, 198

probability of a categorical distribution 190

product rule, definition of 155

properties of distributions 162–167

sample point distributions for dependent and independent variables 159–160

sampling from a distribution 167–169

sum rule 195

uniform distributions as multivariate 173

uniform random distributions 170–171

variance and expected value 167

variance of a Bernoulli distribution 189

variance of a binomial distribution 185

variance of a uniform distribution 172

variance, covariance, and standard deviation 164–165

See also probability distributions

product rule, definition of 155

Python code 18–19

applying PCA on correlated and uncorrelated datasets 128

computing LSA 145–146

computing LSA and SVD on a large dataset 146–147

computing PCA directly using SVD 139

dot product of two vectors 40

eigenvalues and eigenvectors of a rotation matrix 72

examining linear models 101–105

examining nonlinear models 105–107

finding the axes of a hyperellipse 79–80

introducing matrices via PyTorch 25–26

matrix diagonalization 74

matrix vector multiplication 40–41

orthogonality of rotation matrices 71–72

PCA computation, code listing 128–129

PCA on synthetic correlated data, code listing 129

PCA on synthetic nonlinearly correlated data, code listing 130

PCA on synthetic uncorrelated data 129

performing a matrix transpose 39–40

slicing and dicing matrices 26

solving linear systems via diagonalization 76

solving linear systems with SVD 137–139

spectral decomposition of a symmetric matrix 77–78

tensors and images in PyTorch 26

training a linear model for the cat brain 108

transpose of a matrix product 42

using for vector manipulations 22

PyTorch

creating a custom PyTorch data set, code listing 317

inferencing a model, PyTorch Trainer code listing 411

introducing matrices via PyTorch 25–26

introducing vectors via PyTorch 23

MNIST data module, PyTorch DataModule code listing 407–408

performing min-max normalization in PyTorch, code listing 296

PyTorch code for solving linear systems with SVD 137–139

PyTorch distributions package 150, 162

tensors and images in PyTorch 26

training a neural network in PyTorch 295–298

using PyTorch code for vector manipulations 22

PyTorch Autograd 103

PyTorch DataLoader 316

PyTorch Lightning 406–411

DataModule component 407–408

implementing LeNet as a PyTorch Lightning module, code listing 408–410

inferencing a model, PyTorch Trainer code listing 411

LightningModule component 410

MNIST data module, PyTorch DataModule code listing 407–408

Trainer component 410–411

quadratic forms, minimizing 118–121

quantitative estimation 2–3

quantitative inputs 4

random variable 151–153, 160–162

rasterized vector, creating 84

reconstruction loss, definition of 478

rectified linear unit (ReLU) 392–394

regression loss, definition of 303, 423

regressors, definition of 2, 12

regularization 330

representation learning, definition of 478

ResNet architecture 401–406

components of the core architecture 403

examining how to solve the degradation problem 401–403

identity shortcut connection 402

implementing a basic skip connection block (BasicBlock) 403–404

PyTorch Lightning 406–411

ResidualConvBlock, code listing 405

ResNet-34, code listing 405–406

root mean squared propagation (RMSProp) 327–328

semantic segmentation 150

sigmoid function 14, 273–276

singular value decomposition (SVD) 130–140

applying SVD by solving arbitrary linear systems 135–136

applying SVD to find the best low-rank approximation of a matrix 139–140

applying SVD via PCA computation 135

computing PCA directly using SVD 139

full-rank matrices, definition of 137

linear system as degenerate 137

PyTorch code for solving linear systems with SVD 137–139

rank of a matrix, definition of 137

SVD theorem 131–134

softmax function 306–314

spectral norms, definition of 122

standard deviation 164–165

stochastic gradient descent (SGD) 315, 317

stochastic mapping, definition of 484

supervised learning 4, 194

supervised neural networks 273

supervised training data 241

tanh function 275–276

target functions 240–242

target output 4, 241

Taylor series 100–101, 112–113

tensors 25–26

term frequency (TF) 141

threat score 4, 7

thresholding 10, 12

torchvision package 397

total training loss, definition of 301

training data 4–6, 86, 272, 315

transposed convolution. See convolution, transposed

tunable hyperparameter 285

unit vector 36

unsupervised machine learning 193–194

vanishing gradient problem 394

variable bit-rate coding 200

variance

Bayesian estimation with unknown mean, known variance, code listing 452–453

Bayesian estimation with unknown mean, unknown variance, code listing 459

Bayesian estimation with unknownvariance, known mean, code listing

computing the variance of a Gaussian distribution 499–501

covariance as the multivariate analog of variance 165–167

mean and variance of a Bernoulli distribution, code listing 189

mean and variance of a multinomial distribution, code listing 187–188

mean and variance of a multivariate normal distribution, code listing 179

mean and variance of a univariate Gaussian, code listing 178

variance and expected value 167

variance of a Bernoulli distribution 189

variance of a binomial distribution 185

variance of a uniform distribution 171–172

variance, covariance, and standard deviation 164–165

See also covariance

variational autoencoders (VAEs) 150, 241, 483–495

autoencoders vs. VAEs 494

comparing autoencoder-and VAE-reconstructed images on the MNIST data set 494

computing the reconstruction loss and KL divergence loss 485

differences between the learned latent spaces of the autoencoder and VAE 495

evidence lower bound (ELBO) 488–490

examples of high and low KL divergence loss 486

geometric overview of 483–484

KLD loss as regularizing the latent space 486

minimizing reconstruction loss as leading to ELBO maximization 490

physical significance of ELBO maximization 489

reparameterization trick, code listing 492

stochastic mapping as leading to latent-space smoothness 487

stochastic mapping, definition of 484

VAE decoder, code listing 493

VAE loss, code listing 494

VAE training, code listing 494

VAE training, losses, and inferencing 485–486

VAE, code listing 493

VAEs and Bayes’ theorem 487

zero-mean, unit-covariance Gaussian for the known prior 490–492

See also autoencoders

vectors

basic vector and matrix operations in machine learning 26–28

basis vectors 48

creating feature vectors that describe a document 20

defining the span of a set of vectors 47–48

definition of 19

describing a point’s position in a coordinate system 21

document feature vectors 141

dot product of two vectors 29–30

feature vector, definition of 19

geometric intuitions for vector length 36

geometric view of 21

as inputs to a machine learning system 84

introduction to vectors via PyTorch 23

linear transforms 49–51

mapping input points to output points in a high-dimensional space 21

matrix and vector transpose 28–29

minimal and complete basis 48–49

orthogonality of vectors 39

representing both inputs and outputs 19

representing the parameters of the model function 19

role of in machine learning 19–21

unit vector 36

vector spaces 48–49

VGG (Visual Geometry Group) Net 391–397

common structural elements of the VGG family of networks 391–392

convolutional backbone, code listing 395–396

graph of a 1D sigmoid function and its derivative 394

instantiating a VGG network from a specific config 397

rectified linear unit (ReLU) 392–394

removal of the local response normalization (LRN) layers 391

single convolutional block, code listing 395

torchvision package 397

use of smaller (3x3) convolution filters 391

vanishing gradient problem 394

VGG network, code listing 396

VGG-11 architecture diagram 393

weights 4, 6, 241

Wishart distribution 454, 463–464

← Previous Section 22 of 22 Next →