Training Courses for Graduate Students

Basic Mathematics for Machine Learning





Date: 2021-08-07
Presented by Hao-Ting Li (李皓庭)

Outline

  • Linear Algebra
  • Calculus
  • Probability and Information Theory
  • Optimization

Linear Algebra

  • Scalars, vectors, matrices and tensors
  • Identity, Transpose, and Inverse
  • Inner product, element-wise product, and matrix product
  • Transformations
  • Norms
  • References

Scalars, Vectors, Matrices, and Tensors

  • A scalar is just a single number.
    • sRs \in \mathbb{R}^{}
  • A vector is an array of numbers.
    • v=[v1vn],vRn\mathbf{v} = \begin{bmatrix} \mathbf{v}_1 \\ \vdots \\ \mathbf{v}_n \end{bmatrix}, \mathbf{v} \in \mathbb{R}^{n}
    • shape: (n,)(n,)
  • A matrix is a 2-D array of numbers.
    • M=[M1,1M1,nMm,1Mm,n],MRm×n\mathbf{M} = \begin{bmatrix} \mathbf{M}_{1,1} & \cdots & \mathbf{M}_{1, n} \\ \vdots & \ddots & \vdots \\ \mathbf{M}_{m, 1} & \cdots & \mathbf{M}_{m, n} \end{bmatrix}, \mathbf{M} \in \mathbb{R}^{m \times n}
    • shape: (m,n)(m, n)
  • A tensor is a multi-dimension array. T\mathsf{T}
    • Scalars are 0th-order tensors.
    • Vectors are 1th-order tensors.
    • Matrices are 2th-order tensors.

Identity, Transpose, and Inverse

  • The identity matrix of size nn is the n×nn \times n square matrix with 11 on the main diagonal and 00 elsewhere.

    In=[100010001]\mathbf{I}_n = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}

    • InAn=AnIn=An\mathbf{I}_n \mathbf{A}_n = \mathbf{A}_n \mathbf{I}_n = \mathbf{A}_n
  • The transpose of a matrix is an operator which flips a matrix over its diagonal.

(A)i,j=Aj,i(\mathbf{A}^\top)_{i,j} = \mathbf{A}_{j,i}

  • The inverse matrix of ARn×n\mathbf{A} \in \mathbb{R}^{n \times n} denoted as A1\mathbf{A}^{-1} matrix is defined as the matrix such that

    AA1=A1A=In\mathbf{A} \mathbf{A}^{-1} = \mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_n

Inner Product, Element-wise Product, and Matrix Product

  • The inner product (or dot product) of two vectors aRn\mathbf{a} \in \mathbb{R}^{n} and bRn\mathbf{b} \in \mathbb{R}^{n} is defined as:

ab=i=1naibi=a1b1+a2b2++anbn\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n}a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

  • The element-wise product (or Hadamard product) of two matrices (or vectors) is defined as:

    C=AB\mathbf{C} = \mathbf{A} \odot \mathbf{B}

    • Ci,j=Ai,jBi,jC_{i,j} = A_{i,j} B_{i,j}
  • The matrix product of two matrices ARm×p\mathbf{A} \in \mathbb{R}^{m \times p} and BRp×n\mathbf{B} \in \mathbb{R}^{p \times n} is defined as:

    C=AB\mathbf{C} = \mathbf{A} \mathbf{B}

    • CRm×n\mathbf{C} \in \mathbb{R}^{m \times n}
    • Ci,j=kAi,kBk,jC_{i,j} = \sum_{k} A_{i,k} B_{k,j}
      • can be seen as the dot product

Linear Transformation

Linear transformation (or linear mapping) is a mapping T:VWT: \mathbf{V} \rightarrow \mathbf{W} between two vector spaces that preserves the operations of vector addition and scalar multiplication.

  • Additivity: T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})
  • Scalar multiplication: T(cu)=cT(u)T(c \mathbf{u}) = c T(\mathbf{u})

Linear transformations can be represented by matrices. If T:RnRmT: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} is a linear mapping and xRn\mathbf{x} \in \mathbb{R}^{n} is a column vector, then

T(x)=AxT(\mathbf{x}) = \mathbf{A}\mathbf{x}

for some matrix ARm×nA \in \mathbb{R}^{m \times n}, called the transformation matrix of TT.

Affine Transformation

Affine transformation is the composition of two functions: a linear transformation matrix A\mathbf{A} and a translation vector b\mathbf{b}. It can be represented as:

y=T(x)=Ax+b\mathbf{y} = T(\mathbf{x}) = \mathbf{A} \mathbf{x} + \mathbf{b}

It can also be represented by homogeneous coordinates as:

[y1]=[Ab01]\begin{bmatrix} \mathbf{y} \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{A} & \mathbf{b} \\ \mathbf{0} & 1 \end{bmatrix}

Affine Transformation Examples in 2-D

  • Stretching
  • Squeezing
  • Rotation
  • Shearing
  • Reflection
  • Orthogonal projection

Illustration of the effect of applying various 2D affine transformation matrices on a unit square. (c) by Cmglee is licensed under CC-BY-SA 3.0

Norms

p-norm: Let p1p \ge 1 be a real number, the p-norm (also called p\ell_p-norm) of vector x=[x1,,xn]\mathbf{x} = [x_1, \ldots, x_n] is defined as:

xp(i=1nxip)1/p\| \mathbf{x} \|_p \coloneqq \left( \sum_{i=1}^{n}{ | x_i | }^p \right)^{1/p}

  • Taxicab norm or Manhattan norm: also called 1\ell_{1}-norm.

    x1=i=1nxi\| \mathbf{x} \|_1 = \sum_{i=1}^{n}{ | x_i | }

  • Euclidean norm: also called 2\ell_{2}-norm.

    x2=i=1nxi2\| \mathbf{x} \|_2 = \sqrt{\sum_{i=1}^{n}{ x_i^2 }}

Norms

  • Maximum norm (special case of: infinity norm, uniform norm, or supremum norm)

x=max(x1,,xn)\| \mathbf{x} \|_{\infty} = \max \left( |x_1|, \ldots, |x_n| \right)

  • 0\ell_0 "norm": defining 00=00^0=0, the zero "norm" is the number of non-zero entries of a vector.

x0=x10+x20++xn0\| \mathbf{x} \|_0 = |x_1|^0 + |x_2|^0 + \cdots + |x_n|^0

References

Calculus

  • Derivative and differentiation
  • Chain rule
  • Gradient
  • Automatic differentiation

Derivative and Differentiation

Crowell and Slesnick's Calculus with Analytic Geometry. Copyright (C) 2008 Peter G. Doyle

Derivative and Differentiation

The slope of the tangent line to the graph of ff at PP can be express the limit of the slope of LtL_t:

limt0m(P,Q(t))=limt0f(a+t)f(a)t\lim _{t \rightarrow 0} m(P, Q(t)) = \lim _{t \rightarrow 0} \frac{f(a+t)-f(a)}{t}

If the limit exists, then the ff is differentiable at aa.

The derivative of ff at aa:

f(a)limt0f(a+t)f(a)tf'(a) \coloneqq \lim _{t \rightarrow 0} \frac{f(a+t)-f(a)}{t}

Derivative rules

Notations

  • Lagrange's notation: ff'
  • Leibniz's notation: dfdx\frac{df}{dx}
  • Newton's notation: f˙\dot{f}
  • Eular's notation: Dxf(x)D_x f(x)

Chain Rule

If a variable yy depends on the variable uu, whitch itself depends on the variable xx, then yy depends on xx as well, via the intermediate variable uu. The derivative is:

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

e.g. Consider the function:

y=exy = e^{-x}

It can be decomposed as:

y=eu,u=xy = e^{u}, u = -x

And the derivatives are:

dydx=dydududx=eu1=ex\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = e^{u} \cdot -1 = -e^{-x}

Gradient

For a scalar-valued differentiable function f:RnRf: \mathbb{R}^{n} \rightarrow \mathbb{R}, its gradient
f:RnRn\nabla f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n} is defined at the point p=(x1,,xn)p=(x_1, \ldots, x_n) in n-dimensional space as the vector:

f(p)=fx1(p),,fxn(p).\nabla f(p) = \langle \frac{\partial f}{\partial x_1}(p), \ldots, \frac{\partial f}{\partial x_n}(p) \rangle.

e.g. Consider the function f(x1,x2)=x12+2x2f(x_1, x_2) = x_1^2 + 2x_2, the gradient function is:

f(p)=fx1(p),fx2(p)=2x1,2\nabla f(p) = \langle \frac{\partial f}{\partial x_1}(p), \frac{\partial f}{\partial x_2}(p) \rangle = \langle 2x_1, 2 \rangle

At a point p=(0,1)p=(0, 1), the gradient is:

f(p)=20,2=0,2\nabla f(p) = \langle 2 \cdot 0, 2 \rangle = \langle 0, 2 \rangle

Computing Derivatives

To compute derivatives, there are some ways:

  • Numerical differentiation
  • Symbolic differentiation
  • Automatic differentiation

Numerical Differentiation

The derivative of f:RnRf: \mathbb{R}^n \rarr \mathbb{R}:

f(x)xilimt0f(x+tei)f(x)tf(x+tei)f(x)t(choose a small t)\begin{aligned} \frac{\partial f(\mathbf{x})}{\partial x_i} & \coloneqq \lim _{t \rightarrow 0} \frac{f(\mathbf{x} + t \mathbf{e}_i) - f(\mathbf{x})}{t}\\ & \approx \frac{f(\mathbf{x} + t \mathbf{e}_i) - f(\mathbf{x})}{t} \quad \text{(choose a small } t \text{)} \end{aligned}

  • Approximate solution
  • Drawbacks
    • O(n)O(n) evaluations
    • Truncation error (if tt is too large)
    • Rounding error (if tt is too small)

Symbolic Differentiation

  • Computer Algebra Systems (CAS)
  • Exact solution
  • Drawbacks
    • Need to implement in CAS
    • Can't perform a function including a control flow statement
    • Expression swell, e.g.
      ([(x0, x**2),
        (x1, (3*x + x0 + 4)**2),
        (x2, (9*x + 3*x0 + x1 + 16)**2),
        (x3, (27*x + 9*x0 + 3*x1 + x2 + 52)**2),
        (x4, (81*x + 27*x0 + 9*x1 + 3*x2 + x3 + 160)**2)],
      [729*x + 243*x0 + 81*x1 + 27*x2 + 9*x3 + 3*x4 + (243*x + 81*x0 + 27*x1 + 9*x2 + 3*x3 + x4 + 484)**2 + 1456])
      

Automatic Differentiation

Two modes:

  • Forward mode
  • Reverse mode

Three-part Notation

A function f:RnRmf: \mathbb{R}^{n} \rarr \mathbb{R}^{m} is constructed using intermediate variables viv_i such that

{vin=xi,i=1,,n are the input variablesvi,i=1,,n are the working (immediate) variablesymi=vli,i=m1,,0 are the output variables\begin{cases} v_{i-n} = x_i, & i=1, \ldots, n &\text{ are the input variables}\\ v_i, & i=1, \ldots, n &\text{ are the working (immediate) variables}\\ y_{m-i} = v_{l-i}, & i=m-1, \ldots, 0 &\text{ are the output variables} \end{cases}

Forward Mode

For computing the derivative of ff with respect to x1x_1, we start by associating with each intermediate variable viv_i a derivative

vi˙=vix1\dot{v_i} = \frac{\partial v_i}{\partial x_1}

Example

Consider the evaluation trace of the function

y=f(x1,x2)=ln(x1)+x1x2sin(x2)y = f(x_1, x_2) = \ln(x_1) + x_1 x_2 - \sin(x_2)

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18, 1-43.

Computation Graph

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18, 1-43.

Efficiency

For a function f:RnRmf: \mathbb{R}^{n} \rarr \mathbb{R}^{m}, it cost nn evaluations with the forward mode.

  • It's efficient when nmn \ll m.

Reverse Mode

The reverse mode propagates derivatives backward from a given output by complementing each intermediate variable viv_i with an adjoint

viˉ=yjvi\bar{v_i} = \frac{\partial y_j}{\partial v_i}

e.g., the variable v0v_0 can affect yy through affecting v2v_2 and v3v_3, so its contribution to the change in yy is given by

yv0=yv2v2v0+yv3v3v0 or vˉ0=vˉ2v2v0+vˉ3v3v0.\frac{\partial y}{\partial v_{0}} = \frac{\partial y}{\partial v_{2}} \frac{\partial v_{2}}{\partial v_{0}} + \frac{\partial y}{\partial v_{3}} \frac{\partial v_{3}}{\partial v_{0}} \quad \text { or } \quad \bar{v}_{0}=\bar{v}_{2} \frac{\partial v_{2}}{\partial v_{0}}+\bar{v}_{3} \frac{\partial v_{3}}{\partial v_{0}} .

Example

Consider the evaluation trace of the function

y=f(x1,x2)=ln(x1)+x1x2sin(x2)y = f(x_1, x_2) = \ln(x_1) + x_1 x_2 - \sin(x_2)

Efficiency

For a function f:RnRmf: \mathbb{R}^{n} \rarr \mathbb{R}^{m}, it cost mm evaluations with the reverse mode.

  • It's efficient when mnm \ll n.
  • The cost of storage is growing in proportion to the number of operations in the evaluated function.

References