DictionaryStatistics, Information Theory and MLEntropy, conditional entropy, and mutual informationI(X;Y)≡H(X)−H(X∣Y)≡H(Y)−H(Y∣X)≡H(X)+H(Y)−H(X,Y)≡H(X,Y)−H(X∣Y)−H(Y∣X) Expected score in MLE is zerof(z;θ) is the density function, for data z, and parameter vector θ, so ∫f(z;θ)dz=1 for any θ. ∫∂f(z;θ)∂θdz=0⇔∫f(z;θ)f(z;θ)∂f(z;θ)∂θdz=0⇔∫f(z;θ)∂logf(z;θ)∂θdz=0 Fisher informationThe Fisher information is defined to be the variance of the score: I(θ)=E[(∂∂θlogf(X;θ))2|θ]=∫R(∂∂θlogf(x;θ))2f(x;θ)dx=−E[∂2∂θ2logf(X;θ)|θ], The Fisher information may be seen as the curvature of the support curve (the graph of the log-likelihood). Near the maximum likelihood estimate, low Fisher information therefore indicates that the maximum appears “blunt”, that is, the maximum is shallow and there are many nearby values with a similar log-likelihood. Conversely, high Fisher information indicates that the maximum is sharp. Relation of Fisher information to relative entropyFisher information is related to relative entropy. The KL-divergence between two distributions in the family can be written as D(θ,θ′)=KL(p(⋅;θ):p(⋅;θ′))=∫f(x;θ)logf(x;θ)f(x;θ′)dx. If θ is fixed, then the relative entropy between two distributions of the same family is minimized at θ′=θ. For θ′ close to θ, one may expand the previous expression in a series up to second order: D(θ,θ′)=12(θ′−θ)T(∂2∂θ′i∂θ′jD(θ,θ′))θ′=θ(θ′−θ)+o((θ′−θ)2) But the second order derivative can be written as (∂2∂θ′i∂θ′jD(θ,θ′))θ′=θ=−∫f(x;θ)(∂2∂θ′i∂θ′jlog(f(x;θ′)))θ′=θdx=[I(θ)]i,j Thus the Fisher information represents the curvature of the relative entropy of a conditional distribution with respect to its parameters. Entropy, cross entropy and KL divergenceAssume two distributions p and q
H(p)=−Ep[logp]=−∫p(y)logp(y)dy
H(p,q)=−Ep[logq]=−∫p(y)logq(y)dy
DKL(p‖
H(p,q)=H(p)+D_{\mathrm{KL}}(p\|q) Logistic regressionThe predicted probability of the output h_{w}(x)={\displaystyle \widehat{P}(y=1|x)\equiv{\frac{1}{1+e^{-w^{\top}x}}}} {\displaystyle 1-h_{w}(x)}=1-\widehat{P}(y=1|x)=\widehat{P}(y=0|x) The logistic loss is L(y,h_{w})=\begin{cases} -\log(h_{w}) & \text{if }y=1\\ -\log(1-h_{w}) & \text{if }y=0 \end{cases} which is the negative log probability that y is classified as its correct class. Since y\in\{0,1\} L(y,h_{w}(x))=-y\log(h_{w}(x))-(1-y)\log(1-h_{w}(x)) Sum over n samples L(w)=\sum_{i=1}^{n}\left[-y_{i}\log(h_{w}(x_{i}))-(1-y_{i})\log(1-h_{w}(x_{i}))\right] AnalysisHilbert spaceEuclidean space \subset Hilbert space \subset Banach \subset Metric space log-sum-exponentialThe log-sum-exponential (LSE) function is commonly used as a smooth approximation of the maximum function. The LSE function is defined as: \text{LSE}(x_1, x_2, \dots, x_n) = \log\left( \sum_{i=1}^n e^{x_i} \right) This function has properties that make it a smooth and differentiable approximation of the maximum of the inputs x_i. Specifically, as the inputs become larger, the exponentials of the smaller x_i become negligible compared to the exponential of the maximum x_{\text{max}} , and the LSE converges to x_{\text{max}}: \lim_{k \to \infty} \frac{1}{k} \log\left( \sum_{i=1}^n e^{k x_i} \right) = \max(x_1, x_2, \dots, x_n) This property is particularly useful in optimization and machine learning because it allows for gradient-based optimization methods. The LSE function is convex and differentiable everywhere, which facilitates the computation of gradients. Example: Suppose you have two numbers, x and y . The maximum function is not differentiable at points where x = y , but the LSE function provides a smooth approximation: \text{LSE}(x, y) = \log\left( e^x + e^y \right) This function is smooth and has continuous derivatives with respect to x and y. Connection between log-sum-exponential and softmaxConsider the LSE function: f(x) = \log\left( \sum_{i=1}^{n} e^{x_i} \right) The partial derivative with respect to x_k is: \frac{\partial f}{\partial x_k} = \frac{e^{x_k}}{\sum_{j=1}^{n} e^{x_j}} = \text{Softmax}(x_k) This shows that the softmax function emerges naturally as the gradient of the LSE function. Computing the softmax function directly can lead to numerical instability due to large exponentials. By leveraging the LSE function, we can rewrite the softmax function to improve numerical stability: \text{Softmax}(x_i) = e^{x_i - \text{LSE}(x_1, x_2, \dots, x_n)} Stirling's approximationMore precise formula n! \sim \sqrt{2 \pi n}\left(\frac{n}{e}\right)^n Tangent coneDefine the tangent cone of a convex set \mathcal{C} at a point z to be the set \mathcal{K}_{\mathcal{C}}(z) = \{ \Delta \in \mathbb{R}^n: \Delta = t(x - z), \; \text{for some } t \geq 0, x \in \mathcal{C} \} For example, If the set is \mathcal{C} = \{ \thet\overline{{A}^{\top}} \in \mathbb{R}^n : \| \thet\overline{{A}^{\top}} \|_1 \leq 1 \}
Let K be a closed convex subset of a real vector space V and \partial K be the boundary of K. The solid tangent cone to K at a point x \in \partial K is the closure of the cone formed by all half-lines (or rays) emanating from x and intersecting K in at least one point y distinct from x. It is a convex cone in V and can also be defined as the intersection of the closed half-spaces of V containing K and bounded by the supporting hyperplanes of K at x. Sum of the supremumSee that f(\theta) \le \sup_{\Theta}f(\theta) and g(\theta) \le \sup_{\Theta}g(\theta) for all \theta \in \Theta f(\theta)+g(\theta) \le \sup_{\Theta} \left ( f(\theta)+g(\theta) \right ) \le \sup_{\Theta}f(\theta)+\sup_{\Theta}g(\theta) AlgebraHolder's inequality\|fg\|_1 \le \|f\|_p \|g\|_q Special case, \|fg\|_1 \le \|f\|_2 \|g\|_2 Triangular inequalityThe sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. \|x\|+\|y\| \geq \|x+y\|
Reverse triangle inequalityAny side of a triangle is greater than the difference between the other two sides. \bigg|\|x\|-\|y\|\bigg| \leq \|x-y\| The proof for the reverse triangle uses the regular triangle inequality, \|x\| = \|(x-y) + y\| \leq \|x-y\| + \|y\| \Rightarrow \|x\| - \|y\| \leq \|x-y\| \|y\| = \|(y-x) + x\| \leq \|y-x\| + \|x\| \Rightarrow \|x\| - \|y\| \geq -\|x-y\| Vector norm inequality\left\| x \right\|_\infty \le \left\| x \right\|_2 \le \left\| x \right\|_1 \le \sqrt{n} \left\| x \right\|_2 \le n \left\| x \right \|_\infty Frobenius inner productThe Frobenius inner product is defined as, \langle {A}, {B} \rangle_\mathrm{F} =\sum_{i,j}\overline{A_{ij}} B_{ij} \, = \mathrm{Tr}\left(\overline{{A}^{\top}} {B}\right) = \mathrm{Tr}\left((\overline{A})^{\top} {B}\right) When A is real \langle {A}, {B} \rangle_\mathrm{F} =\sum_{i,j}A_{ij} B_{ij} \, = \mathrm{Tr}\left({A}^{\top} {B}\right) Also we have \|A\|^{2}_{F} = \langle {A}, {A} \rangle_\mathrm{F} = \mathrm{Tr}\left({A}^{\top} {A}\right) = \mathrm{Tr}\left({A} {A}^{\top}\right) TraceTrace of a productIf A and B are two m \times n real matrices \operatorname{tr}\left({A}^\mathsf{T}{B}\right) = \operatorname{tr}\left({A}{B}^\mathsf{T}\right) = \operatorname{tr}\left({B}^\mathsf{T}{A}\right) = \operatorname{tr}\left({B}{A}^\mathsf{T}\right) = \sum_{i=1}^m \sum_{j=1}^n a_{ij}b_{ij} \; . Additionally, for real column vectors {a}\in\mathbb{R}^n and {b}\in\mathbb{R}^n, the trace of the outer product is equivalent to the inner product: \operatorname{tr}\left({b}{a}^\textsf{T}\right) = {a}^\textsf{T}{b} The symmetry of the Frobenius inner product may be phrased more directly as follows: the matrices in the trace of a product can be switched without changing the result. If A and B are m \times n and n \times m real or complex matrices, respectively, then \operatorname{tr}({A}{B}) = \operatorname{tr}({B}{A}) Cyclic propertyMore generally, the trace is invariant under circular shifts, that is, \operatorname{tr}({A}{B}{C}{D}) = \operatorname{tr}({B}{C}{D}{A}) = \operatorname{tr}({C}{D}{A}{B}) = \operatorname{tr}({D}{A}{B}{C}). Arbitrary permutations are not allowed: in general, \operatorname{tr}({A}{B}{C}) \ne \operatorname{tr}({A}{C}{B}). However, if products of ’'three’’ symmetric matrix matrices are considered, any permutation is allowed, since: \operatorname{tr}({A}{B}{C}) = \operatorname{tr}\left(\left({A}{B}{C}\right)^{\mathsf T}\right) = \operatorname{tr}({C}{B}{A}) = \operatorname{tr}({A}{C}{B}), where the first equality is because the traces of a matrix and its transpose are equal. Note that this is not true in general for more than three factors. Trace as the sum of eigenvaluesGiven any n \times n real or complex matrix A, there is \operatorname{tr}({A}) = \sum_{i=1}^n \lambda_i where \lambda_1,\ldots,\lambda_n are the eigenvalues of A counted with multiplicity. This holds true even if A is a real matrix and some (or all) of the eigenvalues are complex numbers. Trace of the block matrixFor a block matrix A {A} = \begin{bmatrix} {A}_{11} & {A}_{12} & \cdots & {A}_{1q} \\ {A}_{21} & {A}_{22} & \cdots & {A}_{2q} \\ \vdots & \vdots & \ddots & \vdots \\ {A}_{q1} & {A}_{q2} & \cdots & {A}_{qq} \end{bmatrix} We have \operatorname{tr}(A)=\operatorname{tr}(A_{11})+\operatorname{tr}(A_{22})+\cdots+\operatorname{tr}(A_{qq}) Projection matrixA square matrix P is called a projection matrix iff P^2 = P. A square matrix P is called an orthogonal projection matrix if P^2 = P = P^{\top} for a real matrix, and respectively P^2 = P = \overline{{P}^{\top}} for a complex matrix. If P is a projection, so is I-P. Induced norm or operator norm\|A\| _p = \sup_{x \ne 0} \frac{\| A x\| _p}{\|x\|_p} The induced operator norm shows by how much the “length” vector is stretched after applying the linear transformation A, since \|Ax\|_{p} \leq \|A\|_{p}\|x\|_{p} Any induced operator norm is a sub-multiplicative matrix norm
i.e. it satisfies the following property \|AB\|_p \leq \|A\|_p \|B\|_p Since one can readily check \|Ax\|_{p}\leq\|A\|_{p}\|x\|_{p} Apply this property twice, we get \|ABx\|_{p}\leq\|A\|_{p}\|Bx\|_{p}\leq\|A\|_{p}\|B\|_{p}\|x\|_{p} Since this holds for any x \|AB\|_{p} = \sup_{x\neq 0} \frac{\|ABx\|_{p}}{\|x\|_{p}} \leq\|A\|_{p}\|B\|_{p} In the special cases of {\displaystyle p=1,2,\infty ,} the induced matrix norms can be computed or estimated by Column sum normp=1 Column sum \|A\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^m | a_{ij} | Row sum normp=\infty Row sum norm \|A\|_\infty = \max_{1 \leq i \leq m} \sum _{j=1}^n | a_{ij} | Spectral normp=2 Spectral norm \|A\|_2 = \sqrt{\lambda_{\max}\left(\overline{{A}^{\top}} A\right)} = \sigma_{\max}(A) “Entry-wise” matrix normsThese norms treat an m\times n matrix as a vector of size mn, and use one of the familiar vector norms. \| A \|_{p,p} = \| \mathrm{vec}(A) \|_p = \left( \sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^p \right)^{1/p} Frobenius normThe special case p = 2 is the Frobenius norm, \| A \|_{2,2} = \| \mathrm{vec}(A) \|_2 = \left( \sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2 \right)^{1/2} = \|A\|_{F} Max normThe max norm is the elementwise norm in the limit as p \rightarrow \infty . \| A \|_{\infty,\infty} = \| \mathrm{vec}(A) \|_\infty = \left( \sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^\infty \right)^{1/\infty} = \max_{ij} |a_{ij}| = \|A\|_{\max} The max norm is NOT sub-multiplicative.
\|AB\|_{\max} \nleq \|A\|_{\max} \|B\|_{\max} L_{p,q} normFor p,q\geq 1, the L_{2,1} norm can be generalized to the L_{p,q} norm as follows: \| A \|_{p,q} = \left(\sum_{j=1}^n \left( \sum_{i=1}^m |a_{ij}|^p \right)^{\frac{q}{p}}\right)^{\frac{1}{q}} Schatten normsThe Schatten p-norms arise when applying the p-norm to the vector of singular values of a matrix \|A\|^{\text{sh}}_p = \left( \sum_{i=1}^{\min\{m,n\}} \sigma_{i}^p(A) \right)^{\frac{1}{p}} Trace (Nuclear) normp = 1 Nuclear norm, Trace norm \|A\|^{\text{sh}}_1 = \|A\|_{*} = \operatorname{trace} \left(\sqrt{\overline{{A}^{\top}}A}\right) = \sum_{i=1}^{\min\{m,n\}} \sigma_{i}(A) Frobenius normp = 2 Frobenius norm \|A\|^{\text{sh}}_2 = \|A\|_\text{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2} = \sqrt{\operatorname{trace}\left(\overline{{A}^{\top}} A\right)} = \sqrt{\sum_{i=1}^{\min\{m, n\}} \sigma_i^2(A)} Spectral normp = \infty spectral norm \|A\|^{\text{sh}}_{\infty} = \|A\|_2 = \sqrt{\lambda_{\max}\left(\overline{{A}^{\top}} A\right)} = \sigma_{\max}(A) For matrix {\displaystyle A\in \mathbb {R} ^{m\times n}} of rank {\displaystyle r}, the following inequalities hold {\displaystyle \|A\|_{2}\leq \|A\|_{F}\leq {\sqrt {r}}\|A\|_{2}} {\displaystyle \|A\|_{F}\leq \|A\|_{*}\leq {\sqrt {r}}\|A\|_{F}} The above to can be written as {\displaystyle \|A\|_{2}\leq \|A\|_{F}\leq \|A\|_{*} \leq {\sqrt {r}}\|A\|_{F}} \leq {r\|A\|_{2}} Other matrix norm inequalitiesWe have {\displaystyle {\begin{aligned} \|AB\|_{\max} & \leq\max_{i,k}|\sum_{j}a_{ij}b_{jk}| \\ & \leq\max_{i,k}\sum_{j}|a_{ij}||b_{jk}| \\ & \leq\max_{i,k}\sum_{j}|a_{ij}|(\max_{j}|b_{jk}|)\\ & =\max_{i}\sum_{j}|a_{ij}|\max_{j,k}|b_{jk}|\\ & =\|A\|_{\infty}\|B\|_{\max}\\ & =\|A^{\top}\|_{1}\|B\|_{\max} \end{aligned}}} since \|A\|_{\infty} = \|A^{\top}\|_{1}. We also have {\displaystyle {\begin{aligned} \|AB\|_{\max} & \leq\max_{i,k}|\sum_{j}a_{ij}b_{jk}| \\ & \leq\max_{i,k}\sum_{j}|a_{ij}||b_{jk}| \\ & \leq\max_{i,k}\sum_{j}(\max_{j}|a_{ij}|)|b_{jk}|\\ & =\max_{k}\sum_{j}(\max_{i,j}|a_{ij}|)|b_{jk}|\\ & =(\max_{i,j}|a_{ij}|) \max_{k}\sum_{j}|b_{jk}|\\ & =\|A\|_{\max}\|B\|_{1}\\ & =\|A\|_{\max}\|B^{\top}\|_{\infty} \end{aligned}}} since \|A\|_{\infty} = \|A^{\top}\|_{1}. We also have {\displaystyle \|A\|_{\max }\leq \|A\|_{2}\leq {\sqrt {mn}}\|A\|_{\max }} {\displaystyle {\frac {1}{\sqrt {n}}}\|A\|_{\infty }\leq \|A\|_{2}\leq {\sqrt {m}}\|A\|_{\infty }} {\displaystyle {\frac {1}{\sqrt {m}}}\|A\|_{1}\leq \|A\|_{2}\leq {\sqrt {n}}\|A\|_{1}} \|A\|_2\le\sqrt{\|A\|_1\|A\|_\infty} The gradient of log det matrix inverse\log \det X^{-1} = \log (\det X)^{-1} = -\log \det X \frac{\partial}{\partial X_{ij}} \log \det X^{-1} = -\frac{\partial}{\partial X_{ij}} \log \det X = - \frac{1}{\det X} \frac{\partial \det X}{\partial X_{ij}} = - \frac{1}{\det X} \mathrm{adj}(X)_{ji} = - (X^{-1})_{ji} ProbabilityBoole's inequalityAlso known as the union bound \mathbb{P}(\bigcup_{i} A_i) \le \sum_i {\mathbb P}(A_i) Conditional expectation and CovarianceBinomial and multinomial coefficientMultinomial coefficientThe combinatorial interpretation of multinomial coefficients is distribution of n distinguishable elements over m (distinguishable) containers, each containing exactly k_i elements, where i is the index of the container. {n \choose k_1, k_2, \ldots, k_m} = \frac{n!}{k_1!\, k_2! \cdots k_m!}, Binomial coefficientThe binomial coefficient is the special case where k items go into the chosen bin and the remaining n-k items go into the unchosen bin: \binom nk = {n \choose k, n-k} = \frac{n!}{k!(n-k)!}. Expected value of quadratic formLet X = (X_i) be an n \times 1 random vector and let and let A be an n\times n symmetric matrix. If \mathbb{E}[X] = \mu and \operatorname{Var}[X] = \Sigma = (\sigma_{ij}) then \mathbb{E}[X^{\top} AX] = \operatorname{tr}(A\Sigma) + \mu^{\top} A\mu StatisticsPrecision and RecallRecall = Power = Sensitivity = True positive ratePrecision = 1 - False discovery rateIndependence, mean independence and uncorrelatednessStatistical Independence \Longrightarrow Mean Independence \Longrightarrow Uncorrelatedness P(Y|X) = P(X) \Longrightarrow E(Y|X) = E(Y) \Longrightarrow cov(X,Y)=0 The arrows can not be reversed. Statistical genetics |