\(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\DeclareMathOperator*{\argmin}{arg\,min}\) Orfeas | Learning Theory

Learning Theory


Lecture notes and solved exercises for CS-526 Learning Theory 2023, and understanding some theoretical results from the literature.

Learning Theory


1. PAC Learning

The idea of PAC learning is to describe which hypotheses about a dataset are statistically learnable, in the sense that an algorithm can, with enough data, propose a hypothesis that achieves any accuracy on a labeling task with any probability.

Definition (Instance and label domains) Data comes to you from a set \(\mathcal{X}\) and the task is to assign a label from a set \(\mathcal{Y}\). The sample set is \(\mathcal{X}\times\mathcal{Y}=\mathcal{Z}\)

You are presented with a list of samples \(S=\{(z_i)\}_{i=1}^m\) where the samples are drawn independently and identically from unknown distribution \(\mathcal{D}\) on \(\mathcal{Z}\). We denote this relationship as \(S\sim\mathcal{D}^m\). We don't necessarily assume that there is one true label for any particular instance, only that the distribution \(\mathcal{D}\) assigns some probability to a label \(y\) given an instance \(x\).

$$\mathcal{D}(x,y)=\mathcal{D}(x)\mathcal{D}(y|x)$$

The learning task is to find an algorithm \(\mathcal{A}\) that trains on \(\mathcal{S}\) and outputs a hypothesis that best fits the data in the sense that it picks the most likely labels. If you could choose any hypothesis at all then you will certainly overfit the data you are given. Therefore we typically restrict the hypotheses you can choose from.

Definition (Hypothesis class) A learner chooses in advance a set of predictors called the hypothesis class \(\mathcal{H}\). In the deterministic setting, \(h \in \mathcal{H}\) is a mapping from \(\mathcal{X}\) to \(\mathcal{Y}\). In the non-deterministic setting, \(h\) maps \(\mathcal{X}\) to the set of all label set distributions \(\Delta\mathcal{Y}\).

In the case where there are only two possible labels, some authors prefer the following equivalent notion.

Definition (Concept class) A concept class over \(\mathcal{X}\) is a nonempty set \(\mathcal{C}\subseteq 2^{\mathcal{X}}\).

Indeed, an element of the power set \(X\in 2^{\mathcal{X}}\) can be seen as a choice of labels "included" and "not included" on \(\mathcal{X}\). In other words, \( 2^{\mathcal{X}}\) is the set of binary hypotheses.

Definition (Loss function) In order to measure the quality of a hypothesis, we define a loss function \(l\) that takes a hypothesis \(h\), a particular data point \(x\), and a label \(y\). The loss function outputs a nonnegative real number that measures how close \(h(x)\) is to \(y\).

Examples of loss functions include square loss \(l(h, z)=(y-h(x))^2\) and 0-1 loss \(l(h, z)=\mathbb{1}(h(x)\neq y)\).

The loss function itself does not measure the quality of the hypothesis. It need not be that the prediction \(h(x)\) be close to any random label \(y\), only that it should pick the label that is most likely to occur under the unknown distribution \(\mathcal{D}\). The best hypothesis minimizes expected loss over the random variable \(Z\sim\mathcal{D}\), which is the so-called true loss \(L_\mathcal{D}\).

$$h^*=\argmin_{h\in\mathcal{H}} \mathbb{E}_{Z\sim\mathcal{D}}[l(h,Z)]:=\argmin_{h\in\mathcal{H}} L_\mathcal{D}(h)$$

We call \(h^*\) the Bayes-optimal predictor, and is described in more detail in exercise (?). Of course, in reality, \(D\) is unknown, so we employ a proxy called empirical loss \(L_S\), which is nothing other than the average loss over the set \(S\).

$$L_S(h)=\frac{1}{m}\sum_{j=1}^m l(h, z_j)$$

Finally, we come to the most important definition of Learning theory, which formalizes what it means for a hypothesis class to be learnable. In this setting, the algorithm \(\mathcal{A}\) and hypothesis class \(\mathcal{H}\) are both deterministic.

Definition (PAC learnable) A hypothesis class \(\mathcal{H}\) is probably-approximately-correct (PAC) learnable with respect to a sample set \(\mathcal{Z}\) and a loss function \(l\) iff there exist an algorithm \(\mathcal{A}\) and a minimum sample size function \(m_\mathcal{H}:]0,1[^2\rightarrow\mathbb{N}\) such that for every \(\epsilon, \delta\in]0,1[^2\), and any distribution \(\mathcal{D}\) on \(\mathcal{Z}\), the true loss of the hypothesis \(\mathcal{A(S)}\), where \(S\sim\mathcal{D}^m\), is within a tolerance \(\epsilon\) of the true loss of the best hypothesis, with probability at least \(1-\delta\), for every \(m\geq m_\mathcal{H}(\epsilon,\delta)\).

$$\mathbb{P}_S\{L_\mathcal{D}(A(S))\leq L_\mathcal{D}(h^*)+\epsilon\}\geq 1-\delta$$

In plain English, this means that if you fix your architecture, loss function, and dataset, then we say the architecture is learnable if there is some amount of training data and an algorithm that would output a model that achieves a score arbitrarily close to the best model's score with arbitrary probability.

In binary classification, if the two classes can be cleanly separated by a straight line, then that line achieves a true loss of zero. This is the realizability assumption.

Definition (Agnostic / Realizable) The realizability assumption, as opposed to agnostic learning, is that there there indeed exists a hypothesis in \(\mathcal{H}\) that achieves a true loss of zero.

The size of the hypothesis class is important for whether or not it is PAC learnable. For example, the hypothesis class with one hypothesis \(\mathcal{H}=\{h_0\}\) is trivially PAC learnable. We will see that finite classes are always PAC learnable, whereas the situation is more complicated for infinite classes.

Exercise 1.1 (Monotonicity of Sample Complexity) [Schwartz, David] Let \(\mathcal{H}\) be a hypothesis class for a binary classification task. Suppose that \(\mathcal{H}\) is PAC learnable and its sample complexity is given by \(m_\mathcal{H}(\cdot,\cdot)\). (i) Show that \(m_\mathcal{H}\) is monotonically nonincreasing in each of its parameters. That is, show that given \(\delta\in]0,1[\), and given \(0< \epsilon_1\leq\epsilon_2< 1\), we have that \(m_\mathcal{H}(\epsilon_1,\delta)\geq m_\mathcal{H}(\epsilon_2,\delta)\). (ii) Similarly, show that given \(\epsilon\in]0,1[\), and given \(0< \delta_1\leq\delta_2< 1\), we have that \(m_\mathcal{H}(\epsilon,\delta_1)\geq m_\mathcal{H}(\epsilon,\delta_2)\).

Show solution

2. Finite Hypothesis Classes & Uniform Convergence

By the end of this section, we'll have a way to roughly estimate the number of training samples you need to get a model that's as accurate as you like. We mentioned that since \(\mathcal{D}\) is unknown, we employ the empirical loss \(L_S\) as a proxy measure of the quality of a hypothesis. An algorithm that minimizes this quantity over \(\mathcal{H}\) is an Expected Risk Minimization (\(ERM_\mathcal{H}\)) algorithm.

Definition (Expected risk minimizer) Denote the result of an \(ERM_\mathcal{H}\) algorithm with respect to a hypothesis class \(\mathcal{H}\) and a training set \(S\) as \(h_S:=ERM_\mathcal{H}(S):=\argmin_{h\in\mathcal{H}}L_S(h)\).

We are going to work towards a corollary that says that any finite \(\mathcal{H}\) is agnostic PAC learnable with respected to a bounded loss function with sample complexity \(\lceil\frac{1}{2\epsilon^2}\ln(\frac{2|\mathcal{H}|}{\delta})\rceil\). We won't construct a specific algorithm that learns any finite \(\mathcal{H}\), but we will show that as long as the algorithm is an expected risk minimizer then the result follows.

But first, note the difference between \(h_S\) and \(h^*\). If it so happens that \(S\) is unrepresentative of the distribution \(\mathcal{D}\) as a whole, it may be that \(h_S\) is completely different from \(h^*\). We therefore introduce the notion of a \(\epsilon\)-representative sample \(S\).

Definition (\(\epsilon\)-representative) We say that \(S\) is \(\epsilon\)-representative iff $$\forall h \in \mathcal{H}, |L_S(h)-L_\mathcal{D}(h)|\leq\epsilon.$$

This definition is closely related to the tolerance inequality found in the definition of PAC learning, except that we fix the algorithm \(\mathcal{A}=ERM_\mathcal{H}\) beforehand. Indeed, we can show the following lemma.

Lemma If \(S\) is \(\epsilon / 2\)-representative, then \(L_\mathcal{D}(h_S)\leq L_\mathcal{D}(h^*)+\epsilon\)

Show proof

In fact, once we know that we want to perform expected risk minimization, there's a nice property of some hypothesis classes that you can always increase the sample size until it is \(\epsilon\)-representative of \(\mathcal{H}\) with probability at least \(1-\delta\).

Definition (Uniform convergence property) \(\mathcal{H}\) has the UC property with respect to a sample set \(\mathcal{Z}\) and loss function \(l\) iff \(\exists m_\mathcal{H}^{UC}:]0,1[^2\rightarrow\mathbb{N}\) such that \(\forall\epsilon,\delta\in]0,1[\) and \(\forall \mathcal{D}\) over \(\mathcal{Z}\), any \(S\sim\mathcal{D}^m\) with \(m\geq m_\mathcal{H}^{UC}(\epsilon,\delta)\) is \(\epsilon\)-representative with probability greater than \(1-\delta\). $$\mathbb{P}_S\{\forall h \in \mathcal{H}, |L_S(h)-L_\mathcal{D}(h)|\leq\epsilon\}\geq 1-\delta$$

What's suprising is that any finite hypothesis class has the UC property as long as \(l\) is bounded. Here's the lemma and proof.

Lemma If \(|\mathcal{H}|<\infty\) and \(\forall h\in\mathcal{H}, z\in\mathcal{Z}\), the loss function is bounded as \(a\leq l(h, z) \leq b\), then \(\mathcal{H}\) has the uniform convergence property.

Show proof

And the main result follows naturally from the UC property and choosing 0-1 loss.

Corollary (Finite classes are PAC learnable) If \(\mathcal{H}\) is finite, then it is agnostic PAC-learnable with respect to any sample set and 0-1 loss, thanks to the ERM rule and sample complexity \(m_\mathcal{H}(\epsilon, \delta)=\frac{1}{2\epsilon^2}\ln{\frac{2|\mathcal{H}|}{\delta}}\).

Show proof

So how good is this prediction for number of samples you need for a required performance? Say for example we are working with LeNet on the MNIST dataset with a 0-1 loss function, so \(\epsilon\) represents difference in accuracy from the best possible network. Since LeNet's weights are real numbers, in practice they are represented with 64 bits.

WIP

Tensor Decomposition

Sources


  • [link] Rüdiger Urbanke, Nicolas Macris. CS-526 Learning Theory.
  • [link] Shai Shalev-Shwartz, Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms
  • [link] Leslie Gabriel Valiant. A Theory of the Learnable.
  • [link] Gyora.M. Benedek & Alon Itai. Learnability with respect to fixed distributions.
  • [link] Zhiyuan Li, Yi Zhang & Sanjeev Arora Why are convolutional nets more sample-efficient than fully-connected nets?

WIP | Last edited 04/06/2023