October 18, 2023

Total Training Error: 0.00%

TP %: The percent of the samples seen in training that are correctly enclosed by the dragged square. 0.00%

FP %: The percent of the samples seen in training that are in the dragged square, but shouldn't be. 0.00%

FN %: The percent of the samples seen in training that are not in the dragged square, but should be. 0.00%

TN %: The percent of the samples seen in training that are correctly not in the dragged square. 0.00%

Total Testing Error: 0.00%

TP %: The percent of the full space that is true positives (the dark green area). 0.00%

FP %: The percent of the full space that is false positives (the gray area). 0.00%

FN %: The percent of the full space that is false negatives (the light green area). 0.00%

TN %: The percent of the full space that is true negatives (the area out of either squar). 0.00%

I first heard of this game from Tufts University professor emeritus Anselm Blumer, who along with his coauthors Ehrenfeucht, Haussler, and Warmuth used this setting to demonstrate the concept of learnability in their article in 1989. That paper became known as the “Four Germans Paper”, and in honor of their work, I call this game the *Four Germans Game*. This game is a useful abstraction that can help illustrate the task that machine learning algorithms have to solve. Machine learning models see some training data generated by an underlying phenomenon (i.e. the hidden box), and based on that limited information must fit a model that approximates the underlying phenomonen. A successful algorithm will pick a concept with low error on its training data, and will hopefully have similarly low error when it is tested (here error defined as the total area that is only inside one box, i.e. false positives (in our proposed concept, not in the ground truth concept), or false negatives (in ground truth concept, not in our proposed concept)). A lot can go wrong, however, even in this simple game:

- What if the training data is not representative of the actual phenomenon?
- What if there is not enough training data?
- What if the training data is noisy?
- What if the wrong model is used (i.e. you guess a rectangular region, but the actual ground truth is a circle)?

This game can be used to illustrate many concepts of one of the underlying theories of machine learning: probably approximately correct (PAC) learning, proposed by Leslie Valiant in 1984. PAC learning is a framework to help understand why we are able to trust machine learning algorithms. It also provides some insight into why machine learning models can make nonsensical and erroneous precictions. In this article, we’ll illustrate the basic assumptions of most machine learning algorithms, and demonstrate why fully-automated approaches will never be able to be error-free if these assumptions are validated.

Machine Learning algorithms are becoming ubiquitous tools in the age of big data to aid in decision making and data analysis. They are actively being deployed in scenarios where their impact directly effects humans,including self-driving cars, patient triaging software at hospitals, and even crime prediction software, a la Minority Report. Automation from deep learning models such as facial recognition technology, image and scene classification, and text generation through large language models are rapidly eclipsing levels of human competency and changing the way we live our lives.

When they are successful, these models can result in new industries and capabilities that were science fiction only years ago. But when they fail, they can have drastic consequences for the humans effected by them. Consider a scenario where making a mistake can have drastic consequences: for example, a model might be used to prescribe a medication that has drastic side effects if applied to the wrong patient. Would you feel comfortable using a machine learning model to decide on that medication?
When we deploy machine learning algorithms out in the wild, how do we *know* that they will work at their intended goal? How do we know that the probability of their error is small *enough*?

The answer, of course, is that we don’t know for sure the models will work as intended. We train models on an available dataset, and then hope that the model’s performance on a portion of that dataset is representative on however that model will be used out in the wild. This can fail spectacularly, though: A recent project called called Gender Shades by Joy Buolamwini et al. at the Massachusetts Institute of Technology discovered that commercial gender classification systems have less than one percent error on lighter-skinned males, and up to 35% error on darker-skinned females, presumably because they were trained on datasets that were biased towards light-skinned males. Because the training sets were imbalanced and not representative of the general population, the tool performed much worse in practice than we can imagine its creators assumed it would perform.

It is important for the general public to understand that machine learning algorithms make mistakes. And it is *crucial* for decision makers to understand *why* machine learning models can perform worse when deployed in the real world, so that they can make informed decisions about deploying these models in high-risk environments. In hindsight, it might seem obvious that the training data of facial recognition algorithms could lead to a bias in a deployed system. PAC learning, however, will provide us foresight to anticipate these types of risks before building a model. And this article will illustrate what that means.

Let’s return to the game, and consider some machine learning concepts. First, try to play the game again (in the smaller version to the right here), but without getting any training data.

...

Not very fun, or interesting, right?

This is an illustration of there being No Free Lunch theorems for optimization. In the absence of any information, we can do no better than random chance, and thus we can make no statements about how “correct” we are. You might have 10% error, you might have 90% error, you might fit the rectangle perfectly and have 0 error.

However, suppose we are given some information. Here, we are given 10 points, with 5 inside the target rectangle and 5 outside. If you guess a rectangle, can you make any guarantees about how close you are to the ground truth?

Consider the various strategies that you might use to minimize your error. Do you tightly wrap the green examples, or do you leave some space around them to allow for data you haven’t seen yet? Which strategy generally works better? Which strategy works better in the unlucky case, where the sampled data doesn’t provide much information due to bad luck?

Up to this point, you have been playing the machine learning algorithm. You have all the magnificent faculties of the human mind, letting you change and adapt your strategy as you encounter new examples. But machine learning algorithms tend to be much simpler, and typically more stuck-in-their ways, because they have to be well-defined.

We present three simple machine learning algorithms to solve this problem.

**Tightest Fit**: In this algorithm, the tightest possible rectangle is chosen that still contains all of the positive examples seen so far. This algorithm is the smallest possible region that is still consistent with the samples seen. It is related to the principle of Risk Minimization, which is a strategy applicable to many machine learning problems. This will under-estimate the rectangular region.

**Loosest Fit**: In contrast to the Tightest Fit algorithm, this algorithm chooses the loosest possible rectangle. This rectangle will be internal to all negative examples, and will contain all positive examples. This will over-estimate the rectangular region.

**Maximum Margin**: This is a midpoint between the two previous algorithms. It chooses a rectangle that is as far as possible from both the positive and negative examples, while containing all examples.

How do we determine if any of these algorithms are good? Or reliable? What are the properties and assumptions that are necessary to make these judgements?

When we use classical statistical methods, such as logistic regression, to analyze data, we can *trust* the model works because formalisms such as the p value and confidence intervals give us bounds on the possible errors of the model. The rectangle game proposed by Blumer et. al. that is featured in this post is designed to demonstrate how such formalisms can be defined for machine learning techniques.

Computational Learning Theorists attempt to find theoretically sound definitions of concepts found in machine learning,
such as training data, validation accuracy, and modelling. These definitions are then used to prove bounds on various
metrics like error and runtime. The paradigm of **probably approximately correct learning**, or **PAC learning**, has the
explicit goal of determining under what conditions a machine learning algorithm will most likely perform about as well as it does
on a training set, when deployed in the wild.

We provide the following treatment based on chapter 1 from Kearns and Vazirani, and we direct the reader to that text for a more thorough discussion. Loosely, a concept is PAC-learnable if there exists a machine learning algorithm such that, after viewing a sufficient amount of labeled samples, we can prove a bound on the error (let’s say that error is smaller than some epsilon $\epsilon$ ), assuming we aren’t too unlucky with which samples we receive (so, with probability at least $(1 - \delta)$ ). This type of analysis mirrors analysis done using p values for logistic regression or mathematical definitions of limits, and allows us to bound error on machine learning models. We define it formally below, then prove that our game is PAC-learnable. The assumptions in the definition and its proof lead us to a better understanding of potential vulnerabilities of machine learning algorithms, which will be demonstrated below.

Let $X$ be a data space. This could be as simple as our 2-D space, or it could be the space of all images or the space of all snippets of text. A *concept* is a subset $c \in X$ of the data space - in our case a rectangular region. A *concept class* $C$ is the collection of all concepts over our data space - in our case, all possible rectangular regions in the 2-D space.

Let $EX(c, D)$ be a procedure that draws an example, $x$ , using a probability distribution $D$ and gives the correct label $c(x)$ . In our case, $EX(c, D)$ is the process that happens when you hit the button. It draws a random location in the 2-D space using $D$ (the uniform distribution), then provides a label for that location as green (if it is part of the rectangular region we want to guess), or red (if it is outside that region).

An algorithm $A$ is a **PAC learning algorithm** for $C$ (equivalently, the concept space $C$ is **PAC learnable** if the following is true: For any $\epsilon$ and $\delta$ that we choose between 0 and 1 (i.e. some small fraction of a number), there exists a finite sample size $p$ such that if $EX(c, D)$ draws and labels $p$ examples, the algorithm $A$ will output a hypothesis concept, $h \in C$ , that will have less than $\epsilon$ error, with probability $1-\delta$ .

In other words, a type of concept (which could be rectangular regions on a plane, or the set of all pictures of cats and dogs within a group of images - the thing we are trying to see in the data space) is considered learnable if a learning algorithm exists that can usually reach a level of guaranteed performance if it sees enough training data. In the definition, choosing $\epsilon$ chooses how “good” an algorithm we want. If $\epsilon = 0.05$ , then we are saying a “good” algorithm will have less than 5% error, however we want to define error. The $\delta$ instead compensates for bad luck. If $\delta = 0.1$ , then we are saying that we will be able to produce a “good” algorithm at least 90% of the time. The other 10% of the time, we might be unable to produce a “good” enough algorithm because of getting unlucky with a sample size. In general, the number of samples increases as the amount of error ( $\epsilon$ ) and bad luck ( $\delta$ ) we tolerate goes down. These two parameters in the definition explain the redundancy in the term: probably ( $\delta$ ) approximately ( $\epsilon$ ) correct.

It’s useful to consider when a concept is *not* PAC learnable. It would seem pretty universal that the more samples a learning algorithm sees, the more confident it would get about its predictions, and the better those predictions would be. But in a lot of problems, you hit a plateau. If you try to fit a very simple model, like a logistic regression classifier, to a very complicated concept, like classifying photos of dogs, you probably can’t guarantee any arbitrary level of performance no matter how many samples you see.

The most popular learning paradigm used in machine learning use cases is *Empirical Risk Minimization*, which says that when we want to minimize error (and thus risk) when putting a machine learning model out in the world, we should first tune it to minimize error on the *empirical* data that we have, i.e. the training data. By minimizing the *empirical risk*, we are hopefully minimizing the *actual risk* the model will try to avoid when it is used out in the world.

This only makes sense, though, if there is some theoretical basis for why seeing labeled trainign data would guarantee performance on unseen data. If a concept class wasn’t PAC learnable, then no matter how good our performance on a training set was, we wouldn’t be able to make any guarantees on an algorithm’s performance out in the world.

Understanding how to prove PAC learnability can help us see what stipulations about a machine learning use case need to hold in order to trust empirical risk minimization. The definition of PAC learnability can also provide hints into how applied machine learning can go wrong.

In this article, we will use our game to demonstrate first how a learning algorithm can be proven to be a PAC learning algorithm. Then, we will break down different ways where PAC learning can break down due to issues with the data or the complexity of the concept that we are trying to model.

We can use the PAC learning paradigm to analyze our algorithms. In particular, we can look at the tightest fit algorithm, and try to bound its error, where error is defined as the probability that our algorithm’s chosen region will get the label incorrect. We would like to prove that the error is less than $\epsilon$ , with probability at least $1-\delta$ . We do this by showing that, under the uniform distribution that gives us our labeled points, we have a pretty good chance of having low error.

First, we note that the tightest-fit rectangle is always contained in the target rectangle. We can express the total error region as four strips: the strips above, below, to the left, and to the right of the tightest-fit rectangle. This is the only region where our algorithm will label incorrectly. To proceed, we determine the probability that the size of each region is $\leq \frac{\epsilon}{4}$ , so that the total error is less than $\epsilon$ . This will give us a relationship between $\epsilon$ and $\delta$ in terms of the nubmer of samples drawn, $M$ .

Consider the top strip, which we’ll call T’. Then, consider the strip T that starts from the top of the proposal region, and sweeps down in height until it has area $\frac{\epsilon}{4}$ . **If T’ is contained in T, then it has area less than $\frac{\epsilon}{4}$ , and we are done**. By our algorithm, we know that our T’ can only be greater in area than T if we haven’t seen any samples in T, since if we had, then our proposal region would contain that point. The probability that M independent draws from the distribution (i.e. new points that we see) all miss the region T (and thus that the area is less than $\frac{\epsilon}{4}$ ) is exactly $(1 - \frac{\epsilon}{4})^M$ .

The same analysis holds for the other three regions, so by the union bound, the error is at most $4(1 - \frac{\epsilon}{4})^M$ . Then, we choose M to satisfy $4(1 - \frac{\epsilon}{4})^M \leq \delta$ , i.e. the probability that the area is **not** bounded by $\epsilon$ is smaller than $\delta$ , so the probability that it **is** bounded by $\epsilon$ is greater than $1 - \delta$ .

The final step of the proof is to determine how many samples we’d need - solving for M. Using the inequality $(1 - x) \leq e^{-x}$ , we can simplify the bound to $4e^{\epsilon M / 4} \leq \delta$ , and solve for $M \geq (4/\epsilon)ln(4/\delta)$ . This is a very common strategy for establishing bounds in these types of proofs. The key takeaway is that M is not infinity. Since M is a real, finite number, we’ve demonstrated that **the problem is PAC-learnable**, i.e. we have bounded the error in terms of the number of samples seen.

As the old saying goes: Don’t Assume. It makes an ass out of you and me. And our proof involved many assumptions that may not be so realistic, if we imagine proving something similar for a more applied case.

While the proof contains some complicated logic, and our conclusion can be hard to interpret, it was actually a significant result. For this very simple problem (much simpler than the types of problems machine learning is being used to solve in the real world), we were able to show that the error on unseen examples would be bounded.

But not so fast - there’s a major issue that we didn’t mention. The proof we went through, as well as the PAC-learning paradigm in general, relies on several assumptions.

That the sampled data in independent and identically distributed (typically referred to as i.i.d). In real world data, this may not be the case; if the algorithm is being trained on streamed data, it typically cannot be considered independent draws. This assumption was used in our probability bounds.

In this version of the game, the i.i.d. assumption is violated by placing some linear drift on the data. As we get new points, they move linearly in a certain direction.

In this version of the game, the training distribution and the testing distribution are independent of one another. This means that the true labels (green points) that you get in training may not be representative of where the true labels would appear in the testing phase. It makes it impossible to guarantee any performance: it’s back to the No Free Lunch situation described earlier.

Our proof assumed that we already knew that the type of region we were looking for was a rectangle. But in practice, we rarely know what kind of machine learning model will match the phenomenon being modeled in the data. Suppose, instead of a rectangular region, the region we were looking for turned out to be an ellipse, or a parallellogram, or even an arbitrary, amorphous disconnected region. A geometric proof would then be much harder, or even impossible.

In this version of the game, the actual region we are trying to guess is an ellipse, but we can only choose a rectangle. Play it a few times and notice that no matter how good your guess is, you will always have some nonzero lower bound on your testing error.

There are other considerations for failure scenarios for machine learning algorithms. Varshney presented an interesting treatment in 2017. In practice, there is active research into proving PAC learnability for more complex learning spaces than in this article (PAC Learnability of Node Functions in Networked Dynamical Systems by Adiga et al. in ICML from 2019, or A Theory of PAC Learnability under Transformation Invariances by Shao et al in NeurIPS from 2022). However, it is likely that PAC learnability wouldn’t hold in real world cases because we typically can’t *prove* (or *know*) that the assumptions are going to be held. But the paradigm of PAC learnability illustrates what types of problems we have to look out for in applied machine learning.

The game played in this blog post was hopefully a relevant illustration of some of the topics introduced. If it’s helpful, it may be because it is a visual explanation. In general, visualizing the right properties of the data and the algorithm can help indicate to a data scientist whether any of the necessary assumptions are broken. The data scientist can then address them by cleaning or preprocessing the data, or choosing a different machine learning algorithm.

In this blog post, we looked at an incredibly simple machine learning problem, and the algorithms we considered were easily explainable in a sentence or two. Even for this simple problem, it was difficult to prove any limitations on error. Machine learning is typically used to model much more complex problem domains, with much more complex data and intricate, high-dimensional processes. In these cases, it is very unlikely that error bounds are proveable, and even if they are, it is unlikely that the assumptions are upheld.

Instead, a human in the loop, armed with appropriate visualizations and analytical tools, can act as a safeguard against the most endemic cases. This additional line of defense is more and more necessary as machine learning models are deployed in scenarios that directly affect humans. More discussion on this topiccan be found in chapters 1 and 7 of my thesis, Bridging the Human-Machine Gap in Applied Machine Learning with Visual Analytics, which offers additional perspective on the role of visual analytics systems in empirical risk minimization.

In this interactive article, we used the Four Germans Game to illustrate the concept of probably approximately correct (PAC) learning. We then demonstrated how several assumptions are used in learning theory to build theoretical bounds on performance for machine learning models in the empirical risk minimization learning paradigm. We presented some examples, using the game, of how these assumptions might be broken.

*An Introduction to Computational Learning Theory*by Kearns and Vazirani- The Myth of the Impartial Machine by Feng and Wu
*Principles of Risk Minimization for Learning Theory*by Vapnik*On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and Data Products*by Varshney and Alemzadeh*Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World*by Leslie Valiant

Thank you to Anselm Blumer and the reviewers from VisXAI.

Dylan Cashman is an assistant professor in the Michtom School of Computer Science at Brandeis University in Waltham, MA. He previously worked as a Senior Expert in Data Science and Advanced Visual Analytics within the Data Science and Artificial Intelligence division of Novartis Pharmaceuticals in Cambridge, MA. Dylan holds a PhD in Computer Science from Tufts University and an Sc. B in Mathematics from Brown University. His research interests include the development and evaluation of visual affordances that improve usability of artificial intelligence models and data science processes. His research has won best paper awards at the Eurovis conference, the Symposium on Visualization for Data Science (VDS), and the Workshop on Visual Analytics for Deep Learning at IEEEVIS.