Vapnik’s Principle: What was actually said?

If you google Vapnik’s Principle, this is the first search result:

When solving a problem of interest, do not solve a more general problem as an intermediate step

We will come back to this “principle” later. Allow me to take a rather elaborate detour for now.

Over the years, I’ve delved into several self-help books. They were engaging and pleasant reads, for sure. Yet, in retrospect, I’ve come to realize they were somewhat of a scenic route to nowhere for me. The true essence of a self-help book shines when you’re truly ready to help yourself—which has never been the case for me.

However, I discovered a self-help principle that I could actually apply in my life (found it myself, not in any book). That principle is:

To excel at anything, streamline your path to focus solely on the absolute essentials required to do it consistently.

Admittedly, my attempt at paraphrasing might not do it justice, so let me give an example. I’m an enthusiastic bass player—passion over prowess. For the longest time, my progress hit a plateau. Back in Dhaka, I owned a Warwick—a badge of quality in the bass world—proving that neither money nor equipment was my blockade. The game-changer came with two simple adjustments after moving to Australia. I bought an incredibly cheap bass (AUD 170) and made sure it was always within arm’s reach of my workspace. I didn’t do it intentionally, though; there was simply not enough space in my then tiny townhouse. I also bought a cheap amp and kept it close to my footrest.

Doing so enabled me to practice without the formalities of practicing. Whenever I was stuck on a problem, I played scales while thinking about it. Whenever I was watching some Netflix, I practiced my 16th notes. Eventually, it helped me reach a level that was previously unreachable for me.

Why couldn’t I do this earlier? Because I had phrased the problem inefficiently, requiring me to solve a more general problem that stretched far beyond playing bass: the formalities of practicing. Calling up mates, limping through the horrendous Dhaka traffic, and not having a place for myself all created barriers. Solving these problems would address ten other issues because they were more general by definition. Once I rephrased the problem to focus on accessing the absolute essentials with ease, it took me further. The same goes for my book reading. Once I let go of the romanticism of the tactile experience of reading a physical book and embraced Kindle, my diminishing reading habit took a U-turn.

In hindsight, this was Vapnik’s principle.


I am familiar with some of the works of pre-Soviet and Soviet/Post-Soviet mathematicians (mainly Kolmogorov and, of course, Vladimir Vapnik). What signifies their work is the uncompromising rigor. So Vapnik probably would not have said such a loosely interpreted, populist, self-help quote without proper context.

Now the question is: what was the context?

Actually, “Vapnik’s Principle” is quite possibly more of a hipster term. I have not found it in his seminal work “The Nature of Statistical Learning” (I might be wrong, my memory is dangerously obnoxious). The actual theory where it has been discussed at length is “Structural Risk Minimization.” But what is the structure here? Also, what risk are we talking about?

The “structure” in this context is basically the framework one uses to solve the problem. In the realm of machine learning, it is the selection of a model against a training set with a finite amount of data. The risk one runs here is the risk of overfitting, i.e., the lack of generalization capacity of the selected model. Let’s look at the following list of learning tasks, with increasing order of complexity, according to Vapnik’s principle.

  • If you need to estimate a rank ordering of a set of variables, you don’t need their point estimates because that’s a more generalized problem than rank ordering
  • If you need the point estimate of a variable then you don’t need the entire distribution because that’s a more generalized problem than estimating a point
  • If you need to find the distribution of the variable, you don’t need to find the underlying stochastic process (currently widely known as, with a lot of abuse of notion, wink wink, generative model)

Here is an example of Vapnik Principle aided thinking in pop culture by Walter White in Breaking Bad (first 40 seconds). If you don’t know the full context here, it won’t hit you with undue spoiler. :

I am intentionally avoiding the discussion on Inductive vs Transductive learning here because that’s another huge can of worms. My goal here is to imply that Vapnik’s Principle is not relevant only in the realm of statistical learning but also in strategic decision-making within the confines of limited resources.


Is Vapnik’s Principle Limiting and Anti-Innovation?

Answering this question involves invoking the debate between empiricism and rationalism. Indeed, some problems are easier to solve if we find their general form first. Many problems are challenging in the time domain but become straightforward when we perform Fourier or Laplace transforms. However, most such problems do not involve the elephant in the room: the finite amount of training data when a learning task is involved.

Vapnik’s Principle, with its mechanisms, provides the conditions under which a task can be “learned” with guarantees, but the converse is not true. Over-parameterized deep learning has shown absolute disregard for Vapnik’s Principle and has achieved tremendous success. Yet, Vapnik’s Principle still influences this field in some way.

For example, the Lottery Ticket Hypothesis suggests that there exists a small subnetwork within a much larger neural network that works just as well when initialized in the “right way.” Coined in 2018, this hypothesis has seen subsequent evolution, refutation, and alternation. You can check out this dense post on LessWrong for more details. The possibility of a sparser network can be seen as a connection to the structure in structural risk minimization. We are running low on the risk side in deep learning, probably because a good part of the network is not learning much, so the existential threat posed on generalization is smaller, idk. But this works in our favor, right?

Leave a comment