I find it rather weird that mathematics is usually taught to people on a historical basis. It would likely be possible for a 4th century Alexandrian mathematician to teach geometry in a modern primary school and get decent results. An 18th-century Venetian tutor could carry the curriculum all the way up to high school, though I’ll grant they might need a week or two to brush up on sets and some modern notation.

On the other hand, Darwin or Huxley would most likely get fired within the week if they tried to teach high school biology. The great Aristotle would be unable to introduce a 1st grader to any of the sciences, Despite his works in the precursors of modern science remaining foundational more than a millennium after his death.

To a great extent, mathematical knowledge retains validity over time. When that knowledge is poorly formulated or it does become irrelevant, it’s nonetheless still valid.

The obvious hypothesis for why this is true is that mathematics is rather “pure” compared to many/all other fields.

That’s boring. I’ve come up with a second, more interesting and much more scandalous hypothesis.

## Mathematical Abstractions are Very Leaky

In case the term is unfamiliar to you, a **leaky abstraction** is an abstraction that leaks the details that it’s supposed to abstract away.

Take something like sigma summation (∑), at first glance it looks like an abstraction, but it’s too leaky to be a good one. Why is it leaky? Because all the concepts it “abstracts” over need to be understood in order to use the sigma summation. You need to “understand” operations, operators, numbers, ranges, limits and series.

For another example take integrals (∫), where the argument for leakiness can be succinctly stated. To master the abstraction, one needs to understand all of calculus and all that is foundational to calculus (also known as “all of math” until 60 years ago or so).

A leaky abstraction is the rule rather than the norm in mathematics. The only popular counterexample that comes to mind are integral transforms (think Laplace and Fourier Transforms). Indeed, any mathematician in the audience might scoff at me for using the word “abstraction” for what they think of as shorthand notations.

Why? Arguably because mathematics has no such thing as an “observation” to abstract over.

Science has the advantage of making observations, it sees that X correlates with Y in some way, then it tries to find a causal relationship between the two via theory (which is essentially an abstraction).

Darwin’s observations about finches were not by any means “wrong”. Heck, most of what Jabir ibn Hayyan found out about the world is probably still correct. What is incorrect or insufficient are the theoretical frameworks they used to explain them.

Alchemy might have described the chemical properties and interactions of certain type of matter quite well. We’ve replaced it with the Bhoring model of chemistry simply because it abstracts away more observations than alchemy.

Thinking with competition, sexual reproduction, DNA, RNA and mutations explains Darwin’s observations. This doesn’t mean his original “survival of the fittest” evolutionary framework was “wrong”. It just makes it a tool that outlived its value.

So, mathematics ends up being “leaky” because it has no such thing as an “observation”. The fact that 2 * 3 = 6 is not observed, it’s simply “known”. Or… is it?

The statement 95233345745213 * 4353614555235239 = 414609280180109235394973160907 is just as true as 2 * 3 = 6.

Tell a well-educated ancient Greek, “ 95233345745213 times 4353614555235239 is always equal to 414609280180109235394973160907, this must be fundamentally true within our system of mathematics,” and he will look at you like you’re a complete nut.

Tell that same ancient Greek,“ 2 times 3 is always equal to 6, this must be fundamentally true within our system of mathematics,” and he might think you are a bit pompous, but overall he will nod and agree with your rather banal statement.

To a modern person there is little difference in how “obviously true” the two statements seem, because a modern person has a calculator.

But before the advent of more modern techniques for working with numbers, such calculations were beyond the reach of most if not all. There would be no “obvious” way of checking the truth of that first statement… people would be short 414609280180109235394973160897 fingers.

So really, there is something that serves a similar role to observations in mathematics; that which most can agree on as being “obviously true”. Obviously there’s cases where this concept breaks down a bit (e.g. Ramanujan summation), but these are the exception rather than the rule.

Computers basically allow us to raise the bar for “obviously true” really high. So high that “this is obviously false” brute force approaches can be used to disprove sophisticated conjectures. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain.

Computers also allow us to build functional mathematical abstraction. Because suddenly we can “use” those abstractions without understanding them. Being able to use a function and understanding what a function does are different things in a modern age, but this is a very recent development.

For the most part computers have been used to run leaky mathematical abstractions. Leaky by design, made for a world where one had to “build up” their knowledge to use them from the most obvious of truths.

However, I think that non-leaky abstractions are slowly showing up to the party, and they have a lot of potential. In my view, the best specimen of such an abstraction is the neural network.

## Neural Networks as Mathematical Abstractions

As far as dimensionality reduction (DR) methods go, an autoencoder (AE) is basically one of the easiest ones me to explain, implement and understand. Despite being quite sophisticated compared to most other nonlinear DR methods.

The way I’d explain it (maybe a bit more rigorous than necessary) is:

1. Build a network that gets the data as an input and has to predict the same data as an output.

2. Somewhere in that network, preferably close to the output layer, add a layer E of size n, where n is the number of dimensions you want to reduce the data to.

3. Train the network.

4. Run your data through the network and extract the output from the layer E as the reduced dimension representation of said data.

5. Since we know from training the network that the values generated by E are good enough to reconstruct the input values, they must be a good representation.

This is a very simple explanation compared to that of most “classic” nonlinear DR algorithms. Even more importantly, it uses no mathematics whatsoever. Or rather, it uses a heap of mathematics, but it’s all abstracted away behind the word “train”.

Consider for a moment the specific case of DR. It could be argued that a subset of AEs are basically performing PCA (1) (2).

In the case of most “real” AEs however, they are nonlinear. We could see them as doing the same thing as a kernel PCA, but removing the need for us to choose the kernel.

AEs are also impressive because they “work well”. Most people working on a hard DR problem will probably use some brand of AE (e.g. a VAE).

Even more than that, AEs are fairly “generic” compared to other nonlinear DR algorithms, and there are a lot of these algorithms. So, one can strive to understand when and why to apply various DR algorithms, or one can strive to understand AEs. The end result will be similar in efficacy when applied to various problems, but understanding AEs is much quicker than understanding a few dozen other DR algorithms.

Even better, in order to understand AEs the only “hard” part is understanding neural networks, but neural networks are a very broad tool so many people may already possess some understanding of them.

To understand neural networks, you need three other concepts: automatic differentiation as used to compute the gradients, loss computation and optimization.

Automatic differentiation is hard, to the extent that I assume most people otherwise skilled in coding and ML wouldn’t be able to replicate Google’s jax or Pytorch’s autograd. Luckily for us, automatic differentiation is very easy to abstract away and explain: “Based on a loss function computed between the real values and your outputs (the error), we use {magic} to estimate how much every weight and bias was responsible for that error and whether it’s influence increased or decreased the error. Furthermore, we can use {magic} to do this efficiently in batches”.

This sort of explanation is not that satisfactory, but I doubt most people go through life acquiring a much deeper understanding than this. I have no strong evidence that this is true, but empirically I notice new papers about new optimizers and loss functions pop up every day on /r/MachineLearning. I can’t remember the last time I saw a paper proposing significant improvements to common autograd methods or scrapping them altogether for something new.

Optimizers are… not that hard. Explaining something like Adam or random gradient descent to someone is fairly intuitive. Since the optimization itself is applied to rather trivial 2d functions and the process is just repeated a lot of times, it’s not that hard to conceptualize how it works. Alas, you can probably construct a “just so” story for why optimizers work similar to the one above, tell people to use AdamW and the world will be just fine.

Loss functions are certainly a trivial concept and if you stick to the heuristic behind basic loss functions and don’t try to factor them into crazy inferences, I’d argue anyone could probably write their own loss functions the same way they write their own training loops.

It’s really tough to argue how much “harder” it is to understand these three concepts to the degree where you can understand how to design neural networks, versus how hard it is to understand the kernel trick. I would certainly say they are probably in the same difficulty and time ballpark.

However, the advantages of neural networks are:

a) You don’t actually have to deeply understand the concept of automatic differentiation and optimization in order to design a network that does roughly what you want.

b) Once you learn the three basic concepts, you have the foundation on which to understand anything neural network related. Be it AEs, RNNs, CNNs, Residual blocks/nets or transformers.

So essentially, neural networks become a sort of mathematical abstraction that isn’t very leaky. Sure, there’s some need for an understanding of the underlying concepts in order to figure out how to use it, but it’s rather minimal. You can train a programmer with basic algebra knowledge to use Keras in a few days. The same cannot be said about using 4-dimensional geometry or complex calculus.

And the neutral network abstraction gives us a higher-level language to talk about ML algorithms. As was the case with the AE, some of the resulting algorithms are arguably not that far off from the “classical” ML world, but instead of intellectual monstrosities , the neural network based designs are conceptually simple, all of the complexity is in finding the weights and biases, hidden behind the abstraction.

For example, I’ve heard it argued that a typical transformer network (think BERT), is basically similar in structure to a “classical” NLP processing pipeline (3).

About three years ago, it was really popular to argue that a CNN’s layers could be thought of as feature detectors that followed a simple to complex hierarchy the closer you got to the output layer (e.g. the first layer handles edges, the second layer handles simple geometric shapes… the 12th layer handles facial features specific to mullet fishes and seagulls).

The fact that they sometimes resemble these algorithms, however, is more of a cherry on top rather than the crux of the matter. At its core, it’s important that models created with this abstraction should work and the last five years have answered that question with a resounding “Yes”.

## So Why is This Important?

Well, it’s important because non-leaky mathematical abstractions are rather rare. Especially ones that have such a low point of entry to and are used so widely. I wouldn’t compare neural networks to integral transforms just yet, but I think we are getting there.

I’d also argue it’s important because it explains why neural networks have not only taken over the field of “hard” ML problems, but are now making their way into all facets of ML where a SVM or DT or GB classifier might have worked just fine. It’s not necessarily because they are “better”, but it’s because people have more confidence in using them as an abstraction.

Lastly, it’s important because it’s a way to conceptualize why neural networks *are,* in a way better, than classical ML algorithms. Because this lack of leaks means that anyone can play around with them without breaking the whole thing and being thrown one level down. Want to change the shape? Sure. Want to change the activation function? Sure. Want to add a state function to certain elements? Go ahead. Want to add random connections between various elements? Don’t see why not… etc. They have a lot of tweakable hyperparameters and they are not modifiable just in principle.

Of course, every ML algorithm has tweakable parameters, but as soon as you start changing the kernel function of your SVM you realize that for the tweaks to be useful, the abstraction must break down and you need to learn the concepts underneath (and so on, and so on).

It’s rare for me to argue that a relatively popular and hyped up thing is “even better” than people think. But in the case of neural networks, I truly think that they are among the first of a “new” type of mathematical abstractions. They allow people that don’t have the dozen+ years background of learning applied mathematics, to do applied mathematics.