Let’s start with a little experiment that will give you a deeper understanding of what a neural network is. Even if you are an expert in AI, please try.
Write down on a piece of paper what you see here (up to two words):
We will be back to this experiment later, but it is very important that you keep that physical piece of paper near you.
This post was born after an interesting discussion on Nexa’s mailing list.
The Nexa Center for Internet for Internet & Society is an independent research center, focusing on interdisciplinary analysis of the force of the Internet and of its impact on society.
The
thread was about a report from the French Commission National informatique et
libertés (CNIL), and I tried to give my technical contribution
by sharing obvious considerations
about machine learning in general and neural networks in
particular.
Such simple considerations come from hands-on experience of the
matter, since I’m a programmer.
Turn out, they were not that obvious.
So here
I try, in the style of Nexa, to provide useful insights to both the
layman that use AI for fun or profit, and to the AI practitioner
that is too deep in the field to see it from a broader,
interdisciplinary perspective.
It’s lengthy, to be both clear and technically correct, but should
worth a read.
Also,
while we focus on artificial neural networks here, most of what is
said applies, mutatis
mutandis, to other machine learning techniques.
Just, with different trade-offs.
We call artificial neural networks a class of algorithms that can statistically approximate any function.
Currently, they constitute the most exciting research field in Statistics.
The first keyword in our definition is “function”.
A function, in mathematics, is a rule that assigns to each element in a set exactly one element in another.
Counting is a function from the set of “things” to the set of numbers.
Addition is a function from (the set of) numbers’ pairs to (the set of) numbers. Such is multiplication. And exponential.
The point here is that, if you have two sets and a rule that map each element of one set to exactly one element of the other, you have a function.
Neural networks are statistical applications.
This
should explain why they need tons of data to be calibrated.
They are just like any other statistical application (or ML
technique).
Don’t agree? Try to apply them to domains with low cardinality. ;-)
If you think of neural networks (and ML in general) as statistical applications, it suddenly becomes clear that any discriminatory behavior is not due to an obscure “bias” in an inscrutable computer brain.
It’s simply a problem in the data. Or in the code. Or both.
Just like any other statistical application, the results provided by the output layer of a neural network are subject to an error.
Consider the function f : ℕ → ℙ that map each natural N to the Nth prime.
We have f(0) = 2; f(1) = 3; f(2) = 5; f(3) = 7 … and so on.
You literally have infinite samples to use, so you might want to use a neural network to approximate this function.
Or, you
could try to calibrate a neural network to identify prime numbers.
Again you have a function and literally infinite samples to
use.
But, if
you try to use their outputs for cryptography, you are
doomed.
Still you can’t blame the AI, just your poor understanding of
statistics.
The formal definition of algorithms is still a deep research topic.
In 1972, however, Harold Stone provided an useful informal definition that most programmers will agree with: “we define an algorithm to be a set of rules that precisely defines a sequence of operations such that each rule is effective and definite and such that the sequence terminates in a finite time.”
Since computer systems are deterministic in nature (and will continue to be, until the widespread adoption of quantum computing), all algorithms executed by computers are deterministic too.
When a
race condition makes a concurrent algorithm non-deterministic, programmers call it “a
bug”. We
just add time to the equation and fix it.
And when true entropy is fed to a deterministic algorithm to
make its results hard to predict, we can still replicate them by
recording the random bits fed to the algorithm and replying its
execution with those same bits.
We randomize the input, not the algorithm.
The last word in our definition is what make neural networks “magical”.
Neural networks can statistically approximate any function.
Even unknown ones.
If you
suspect
that a function exists, you can try to statistically approximate it
with a neural network, even if you do not know the rule that it
follows.
You just need two
set. And tons of data.
This is the strongest strength of neural networks. And their weakness, too.
Back to our little experiment. Take your piece of paper. What did you write?
Can you see the cat? Me too.
This is what we call pattern recognition: we match a stimulus with information retrieved from memory.
We, as humans, are very good at this. Very, very good.
Still, there is no cat there!
Really, there is just a screen connected to a computer. ;-)
Humans are so good at pattern recognition that we can be fooled by it.
We take two sets, such as the (set of) measures of Iris and the botanic classification of Iris (a set of words).
We “suspect” that a function exists between these two sets. :-D
We look for a large data set, classified by an expert, such as a Botanist.
We calibrate a neural network to approximate that hypothetical function.
Finally we run the program, and we see that it classifies Iris “like a pro”.
And, just like a mother looks her beloved son, we say: “how smart it is!”
We see
a program doing the work of a Botanist and we recognize a
pattern.
We are matching the program with experiences from our own
memories.
We look at the computer and we see a Botanist. We see an intelligence.
But it’s like with the cat.
If you
know the technology in depth, you might have noticed that I’ve
never used the term “training” for the calibration phase of a
neural network.
In the same way, I do not like to use the term “learning” for
machines.
The words we use to describe the reality forge our understanding of it.
Talking about “deep learning”, “intelligence” and “training” is evocative, attracts investments and turns programmers into semi-gods.
It’s
funny, but
plain wrong.
And dangerous, as we will see later.
All techniques known as “machine learning” requires tons of data.
We don’t have better algorithms. We just have more data.
— Peter Norwing, Chief Scientist, Google
It’s obvious, once you understand that they are just statistical applications.
Still, the amount of data required to calibrate a neural network is so large that, despite being a 70 years old technology, it became practical just recently.
Today, anybody can easily collect, buy or sell tons of data.
Why? Simply because we leak data. Precious data. Data about ourselves.
We
could apply these tools to any system for which we have enough
data.
But we have tons of cheap data about people.
With
enough data, we could try to calibrate a
neural network to select the resumes to consider for an interview.
Or to decide the perfect salary for an employee. Or to select the
best match for a transplant. Or for a love story.
A company could try to approximate the “right” cost for your
insurance.
Why we
need so much data?
Fine, it’s just statistics, but concretely why?
Let’s recap: a neural network can statistically approximate any function.
How
many curves pass from a point
in a multidimensional space?
And from two points? And from three? And from N?
Turns out that the answer to all of these questions is “infinitely many”.
Can you see the problem?
Let’s explain this in a cooler way, using the common AI parlance.
After training a neural network, we do not know which knowledge it will deduce from the training samples and we do not know what reasoning it will use for its computation. It’s a like a black box.
It
approximates the desired output in the range covered by our
samples.
That’s all we can say.
There
are infinitely many
functions that a given network might approximate.
And, currently, we
cannot say which function a given node from the output
layer of a neural
network is actually approximating.
Now,
there are a few interesting researches about this issue, but I’m not
much optimistic about them. My insight is that being able to deduce
the target function from a generic calibrated neural network is
equivalent to resolving the halting problem.
After all, DGNs are neural networks too!
BTW, we need a big data set to filter out unwanted functions.
All the
headaches you get from overfitting or underfitting are just side effects of
this heroic
challenge: whenever you feed a sample to the network, you
exclude an infinite number of functions from being approximated by
it.
Nevertheless an infinite number of functions still fit your data
set.
So you can not know which function your
network will approximate.
It’s sad, but you can’t really win.
Headache apart, this fact has deep legal implications.
How can
you prove
that a neural network do not discriminate a minority?
How can you prove that it’s not
calibrated to be racist? Or sexist?
Theoretically, you can not.
What math can’t prove, engineers can check.
When in doubt, use brute force.
— Ken Thompson, Unix inventor
To show that your neural network is not “trained to discriminate” you simply have to declare the function you tried to approximate and
The
experts will try to falsify a theory, a predicate about the network.
Experimentally. They will at least
If they do not find a programming error (which, trust me, is almost sure) it might take years, since the cost of debugging always grows with complexity.
Still,
while very
expensive,
this approach is always technically
possible.
Recall? Neural networks are deterministic
programs.
But
when does
it worth such effort?
When do you have to pay such a huge cost?
If you calibrate a neural network to play Go, nobody will ask you to prove that it is not discriminating white stones. Nobody cares about stones’ rights.
But if
you delegate to a neural network a decision about people, the
decision is still on your own responsibility.
You are stating that you trust who implemented, configured and
calibrated the network and you are accepting to be
liable of its outputs.
It’s just a statistical application, after all!
The word “responsibility” comes from the Latin responsum abilem, which basically means “able to explain”.
This was pretty clear to people who wrote the European GDPR.
Indeed, at Nexa’s mailing list, Marco Ciurcina pointed out that article 13 and article 14 of the GDPR were relevant to the discourse.
In particular, the point (f) of Article 13(2) states that
…the controller shall (…) provide the data subject with the following further information necessary to ensure fair and transparent processing: (…) the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.
The point (g) of Article 14(2) is equivalent, but relates to data acquired by third parties, instead of the data subject.
So if you operate in Europe and you apply AI to people, you should be able to explain the logic that led to each of its outputs in a court.
Which information about the logic involved in an automated decision-making are “meaningful” for the data subject?
To be “meaningful” the information about a decision-making process (automated or not) must be
But to prove that they are pertinent, the information must be complete.
And to prove that they are complete, you have to be able to replicate that specific decision-making process using them.
So, to recap, the meaningful information about a decision-making process are all those information that are relevant to the process itself and that the data subject (or the Magistrate) can use to the replicate process itself.
And obviously, to be pertinent, such information must be up to date.
Programmers can see how Magistrates debug. :-)
Of all machine learning techniques, neural networks are the most expensive to debug/explain. By several times the most expensive.
But you
can’t simply state “we didn’t trained the network to be
racist”.
Or “the neural network was simply trained so and so”.
You could be lying.
Fortunately, you can prove your statement. With brute force and
debug.
It’s just another cost in the budget. Probably a huge cost, but a
cost.
Still,
if you can’t afford to show that your neural networks (or your MLs
in general) are approximating legal functions, it’s wise to replace
them.
Or your business model.
If you
are not a programmer, it is safe (and wise) to mentally replace
terms like AI and ML with “statistical application”, in any
discourse you listen.
You will have a deeper understanding of the matter, if you do
so.
Artificial neural networks are simply deterministic algorithms that statistically approximate functions. It’s just not possible to exactly say which function they approximate.
This is
not a big issue, until you apply them to people.
And it’s just expensive when you do. At least in Europe.
Incidentally, the biggest data sets are about people.
And artificial intelligence is not allowed to discriminate on your behalf.