Machine Learning Testing through Continuous Assurance

1 February 2022

The Urge for a Continuous Assurance Pipeline to Surpass Limitations of Machine Learning Algorithms

The Unreasonable
Effectiveness of Data

In a 2009 published expert opinion article titled “The Unreasonable Effectiveness of Data”, Alon Halvey, Peter Norvig and Fernando Pereira from Google’s research team were stating:

“… A trillion-word corpus … could serve as the basis of a complete model for certain tasks – if only we knew how to extract the model from the data”.

That statement entails quite a lot of what we want to discuss in the following. Firstly, what does it mean to have a complete model for a task and how do we ascertain its completeness. Secondly, the question how we extract the model from the data at hand is not independent from the actual task and the assumptions which we will never be relieved to make.

Classical Software Development and the Philosophy of Agility

The modern software development age that evolved naturally in the last decades brought with it the mode and philosophy of working in an agile manner. Only by that it was possible to guarantee the advent of fast and reliable development of software products in an iterative way: prototype leads to feedback, to new prototype, to new feedback and so forth. In the wake of the different phases of digitization era, starting from the onset of the Internet, over to web development and going to the age of mobile development, the process of professional software development was also developed and refined iteratively. This led to the approach of best practices like continuous development, continuous testing and continuous integration. These process artifacts allow collaborative development of big and hugely complex software products in complex and uncertain domains, while at the same time ensuring running code along the development cycle and deployment of ever more valuable features for the sake of the customers in less time.

The Era of Deep Learning – A Big Paradigm Shift

The beginning of the 2010 decade marked the transition to the era of Deep Learning (DL) of Artificial Neural Networks (ANNs) at industry scale. As one manifestation of the broader area of Machine Learning (ML), the mathematical concept of an ANN was not invented at that time but rather was studied initially in the early 1940s. At that time and the coming decades after, the field of learning based on ANNs remained a rather academic endeavor. The first step stone for the later success of DL was laid by Dave Rumelhart, Geoffrey Hinton and Ronald Williams in 1984 when they published their seminal paper on the Backpropagation algorithm. This algorithm was the long sought theoretical key for training deep ANNs. The second step stone was the accelerated development of more and more powerful computing hardware in the last three decades. The third and to date last big step stone was the arrival of the web and internet era, which yielded the explosion of data generation in an unprecedented manner. These three parts – the learning algorithm, the hardware power, and the non-vanishing amount of raw data – pitchforked us into the era of ML-based software development.

To really encompass the paradigm shift that is in between the classical software development and the new ML-based software development, we must consider the major conceptual distinction between a classical software function/algorithm and the ML algorithm. A software function in the classical sense is one which is explicitly designed and implemented by a human programmer. It takes in the input, then transforms that input in a predefined manner to an output, which then is passed to some other software function. For example, we can imagine setting up a function which is taking an image as input, transforms it in a way (e.g., scaling, color manipulation, etc.) and then outputs it to the display function or a function saving the resulting image to the hardware device. The processing of classical software functions is principally designable and as such also testable. We could for example test the correct functioning of our image manipulation function for any test image and that test will be in principle valid for a infinite set of images that comply to the pre-defined requirements set by the function. The classical software function thus is a software paradigm in which the input and the algorithm are given, while the output is generated. In contrast to that, the ML paradigm has the input and the output data as the given artifacts, while the function/algorithm that shall transform the input to the output is the unknown.

As was pointed out in case of classical software development, testing the resulting classical software functions in a continuous fashion was developed in the last decades and is quite straight-forward. When we want to test such a function like our image manipulation function, we can write a companion test function that is running each time along the development and change cycle of the function itself, thus ensuring the correct functioning of it for given input images and deterministic output assumptions. Even more trivially put, if I want to write a function to add two numbers, then I can test the function for yielding 2 + 2 = 4.

Contrary to that, in the case of functions/algorithms that are extracted from data, the testing strategy is not as straight-forward. To assure the correct functioning of an ML model in the real-world once deployed to an automated decision system, the whole cycle from data gathering, data pre- and post-processing, data quality assurance, model training, model validation, model deployment is involved. When we then come to the actual ML model itself, most of the successful models of the last decade were ANNs, which in essence are black box systems. There is no way for us to directly investigate these black box systems as it is the case for a classical software function. The tremendous success of modern ML has overshadowed that deficiency a bit. As long we consider the application of automated decision systems with the aid of ML for fun use-cases like cat detectors, funny transformations of faces in mobile photo apps etc., we might not be worried. But, when it comes to safety-relevant domains (health care, automotive, manufacturing, etc.), slightly wrong decisions may have severe life-threatening implications.

Two Approaches for Software Development in Safety-critical Domains

This puts us in between two conflicting approaches to development of software systems in potentially safety-critical domains: On the one side we must abide to safety standards and on the other side we must build agile and iterative development pipelines in order to ensure innovation of these safety-critical domains in the coming decades. That brings us to the continuous assurance of ML systems. Essentially the paragon for that is the well-established similar concepts within development of classical software systems of the last decades. There we have such concepts like continuous development, continuous testing and continuous integration. All these are automatic pipelines that are assuring the correct functioning of single software units on the one hand and their proper integration within the whole system. Whenever a software developer is making any slight change and/or extension in one small part of the software system, the change will push the automatic pipeline to run and either pass all defined unit and integration test or not. In analogy, we need similar type of test and assurance pipelines for software systems with ML modules.

The best way to establish a continuous assurance pipeline for a software system with ML modules will be to devise another software system accompanying the system to be assured. The assurance pipeline will then implement a multitude of distinct ML assurance software tools and functions. These assurance tools will assess the inner working and the input-based decision characteristics of the ML modules. To understand what we explicitly mean by that let us go through an example in the following. Let us consider a problem from the area of computer vision, in which we want to either detect or classify objects on a given input image. This is at the heart of many problems within health care and autonomous driving domains. Both domains are best examples for safety-relevance and hence heavily regulated. In the last decade Convolutional Neural Networks (CNNs) took over the lead as the single most successful ANN type for solving computer vision tasks. However, these success stories by CNNs did outshine to a large extent the existing deficiencies.

The Issue of Short-Cut Learning – From Natural Sciences to Data-based Learning

Recently researchers from Tübingen, Germany wrote an intriguing paper titled “Shortcut learning in deep neural networks”. In there they outline their findings on how many seemingly distinct failure cases of ANNs can be connected by the concept of a shortcut decision strategy. Essentially, a form of shortcut decision strategy is not only present in deep learning but also prevalent in learning within biological species, like us humans. It is also not a bad strategy at all. Finding shortcuts for explaining phenomena that we observe in the world is at the heart of science. As the researchers write in that paper:

“If science was a journey, then its destination would be the discovery of simple explanations to complex phenomena”.

Any (mathematical) theory extracted from seemingly diverse and unconnected experiments and observations from the physical world is in essence a shortcut representation for these observations. One can say that the power of a (mathematical) theory is measured by its ability to explain as much distinct natural phenomena as possible. There is a nice piece on “the unreasonable effectiveness of mathematics in the natural sciences”, written by the famous theoretical physicist Eugene P. Wigner, where it is stated (section “The Uniqueness of the Theories of Physics”, 2nd paragraph, p. 8.):

Every empirical law has the disquieting quality that one does not know its limitations. We have seen that there are regularities in the events in the world around us which can be formulated in terms of mathematical concepts with an uncanny accuracy. … The question which presents itself is whether the different regularities … will fuse into a single consistent unit …”.

Again, that “consistent unit” to which “the different regularities” may converge, represents an intended shortcut. Some examples for intended shortcuts in physics are: Classical Newtonian Mechanics describing phenomena ranging from mechanical machines up to motions of planets, the Maxwell equations explaining the world of Electromagnetism, Einstein’s General Relativity theory predicting gravity phenomena on our solar system up to the existence of gravitational waves and black holes in the universe.

So, what about shortcuts extracted by ML algorithms based upon data? The paper “The Unreasonable Effectiveness of Data” essentially tried to make an argument for the effective extraction of shortcut decision models from big data. Written in 2009, it came out at a time that in hindsight was the starting point of the deep learning success era. Yet, after the 2010s ended and we are at the beginning of the 2020s, the answer to that question may be the reasonable effectiveness of extracting any possible shortcut from given data, no matter if intended or unintended. There are multitude examples for unintended shortcuts in deep learning, including classifier decision strategies according to background context (cow and sheep expected on grass land mostly), object detection decided by texture instead of shape (example 1, example 2), face recognition algorithms with high error rate for minority groups, and many more. Two very visual examples for unintended shortcut learning are shown in Figure 1.

Embracing the Limitations and Developing the Cure

ANNs essentially are amazingly good at extracting information out of data and compress them successively into more and more small representation of the data set. There is also a very interesting theory out there explaining how the deep learning process achieves this information retrieval and compression to shortcut representations (see part 1 and part 2 of a blog series which I wrote earlier). Unfortunately, the learned representations are not guaranteed to be in any way the intended ones, as we learned. Worryingly, we moreover never will be completely sure, if the learned shortcut representations are the intended ones. What Wigner have been written regarding empirical laws in the above citation, can be abbreviated to an analogous statement regarding these data-extracted shortcuts:

Every data-extracted shortcut representation has the disquieting quality that one does not know its limitations. We have seen that there are regularities in the data which can be formulated in terms of abstract black-box concepts with an uncanny accuracy. The question which presents itself is whether the different regularities will converge into a single consistent shortcut representation.

Hence, we need to research and develop novel tools for assessing these learned representations within the ANN. These tools must then be integrated in a continuous assurance pipeline that will serve as our guide through the otherwise inaccessible wilderness of learned representations. Just in analogy to continuous development and continuous integration picture in classical software development, each (small) change in our overall ML-development pipeline – data extraction, model training, model evaluation, model test – will pass through the devised continuous assurance pipeline to tell us if we are on the right direction to the intended shortcut representation or not. Otherwise, we always be doomed to tumble like a blind man through the jungle.


In here we provided an introductory piece on the issue of ML limitations and how we might surpass these limitations with the aid of a continuous assurance pipeline for ML-based software development. In a follow-up blog article, we intend to go into some more detail on already existing tools for assessment of ANNs about their learned shortcut representations and their overall stability and safety in face of adversarial perturbations.

Back to top