Every day I receive a steady stream of pitches from companies and universities large and small touting their latest AI breakthrough. Yet once one looks past the marketing hype and hyperbole and actually tests the tool itself, the results are rarely as glowing as one might hope. In fact, for those on the front lines of applying AI to complex real-world problems, today’s AI solutions are akin to asking toddlers to operate a spacecraft. When putting a new AI tool through its paces, one of the most common outcomes is that the algorithm has latched onto an extraordinarily fragile and inaccurate representation of its training data. As “explainable AI” approaches become steadily more robust, what if companies were asked to subject their AI creations to algorithmic audits and report the results?
To the press and public, today’s AI solutions are nothing sort of magic. They are living silicon intelligences that can absorb the world around them, learn its patterns and wield them with superhuman precision and accuracy. To those who actually use them in complex real-world scenarios each day, they are brittle and temperamental toddlers who oscillate wildly from remarkable accuracy to gibberish meltdown without warning and whose mathematical veneer masks a chaotic mess of alchemy and manual interventions.
In fact, even some major driverless cars still rely upon hand-coded rules for some of their most mission-critical tasks, reminding us that even the most vocal proponents of deep learning still acknowledge its grave limitations.
In many domains, the accuracy of deep learning solutions still pales in comparison to traditional Naïve Bayesian and hand-coded rulesets. In our rush to crown deep learning as the defacto technology of the modern era, we too often forget the cardinal rule of checking whether something simpler does the job just as well or even better.
Few AI startups publicly compare their deep learning solutions to existing non-neural solutions. Even the academic literature typically compares new deep learning approaches against previous neural approaches rather than existing non-neural solutions.
When benchmarks are provided, they typically focus on extreme edge cases that showcase the new technology at its best, rather than the actual real-world content that will constitute 99% of what the algorithm will be used on and upon which it may actually perform far worse than classical approaches.
Yet even in cases where deep learning solutions perform markedly better than classical approaches or domains like image understanding where there are few non-neural solutions, the lack of visibility into what the algorithm has learned about its training data clouds any understanding of how robust it may be.
Explainable AI, coupled with more stringent benchmarking, offers a solution to these challenges.
Before investing in a new AI technology or purchasing a product for use, companies should focus more heavily on benchmarking it against classical approaches rather than only against its neural peers and emphasize the use of their own data rather than standard benchmarking datasets.
Most importantly, companies should require AI developers to provide the results of standard explainable AI tests for their algorithms, documenting what variables they are most reliant upon and the stability and actual predictiveness of those variables. Companies should also provide small sample or artificial datasets that mimic the consistency of their own data and request stability metrics for each test point, to understand how much slight changes in the attributes of each input would cause it to yield different results, to understand how robust the algorithm is on their own data.
Understanding what an algorithm sees is especially important in regulated industries where an algorithm that focuses on the wrong variable could have substantial legal and societal implications. An AI company building a mortgage or rental evaluation system might go to great lengths to ensure their algorithm does not have any inputs related to race. Despite their best efforts, the system may eventually learn to infer race based on an unexpected combination of inputs that were not previously known to encode race. In turn, a major apartment rental company that applies the algorithm and unknowingly systematically rejects selected applicants can face enormous legal liabilities, despite having conducted what it believed to be due diligence in certifying that race was not used as a factor in its evaluation process. Had the company subjected the algorithm to a series of explainable AI tests it could have uncovered this bias at the beginning.
In the end, explainable AI is beginning to shine a bit of light into the opaque black box workings of the deep learning revolution. Companies should take advantage of these new insights to more thoroughly evaluate the technologies they invest in and apply to their businesses.