Strengthening trust in machine-learning models

A "taxonomy of trust" to identify confidence in data analysis.

Share

Machine learning (ML) is increasingly used to make major decisions in science, social science, and engineering, with the potential to impact people’s lives profoundly. It is important to ensure that the probabilistic ML outputs are useful for the stated purposes of the users.

Probabilistic machine learning methods are becoming more powerful data analysis tools. However, math is only one piece of the puzzle in determining their accuracy and effectiveness. 

To address this issue, a team of researchers created a classification system known as a “taxonomy of trust,” which defines where trust may break down in data analysis and identifies strategies to strengthen trust at each step. 

The following list identifies the points in the data analysis process where trust may be lost: Analysts choose which models, or mathematical representations, best represent the real-world issue or question they are trying to solve. They select algorithms to fit the model and use code to run those algorithms. Each of these steps poses unique challenges around building trust. They also decide which data to collect. Building trust presents particular challenges for each of these steps. 

There are measurable ways to verify the accuracy of some components. One query that can be examined against objective standards is, “Does my code have bugs?”. Analysts are faced with various strategies to gather data and determine whether a model accurately represents the real world when problems are more subjective and lack obvious solutions.

The team aims to highlight issues that have already been thoroughly researched and those that require additional attention. 

MIT computer scientist Tamara Broderick said, “What I think is nice about making this taxonomy is that it really highlights where people are focusing. Much research naturally focuses on this level of ‘are my algorithms solving a particular mathematical problem?’ in part because it’s very objective, even if it’s a hard problem. I think it’s really hard to answer ‘is it reasonable to mathematize an important applied problem in a certain way?’ because it’s somehow getting into a harder space; it’s not just a mathematical problem anymore.”

The categorization of trust breakdown by the researchers is rooted in a real-world application, even though it may appear abstract. Meager, a co-author of the paper, examined whether micro finances can benefit the community. The project served as a case study for how to lower the risk of trust failing in various situations. 

Analysts must define a positive outcome, such as the average financial gain per business in communities where a microfinance program is implemented, to measure the impact of microfinance. 

Analysts must assess whether specific case studies can reflect broader trends to contextualize the data. It is also critical to contextualize the available data. For example, owning goats may be considered an investment in rural Mexico.

Finally, they must define the real-world problems they hope to solve. 

Analysts must define what they consider a positive outcome when evaluating the benefits of microfinance. In economics, for example, measuring the average financial gain per business in communities where a microfinance program is introduced is standard practice. However, reporting an average may imply a net positive effect even if only a few people benefited rather than the entire community.

He said. “It’s hard to measure the quality of life of an individual. People measure things like, ‘What’s the business profit of the small business?’ Or ‘What’s the consumption level of a household?’ There’s this potential for a mismatch between what you ultimately really care about and what you’re measuring. Before we get to the mathematical level, what data and what assumptions are we leaning on?

The researcher said, “What you wanted was that a lot of people are benefiting. It sounds simple. Why didn’t we measure the thing that we cared about? But I think it’s common for practitioners to use standard machine learning tools for many reasons. And these tools might report a proxy that doesn’t always agree with the quantity of interest.”

He added, “Someone might be hesitant to try a nonstandard method because they might be less certain they will use it correctly. Or peer review might favor certain familiar methods, even if a researcher might like to use nonstandard methods. There are a lot of reasons, sociologically. But this can be a concern for trust.”

While transforming a real-world problem into a model can be a big-picture, amorphous problem, checking the code that runs an algorithm can feel “prosaic.” However, there is another area where trust can be strengthened that is often overlooked.

In some cases, checking a coding pipeline that executes an algorithm may be considered outside the scope of an analyst’s job, especially when standard software packages are available.

Testing whether code is reproducible is one way to catch bugs. However, depending on the field, sharing code alongside published work is only sometimes required or the norm. As models become more complex over time, it becomes more difficult to recreate code from scratch. It becomes difficult to replicate a model.

The researcher said, “Let’s just start with every journal requiring you to release your code. Maybe it doesn’t get totally double-checked, and everything isn’t absolutely perfect, but let’s start there.” as one step toward building trust.

The main findings from this text are that practitioners use standard machine learning tools for various reasons and that checking the code that runs an algorithm is an often overlooked area where trust can be strengthened. Broderick and Gelman collaborated on an analysis forecasting the 2020 U.S. presidential election using real-time state and national polls. 

The team published daily updates in The Economist magazine while making their code available online for anyone to download and run. While there is no single solution for creating a perfect model, the researchers acknowledge that analysts can build trust by testing code for reproducibility and sharing code alongside published work.

Broderick said, “I don’t think we expect any of these things to be perfect. but I think we can expect them to be better or to be”

Journal Reference:

  1. Broderick, T., Zheng,etal . Toward a taxonomy of trust for probabilistic machine learning. Science Advances. DOI: 10.1126/sciadv.abn3999

Trending