As artificial intelligence (AI) and machine learning gain momentum, an increasing number of government agencies are considering or starting to use them to improve decision making. Some examples of compelling applications include those that identify tax-evasion patterns, sort through infrastructure data to target bridge inspections, or sift through health and social-service data to prioritize cases for child welfare and support. They enable governments to perform more efficiently, both improving outcomes and keeping costs down.
The most pressing aspects of adopting such solutions are generally well known. Algorithms should be accurate and consciously checked for unintended bias.1 Others are less so. Algorithms must be stable, meaning that small changes to their input don’t meaningfully change their output. They should be explainable, especially in the public sector, where myriad stakeholders will review every step. And to ensure successful adoption, public-sector users should pay particular attention to how AI solutions are deployed, given public-sector managers generally have less authority and operational control to compel adoption than private-sector ones. While all these factors are relevant to every public-sector entity, they aren’t necessarily relevant in the same way.
Getting the right balance is essential not only to minimize the risks but also to build a proper business case for the investment, and to ensure that taxpayer dollars are well spent. Below, we’ll explore each of these five dimensions—accuracy, fairness, explainability, stability, and adoption—as they apply to the public sector.
When it comes to algorithms, public-sector users could measure performance in terms of better decision making. Since there are typically many possible measures and probabilistic outcomes, it’s unlkely that an algorithm will forecast every one of them precisely. Users could start with identifying which ones are most likely to lead to the best decisions for the situation. We recommend focusing on two or three measures that truly matter for the specific use case. Consider the following examples:
- Deciding which individuals receive rehabilitation treatment.Correctional officers or social workers at prisons may prefer the algorithms to reduce the number of false negatives—high-risk individuals falsely classified as low risk—relative to false positives—low-risk individuals falsely classified as high risk. That’s because the potential impact of missing a high-risk individual could be higher likelihood of recidivism while that of misclassifying a low-risk individual would be additional programming.
- Deciding where to focus tax audits. Tax officials may want to optimize for focusing on only the most likely tax evaders—given the potential consequences of falsely tagging someone as a high risk for evasion.
- Deciding which students get scholarship money based on probability to graduate. When the rank order of students determines scaled scholarship amounts, the order in which students rank could matter more than the absolute probabilistic score that the individual student receives from the model—in this instance, the likelihood of graduation. In such cases, school administrators would care most about predicting the correct ranking order of the students than the accuracy of the probabilistic outcome by itself.
One word of caution: ensure that a clear baseline accuracy for decision making exists before implementing an algorithm, whether based on historical human decisions, rudimentary scoring, or criteria-based approaches that were being used. Knowing when the algorithm performs well and when it does not, relative to the baseline, is helpful both for making a case to use it as well as to establish incentives for continued improvement of the algorithm.
In our experience, machine learning can significantly improve accuracy relative to most traditional decision-making processes or systems. Its value can come from better resource-allocation decisions, such as matching the right types of rehabilitation programs in a corrections facility to the prisoners most likely to benefit from them. But it can also be valuable for improving efficiency, such as helping public-health case workers prioritize the right cases, as well as effectiveness, such as knowing which school programs are most effective at minimizing drop-outs.
There are many ways to define a fair algorithm, or “algorithmic fairness.”2The notion reflects an interest in bias-free decision making or, when protected classes of individuals are involved, in avoiding disparate impact to legally protected classes.3 There is extensive literature on bias in algorithms and how this could manifest. Common issues include some kinds of bias in the data sets and distortions in the algorithm’s analytical technique—or in how humans interpret the data.
A critical first step is to establish what fairness means in the specific context of the use case—that is, what are the protected classes and what are the metrics for fairness. There are a few ways to measure and address fairness, not all of which may be equally effective in each instance:
- Willful blindness. One approach that is commonly used is to build a kind of blindness into the algorithm, so that it treats subgroups the same regardless of traditional distinctions between them, such as race, gender, or other socioeconomic factors.For example, if a school uses an algorithm to identify students at risk of dropping out, educators could deploy a model that uses gender-masked or gender-neutral records to identify those at the greatest risk. Yet even that kind of approach can be naive if it doesn’t account for cross-correlated variables—such as postal codes that could imply race, education level, or gender. Such an approach could lead to unfair outcomes or cause issues with the sample data used to train the model itself. It ends up creating an algorithm that is merely unaware without any consideration to fairness.
- Demographic or statistical parity. Another way to address fairness is to ensure statistical parity in the decisions being enabled or in the outcomes—for example, by selecting an equal share of people from both protected and nonprotected groups. One way to achieve this would be to set different thresholds for different groups to ensure parity in the outcomes for each group.An example of the latter would be an algorithm written to apply different credit-score thresholds for different demographic groups, in order to select the same proportion of applicants from each. However, this approach requires someone to constantly verify and modify the thresholds—and often may not account for underlying differences in the subgroups. It is usually effective only when someone cares about a single measure of fairness, in this case, an equal share of loan-approval outcomes across gender types.
- Predictive equality. Possibly the most balanced approach to address fairness is to not force it in the decision outcome, but rather in the algorithm’s performance (or accuracy) across different groups. In this definition, fairness means that the algorithm is not disproportionately better or worse off in how decisions are being made for specific subgroups. That means, for example, that the error rates or prevalence of false positives or false negatives for each group is the same—while accounting for variations in the underlying population. In our loan-applicant example, this means that we may not approve an equal share of loan applicants across genders, but the percent of approved applicants who end up defaulting (that is, the false positives) would be the same across genders. In other words, we are not disproportionately favoring or affecting either gender as we are making the same rate of mistakes or errors in our selection.Fairness through predictive equality can be achieved through a set of nuanced debiasing practices used in the field of data science.Read the source post at McKinsey.