Judges let algorithms help them make decisions, except when they don’t

When Northwestern University graduate student Sino Esthappan began researching how algorithms decide who stays in jail, he expected “a story about humans versus technology.” On one side would be human judges, who Esthappan interviewed extensively. On the other would be risk assessment algorithms, which are used in hundreds of US counties to assess the danger of granting bail to accused criminals. What he found was more complicated — and suggests these tools could obscure bigger problems with the bail system itself.

Algorithmic risk assessments are intended to calculate the risk of a criminal defendant not returning to court — or, worse, harming others — if they’re released. By comparing criminal defendants’ backgrounds to a vast database of past cases, they’re supposed to help judges gauge how risky releasing someone from jail would be. Along with other algorithm-driven tools, they play an increasingly large role in a frequently overburdened criminal justice system. And in theory, they’re supposed to help reduce bias from human judges.

But Esthappan’s work, published in the journal Social Problems, found that judges aren’t wholesale adopting or rejecting the advice of these algorithms. Instead, they report using them selectively, motivated by deeply human factors to accept or disregard their scores.

Pretrial risk assessment tools estimate the likelihood that accused criminals will return for court dates if they’re released from jail. The tools take in details fed to them by pretrial officers, including things like criminal history and family profiles. They compare this information with a database that holds hundreds of thousands of previous case records, looking at how defendants with similar histories behaved. Then they deliver an assessment that could take the form of a “low,” “medium,” or “high” risk label or a number on a scale. Judges are given the scores for use in pretrial hearings: short meetings, held soon after a defendant is arrested, that determine whether (and on what conditions) they’ll be released.

As with other algorithmic criminal justice tools, supporters position them as neutral, data-driven correctives to human capriciousness and bias. Opponents raise issues like the risk of racial profiling. “Because a lot of these tools rely on criminal history, the argument is that criminal history is also racially encoded based on law enforcement surveillance practices,” Esthappan says. “So there already is an argument that these tools are reproducing biases from the past, and they’re encoding them into the future.”

It’s also not clear how well they work. A 2016 ProPublica investigation found that a risk score algorithm used in Broward County, Florida, was “remarkably unreliable in forecasting violent crime.” Just 20 percent of those the algorithm predicted would commit violent crimes actually did in the next two years after their arrest. The program was also more likely to label Black defendants as future criminals or higher risk compared to white defendants, ProPublica found.

Both the fears and promises around algorithms in the courtroom assume judges are consistently using them

Still, University of Pennsylvania criminology professor Richard Berk argues that human decision-makers can be just as flawed. “These criminal justice systems are made with human institutions and human beings, all of which are imperfect, and not surprisingly, they don’t do a very good job in identifying or forecasting people’s behaviors,” Berk says. “So the bar is really pretty low, and the question is, can algorithms raise the bar? And the answer is yes, if proper information is provided.”

Both the fears and promises around algorithms in the courtroom, however, assume judges are consistently using them. Esthappan’s study shows that’s a flawed assumption at best.

Esthappan interviewed 27 judges across four criminal courts in different regions of the country over one year between 2022 and 2023, asking questions like, “When do you find risk scores more or less useful?” and “How and with whom do you discuss risk scores in pretrial hearings?” He also analyzed local news coverage and case files, observed 50 hours of bond court, and interviewed others who work in the judicial system to help contextualize the findings.

Judges told Esthappan that they used algorithmic tools to process lower-stakes cases quickly, leaning on automated scores even when they weren’t confident in their legitimacy. Overall, they were leery of following low risk scores for defendants accused of offenses like sexual assault and intimate partner violence — sometimes because they believed the algorithms under- or over-weighted various risk factors, but also because their own reputations were on the line. And conversely, some described using the systems to explain why they’d made an unpopular decision — believing the risk scores added authoritative weight.

“Many judges deployed their own moral views about specific charges as yardsticks to decide when risk scores were and were not legitimate in the eyes of the law.”

The interviews revealed recurring patterns in judges’ decisions to use risk assessment scores, frequently based on defendants’ criminal history or social background. Some judges believed the systems underestimated the importance of certain red flags — like extensive juvenile records or certain kinds of gun charges — or overemphasized factors like an old criminal record or low education level. “Many judges deployed their own moral views about specific charges as yardsticks to decide when risk scores were and were not legitimate in the eyes of the law,” Esthappan writes.

Some judges also said they used the scores as a matter of efficiency. These pretrial hearings are short — often less than five minutes — and require snap decisions based on limited information. The algorithmic score at least provides one more factor to consider.

Judges also, however, were keenly aware of how a decision would reflect on them — and according to Esthappan, this was a huge factor in whether they trusted risk scores. When judges saw a charge they believed to be less of a public safety issue and more of a result of poverty or addiction, they would often defer to risk scores, seeing a small risk to their own reputation if they got it wrong and viewing their role, as one judge described it, as calling “balls and strikes,” rather than becoming a “social engineer.”

For high-level charges that involved some sort of moral weight, like rape or domestic violence, judges said they were more likely to be skeptical. This was partly because they identified problems with how the system weighted information for specific crimes — in intimate partner violence cases, for instance, they believed even defendants without a long criminal history could be dangerous. But they also recognized that the stakes — for themselves and others — were higher. “Your worst nightmare is you let someone out on a lower bond and then they go and hurt someone. I mean, all of us, when I see those stories on the news, I think that could have been any of us,” said one judge quoted in the study.

Keeping a truly low-risk defendant in jail has costs, too. It keeps someone who’s unlikely to harm anyone away from their job, their school, or their family before they’ve been convicted of a crime. But there’s little reputational risk for judges — and adding a risk score doesn’t change that calculus.

The deciding factor for judges often wasn’t whether the algorithm seemed trustworthy, but whether it would help them justify a decision they wanted to make. Judges who released a defendant based on a low risk score, for instance, could “shift some of that accountability away from themselves and towards the score,” Esthappan said. If an alleged victim “wants someone locked up,” one subject said, “what you’ll do as the judge is say ‘We’re guided by a risk assessment that scores for success in the defendant’s likelihood to appear and rearrest. And, based on the statute and this score, my job is to set a bond that protects others in the community.’”

“In practice, risk scores expand the uses of discretion among judges who strategically use them to justify punitive sanctions”

Esthappan’s study pokes holes in the idea that algorithmic tools result in fairer, more consistent decisions. If judges are picking when to rely on scores based on factors like reputational risk, Esthappan notes, they may not be reducing human-driven bias — they could actually be legitimizing that bias and making it hard to spot. “Whereas policymakers tout their ability to curb judicial discretion, in practice, risk scores expand the uses of discretion among judges who strategically use them to justify punitive sanctions,” Esthappan writes in the study.

Megan Stevenson, an economist and criminal justice scholar at the University of Virginia School of Law, says risk assessments are something of “a technocratic toy of policymakers and academics.” She says it’s seemed to be an attractive tool to try to “take the randomness and the uncertainty out of this process,” but based on studies of their impact, they often don’t have a major effect on outcomes either way.

A larger problem is that judges are forced to work with highly limited time and information. Berk, the University of Pennsylvania professor, says collecting more and better information could help the algorithms make better assessments. But that would require time and resources court systems may not have.

But when Esthappan interviewed public defenders, they raised an even more fundamental question: should pretrial detention, in its current form, exist at all? Judges aren’t just working with spotty data. They’re determining someone’s freedom before that person even gets a chance to fight their charges, often based on predictions that are largely guesswork. “Within this context, I think it makes sense that judges would rely on a risk assessment tool because they have so limited information,” Esthappan tells The Verge. “But on the other hand, I sort of see it as a bit of a distraction.”

Algorithmic tools are aiming to address a real issue with imperfect human decision-making. “The question that I have is, is that really the problem?” Esthappan tells The Verge. “Is it that judges are acting in a biased way, or is there something more structurally problematic about the way that we’re hearing people at pretrial?” The answer, he says, is that “there’s an issue that can’t necessarily be fixed with risk assessments, but that it goes into a deeper cultural issue within criminal courts.”