June 5, 2016
Looking into the Wisconsin case looking into the use of risk-assessment tools at sentencing
The Wall Street Journal has this effective new article discussing the case now before the Wisconsin Supreme Court considering a defendant's challenge to the use of a risk assessment tool in the state's sentencing process. The article's full headline notes the essentials: "Wisconsin Supreme Court to Rule on Predictive Algorithms Used in Sentencing: Ruling would be among first to speak to legality of risk assessments as aid in meting out punishments." And here is more from the body of the article:
Algorithms used by authorities to predict the likelihood of criminal conduct are facing a major legal test in Wisconsin. The state’s highest court is set to rule on whether such algorithms, known as risk assessments, violate due process and discriminate against men when judges rely on them in sentencing. The ruling, which could come any time, would be among the first to speak to the legality of risk assessments as an aid in meting out punishments.
Criminal justice experts skeptical of such tools say they are inherently biased, treating poor people as riskier than those who are well off. Proponents of risk assessments say they have elevated sentencing to something closer to a science. “Evidence has a better track record for assessing risks and needs than intuition alone,” wrote Christine Remington, an assistant attorney general in Wisconsin, in a legal brief filed in January defending the state’s use of the evaluations.
Risk-evaluation tools have gained in popularity amid efforts around the country to curb the number of repeat offenders. They help authorities sort prisoners, set bail and weigh parole decisions. But their use in sentencing is more controversial.
Before the sentencing of 34-year-old Eric Loomis, whose case is before the state’s high court, Wisconsin authorities evaluated his criminal risk with a widely used tool called COMPAS, or Correctional Offender Management Profiling for Alternative Sanctions, a 137-question test that covers criminal and parole history, age, employment status, social life, education level, community ties, drug use and beliefs. The assessment includes queries like, “Did a parent figure who raised you ever have a drug or alcohol problem?” and “Do you feel that the things you do are boring or dull?” Scores are generated by comparing an offender’s characteristics to a representative criminal population of the same sex.
Prosecutors said Mr. Loomis was the driver of a car involved in a drive-by shooting in La Crosse, Wis., on Feb. 11, 2013. Mr. Loomis denied any involvement in the shooting, saying he drove the car only after it had occurred. He pleaded guilty in 2013 to attempting to flee police in a car and operating a vehicle without the owner’s consent and was sentenced to six years in prison and five years of supervision. “The risk assessment tools that have been utilized suggest that you’re extremely high risk to reoffend,” Judge Scott Horne in La Crosse County said at Mr. Loomis’s sentencing.
Mr. Loomis said in his appeal that Judge Horne’s reliance on COMPAS violated his right to due process, because the company that makes the test, Northpointe, doesn’t reveal how it weighs the answers to arrive at a risk score. Northpointe General Manager Jeffrey Harmon declined to comment on Mr. Loomis’s case but said algorithms that perform the risk assessments are proprietary. The outcome, he said, is all that is needed to validate the tools. Northpointe says its studies have shown COMPAS’s recidivism risk score to have an accuracy rate of 68% to 70%. Independent evaluations have produced mixed findings.
Mr. Loomis also challenged COMPAS on the grounds that the evaluation treats men as higher risk than women. COMPAS compares women only to other women because they “commit violent acts at a much lower rate than men,” wrote Ms. Remington, the state’s lawyer, in her response brief filed earlier this year in the Wisconsin Supreme Court. Having two scales — one for men and one for women — is good science, not gender bias, she said.
The parties appeared to find common ground on at least one issue. “A court cannot decide to place a defendant in prison solely because of his score on COMPAS,” Ms. Remington acknowledged, describing it as “one of many factors a court can consider at sentencing.” Her comments echoed a 2010 ruling by the Indiana Supreme Court holding that risk assessments “do not replace but may inform a trial court’s sentencing determinations.”
June 5, 2016 at 12:17 PM | Permalink
Welcome to modern-day haruspicy. Now we use computer models instead of sheep livers, but the principle is the same.
The witch doctor rolls the bones to show the gods favor this fellow, and curse that one. Anyone with a lick of sense would see this is rigged to benefit nice people (like us), and ensure the horrors of the carceral state only fall on those other nasty folks.
Posted by: Boffin | Jun 5, 2016 6:13:23 PM
While different metrics have been used at different times, the concept of an "objective" risk assessment tool is not new. Way back in the pre-sentencing guidelines day, the federal parole agency used a "salient factor" score as part of parole consideration in an attempt to quantify which offenders were likely to re-offend.
There are three major concerns with any "objective" scale for quantifying risk. First, we are talking about probabilities. Some offenders who are categorized as a high risk to re-offend will not, but some who are categorized as low risk will. Second, it is difficult to determine the appropriate weight for factors. For example, many would give some weight to drug abuse, should drug abuse be worth 2% of the score or 4% of the score. Additionally, some factors may be correlated (e.g. mental disease and drug abuse) rather than truly independent. Third, there are some potential factors that would be controversial (particularly those with a genetic component). Do we use those to get a more accurate tool (recognizing that those factors may actually be correlated with the real factors) or do we try to find and isolate the underlying factors that are truly relevant.
Statistics is not the same as the old-fashioned reading of sacrificial omens. But statistics are only as good as the science, modeling, and mathematics behind them. Time and time again in the social sciences, statistical modeling has been shown to be somewhat subjective with minor difference in weighting leading to vastly different conclusions. Unfortunately, policy makers tend not to be experts in statistics able to make appropriate calls in structuring and choosing between tools, leaving too much influence to the men (and women) behind the curtain who do not want anybody looking too closely at their models.
Posted by: tmm | Jun 6, 2016 2:04:34 PM
Reading sheep livers is "evidence-based" too! And the haruspex had as much formal training as any contemporary statistician.
The problem is taking the wrong null hypothesis. What we have now is proof by incredulity: "I can't believe these computer models aren't useful for something or other."
I dare you - just as a though experiment - to consider a proper null hypothesis: That these computer widgets don't predict anything at all. Then ask, What sort of experiments could prove this hypothesis false? If you are honest, you'll admit that it could never be done. The complexities of human behavior and societal changes over time make any model useless.
Posted by: Boffin | Jun 6, 2016 3:11:07 PM
The thought experiment for testing is simple. The problem is that the thought experiment only disproves the particular model.
As an initial point, the validity of a statistical model requires distinguishing between probability and absolute certainty. Using the ELO chess ratings as a good example of a proven predictive statistical probability model, the gap in the ratings of two chess players (the ratings being based on past result) is a very good predictive tool for who will win. The bigger the gap between the two players, the less likely that the lower-rated player will win. Over time (approximately 80 years of use), statisticians can put a real probability on the chances that the lower-rated player will win. History has shown that the ranking system measures something that is real and significant. Yes, sometimes, the model will predict the wrong result, but at any significant gap (say 100 points on a scale in which the top players are at around 2700), the number of wrong results are both relatively few and close in number to what the system projects.
To the merits of any assessment of criminal behavior, testing begins with designing the model/scoring system and defining what it signifies. Then, you apply it to past cases (outside of the sample used to design the model). Say, for example, 2,000 inmates who received parole ten-fifteen years ago. The test is whether the model accurately distinguishes in a statistically meaningful way those who will re-offend from those who will not. (E.g., if the going recidivism rate is that 60% will re-offend if randomly selected, can the model pull out a group that only has a 20% recidivism rate and another group that has a 90% recidivism rate.) Take a second sample from a different state, does the model work in that state as well. If the model is unable to work with a random sample of past offenders, then it is flawed.
The distribution of crime (at least in "the law in its majesty equally forbids the rich and the poor to sleep on the banks of the River Seine" sense) does not appear to be purely random. The wealthy businessman might commit a large scale fraud scheme, but is unlikely to commit a bank robbery. The street level drug dealer is likely to commit additional drug offenses, but is no more likely than anybody else to commit a sexual offense. In short, we know that some factors at least have correlation with crime (failure to complete high school, committing juvenile offenses, having untreated mental illness, using controlled substances). While on a micro level, human beings and their behavior is complex; it's less so on a macro level. At the scale of large numbers, individual idiosyncracies cancel each other out. The problem is not with the concept of modeling, it is with the actual models.
Posted by: tmm | Jun 6, 2016 3:59:57 PM