January 18, 2018
New research findings by computer scientists "cast significant doubt on the entire effort of algorithmic recidivism prediction"
This notable new research article in the latest issue of Science Advances provides a notable new perspective on the debate over risk assessment instruments. The article is authored by computer scientists Julia Dressel and Hany Farid and is titled "The accuracy, fairness, and limits of predicting recidivism." Here are parts of its introduction:
In the criminal justice system, predictive algorithms have been used to predict where crimes will most likely occur, who is most likely to commit a violent crime, who is likely to fail to appear at their court hearing, and who is likely to reoffend at some point in the future.
One widely used criminal risk assessment tool, Correctional Offender Management Profiling for Alternative Sanctions (COMPAS; Northpointe, which rebranded itself to “equivant” in January 2017), has been used to assess more than 1 million offenders since it was developed in 1998. The recidivism prediction component of COMPAS — the recidivism risk scale — has been in use since 2000. This software predicts a defendant’s risk of committing a misdemeanor or felony within 2 years of assessment from 137 features about an individual and the individual’s past criminal record.
Although the data used by COMPAS do not include an individual’s race, other aspects of the data may be correlated to race that can lead to racial disparities in the predictions. In May 2016, writing for ProPublica, Angwin et al. analyzed the efficacy of COMPAS on more than 7000 individuals arrested in Broward County, Florida between 2013 and 2014. This analysis indicated that the predictions were unreliable and racially biased. COMPAS’s overall accuracy for white defendants is 67.0%, only slightly higher than its accuracy of 63.8% for black defendants. The mistakes made by COMPAS, however, affected black and white defendants differently: Black defendants who did not recidivate were incorrectly predicted to reoffend at a rate of 44.9%, nearly twice as high as their white counterparts at 23.5%; and white defendants who did recidivate were incorrectly predicted to not reoffend at a rate of 47.7%, nearly twice as high as their black counterparts at 28.0%. In other words, COMPAS scores appeared to favor white defendants over black defendants by underpredicting recidivism for white and overpredicting recidivism for black defendants....
While the debate over algorithmic fairness continues, we consider the more fundamental question of whether these algorithms are any better than untrained humans at predicting recidivism in a fair and accurate way. We describe the results of a study that shows that people from a popular online crowdsourcing marketplace — who, it can reasonably be assumed, have little to no expertise in criminal justice — are as accurate and fair as COMPAS at predicting recidivism. In addition, although Northpointe has not revealed the inner workings of their recidivism prediction algorithm, we show that the accuracy of COMPAS on one data set can be explained with a simple linear classifier. We also show that although COMPAS uses 137 features to make a prediction, the same predictive accuracy can be achieved with only two features. We further show that more sophisticated classifiers do not improve prediction accuracy or fairness. Collectively, these results cast significant doubt on the entire effort of algorithmic recidivism prediction.
A few (of many) prior related posts on risk assessment tools:
- ProPublica takes deep dive to idenitfy statistical biases in risk assessment software
- "Assessing Risk Assessment in Action"
- Thoughtful account of what to think about risk assessment tools
- "The Use of Risk Assessment at Sentencing: Implications for Research and Policy"
- Wisconsin Supreme Court rejects due process challenge to use of risk-assessment instrument at sentencing
- "In Defense of Risk-Assessment Tools"
- Parole precogs: computerized risk assessments impacting state parole decision-making
- Thoughtful look into fairness/bias concerns with risk-assessment instruments like COMPAS
- "Gender, Risk Assessment, and Sanctioning: The Cost of Treating Women Like Men"
- Expressing concerns about how risk assessment algorithms learn
- "Under the Cloak of Brain Science: Risk Assessments, Parole, and the Powerful Guise of Objectivity"
January 18, 2018 at 10:28 PM | Permalink
Cast significant doubt? No.
The differences are meaningless in the real world. They have a large sample. That will make tiny differences in random directions get statistical significance. These differences are so small, they may be automatically dismissed as not meaningful.
More propaganda from Ivy League pro-criminal assholes. Dismissed.
Posted by: David Behar | Jan 18, 2018 11:42:44 PM
I am not sure that is the proper way to measure accuracy.
The scores given by these tools identify the likelihood that any person would re-offend. That is sort of like predicting the likelihood of which team is going to win a sport's event. After the event, the percentage is necessarily wrong -- the accurate percentages a 100 and 0 for the outcomes.
A better measurement of accuracy is whether the tool reflects what we know and do not know about why somebody re-offends. Because of the unknowns, any tool can only place an offender into a risk group (e.g. 35% chance of re-offending). The question for accuracy is whether it places an offender in the right "risk" group. If you broke the score into deciles (0-9%, 10-19%) by the possibility of re-offending and then measured the actual results for that decile, then you would know if the result given by the tool accurately reflected the risk that a given person would re-offend. In other words, does the group that is supposed to re-offend 30-39% of the time re-offend 30-39% of the time.
Posted by: tmm | Jan 19, 2018 10:54:51 AM
Define proper....it seems to me that you and the researchers simply have two different ways of measuring validity. They would define valid results as results that are free from invidious bias. You would define a valid result as one that measures what it says it measures. Neither is per se wrong, indeed they both may be correct. In short, I do think their way of defining validity is correct but I wouldn't dispute the claim that it is not the only way to define validity.
Posted by: Selfie Man | Jan 19, 2018 3:33:28 PM
TMM. You raise a devastating problem. It makes evidence based practice a form of utter nonsense and quackery.
A drug is superior to placebo in a group of people. More people respond to it than to placebo. However, you are a doctor facing a single patient and making a decision about that person. This is the same scenario as a judge facing a defendant, and not addressing a group of defendants.
Evidence based legal practice faces the same insurmountable problems.
Problems, 1) delineation by academic professors with half the clinical experience and therefore half the insider knowledge of practitioners; 2) obsolescence; 3) based on wrong statistical application; 4) violation of the rules of statistical testing by exclusion criteria in all studies; 5) misapplication to individual patients (a treatment killed 99% of patients who had it, this patient has done well on it, follow guidelines and stop this effective treatment?); 6) ignorance of the individualized dose-response curve.
In a legal context of any kind, the suborning of quackery is a violation of the Fifth Amendment procedural due process rights of the defendant to a fair hearing. Evidence based medicine is itself is a constitutional tort. I would urge all defendants to sue the plaintiff lawyer, the plaintiff, the plaintiff experts, guideline writers, as individuals, their universities, their chairman that failed to supervise them, their association. The association should be charged with civil RICO. That is subject to punitive damages (triple) because it is an intentional act, not just negligence. If it can be shown to have been financially self serving, it can be converted to criminal RICO, and the guideline writers should be arrested, tried, and sentenced to prison. To deter.
More detail here.
Posted by: David Behar | Jan 19, 2018 4:40:35 PM