« Is it problematic for sentencing judges to require the COVID vaccine as a probation condition? | Main | Another rounds of terrific new essays in Brennan Center's "Punitive Excess" series »

August 9, 2021

Guest post: another critical look at provocative paper claiming to identify the "most discriminatory" federal sentencing judges

Guest-Posting-ServiceI expressed concerns in this recent post about a new empirical paper making claims regarding the "most discriminatory" federal sentencing judges.  Upon seeing this Twitter thread by Prof. Jonah Gelbach about the work, I asked the good professor if he might turn his thread into a guest post. He obliged with this impressive essay:


This post will comment on the preprint of The Most Discriminatory Federal Judges Give Black and Hispanic Defendants At Least Double the Sentences of White Defendants, by Christian Michael Smith, Nicholas Goldrosen, Maria-Veronica Ciocanel, Rebecca Santorella, Chad M. Topaz, and Shilad Sen.  Doug Berman blogged about it here, and I’m grateful to him for the opportunity to publish this post here.

As I explained in a Twitter thread over the weekend, I have serious concerns about the study.  The most important concerns I raised in that thread fall into the following categories:

  1. Incomplete data
  2. Endogeneity of included regressors
  3. Small numbers of observations per judge
  4. Use of most extreme judge-specific disparity estimates

I’ll take these in turn.

(1) Incomplete Data.  It’s complicated to explain the data, whose construction involve merging multiple large data sets.  In fact, a subset of the authors have a whole other paper about data construction.  In brief, the data are constructed by linking US Sentencing Commission data files to those in the Federal Judicial Center’s Integrated Data Base, which gives them enough to form docket numbers. They then use the Free Law Project’s Juriscraper tool (https://free.law/projects/juriscraper/) to query PACER, which yields dockets with attached judges’ initials for most cases that merged earlier in the authors’ pipeline.  The authors use those initials to identify the judge they believe handled sentencing, using public lists of judges by district.

As involved as the data construction is, my primary concern is simple: the share of cases included in the data set the authors use is very low.  For 2001-2018, there were 1.27 million sentences in USSC data and 1.46 million in FJC data (these figures come from the data-construction paper, which is why they apply to the 2001-2018 period rather than the 2006-2019 period used in the estimation of “Most Discriminatory” judges).  Of these records, the authors were able to match 860k sentences, of which they matched 809k to dockets via Juriscraper.  After using initials to match judges, they have 596k cases they think are matched.  That’s a match rate of less than 50% based on the USSC data and barely 40% based on the FJC data.  The authors can’t tell us much about the characteristics of missing cases, and it’s clear to me from reading the newer paper that the match rate varies substantially across districts.

I think this much alone is enough to make it irresponsible to report estimates that purport to measure individually named judges’ degrees of discrimination.  As a thought experiment, suppose that (i) the authors have half the data, and (ii) if they were able to include the other half of the data they would find that there was no meaningful judge-level variation in estimated racial disparities in sentencing.  By construction, that would render any discussion of the “Most Discriminatory” judges pointless.  Because the authors can’t explain why cases are missed, they have no way to rule out even such an extreme possibility.  Nor do they determine what share of cases they miss for any judge in the data, because they have no measure of the denominator (perhaps they could do this with Westlaw or similar searches for some individual judges).  Their approach to the issue of missing data is to simply assume that missing cases are missing at random:

“One unknown potential source of error is that we cannot determine what percentage of each judge’s cases were matched in the JUSTFAIR database. If this missingness is as-if random with respect to sentencing variables of interest, that should not bias our results, but we have little way of determining this.” (Pages 18-19, emphasis added.)

I believe it is irresponsible to name individual judges as “The Most Discriminatory” on the basis of data as incomplete as these.

(2) Endogeneity.  The authors include as controls in their model each defendant’s guideline-minimum sentence, variables accounting for the type of charge, & various defendant characteristics.  They argue that these variables are enough to deal not only with the enormous amount of missing data (with unknown selection mechanism; see above) but also any concerns that would arise even if all cases were available.  As Doug Berman previously noted here, if prosecutors offer plea deals of differing generosity to defendants of different races, then the guideline minimum doesn’t account for heterogeneity in cases.  And note that if that happens in general, it’s a problem for all the model’s estimates. In other words, even if the particular mechanism Doug hypothesized (sweet plea deals for Black defendants in the EDPA) doesn’t hold, the whole model is suspect if the guidelines variable is substantially endogenous.

There are other endogeneity concerns, e.g., the study includes as regressors variables that capture reasons why a sentence departed from the guidelines — an outcome that is itself partly a function of the sentence whose (transformed) value is on the left hand side of the model.  And as a friend suggested to me after I posted my Twitter thread, the listed charges are often the result of plea bargains, whose consummation can be expected to depend on the expected sentence.  So the guideline minimum variable, too, is potentially endogenous.

(3) Small numbers of observations per judge. The primary estimates on which the claim about particular judges’ putative discriminatory sentencing are based are what are known as random effect coefficients on race dummies.  It’s lengthy to explain all the machinery here, but I’ll take a crack at a simplified description.

The key model output on which the authors make their “Most Discriminatory” designations are judge-level estimated Black-White disparities (the same type of analysis applies for Hispanic-White disparity).  Very roughly speaking, you can think of the estimated disparity for Judge J as an average of two things: (i) the overall observed Black-White disparity across all judges — call this the “overall disparity”, and (ii) the average disparity in the subset of cases in which Judge J did the sentencing — call this the “judge-specific raw disparity”.

For example, suppose that over all defendants, the average (transformed) sentence is 9% longer among Black defendants than among White ones; then the overall disparity would be 9%.  Now suppose that among defendants assigned to Judge J, average sentences were 20% longer for Black than White defendants; then the judge-specific raw disparity would be 20%.

The judge-level estimated disparity that results from the kind of model the authors use is a weighted average of the overall disparity and the judge-specific raw disparity. So in our example, the estimated disparity for Judge J would be a weighted average of 9% (overall disparity) and 20% (judge-specific raw disparity).  What are the weights used to form this average?  They depend on the variance across judges in the true judge-specific disparity and the “residual” variance of individual sentences — the variance that is unassociated with factors that the model indicates help explain variation in sentences.

The greater the residual variance, the less weight will be put on the judge-specific raw disparity.  This is what’s known as the “shrinkage” property of mixed models — they shrink the weight placed on judge-specific raw disparities in order to reduce the noisiness of the model’s estimated disparity for each judge. (I noted this property in a follow-up tweet to part of my thread.)

However, all else equal, greater residual variance also means that variation in judge-specific raw disparities will be more driven by randomness in the composition of judges’ caseload. Because these raw disparities contribute to the model-estimated disparity, residual variance creates a luck-of-the-draw effect in the mode estimates: a judge who happens to have been assigned 40 Black defendants convicted of very serious offenses and 40 White defendants convicted of less serious ones will have a high raw disparity due to this luck factor, and that will be transmitted to the model’s estimate disparity.

How important this effect of residual variance is context-sensitive. The key relevant factors are likely to be the numbers of cases assigned to each judge for each racial group and the size of residual variance relative to the size of variance across judges in true judge-level disparities.

As I wrote in my Twitter thread, I used the authors’ posted code and data to determine that Hon. C. Darnell Jones II, the judge named by the authors as the “Most Discriminatory”, had a total of 103 cases with Black (non-Hispanic) defendants, 37 cases with Hispanic defendants, and 67 with White defendants. Hon. Timothy J. Savage, the judge named as the second “Most Discriminatory”, sentenced 155 Black (non-Hispanic) defendants included in the estimation, 58 Hispanic defendants, and 93 White defendants.  These don’t strike me as very large numbers of observations, which is another way of saying that I’m concerned residual variance may play a substantial role in driving the model-estimated disparities for these judges.

My replication of the authors’ model shows that true judge-specific disparities in the treatment of Blacks and Whites have an estimated variance of 0.055, whereas the estimated residual variance is nearly 30 times higher — 1.59 for a single defendant.  For a judge who sentenced 40 Black and 40 White defendants, this would mean that residual variance would be 2(1.59)/40~0.08 — which is larger than the 0.055 estimated variance in true judge-level disparity.  It’s more complicated to assess the pattern for judges with different numbers of defendants by race, but I would not be surprised if the residual variance component is roughly the same size as the variance in judge-level effects.

In other words, even given the effect of shrinkage, I suspect that “bad luck” in terms of the draw of defendants might well be quite important in driving the judge-specific estimates the authors provide. Even leaving aside the missing-data problem, I think that makes the authors’ choice to name individual judges as “Most Discriminatory” problematic.

Another issue is that the judge-specific estimated disparity (remember, this is the model’s output, formed by taking the weighted average of overall and judge-specific raw disparities) is itself only an estimate, and thus a random variable.  Thus if one picked a judge at random from the authors’ data, it would be inappropriate to assume that the estimated disparity for that judge was the true value. To compare the judge-specific estimated disparity to other judges’ estimated disparities, or to some absolute standard, would require one to take into account the randomness in estimated disparity.  The authors do not report any such estimates.  Nor does the replication code they posted along with their data indicate that they calculated standard errors of the judge-specific estimated disparities.  There is no indication that I can find in either the code or the paper that they investigated this issue before posting their preprint.

(4) The many-draws problem.  Consider a simple coin toss experiment.  We take a fair coin and flip it 150 times. Roughly 98% of the time, this experiment will yield a heads share of 41.6% or greater (in other words, 41.6% is the approximate 2nd percentile for a fair coin flipped 150 times).  So if we flipped a fair coin once, it would be quite surprising to observe a heads share of 41.6% or lower.  But now imagine we take 760 fair coins and flip each of them 150 times. Common sense suggests it would be a lot less surprising to observe some really low heads shares, because we’re repeating the experiment many times.

To illustrate this point, I used a computer to do a simulation of exactly the just-described experiment — 760 fair coins each flipped 150 times. In this single meta-experiment I found that there were 13 “coins” with heads shares of less than 41.6%, just under two percent of the 760 “coins”, roughly as expected.  Given that we know all 760 “coins” are fair, it would make no sense to say that “the most biased coin is coin number 561”, even though in my meta-experiment it had the lowest heads share (36.7%, more than 3 standard deviations below the mean).  We know the coin is fair; it’s just that we did 760 multi-toss experiments, and with that much randomness we’re going to see some things that would be very unlikely with only one experiment.

Leaving aside differences across judges in the number of cases heard, this is not that different from what the authors’ approach entails.  If all judges had the same number of sentences, then they’d all have the same weights on their raw disparities, and so differences across judges would be entirely due to variation in those raw disparities.  If the residual variance component of these raw disparities is substantial (see above), then computing judge-specific model-estimated disparities for each of 760 judges would involve an important component related to idiosyncratic variation.  Taking the most extreme out of 760 model-estimated disparities is a lot like focusing on “coin” number 561 in my illustrative experiment above.

Another way to say this is that even if there were zero judge-specific disparity — even if all judges were perfectly fair — we might not be surprised to see substantial variation in the authors’ model-estimated disparities.

Now, it’s not really the case that all judges gave the same number of sentences, so there’s definitely some heterogeneity due to shrinkage as discussed above, which complicates the simpler picture I just painted for illustrative purposes.  But I suspect there is still a nontrivial “many-tosses problem” here.  Note that this is really an instance of a problem sometimes referred to as “multiple testing” in various statistics literatures; as responders to my Twitter thread noted, one place it comes up is in attempts to measure teachers’ value added in education research, and another is in ranking hospitals and/or physicians.  In other words, this isn’t a problem I’ve made up or newly discovered.

* * *

In sum, I think the paper has several serious problems.  I do not think anyone should use its reported findings as a basis for deciding which judges are discriminatory, or how much.  This is as true for people who lack confidence in the fairness of the system as for any people who doubt there is discrimination.  In other words, the criticisms I offer do not require one to believe federal criminal sentencing is pure and fair.  These criticisms are about the quality of the data and the analysis.

I want to make one final point, as I did in my Twitter thread.  Like the authors of the study, I believe that PACER should be made available to researchers.  Indeed, I recently have written a whole paper taking that position.  But I am very concerned about the impact of their work on that prospect.  The work involves problematic methods and choices and then calls out individual judges for shaming.  In my experience there’s nontrivial opposition to data openness within the federal judiciary, and I fear this paper will only harden it.

August 9, 2021 at 09:50 PM | Permalink


Apologies that I accidentally initially posted this on a closeby entry in this blog. Admitting that I am no econometrician, I am very confused about all this. So far as I could tell from the online discussion, the two counterfactuals various people have mentioned are (a) there's no bias in sentencing, and (b) all judges are equally biased. Option (a) seems to have been dispensed with already in the literature? So the "worst" case scenario (in the sense of this new paper being wrong) is that people want to argue that all judges are equally racially discriminatory? If that's the case, why are we not doing something about it? Or why isn't someone using a better method on this data to identify the discriminatory judges?

Posted by: Idont Thinkso | Aug 9, 2021 10:20:12 PM

No response yet I guess. I'll check back regularly to see when White Academia decides it wants to answer this one.

Posted by: Idont Thinkso | Aug 9, 2021 11:39:38 PM

Just checking in one more time to see if a white academic, especially any white academic whining about methodology, can reply with either (1) your unequivocal claim that there are not racially discriminatory judges, or (2) your admission that there might be or ARE, along with your accounting of what you are doing to address it. Yes, you.

Posted by: Idont Thinkso | Aug 10, 2021 12:55:19 AM

Ok my last word on this. Your stance, Mr Gelbart, is that there’s not enough data to determine which judges are discriminatory and since people of color have waited forever and still don’t have that data, we should keep waiting because any second judges will for sure make the data public and then everything will be fine. That sounds like the helpful suggestion of a white man with a nearly $400k public salary.

Posted by: Idont Thinkso | Aug 10, 2021 1:51:17 AM

What a lot of dense gabble.

Better: A Hall of Shame on the site. As the Marines say, "kick a-- and take names..." (per the late, brilliant F. Lee Bailey): Federal Judge Linda Reade of Iowa. The late Milton Shadur before whom, for some inexplicable reason, landed the case of his fellow Chicago committeeman from the distant past, Fast Eddy Vrdolyak, and who warranted a pat on the wrist and a sentence of next-to-nothing.Nice way to make millions and keep millions.

Surely each of our district courts has a share of these incandescently power- or sentence-mad judges, applying the "Guidelines (how rich!) irrespective of color...Jews, of course will suffer harsh punishment; they've got that Shylock legacy to live down. Label someone a "kingpin" (one who in reality resides in a shack in some pauperized rural community), or a black "gangbanger" and they're done for. Judges like labels to justify imposing near lifelong sentences then go home, get soused and contribute a few nickels to charitable institutions.

Who are these judges meting out sentences anyway? They're people in glass houses appointed via connections. Some family member had connection and made substantial donations. The appointer has little or no interest in whether the candidate is just plain dumb, meds-addicted, or crazy. For these latter three, I'll forego the naming.

Posted by: Brenda Rossini | Aug 10, 2021 9:05:43 AM

Idont Thinkso: I am not able to speak for Prof Gelbach, but I am able to say that lots of academics (of all colors) have been researching and writing about racial biases in sentencing for many, many decades. Many modern sentencing reform efforts --- from the creation of sentencing guidelines to the development and use of risk algorithms --- have been prompted or justified as a means to try to reduce or eliminate racial and other biases in sentencing. (Other biases of concern range from gender to socio-economic status to geography to other factors.) I could do literally thousands of blog posts on all the papers written about racial biases in sentencing.

The concern with this new paper is not that it seeks to explore which judges might sentence with the most racial disparity --- it is that the data and methods used to explore this question are opaque and questionable. By definition, there must be a set of legal academic among 10,000 lawprofs who are the "most stupidist" or "most racist in how they grade." If I were to announce I used a new data formula to determine that Prof. Thinkso and Prof Smyth and Prof Alexander were the "most stupidist" and "most racist," wouldn't you have some questions about my data formula? Should economists be trying to list the most discriminatory law professors without concern for how the list is created?

One reason this paper is getting attention is because there rightly is persistent concern about racial disparity in sentencing, but calling out certain persons for being "most discriminatory" using unclear data risks complicating efforts to address wisely this persistent problem. Indeed, I fear this kind of work risks reinforcing the view that this is a "bad apples" problem far more than structural --- e.g., this study uses the sentencing guidelines as a key factors in the analysis, but the racist crack-powder disparity (and many others) are baked into the guidelines. Similarly, racially-skewed work by prosecutors and defense attorneys are also a huge problem in our CJ system, but a focus on judges risks distorting our undertanding of all the sentencing decisions made before a judge gets to actually impose a sentence.

I appreciated your engagement, Idont Thinkso, but I would be interested to know if you think potentially statistically inaccurate identification of the "most discriminatory" judges helpfully advances the conversation. Perhaps problematic discussion of racial disparity is worse than no discussion at all, but getting this done accurately seems really important to me.

Posted by: Doug Berman | Aug 10, 2021 9:55:42 AM

Glad we now have it in writing that you are more comfortable accepting sentencing disparities than a potential false positive. (cf the multiple speculatory “I suspect”s and whatnot in the post.

Posted by: Idont Thinkso | Aug 10, 2021 10:10:14 AM

And now it seems we’re back to your “judges aren’t the problem.” I’ve got news for you: they are certainly part of the problem (the literature exists) but you are more willing to brush it aside because it threatens your comfort and your ivory tower. Would love to hear you say “fair point” but I suspect I won’t get the satisfaction. I’m done here.

Posted by: Idont Thinkso | Aug 10, 2021 10:16:02 AM

Last thing: thank you SO much for the lesson on the history of and loci of sentencing disparities. No possibly way that it could be a subject I also know something about.

Posted by: Idont Thinkso | Aug 10, 2021 10:26:44 AM

Working backward in response to your three latest comments, Idont Thinkso:

1. I do not know what you know because you have not indicated who you are. I welcome hearing more about your background on this important topic and about whatever work you have done to identify the sources of sentencing disparities.

2. I am not saying in any way judges are not "part of the problem," I am saying that we need to be clear and accurate when seeking to figure out which ones are the biggest problem AND also not forget all the other parts of the problems. Notably, I am hosting another professor (Christopher Slobogan) blogging about his new book based on his view that judges are such a big part of the problem that we ought to be comfortable relying a lot more on "just algorithms." Yet others say these algorithms are racially biased worse than judges. This is a critical topic worthy of extended debate with nothing brushed aside --- but we need to try to have our data (and semantics) clear and accurate along the way.

3. If Judges A, B and C are contributing to very worst racialized sentencing disparities, and a report comes out wrongly identifying Judges X, Y and Z are the "most discriminatory judges," it will be even harder to identify and correct the racial problems that Judges A, B and C are producing (and that X, Y and Z might also have some lesser role in). In other words, false positives can directly CONTRIBUTE to allowing key people to be "more comfortable accepting sentencing disparities." I get your eagerness to rail against anyone you think may be defending the "most discriminatory" federal sentencing judges, but I am first eager to make sure that were are accurately identifying the "most discriminatory" federal sentencing judges.

Posted by: Doug Berman | Aug 10, 2021 12:32:05 PM

Yup, got it. God forbid society tags the 50th most racist, or even the LEAST racist (but still biased). What really matters is if judges get their accurate racism rating. To hell with the people getting sentenced. Your null assumption seems to be "a judge is not racist" but why on God's Green Earth would that be? Thank you for demonstrating exactly how structural racism works.

Posted by: Idont Thinkso | Aug 10, 2021 12:38:43 PM

Well, Idont Thinkso, we could require that judges sentence everyone convicted of a felony to death with no discretion at all --- or abolish sentencing altogether --- if we think it inevitable that any and all exercises of judicial discretion will always be so racist that any and all other sentencing values and interests should be secondary to trying to identify and eliminate racial biased judges. Certainly some people seem to view policing and prisons and maybe all of criminal law as inherently so racists that we should abolish the enterprise altogether. If one views the sentencing enterprise as unavoidably coursing with racists judges whose racism eclipses all other sentencing concerns, I suppose I understand why you would not worry too much about the least racist judge being wrongly labeled the most racist (and/or the most racist being wrongly labeled the least racist). Are you making an abolish judicial sentencing pitch? Do you embrace Prof Slobogan's arguments that we'd do much better on this front with sentencing by algorithm?

I tend to think all discretionary decision-makers are subject to an array of conscious and subconscious biases -- that is, my null assumption is that everyone is somewhat racist (and somewhat sexist and natavist and anti-semetic and anti-LBGTQIA and so on). But even with that perspective, I think INACCURATE assessments of racism can readily undercut efforts to do better both for criminal defendants and the entire system. In this context, for example, defendants might look to recuse certain senior judges named as "most racist" by the paper --- if effective (and I do think senior judges may be likely to recuse if asked), the replacement active judge could prove to be, in fact, "more racist" than the senior judge pushed aside. And, of course, any claim that the new judge was actually more racist would be defeated by this possibly flawed study.

As a related follow-up question, I wonder if you think the 1000+ sentencing judges NOT singled out by the study as among the "most discriminatory" federal sentencing judges are likely to feel more or less confident about their current sentencing practices after seeing this study? I suspect they generally feel more confident thinking they are "the good guys" since they are not named on this list. But I doubt you would be eager to endorse nearly all federal judges now feeling emboldened about their existing sentencing practices if you think they are all somewhat racist and all should be working to do a lot better. It is one think to say we need to pay more attention to racial bias in sentencing, but strange to say we should do so by embracing and championing potentially inaccurate approaches to identifying the most biased judges (while reifying structural and personal racism elsewhere).

Posted by: Douglas Berman | Aug 10, 2021 2:51:56 PM

In light of Dr. Gelbach's helpful comments, the paper no longer reports judge-specific estimates. See a summary of the changes here: https://twitter.com/SmithChristianM/status/1427784600869171207

Posted by: Christian S | Aug 18, 2021 11:35:22 AM

I am puzzled that the authors of the study (none of whom appears to have any actual knowledge of or experience in federal sentencing practices) do not acknowledge in their retraction the unanimous reaction of practicing lawyers in the Districts of the (formerly) named "most discriminatory" judges, that these three judges are not by any means more racist in their decisions than other district judges. Such reactions by experienced private practitioners were quoted in the ALM stories, and were elaborated a couple of days later in a detailed statement by the public defender's office. I realize that subjective reactions could be disproven and shown to be fallacious by valid statistical analysis. But questionable statistics can be just as legitimately called into doubt by the unanimous reactions of those with actual, individualized experience appearing before those judges. I am no statistician, but I knew the minute I read about the study and its results that there had to be something deeply wrong with its methodology, given the plainly inapt names that rose to the "top" in its Hall of Shame.

Posted by: Peter Goldberger | Aug 18, 2021 8:52:13 PM

Post a comment

In the body of your email, please indicate if you are a professor, student, prosecutor, defense attorney, etc. so I can gain a sense of who is reading my blog. Thank you, DAB