Sunday, August 6, 2017

Testing for discrimination in college admissions

Recently, the Trump administration’s investigation into racial discrimination in college admissions has brought the topic back into the news. But the claim that some races need higher GPAs or SAT scores to be admitted to colleges is, of course, an old one. This post discusses the statistical subtleties involved in proving such a claim: specifically, I examine some of the arguments that Asian applicants need higher SAT scores than white applicants. To be open about my beliefs at the outset, I think that colleges probably do discriminate against Asians, as they once discriminated against Jews, but the statistical arguments made to prove discrimination are often flawed. This also describes my beliefs about discrimination more broadly: while it is pervasive, quantifying it statistically is hard.

We’re going to use a hypothetical example where only whites and Asians apply for admission, Asians tend to have higher SAT scores than whites, and the only thing that actually affects whether you get admitted is your SAT score. So in this hypothetical example, there is no discrimination; your race does not affect your chances of admission.
On the left, I show the scores for Asian applicants and white applicants. On the right, I show how your probability of admission depends on your SAT score. So someone with an SAT score of 1400 has about a 50% chance of admission, regardless of whether they’re white or Asian. Given that there’s no discrimination in our hypothetical example, if a statistical argument implies there is discrimination, that argument is flawed. So let’s take a look at some arguments.

The most common argument I’ve seen that Asians are discriminated against is that the SAT scores of admitted Asians are higher than SAT scores of admitted whites. But Kirabo Jackson, an economist at Northwestern University, points out the flaw in this argument. In our hypothetical example, where there is no discrimination, admitted Asians will have an average score of about 1460, and admitted whites will have an average score of about 1310. This happens because the Asian distribution is shifted to the right: even though a kid with a 1500 is equally likely to get in regardless of whether they’re white or Asian, there are more Asians with 1500s.

When I ran this argument by a friend, he said that the study which people often cite when claiming Asians are discriminated against is considerably more sophisticated. So I read the study, and it is more sophisticated; it’s worth reading. They fit a model where they simultaneously control for someone’s race and SAT score, which lets you see whether people of some races need higher scores to get in.

Here’s the subtlety. The paper doesn’t actually look at SAT scores, but SAT scores divided into bins from 1200 - 1300, 1300 - 1400, and so on. Within those bins, the paper’s model assumes all applicants should have an equal chance of admission (all else being equal). But that isn’t quite right: an applicant with a 1290 will have a higher chance of admission than an applicant with an 1210. And because Asians are right-shifted in our example, that means that Asians in the 1200 - 1300 bin will have higher scores, and a higher chance of admission, than whites in the 1200 - 1300 bin, even though the paper’s model assumes that applicants in that bin should be equal if there is no discrimination. Below is a plot which illustrates the idea. Within each score bin, Asians (red line) have a higher average SAT score (left plot), and thus a higher chance of admission (right plot), then whites in the same bin (blue line).


So what happens when we fit the paper’s model on our hypothetical data? Now we find discrimination against whites. This happens because the blue lines are below the red lines: whites in a bin have a lower chance of admission than Asians in a bin because they have lower average scores. So the paper’s model will incorrectly conclude that, controlling for SAT score, whites have about 20% lower odds of admission, a significant amount of discrimination. I should note it’s entirely possible that the authors fit other models that don’t bin SAT scores, although I couldn’t find those models mentioned in the paper [1]; please point me to anything I’ve missed.

Okay. So we took hypothetical data that had no discrimination. One widely repeated statistical argument shows discrimination against Asians. Another widely repeated statistical argument shows discrimination against whites. This isn’t good. The basic mathematical takeaway is that when races have different distributions over a variable (like SAT score) and you divide that variable into bins, you can get misleading results. (See the literature on infra-marginality for interesting discussions of related phenomena in tests for police discrimination).

The broader takeaway is that testing for discrimination is really hard. Which isn’t to say you should discount all evidence that it occurs; you should just be mindful of the caveats. Also, these statistical problems are tricky and fun to think about, so you should come work with me on them.

Footnotes:

[1] One of the authors went on to write a book on the topic, the one cited in the lawsuit against Harvard; I took a look at the relevant chapter, and it seems to use a similar binning strategy for SAT scores. To be clear, just because a model has caveats worth discussing doesn’t mean the work is bad or the conclusions are wrong; indeed, the book appears to be impressively comprehensive. Also, our hypothetical example actually suggests that this model might underestimate the amount of discrimination against Asians.

Monday, April 17, 2017

Proving discrimination from personal experience


Here’s an interaction you might’ve participated in:

Member of minority group: I just had [negative interaction] with John. I don’t think he would’ve done that if I hadn’t been a minority.
Listener: That sucks. But...how do you know it was because you were a minority? Maybe he was just having a bad day or he was really busy or …

The negative interaction might be, say, that John talked down to them or didn’t include them on a project.  The listener’s reaction is totally reasonable and well-intentioned (at least, I hope it is, because I’ve had it myself). Sometimes it isn’t even said out loud; the listener just thinks it. Here I argue that this reaction is not the most useful one. I explain why, both in English and in math, and then I suggest four more useful reactions.

The problem with this reaction is not that it’s false. It’s that it’s obvious. If a minority tells you about something bad that happened to them, you can almost always attribute it to factors other than their minority status. (Throughout this essay, I’ll refer to negative behavior that’s due to someone’s minority status as “discrimination”.) Worse, this uncertainty will persist even if the discrimination occurs repeatedly and is quite significant. The core reason for this is that human behavior is complicated, there are lots of things that could explain a given interaction, and in our lives we observe only a small number of interactions. Because it is so hard to rule out other factors, individual discrimination suits have notoriously low success rates.

Let’s be clear: I’m not saying you can never prove discrimination from someone’s individual experience.  Obviously, there some experiences which are so blatant that discrimination is the only explanation: if someone drops a racial slur or grabs their female coworker by the whatever, we know they’re a president bigot. But, in today’s workplaces, problematic discrimination is rarely so overt -- hence the term “second generation” discrimination. Here’s a picture:
Here’s a simple mathematical model that formalizes this idea. If you don’t like math, feel free to skip to the “What should we do instead” section. Let’s say the result of an interaction, Y, depends on a number of observable factors, X, one of which is whether someone’s a minority. Specifically, let:

Screen Shot 2017-04-05 at 11.48.48 AM.png
where beta is a set of coefficients describing how much each factor matters, and noise is due to random things we don’t observe. So, for example, Y might be your grade on a computer science assignment, X might include factors like “does your code produce the correct output” and “are you a minority” and noise might be due to stuff like how quickly the TA is grading [1].

If we want to know whether there’s discrimination, we need to figure out the value of betaminority: this will tell us whether minorities get worse outcomes just for being minorities. We can infer this value using linear regression, and importantly, we can also infer the uncertainty on the value.

Here’s the problem. When you do linear regression on a small number of datapoints (which is all a person has, given that they don’t observe that many interactions) you’re going to have huge uncertainty in the inferred values. To illustrate this, I ran a simulation using the model above with two groups, call them A and B, each half the population. I set the parameters so there was a strong discrimination effect against B. Specifically, even though A and B are equal along other dimensions, the average person in A will be ranked higher than about two thirds of people in B, due solely to discrimination; if you look at people in the top 5%, less than a third will be B. So this is enough discrimination to produce substantial underrepresentation. But when we try to infer the value of the discrimination coefficient, we can’t be sure there’s discrimination. In the plot below, the horizontal axis is how many interactions we observe; the blue area shows the 95% confidence interval for the discrimination coefficient (with negative values showing discrimination against B); the black line shows a world with no discrimination.


The important point being that the blue shaded area overlaps 0 -- meaning no discrimination is possible -- even if you have literally dozens of interactions, which is way more than you often have. (For fewer than about 5 interactions, the errorbars just blow up and you can’t even graph it.) You can alter simulation parameters or simulate things slightly differently, but I don’t think you’ll change the basic point: you can’t infer effect sizes on sample sizes this small with any confidence.

This model also illustrates some features which make concluding discrimination harder. For example, our errorbars will be larger if other features in X are correlated with being a minority. (“No no, I didn’t promote him because he’s a man. I promoted him because we work well together because we always go out to dinner together / play basketball together / he sounds so much more confident. Well, yes, my wife says I can’t go out to dinner with women…”) Also, your errorbars will be larger if you’re observing repeated interactions from the same person. (If you’re trying to compare your treatment to that of a single coworker, it’s even harder to be sure if it’s because you’re a minority or because of one of the innumerable other ways in which you’ll inevitably differ.) Last, you’re going to be in even more trouble if your minority is a very small fraction of the population whose interactions you observe (say, computer scientists) -- I don’t know if most computer scientists are prejudiced against African-American students because I’ve literally never seen them interact with one.

It’s worth noting that there are a lot of other subtleties in detecting discrimination which have nothing to do with small sample size and which this model doesn’t capture (see the intro to this paper for a brief, clear introduction) but I think small sample size is probably the biggest challenge in the individual-human-experience-setting, so it’s what I focused on here.

What should we do instead?  

So it isn’t useful to tell someone that they can’t be sure their experience is due to discrimination, because even in cases when a large amount of discrimination is occurring, people often won’t observe the data to conclusively rule out other factors. What should we do instead?

Here’s one thing I don’t think we should do: assume that discrimination is occurring every time a minority says they think it might be. (I do think we should assume they’re telling the truth about what occurred). The solution to uncertainty and bad data is not to always rule in favor of one party, since it creates perverse incentives and people’s lives get wrecked both by discrimination and by allegations of discrimination. Instead:

  1. Recognize the severity of the problem that minorities deal with. It’s not that they hallucinate discrimination everywhere or are incapable of logical thinking or rigorous standards of proof. It’s that proving discrimination from anecdotal experience is frequently an extremely difficult statistical task. Also, it’s exhausting to continually deal with the unprovable possibility of discrimination: to wonder, every time something doesn’t work out, if some subtle injustice was at play.
  2. Use common sense. Statisticians call this “a prior”: ie, you let your prior knowledge about how the world inform how you interpret the data. So, for example, if you hear someone refer to a black student as “articulate” or a female professor as “aggressive”, you don’t need to hear one hundred more examples to suspect prejudice may be at play. Your prior knowledge about how those adjectives are used helps you conclude discrimination more quickly. (I suspect that one reason female judges are more inclined to rule in favor of discrimination suits is because they have different prior beliefs about how common discrimination is.)
  3. Aggregate data. If one person’s experience doesn’t give you enough data to rule out other factors, aggregate experiences. Class-action lawsuits are an essential means of going after discriminatory employers for this reason. Climate surveys within departments are another example, as is publishing systematic salary gap data (as Britain now does). The sexual assault reporting system Callisto, which aggregates accusations of assault against the same accuser, is based on a related idea, as I’ve discussed.
  4. Conduct workplace audit studies. This idea is kind of crazy and might get you fired, but here it is: if it’s hard to prove discrimination because there are too many other factors at play, keep the other factors constant. Here are some examples:
    1. When a female employee says something in a meeting and people ignore it and then a male employee says the exact same thing and gets a more positive response, we’re more convinced that’s discrimination. (There are a hilarious number of Google results for that phenomenon, by the way.)
    2. A few years ago, I spent a few weeks emailing the NYT’s technical team and getting no response; finally I asked my boyfriend to send them the exact same question, and they immediately responded.
    3. Or take this recent case, where a male and female employee switched their email accounts and were treated dramatically differently.

All these examples feel like compelling evidence of discrimination because it’s hard to pin the different outcome on extraneous factors; everything except minority status remains the same.

So, could you do this in your workplace? More and more interactions occur online, making it easier to switch identities: for example, you could imagine switching Slack accounts for a week. Obviously there are 14 million ways this could go wrong, but drop me a line if you try it.

Footnotes:

[1]  This is easily extended to binary outcomes: Y ~ Bernoulli(sigmoid(X * beta + noise))

Tuesday, August 9, 2016

How much does losing the first round of a tournament hurt your final result?

This is the second half of a two-part piece about arguments I’ve had with Shengwu Li. Part 1 gives a statistical argument for why in some states it is rational to vote.

Today’s question: can you sleep through the first round of a 9-round debate tournament, like the World or European championships, without hurting your final result? You might think this question is only interesting to debaters, but the mathematical way of answering it is potentially applicable to any tournament with some randomness in pairings in which the winners play the winners and the losers play the losers; this also happens in chess and lots of other competitions.

(Brief note for non-debaters: in each round of the debate world championships, you compete against three other teams, earn up to three points, want to accumulate as many points as possible. In each round, you’re paired against teams with roughly the same number of points, so if you win, you face better teams.)
The argument that the first round doesn’t matter is that, if you lose, you face worse teams in the next eight rounds, an advantage you can capitalize on. Perhaps this evens out.

A first glance at the data would imply this argument is wrong. In the three debate world championships from 2013 - 2015, teams who earned three points in their first round ended up with about six more points by the end of the tournament than teams who earned zero points in their first round. This seems pretty amazing, because you only get three points from winning your first round, and then you have to face better teams; how do you end up six points better? Perhaps winning your first round gives you a confidence boost that improves your performance?

This reasoning is wrong. Winning your first round doesn’t necessarily cause you to do better in later rounds; it’s just a sign you’re a better team. (Similarly, getting in an ambulance doesn’t cause you to die; it’s just a sign that you’re sick.) The teams who win their first rounds would’ve done better whether they won their first rounds or not.

If we want to figure out whether winning your first round causes you to do better overall, we need to control for the fact that teams that win their first rounds are better. In an ideal world, we’d just do an experiment: instead of running the first round, we’d divide teams into four random, equally-sized groups, give one group 3 points, one group 2 points, one group 1 point, and one group 0 points, run the rest of the tournament normally, and see whether the initial advantage ended up mattering. But we can’t run a real random experiment, so we need to find a random factor that affects who wins the first round. We call the random factor an instrumental variable. The math behind how exactly this works is beyond the scope of the post (here’s a less mathy reference, here’s a more mathy one), but the basic recipe is straightforward. If you want to know how cause X affects outcome Y, you need to find a random factor Z which affects X (and is uncorrelated with Y when controlling for X). Then there’s some math that lets you put those three ingredients together to figure out how X affects Y. That’s a lot of symbols, so here are some examples.
How does X...
Affect Y...
Random factor Z which affects X
Reference
Serving in the Vietnam War
Future earnings
Draft lottery number
Tea party protests
Election outcomes
Rain on tax day
Iron metabolism
Risk of Parkinson’s
Genetic variants affecting iron metabolism
Height and BMI
Socioeconomic status
Genetic variants affecting height and BMI


To return to our debate problem: our X is how well you do in round 1, our outcome Y is how well you do in the tournament overall, and now we need a random factor Z that affects how well you do in round 1. One random factor that affects a team’s probability of winning round 1 is how good the other teams in the round are, since teams are randomly matched in the first round. We estimate the likelihood that each team will win the round using a number of other factors, including their record in past tournaments, their school’s record in past tournaments, and their EFL / ESL status [1], and use these as our instruments. Then we use these instruments to estimate the true causal effect of winning round 1.  (Another source of randomness you could potentially use is which position you’re assigned to; we discuss this further here [2]).

Here are the results. The red line shows the estimated effect of your round one result on your score after each round (x-axis) using the naive, incorrect method we describe above; the black line shows the estimated effect using the more sophisticated instrumental variable method.
A couple interesting things here. First, the red line and the black line are very different; in particular, the naive approach (red line) suggests that winning round 1 has a much larger effect, and this gets bigger as the tournament progresses, which is wrong for the reasons discussed above. The more sophisticated analysis implies that while winning round 1 affects your performance for a couple rounds, by the end of the tournament (round 9) it doesn’t make much difference. (The 95% confidence interval on the estimate is unfortunately large [3], because the instruments aren’t that strong.)
Takeaways:

  1. Don’t freak out too much if you do badly in an early round in a long tournament like Worlds or Euros -- our estimates suggest it doesn’t affect your final performance very much, if at all. On the other hand, in shorter 4 - 6 round tournaments, losing an early round can have a larger effect.
  2. If early rounds have a smaller effect on your final performance, it might be better to run controversial debates in earlier rounds; that way if teams had to opt out, they could do so without taking as large a hit to their final result.
  3. Storing comprehensive data from all Worlds in a single centralized repository, with a standard format, would make these analyses easier and is worth doing.
  4. Naive correlational estimates of effects are often different than those estimated using causal methods. For the love of God, bear this in mind when reading popular social science coverage. Even if the authors claim to have “controlled for other factors”, this phrase does not work magic. It is very difficult to control completely for other factors.

Notes:
[1] This isn’t, of course, a perfect measure of team strength -- since Oxford A is usually composed of different people from year to year, and many teams do not appear in other Worlds.
[2] In each round, every team will debate the same statement -- for example, “We should ban abortions at all stages of pregnancy” -- but because there are four teams in the same room, two teams will be randomly assigned to argue for the statement, and two teams will be assigned to argue against the statement. It’s hard to come up with statements which are perfectly balanced, so it’s often better to be on one side or the other. We found that a team’s score after the first four rounds was significantly associated with the positions they had been randomly assigned in those rounds, so position in the first four rounds could be used as an instrument. You have to use the first four rounds, not just the first round, to satisfy the exclusion restriction, since most debate teams cycle through all four positions in the first four rounds.
[3] I think you should actually compute standard errors clustering by round, since results within round are highly correlated; when you do this, the errorbars are slightly larger but the conclusions are the same.