Obsession with Regression

Monday, September 3, 2018

Playboy, kittens, and deep learning

It was nearing midnight and it had been a long day, but I was excited. In the morning I had been reading about conv nets, a powerful deep learning technique that lets computers understand images; in the afternoon I had been chatting with one deep learning researcher about how to use them, in the evening I grabbed dinner with a second deep learning researcher, and now approaching bedtime I was tapping away on my laptop marveling at how, in a few lines of code, I could download a vast mathematical structure that could discriminate between a thousand image classes with near-human accuracy. I was messaging an economist friend about how my field was cooler than his because I was so damn excited.

But as I went through the code in a famous deep learning toolbox, I noticed an image file with an odd name -- lena.jpg. And with a jolt I realized who she was:

This is one of the most widely used images in computer science. If you’ve ever taken a computer science class that worked with images, there’s a good chance you’ve used it. It also has a lesser-known, controversial history. The image comes from a 1973 Playboy centerfold. It was originally used in a computer science paper because a bunch of USC scientists were writing a paper in a hurry and just needed an image to add as an example, and someone happened to walk in with a Playboy. The image has been widely used ever since then, and there have been complaints for decades that it’s sexist to use it as a standard test image: see Wikipedia's summary, and also this Washington post op-ed from a female computer science student at my high school.

While I don’t think this is the biggest deal in the world, I also think it’s pretty obvious that we should stop using this image, for a couple reasons:

The image has an inappropriate origin. You shouldn’t get your test images from a magazine famous for its images of naked women.
Separate from origin, the image is obviously sexual. It’s an attractive woman who’s naked from the waist up looking suggestively at the viewer. The titillating nature of the image is clearly part of its appeal: here’s the (male) editor of a computer science journal noting that “it is not surprising that the (mostly male) image processing research community gravitated toward an image that they found attractive”.
Machine learning already has a problem with sexualizing women. See Kristian Lum’s piece for some examples, among many others.
There are literally billions of other images you could use.

To return to the story: some version of the reasoning above flashed through my mind, and my temper flared. It had been a 15-hour day and I had been feeling good and productive and all of a sudden I was not. By midnight, I had stopped writing code and half-written an angry blog post instead. Here is a quote:

“It's a strange juxtaposition, to see this ugliness in the midst of such beautiful math; to feel walls thrown up around open-source code; to see this tired old image in the midst of such innovation; to feel sudden vulnerability when a second ago you felt invinceable [sic].”

Then I stopped and thought before posting it. The cardinal rule I adhere to when talking publicly about gender issues is don’t do things rashly. As a high schooler, I was a tournament chess player, and I like seeing the battle for gender equality as a chess game -- give long thought to every attack, try to look a few steps ahead, keep the end goal in mind. Reflecting on this potential blog post reveals two problems:

The writing’s a bit sloppy. There’s a misspelling; the parallel structure is a little clumsy; the sentence is a little long. There’s nothing worse than a sentence which strives for eloquence but doesn’t quite achieve it.
The tone’s a bit harsh. It makes it sound like the person who used the image is severely, perhaps deliberately, harming female computer scientists on multiple levels. To see how this tone was likely to go over, I asked a couple of male computer scientists what they thought about the image. The first man I talked to laughed and said, “people are too sensitive”. The second man also thought the image was fine. Neither had ever heard of its history, and neither was particularly bothered when they did. Now, of course, I write things that others don’t agree with all the time. But their reaction suggested to me that my tone would not persuade most male computer scientists -- which is to say, a large fraction of my audience -- and would probably actually be counterproductive.

So I went to sleep without doing anything. Over the next few days I reached out to a few people I respected to see what they thought. My anger faded, but I remained convinced the image should be taken down. One option was to open a GitHub issue -- essentially, write a public note to the author of the code explaining the history of the image and asking that he use a different one. The upside of this would be to raise awareness of the image’s history and the prevalence of subtle gender bias in computer science, and make it harder for the author to ignore me. The downsides were that it’s a bit pointed to publicly point out sexism in someone’s work; there was also the risk that the whole interaction would end up on Hacker News or Reddit and waste a lot of my time (although this could also be an upside, if it offered insight into how discussions on these issues evolved). Instead, I decided to privately email the author. If he ignored me, I could escalate by making a public comment. We had the following exchange (I blacked out his identifying details).

Very gracious; totally painless. He was as good as his word: within a day, he replaced the original image with

Thus accomplishing the true goal -- to increase the number of kitten pictures in machine learning. Look at its little ears!

As I said at the outset, this obviously isn’t a tremendously momentous interaction. I share it for two reasons. First, women who speak up about gender issues are sometimes told that we should calm down or that we’re being too sensitive, and I wanted to illustrate that we’ve often already put a lot of time into calming down and choosing our words carefully. There’s a good chance that we were actually a lot madder than we’re acting, and it’s probably unhelpful to ask us to calm down further [1].

My second reason for sharing this is that conversations about social justice issues these days feel so fraught and so frequently blow up, but this interaction gave me hope. When I talk to computer scientists about these issues, I don’t think they are usually as concerned as I am. But I also think that they are usually at least somewhat concerned and, if you reach out to them calmly and it doesn’t cost them anything, will be open to helping you out. So here’s to more kittens and fewer topless women in deep learning.

Notes:

[1] Of course, this isn’t always true. Everyone sometimes lashes out without thinking, and some women do this too. But given that most are thoughtful and restrained, I think it’s risky and unfair to assume someone who’s acting angry hasn’t thought about their words.

Wednesday, February 28, 2018

Is Work Enough?

I wrote this piece three years ago and never published it. But I returned to it recently and it still resonates with me (though a lot has changed since then), so I’m publishing it now.

Recently I wrote my first piece for the Atlantic, and on the day it came out I woke up early because I was so excited. But people quickly started writing mean things about the piece, and because my skin isn’t as thick as it should be I called my mom in the hope that she would make me feel better.

“I’m a little stressed too, actually,” she said.

“Why?” I asked.

“Well, your grandfather just went to the hospital,” she said. “They think he might’ve had a stroke.”

Which reminds you that there are worse things than nasty internet comments. There are also worse things than strokes. The next week was a surreal blend of professional successes and family tragedies. A literary agent emailed an offer to represent me. My mom emailed that the true problem was not a stroke but a brain tumor. A national radio station asked for an interview. My mom said that the tumor was probably not a lymphoma but a glioblastoma, which has a median survival of 3 - 6 months.

My focus on my work has led a collaborator to label me a “serious workaholic”, and a friend to tell me that “you can choose to not care about people”. And yet. It has been made so devastatingly clear to me that my work is not enough.

Here is what professional successes feel like when a close family member is dying. It’s as if you’re sitting at an elegant restaurant with a spear through your chest and waiters keep bringing you beautiful courses. One asks you how your food is.

“It’s lovely,” you say, “but I’ve got this large hole in my ribcage…”

Work is not enough, but nor is it nothing. A few months after my grandfather died, it became clear that my relationship of four years was coming to an end. (This has not been the greatest year.) I was at a conference in Dublin, and I was supposed to give a talk, and I was a mess; I was sneaking into the bathroom between poster sessions to cry. But the talk went well, and I ended up winning my first best talk award. I accepted the certificate from the organizer, smiled at all the clapping scientists, and went straight outside to do some more crying. I was crying because I still felt terrible, but I was also crying because even in the midst of all the terrible, winning still felt good. As broken as I felt, my head was working fine, and that incongruity was surprising and reassuring: a fierce, raw triumph, this sudden awareness that, unlike the Titanic, I was mostly safely compartmentalized.

My life feels emptier of late. But I am happy alone in my head -- I am rarely lonely or bored -- because of my work. I do not think I would find my work comforting if I were pulling 90-hour weeks at Goldman Sachs, though. I am as addicted to pageviews and likes as anyone else, but I think I would find them less meaningful if they were for, say, a cigarette ad campaign I had designed. When I am at my computer, no matter what terrible things are happening, I can still put one word in front of another and think: the world may be bad, but maybe this will make it slightly better.

Or maybe I just find my work comforting because it is safe: it’s a world as still and beautiful as a frozen wasteland, filled with abstractions I understand, problems that demand my full attention and leave no room for sadness. Science fiction writers seem to like this sort of psychotherapy. In Ender’s Game, a 6-year-old genius finds comfort far from home by computing powers of two. I used to try this, but I mostly got irritated because, unlike the 6-year-old, I can’t get up to 67 million in my head. In 2001: A Space Odyssey, after a rather serious mishap with a computer [1] leaves a character stranded alone, he finds comfort by listening to the “abstract architecture” of Bach. There is, if not deep satisfaction, at least distraction in abstraction.

Notes:
[1] Probably programmed by a Cal student.

Monday, December 18, 2017

Disagreeing without disliking each other

...increasingly seems to be impossible. Democrats dislike Republicans; Republicans dislike Democrats; even within liberal enclaves, very liberal campus activists dislike slightly less liberal campus activists. More than one of my friends has told me that, if they could work on any problem, they would want to stitch closed these schisms in our society.

Recently I had two experiences that renewed my faith that this was possible. I started contentious discussions with two very different groups:

1. The Redditors: A bunch of Reddit fans of a popular blogger who, in my view, can be biased against feminists. He had written a piece arguing that the focus on powerful men’s sexual assaults was “a hit job” on men. I wrote a rebuttal, decided I was feeling confrontational, and posted it where all his Reddit fans would see it.

2. The AlterConfers: The attendees of AlterConf 2017, a conference that “provides safe opportunities for marginalized people and those who support them in the tech and gaming industries”. I had been invited to speak about ethical dilemmas in computer science, and decided I would begin my talk by talking about criminal justice sentencing algorithms. I was going to argue that algorithms that created large racial disparities were not necessarily unfair.

These groups, as you can imagine, followed profoundly different norms. When I arrived at AlterConf, I was asked for my preferred pronouns (she/her) and told that the bathrooms had been “liberated from the gender binary”; when I posted on Reddit, a commenter immediately assumed I was male. At AlterConf I was asked whether my talk had any trigger warnings and handed red, yellow and green cards to indicate whether I was comfortable talking to other people; on Reddit, I found a similarly careful means of communication that relied on a totally different vocabulary -- “motte and bailey”; “toxoplasma”; “infinite regress”; “Taleb’s notion of time probability”; “eudaimonia”, “metis”, and “episteme”; "Foucaultian nihilism” and “biopower”.

I doubt the two groups would get along that well. At AlterConf, multiple speakers attributed social problems to cis white men; on Reddit, multiple commenters criticized feminists instead. At AlterConf, speakers discussed how to stop racial discrimination; on Reddit, commenters focused instead on discrimination against men. (Like African-Americans, they argued, men were discriminated against by the criminal justice system.)

Nor was I arguing positions that either group was particularly sympathetic to. Reddit isn’t known for loving feminist positions; similarly, arguing that the criminal justice system isn’t as racist as it appears to be isn’t a popular stance in activist circles. I was genuinely nervous before engaging with each group.

But in fact both interactions went remarkably well: no one got mad, I learned from both groups, and they learned from me. On Reddit, commenters pointed me to a long line of papers showing discrimination against men in criminal justice, which I wrote some code to check out -- more on this some other time. They also asked me for recommendations of other blogs to read, machine learning resources, why feminists acted the way they did, and why I had worked at 23andMe, so information flow went both ways. At AlterConf, I heard ideas for making tech conferences safer and more inclusive (take note, NIPS); making code reviews more pleasant; and finding books with more diverse protagonists, among many others. Conversely, many people came up to me after my talk to ask about tradeoffs in algorithmic fairness.

Why was this communication peaceful and productive? Here are some thoughts.

1. Both communities established strong norms of respectful discourse. These norms are wildly different, of course -- on Reddit, you get a lot of rationalist jargon, and at AlterConf, you get a lot of activist jargon. But they share a common goal: to allow everyone to participate in a free discussion without getting insulted or upset. And while I don't agree with all the ways these communities achieve this goal -- sounding super-rational can sometimes just conceal silly arguments or be pretentious, for example, and I think trigger warnings, while useful in some cases, are used over-broadly -- it helps to just establish a common intention that we're all trying to get along.

Sometimes, the norms are very effective at preserving civil discourse. For example, the one Reddit commenter who was overtly disrespectful, questioning how I had managed to earn my professional bona fides when I wrote like a high schooler, was swiftly downvoted and told they were violating the rules of the forum; their comment was then deleted. One norm I particularly like is charity, a term I heard mentioned frequently on Reddit. As I understand it, charity means "assume good intent and respond to the strongest version of the opponent's argument". I love this ideal, although I don't always achieve it [1].

2. I showed willingness to learn. On Reddit, I began my rebuttal with a long paragraph listing all the things I had learned from the original blog post, and when commenters disagreed with me, I asked them for references. At AlterConf, I started by saying I was grateful to be invited to speak because I thought the wider CS community could take a lot of useful lessons from AlterConf. I also told them that I was about to give a short talk on a controversial topic to a new community, which was always risky, so I was nervous and if they disagreed they should come talk to me because I liked talking to people who disagreed with me.

This willingness to learn was not an act: it helps me to approach new communities anthropologically, with openness, curiosity, and some degree of detachment, and view things I don't agree with as interesting and well-intentioned rather than stupid and malignant. Of course, I don’t always manage to do this.

3. We came from similar tribes. On Reddit, I could credibly claim to be a rationalist math nerd, and the fact that I was in the Stanford CS program was a good thing; had I picked a fight on Breitbart, I suspect I’d have been cast as a liberal elite. Similarly, at AlterConf, I started by saying I studied police discrimination to try to establish that my heart was in the right place.

I'm not sure any of these strategies would allow you to bridge a wider schism and engage with, say, Fox News commenters. But maybe we don't need to do that yet. Even within the Democratic party, there are schisms that if bridged would help us win elections. And, more broadly, if you start by reaching out to the most distant people who will listen to you, perhaps little by little that frontier grows more distant.

Notes:

[1] In my experience, the activist community isn’t always charitable either; I dislike how people are sometimes demonized when their intent is benign, as I’ve discussed before. But at AlterConf everyone was nice to me.

Friday, December 8, 2017

No, Scott Alexander, the focus on powerful men’s sexual assaults is not “a hit job” on men

I want to rebut a recent piece in which Scott Alexander, a widely read blogger, criticizes the focus on sexual assaults committed by powerful men (as opposed to assaults committed by women). I think this post is worth responding to because Alexander's blog is incredibly widely read among computer scientists and my other analytical friends -- it may be the most widely-read blog in my social circle -- and I don't think it covers gender issues fairly, and this post is an example of that.

There are a couple important things Alexander gets right. He's right that society decides not to care about certain classes of sexual assaults; there's probably been more coverage this year of Taylor Swift getting groped than all prison rape combined. He's right that society is wrong to make fun of men who are assaulted by women, and I agree the media should seek out reports from men as well. Reading his post and some of his references increased my already-held belief that we should take men more seriously when they are harassed or assaulted by women, so credit to him for that. He makes a thought-provoking argument that men might care more about assault if they believed it could happen to them too and they'd be taken seriously if it did.

But then the post says a bunch of things that are less reasonable.

First, he repeatedly implies that the current conversation is only about men assaulting women. This is factually incorrect; plenty of men have also gotten in trouble over accusations of assaulting men -- Kevin Spacey, George Takei, James Levine, we could go on.

Then he says that the focus on male assaulters is "a hit job on the outgroup [men]. Do I think that sexual harassment is being used this way? I have no other explanation for the utter predominance of genderedness in the conversation."

Here's another explanation: it's extremely obvious that male-on-female assault, a very common and damaging kind, has some unique characteristics worth discussing -- like the fact that men are generally physically, professionally, and economically more powerful, which fundamentally changes the dynamic of the assault. This discussion is long overdue, and we're having it now. Another reason the assaults of powerful men are worth discussing specifically is that it's a very bad idea to give assaulters, who definitionally don’t have enough regard for others' suffering, access to, say, America's entire nuclear arsenal. So the fact that two of America's last four presidents have been accused of assault or harassment by multiple people is worth talking about. When there are credible allegations of assault against female presidents, senators, media moguls, etc, we should absolutely talk about those those too.

(Note, incidentally, that there are many other ways in which the current conversation is biased and incomplete: other groups being largely left out of the headlines are prisoners, people of color, transgender people, and people whose abusers aren't famous, but somehow his hypothesized "hit job" is against men only. This is odd.)

Alexander also says that focusing on male-on-female assault is like talking only about black-on-white crime or Muslim-on-Christian terrorism: it implies you have an insidious agenda, like Richard Spencer. (Comparing feminists to Nazis is a very tired rhetorical tactic -- feminazis, anyone? -- but let's move on.) These comparisons are wrong for three reasons. First, as explained above, the current conversation isn't just about female victims, though it does focus on male crimes. Second, it's entirely reasonable to have a conversation specifically about the crimes committed by one group. The media has been running non-stop articles about white supremacists, and that is not "a hit job on whites" but an analysis of an important social phenomenon. I don’t feel attacked when people complain about white supremacists; similarly, criticism of male assaulters isn't criticism of all men.

The third reason his examples are bad is that, in both his examples, the group being blamed is a non-dominant group that's been discriminated against for centuries. Contrast this with the group he's comparing to, men. A negative consequence of obsessing about black-on-white crime is a system of mass incarceration that wrecks millions of lives a year. A negative consequence of obsessing about Muslim-on-Christian terrorism was a war that killed hundreds of thousands of Iraqi civilians. In contrast, a negative consequence of obsessing about assaults committed by powerful men is...I'm not sure what it is, but I'm pretty sure it doesn't involve hundreds of thousands of dead people. It would've been easy for him to flip his examples around, which would have made them more apt but less persuasive, so this choice seems like sophistry here.

These counter-arguments are obvious, and Scott Alexander is smart and thorough, so the fact that he doesn't rebut or even mention them is worrisome to me. I think when it comes to gender issues and feminism, he has biases, perhaps driven in part by what happened to Scott Aaronson, and I've had this feeling reading his blog before. And of course we all have biases, certainly I do, but I'm not the main source on gender issues for eight gajillion rationalists. (His post has been shared on several men's rights subreddits, so he's a source for other demographics as well.) So my request is -- please don't take Slate Star Codex as a definitive source on gender issues. He's smart and provocative and I read him, but please read people who disagree with him too.

Sunday, August 6, 2017

Testing for discrimination in college admissions

Recently, the Trump administration’s investigation into racial discrimination in college admissions has brought the topic back into the news. But the claim that some races need higher GPAs or SAT scores to be admitted to colleges is, of course, an old one. This post discusses the statistical subtleties involved in proving such a claim: specifically, I examine some of the arguments that Asian applicants need higher SAT scores than white applicants. To be open about my beliefs at the outset, I think that colleges probably do discriminate against Asians, as they once discriminated against Jews, but the statistical arguments made to prove discrimination are often flawed. This also describes my beliefs about discrimination more broadly: while it is pervasive, quantifying it statistically is hard.

We’re going to use a hypothetical example where only whites and Asians apply for admission, Asians tend to have higher SAT scores than whites, and the only thing that actually affects whether you get admitted is your SAT score. So in this hypothetical example, there is no discrimination; your race does not affect your chances of admission.

On the left, I show the scores for Asian applicants and white applicants. On the right, I show how your probability of admission depends on your SAT score. So someone with an SAT score of 1400 has about a 50% chance of admission, regardless of whether they’re white or Asian. Given that there’s no discrimination in our hypothetical example, if a statistical argument implies there is discrimination, that argument is flawed. So let’s take a look at some arguments.

The most common argument I’ve seen that Asians are discriminated against is that the SAT scores of admitted Asians are higher than SAT scores of admitted whites. But Kirabo Jackson, an economist at Northwestern University, points out the flaw in this argument. In our hypothetical example, where there is no discrimination, admitted Asians will have an average score of about 1460, and admitted whites will have an average score of about 1310. This happens because the Asian distribution is shifted to the right: even though a kid with a 1500 is equally likely to get in regardless of whether they’re white or Asian, there are more Asians with 1500s.

When I ran this argument by a friend, he said that the study which people often cite when claiming Asians are discriminated against is considerably more sophisticated. So I read the study, and it is more sophisticated; it’s worth reading. They fit a model where they simultaneously control for someone’s race and SAT score, which lets you see whether people of some races need higher scores to get in.

Here’s the subtlety. The paper doesn’t actually look at SAT scores, but SAT scores divided into bins from 1200 - 1300, 1300 - 1400, and so on. Within those bins, the paper’s model assumes all applicants should have an equal chance of admission (all else being equal). But that isn’t quite right: an applicant with a 1290 will have a higher chance of admission than an applicant with an 1210. And because Asians are right-shifted in our example, that means that Asians in the 1200 - 1300 bin will have higher scores, and a higher chance of admission, than whites in the 1200 - 1300 bin, even though the paper’s model assumes that applicants in that bin should be equal if there is no discrimination. Below is a plot which illustrates the idea. Within each score bin, Asians (red line) have a higher average SAT score (left plot), and thus a higher chance of admission (right plot), then whites in the same bin (blue line).

So what happens when we fit the paper’s model on our hypothetical data? Now we find discrimination against whites. This happens because the blue lines are below the red lines: whites in a bin have a lower chance of admission than Asians in a bin because they have lower average scores. So the paper’s model will incorrectly conclude that, controlling for SAT score, whites have about 20% lower odds of admission, a significant amount of discrimination. I should note it’s entirely possible that the authors fit other models that don’t bin SAT scores, although I couldn’t find those models mentioned in the paper [1]; please point me to anything I’ve missed.

Okay. So we took hypothetical data that had no discrimination. One widely repeated statistical argument shows discrimination against Asians. Another widely repeated statistical argument shows discrimination against whites. This isn’t good. The basic mathematical takeaway is that when races have different distributions over a variable (like SAT score) and you divide that variable into bins, you can get misleading results. (See the literature on infra-marginality for interesting discussions of related phenomena in tests for police discrimination).

The broader takeaway is that testing for discrimination is really hard. Which isn’t to say you should discount all evidence that it occurs; you should just be mindful of the caveats. Also, these statistical problems are tricky and fun to think about, so you should come work with me on them.

Footnotes:

[1] One of the authors went on to write a book on the topic, the one cited in the lawsuit against Harvard; I took a look at the relevant chapter, and it seems to use a similar binning strategy for SAT scores. To be clear, just because a model has caveats worth discussing doesn’t mean the work is bad or the conclusions are wrong; indeed, the book appears to be impressively comprehensive. Also, our hypothetical example actually suggests that this model might underestimate the amount of discrimination against Asians.

Monday, April 17, 2017

Proving discrimination from personal experience

Here’s an interaction you might’ve participated in:

Member of minority group: I just had [negative interaction] with John. I don’t think he would’ve done that if I hadn’t been a minority.

Listener: That sucks. But...how do you know it was because you were a minority? Maybe he was just having a bad day or he was really busy or …

The negative interaction might be, say, that John talked down to them or didn’t include them on a project. The listener’s reaction is totally reasonable and well-intentioned (at least, I hope it is, because I’ve had it myself). Sometimes it isn’t even said out loud; the listener just thinks it. Here I argue that this reaction is not the most useful one. I explain why, both in English and in math, and then I suggest four more useful reactions.

The problem with this reaction is not that it’s false. It’s that it’s obvious. If a minority tells you about something bad that happened to them, you can almost always attribute it to factors other than their minority status. (Throughout this essay, I’ll refer to negative behavior that’s due to someone’s minority status as “discrimination”.) Worse, this uncertainty will persist even if the discrimination occurs repeatedly and is quite significant. The core reason for this is that human behavior is complicated, there are lots of things that could explain a given interaction, and in our lives we observe only a small number of interactions. Because it is so hard to rule out other factors, individual discrimination suits have notoriously low success rates.

Let’s be clear: I’m not saying you can never prove discrimination from someone’s individual experience. Obviously, there some experiences which are so blatant that discrimination is the only explanation: if someone drops a racial slur or grabs their female coworker by the whatever, we know they’re a president bigot. But, in today’s workplaces, problematic discrimination is rarely so overt -- hence the term “second generation” discrimination. Here’s a picture:

Here’s a simple mathematical model that formalizes this idea. If you don’t like math, feel free to skip to the “What should we do instead” section. Let’s say the result of an interaction, Y, depends on a number of observable factors, X, one of which is whether someone’s a minority. Specifically, let:

Screen Shot 2017-04-05 at 11.48.48 AM.png

where beta is a set of coefficients describing how much each factor matters, and noise is due to random things we don’t observe. So, for example, Y might be your grade on a computer science assignment, X might include factors like “does your code produce the correct output” and “are you a minority” and noise might be due to stuff like how quickly the TA is grading [1].

If we want to know whether there’s discrimination, we need to figure out the value of betaminority: this will tell us whether minorities get worse outcomes just for being minorities. We can infer this value using linear regression, and importantly, we can also infer the uncertainty on the value.

Here’s the problem. When you do linear regression on a small number of datapoints (which is all a person has, given that they don’t observe that many interactions) you’re going to have huge uncertainty in the inferred values. To illustrate this, I ran a simulation using the model above with two groups, call them A and B, each half the population. I set the parameters so there was a strong discrimination effect against B. Specifically, even though A and B are equal along other dimensions, the average person in A will be ranked higher than about two thirds of people in B, due solely to discrimination; if you look at people in the top 5%, less than a third will be B. So this is enough discrimination to produce substantial underrepresentation. But when we try to infer the value of the discrimination coefficient, we can’t be sure there’s discrimination. In the plot below, the horizontal axis is how many interactions we observe; the blue area shows the 95% confidence interval for the discrimination coefficient (with negative values showing discrimination against B); the black line shows a world with no discrimination.

The important point being that the blue shaded area overlaps 0 -- meaning no discrimination is possible -- even if you have literally dozens of interactions, which is way more than you often have. (For fewer than about 5 interactions, the errorbars just blow up and you can’t even graph it.) You can alter simulation parameters or simulate things slightly differently, but I don’t think you’ll change the basic point: you can’t infer effect sizes on sample sizes this small with any confidence.

This model also illustrates some features which make concluding discrimination harder. For example, our errorbars will be larger if other features in X are correlated with being a minority. (“No no, I didn’t promote him because he’s a man. I promoted him because we work well together because we always go out to dinner together / play basketball together / he sounds so much more confident. Well, yes, my wife says I can’t go out to dinner with women…”) Also, your errorbars will be larger if you’re observing repeated interactions from the same person. (If you’re trying to compare your treatment to that of a single coworker, it’s even harder to be sure if it’s because you’re a minority or because of one of the innumerable other ways in which you’ll inevitably differ.) Last, you’re going to be in even more trouble if your minority is a very small fraction of the population whose interactions you observe (say, computer scientists) -- I don’t know if most computer scientists are prejudiced against African-American students because I’ve literally never seen them interact with one.

It’s worth noting that there are a lot of other subtleties in detecting discrimination which have nothing to do with small sample size and which this model doesn’t capture (see the intro to this paper for a brief, clear introduction) but I think small sample size is probably the biggest challenge in the individual-human-experience-setting, so it’s what I focused on here.

What should we do instead?

So it isn’t useful to tell someone that they can’t be sure their experience is due to discrimination, because even in cases when a large amount of discrimination is occurring, people often won’t observe the data to conclusively rule out other factors. What should we do instead?

Here’s one thing I don’t think we should do: assume that discrimination is occurring every time a minority says they think it might be. (I do think we should assume they’re telling the truth about what occurred). The solution to uncertainty and bad data is not to always rule in favor of one party, since it creates perverse incentives and people’s lives get wrecked both by discrimination and by allegations of discrimination. Instead:

Recognize the severity of the problem that minorities deal with. It’s not that they hallucinate discrimination everywhere or are incapable of logical thinking or rigorous standards of proof. It’s that proving discrimination from anecdotal experience is frequently an extremely difficult statistical task. Also, it’s exhausting to continually deal with the unprovable possibility of discrimination: to wonder, every time something doesn’t work out, if some subtle injustice was at play.
Use common sense. Statisticians call this “a prior”: ie, you let your prior knowledge about how the world inform how you interpret the data. So, for example, if you hear someone refer to a black student as “articulate” or a female professor as “aggressive”, you don’t need to hear one hundred more examples to suspect prejudice may be at play. Your prior knowledge about how those adjectives are used helps you conclude discrimination more quickly. (I suspect that one reason female judges are more inclined to rule in favor of discrimination suits is because they have different prior beliefs about how common discrimination is.)
Aggregate data. If one person’s experience doesn’t give you enough data to rule out other factors, aggregate experiences. Class-action lawsuits are an essential means of going after discriminatory employers for this reason. Climate surveys within departments are another example, as is publishing systematic salary gap data (as Britain now does). The sexual assault reporting system Callisto, which aggregates accusations of assault against the same accuser, is based on a related idea, as I’ve discussed.
Conduct workplace audit studies. This idea is kind of crazy and might get you fired, but here it is: if it’s hard to prove discrimination because there are too many other factors at play, keep the other factors constant. Here are some examples:

When a female employee says something in a meeting and people ignore it and then a male employee says the exact same thing and gets a more positive response, we’re more convinced that’s discrimination. (There are a hilarious number of Google results for that phenomenon, by the way.)
A few years ago, I spent a few weeks emailing the NYT’s technical team and getting no response; finally I asked my boyfriend to send them the exact same question, and they immediately responded.
Or take this recent case, where a male and female employee switched their email accounts and were treated dramatically differently.

All these examples feel like compelling evidence of discrimination because it’s hard to pin the different outcome on extraneous factors; everything except minority status remains the same.

So, could you do this in your workplace? More and more interactions occur online, making it easier to switch identities: for example, you could imagine switching Slack accounts for a week. Obviously there are 14 million ways this could go wrong, but drop me a line if you try it.

Footnotes:

[1] This is easily extended to binary outcomes: Y ~ Bernoulli(sigmoid(X * beta + noise))