Somewhere, in a cubicle in Washington, a data analyst is panicking. She has just been asked by the Trump administration to show how 3 million people (or, preferably, more) voted illegally. Deep down, she knows that this is a ridiculous request. But she’s a team player.
She will first try to identify specific cases of clear voter fraud. The goal will be to collate these clear cases into a list of names and addresses. A list with three million entries.
She'll start with the low-hanging fruit. She'll cross-reference voting lists (not registration lists) to the National Death Index. She needs to look at real voting lists since dead people may still be on the registration lists without actually voting. She’ll find a few matches but, unfortunately, they will all prove to be false positives. People with the same name, people who didn't really die, people who didn't actually vote.
She'll then try to figure out if illegal immigrants voted by cross-referencing voter lists with e-verify , the government's electronic employment verification tool. She'll get a few hits, but, again, they will all prove to be false positives. People with the same name, people who actually are citizens but aren't in the system, and so on.
And that’s where the investigation will end, right? Wrong. President Trump has shown us that he is never wrong, and must prove it despite incontrovertible evidence to the contrary. Fortunately for him, there are still a few ways to find 3 million illegal votes. That they all depend on egregious misapplication of statistics and data manipulation can be spun into obscurity by the Press Secretary and the other dynamos in the Trump team. Here are the options.
Option 1: Use bad survey data.
When Sean Spicer was pressed to defend the President's voter fraud claim, he cited this paper, by Jesse Richman, Gulshan Chattha and David Earnest. They used an internet survey and found that a small fraction of people said they voted, but also clicked that they weren't citizens. Of 32,800 surveyed in 2008, a total of 339 people labeled themselves as non-citizens. Of those, 38 people also claimed that they voted, about 11%. Since there are 11 million illegal immigrants in this country (according to Pew Research), we can take 11% of 11 million people and calculate around 1 million fraudulent votes.
Damn, not quite close enough to 3 million. But enough to get you some voter ID laws passed (and, after all, isn’t that really the point?).
Fact Box: The paper has been roundly criticized for its reliance on incredibly small numbers (one tenth of one percent of respondents) to make broad national claims. Also, the simple error rate from miss-clicks on citizenship and voting history easily account for these numbers. Finally, as the survey was targeted to voters (party affiliation and voter registration were part of the survey weighting strategy) it inflates the chances that you will find people who, you know, report that they voted.
Option 2: Use numerology
The administration has been uniquely willing to support and disseminate conspiracy theories. One could argue that the President's real foray into politics was his vocal support of the conspiracy theory that Barack Obama was not born in the US. His refrain when confronted with facts: "I'm just asking questions". Remember that, Mr. President. You'll be using it a lot.
There are some interesting things that happen with long lists of numbers. Certain digits come up more than others. This phenomenon is known as Benford's law and it basically says that in any long list of numbers, the first digit will be "1" more often than any other digit. This is not an alt-fact, this is a true fact.
If humans are going to misreport voting numbers, they tend to do it in biased ways. No human, planning to subvert the democratic process, would falsely report votes in his or her district using nice round numbers. That's too obvious. They would make up a number that seems real, something like 1,357. The important thing here is the digits. If we look at the second digit in vote tallies, so the theory goes, we may find evidence of election fraud because they won't be distributed according to a set mathematical formula.
If you look at enough states, you are guaranteed to find a place where the second digits of vote totals seem off.
In fact, I started with Connecticut to help you out.
I took 682 polling stations in Connecticut and looked at the second digit of the vote tally reported. Here's what I found:
That second digit seems to be spread pretty evenly across all digits from 0 to 9. But we can still look for possible conspiracies.
Some cool math tells us that the average value of the second digit in a long list of numbers should be 4.187. The average of the second digit of the 682 polling place tallies in Connecticut is 4.38. That is not statistically different than 4.187. Okay, fine. Let's look at the second digit of only the Clinton totals: average of 4.04. Still not statistically different than 4.187.
How about the Trump totals? The average of the second digit for his tallies is 3.77. That is statistically different. This would only happen by chance 1 in 1000 times.
Now take that number and run with it. Can it tell you that voter fraud favored Clinton? No. But who cares? You're just asking questions, remember? One in 1000! At the very least, this data should allow you to pass some voter ID laws (and, after all, isn't that really the point)?
Fact Box: These mathematical approximations are just that. They apply only in idealized scenarios, with lists of numbers of infinite size, spanning multiple orders of magnitude. This is not the case in CT, or elections in general. Lack of conformity to the expected distribution can be due to the fact, for example, that Connecticut has an outsized proportion of polling places that have 3 digit totals. And, obviously, I cherry-picked the data. When my narrative wasn’t born out in the vote totals, I checked the Clinton votes. Then I checked the Trump votes. I could have done this all down the ticket.
Option 3: Use Modeling
If it's good enough for Nate Silver, it should be good enough for us right?
Models are fun. We can build them using statistical software to make predictions. Sometimes they work really well. Sometimes they don't. But as a means to an end, they are perfect.
First, we'll make a model to predict how many votes Trump should have received in each polling place.
I took voter registration data from each of my 682 Connecticut polling places, and tried to predict the number of Trump votes. As you might expect, the more Republicans there were, the more votes Trump got. He also did well with "unaffiliated" voters. Here’s a graph:
Not too shabby. Our model is pretty good at predicting Trump votes. But there is a dot that seems out of place (I circled it in red for you). An outlier. Or, in alt-speak: a highly suspect polling station. Trump got WAY less votes here than we expected. About 1500 "missing" votes, you might say.
We can actually measure how far off this polling is from the expected value statistically, and report just how unlikely it is.
We predicted about 2500 votes for Trump from that polling place, but only got a bit above 1000. How unlikely was that? Get this – it’s a 1 in 3.6 million chance. This is pure propaganda gold, a baby-blue boxed gift to Kellyanne Conway. By the way, I would mention which polling place this was if I wasn’t worried that someone would show up there to do some vigilante-style “investigation”. Clear evidence of voter fraud, and in Connecticut of all places. Hillary central. Case closed. Extrapolate this to the rest of the country, and boom – 3 Million votes, no problem. At the very least, this data should allow you to pass some voter ID laws (and, after all, isn't that really the point)?
Fact Box: The result is a fabrication. It is based not on the alt-fact that 1200 Trump votes went missing. It’s based on the true fact that my model to predict Trump votes isn’t nearly as good as it looks. I put three things into the model: The percent of Republicans, the percent of Democrats, and percent of Unaffiliated voters. In that outlier polling place, 12.5% of the voters were Republicans, 42% Democrats, and 44% Unaffiliated. The model expects Trump to do better here because unaffiliated people broke for Trump in the rest of Connecticut. But there are lots of reasons to be unaffiliated. You might be a fiercely independent New Englander who won’t commit to a party. But maybe you just couldn’t be bothered to check the box on the registration form. These two types of unaffiliated individuals may vote very differently. But the model can’t capture that. The model only works if all of the other factors that influence voting are distributed randomly across all polling places. And that, my friends, is the real statistical impossibility.
Modeling can be used for more than predicting expected votes, though. A particularly nefarious strategy would be to use a model to predict whether or not someone is an illegal immigrant (based on factors like ethnicity, duration of residence, zip code, etc). Models always have false positives. Worse models have more false positives. Using such a model you could show not that illegal immigrants voted, but that people who seem like illegal immigrants voted. But those little details will be left out of the discussion.
Remember this when the results of the “investigation” are in. There will be no evidence of voter fraud.
What there will be are minor statistical aberrations combined with bad methods, which will lead to “serious questions about the safety of our voting system”. Don’t believe it when they say it. And don’t allow these tricks to disenfranchise real American Citizens via draconian voter ID laws.
There will be no real proof of voter fraud, there will only be “questions”. There will be no names, only numbers. There will be no evidence, only conspiracy theories.