Assigning Probability to Election Outcomes

25 Sep 2024

Is it a good idea? I’m not sure but that’s not going to stop us. The goal of this post is to build a model for the 2024 United States presidential election.

Disclaimer: My recommendation is to treat any probabilistic election model as entertainment only. They are not decision-making tools, and their predictions cannot be properly validated.

How the Elections Work

There are 538 electoral votes divided between 50 states, 5 congressional districts and a single federal district (from now, on I will use the term “state” for all of them). For example, California has 55 votes. The candidate who wins the popular vote in a given state takes all its electoral votes. The candidate that gets at least 270 votes is elected as the president.

Polling Data = $\boldsymbol{\mu}$

The polls can be downloaded from the fivethirtyeight website. The file contains many columns, but we will use just a few of them. We also keep just the state-level polls and throw away the national polling. After reshaping and aggregation, the data look like this:

start_date	sample_size	state	dem	rep
2021-04-21	933	Missouri	0.38	0.53
2021-05-07	1267	New Hampshire	0.51	0.43

Each row is a single poll. The dem and rep columns represent the relative numbers of respondents choosing the Democratic and Republican candidates respectively. In cases where there is more than one candidate per party in a poll, we take the most popular candidate.

The next step is to calculate the weight of each poll. Polls with larger weights will have a higher influence on the model prediction. The weight $w_i$ of the $i$-th poll is

\[w_i = N_i e^{-0.01 d_i}\]

where $N_i$ is the sample size, and $d_i$ is the age, defined as the number of days between the start date of a poll and the start date of the latest poll in the same state.

Because the model assumes the election is between just two candidates, we want the dem and rep proportions to add up to one. We can achieve this by applying the softmax function row-wise to the dem and rep columns.

start_date	sample_size	state	dem	rep	w	softmax_dem	softmax_rep
2021-04-21	933	Missouri	0.38	0.53	0.0038	0.46	0.54
2021-05-07	1267	New Hampshire	0.51	0.43	0.0061	0.52	0.48

At this point, we can create a vector of polling averages $\boldsymbol{\mu} = (\mu_1, \mu_2, \ldots)$, where $\mu_j$ is the weighted polling average for the Democratic Party in the $j$-th state, calculated using the w and softmax_dem columns.

Election Data = $\boldsymbol{\Sigma}$

The other dataset we need for the model to work is the historical election results. We can get the data from Kaggle. Again, we apply the softmax function to the Democratic and Republican results to get the following dataframe:

year	state	dem	rep	softmax_dem	softmax_rep
2008	Alaska	0.37	0.59	0.44	0.56
2012	Alaska	0.40	0.54	0.46	0.54
2016	Alaska	0.36	0.51	0.46	0.54
2020	Alaska	0.42	0.52	0.47	0.53

We use the softmax_dem values from the last four elections to compute a covariance matrix $\boldsymbol{\Sigma}$ where $\Sigma_{jk}$ is the covariance between the $j$-th and $k$-th state. The matrix tells the model how well the state outcomes correspond with each other.

How the Model Works

With $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ ready, model inference is just drawing from a distribution

\[\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\]

where $\boldsymbol{y}$ is the model output, with $y_j$ being the predicted Democratic result (both $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ are computed using the softmax_dem values) in the $j$-th state. If $y_j > 0.5$, we assign the electoral votes from the $j$-th state to the Democratic candidate. We sum the total electoral votes from states where $y_j > 0.5$ to get the number of electoral votes for the Democratic Party. If the sum is greater than 270, the Democrats win.

We simulate many repetitions by drawing from $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ many times. If there are 10,000 elections in the simulation and Democrats get at least 270 votes in 5,000 of them, the model predicts that Democrats win with a 50% probability.

It is important to remember that the model estimates the probability as if the elections were held today, it doesn’t account for possible polling changes in the future.

Results

As of October 5th, the model assigns a 55% probability to Democrats winning the election. This is exactly the same as the prediction published by fivethirtyeight.

Posterior

Notebook with all the code is available on my GitHub.