The last time I tried to make a prediction, in the Democratic Senate primaries I was way off (I had Capuano winning 37-34).
It was a humbling introduction into the online predicting world, but I noticed that I wasn’t alone. The BMG predictions were also pretty far off, and in the same direction. They still had Coakley winning, but by a significantly smaller margin than she did (37-32 rather than the actual margin of 47-28).
At the time I had also noticed from straw polling that BMG was overwhelmingly supporting Capuano, so I made a note of it in my write-up as a potential source of bias. Just estimating what sort of effect that would have, I knocked off a few points from Capuano and added them to Pags, Khazei, and especially Coakley to account for the mismatch between BMG and the rest of the voters. Polls showed that Coakley was cerainly not down 66-33, so it was clear that BMG was not a representative sample.
The adjusted estimate I came up with back then was:
After mentioning it, I went back to my own reasoning. These numbers, however, in retrospect turned out to be very close to the final numbers. If I had done the exact same thing but distributed just a few more of Capuano’s points from Khazei and Pagliuca to Coakley, it would have almost exactly matched the final results:
That reinforced the idea that a personal bias towards a candidate, in this case Capuano, was strongly affecting the results of not only my own prediction, but the collective predictions of the Capuano-friendly BMG.
If this bias could be identified and quantified, then maybe it could be adjusted for to get actual accurate predictions, in the same way that polls can be adjusted to be more accurate using known partisan and demographic data.
To test that theory, I decided to crunch some of the numbers to try to find patterns. As I saw it, I would need to find out how much predictive error the bias towards a candidate introduces. Put another way, I wanted to find the strength of the correlation between the collective support for a candidate (as measured by the straw poll) and the amount of error in the prediction for that candidate. The information I would need would be:
1) Candidate support (from BMG Straw Poll)
2) BMG Predictions for each candidate (as aggregated here)
3) Actual election results to get the real support levels (from the state election division)
4) The predictive error: the difference between the actual results and the BMG predictions for each candidate. This will tell us how many percentage points the predictions were off by.
Capuano: 32.20 – 27.68 = +4.59%
Coakley: 36.77 – 46.57 = -9.80%
Khazei: 18.81 – 13.35 = +5.46%
Pagliuca: 11.92 – 11.99 = -0.07%
5) The potential bias: the difference between the electorate and BMG in actual voting patterns.
Capuano: 67.12 – 27.68 = +39.44%
Coakley: 23.29 – 46.57 = -23.29%
Khazei: 9.59 – 13.35 = -3.76%
Pagliuca: 0 – 11.99 = -11.99%
Now the two things we are most interested in is the correlation between #4 (the predictive error) and #5 (the potential bias). If the error is correlated with the bias, then it could be very useful for analyzing future predictions. If we know a sample has a heavy bias, we can also assume it has an error in the same direction. Our data:
I included the 0,0 point because this data should be centered around 0 — that is, if there is 0 bias, then there should be 0 error introduced by bias. This is an assumption I am making for the model, but I believe it is a valid one.
To find out the correlation, we want to graph them against each other, with the error on one axis and the bias on the other. I used OpenOffice Calc (free version of Excel, basically), to create this graph:
Predictive Error (X) vs. Potential Bias (Y)
The line in there is a function you can do with scatter plots to get the “best fit line” or linear regression (i think those are synonymous?). It then gives a formula for that line, which is:
f(x) = 2.65x + 0
f(x) = bias
x = error
This means that the bias (y) can be estimated as 2.65 times the error (x), or conversely that the error is the bias/2.65. This means that if we have a potential bias of 20 points, we would expect the error to be 20/2.65, or 7.5 points.
The other number it gives:
r(squared) = 0.46
is the “correlation coefficient”. This gives an indication of how closely these points match the line, scaled from 0 to 1. A value of 0, for instance, would say that the points are randomly placed away from the line, and 1 would mean that they are expected to be directly on the line.
0.46 is actually a pretty high correlation coefficient, so we can have confidence that the correlation is pronounced. a number of 0.46 means that 46% of the dot’s position can be explained by the correlation with the other axis, if that makes any sense.
OK, because this entry is now one-point-three gazillion times longer than I expected it to be, I will abbreviate the rest of this. Basically, I wanted to use the same formula for this election to see if I can get more accurate predictions than the composite numbers. To get this I needed the
1) candidate support numbers (BMG and RMG straw polls),
2) the composite prediction numbers (listed above)
3) an estimation of the actual electorate (doesnt need to be exact, as long as it is close it only minorly affects the numbers. For this I will use the average of all the predictions from BMG and RMG, shown above)
Using these we can get #4, the predictive error, and #5, the potential bias, by doing the same subtractions as before.
For instance, the BMG error towards Coakley, based on the numbers of this model, would be:
error = bias/2.65
error = (Coakley support – Actual electorate)/2.65
error = (86.11 – 49)/2.65
error = 37.11/2.65
error = 14.00
This would be the expected error if the correlation coefficent had been exactly 1, or exactly correlated. Since it is actually 0.46, we need to multiply the error number by 0.46 to get the actual amount the
bias should be influencing the error.
adjusted error = error * correlation coefficient
adjusted error = 14 * .46
adjusted error = 6.44%
Sparing the details of the rest of the calculations, here are the numbers for the predictive model:
BMG adjusted predictions
Brown: 46.74 + 6.00 = 52.74%
Coakley: 50.39 – 6.34 = 43.95%
!!!!! This basically predicts a massive nine point blowout for Brown. Seems a bit extreme, but that’s the model based on the results of the primaries. We could expect this result if BMG posters are uniformly biased with their predictions in 4-way primaries or 2-way generals.
But what if we applied the model to the numbers from the RMG predictions? In a perfect model, we might expect both adjusted examples to show the same results.
RMG adjusted predictions
Brown: 50.15 – 8.04 = 42.11%
Coakley: 47.56 + 7.59 = 55.19%
!!!!?? Uh oh, applying the model to this set of data shows the opposite blowout, where Coakley wins by 13 points.
What this says to me is that the model created using data for the primaries is too sensitive to large swings. I think this is the product of having a four-way race with very little polling, which leads to the numbers in predictions being all over the place with a very high variability. In this current race, there is an abundance of polling and only two options, so the swings and prediction variance should be much less pronounced.
In another quick and dirty trick even out this swing, we can take the average of each model result — one from the left and one from the right — and hope that the middle ground will account for the built-in high variability of the model.
Average BMG/RMG adjusted predictions
So that is my model’s prediction for now! I’m sure it will need plenty of adjustments once all the new numbers for this election come in. I apologize for the excessively long post, it helped me sort it all out to write it down as I go. If there is a next time I will skip many more steps.