Let’s take a look at variable importance with regards to US Census data (per-county basis) for 2012. Note the high importance of the attribute measuring the proportion of the county that was considered ‘Asian.’
We can perform the same importance analysis for 2016.
We can build a conditional inference machine learning model to predict county-level outcomes on the 2012 data. In this case, our accuracy was 86.4% on our test data set (separated from training data). As a note, accuracy is measured by summing the top-left cell and the bottom-right cell (along the lower-right diagonal).
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 624
##
##
## | Predicted Type
## Actual Type | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 460 | 16 | 476 |
## | 0.737 | 0.026 | |
## -------------|-----------|-----------|-----------|
## 1 | 69 | 79 | 148 |
## | 0.111 | 0.127 | |
## -------------|-----------|-----------|-----------|
## Column Total | 529 | 95 | 624 |
## -------------|-----------|-----------|-----------|
##
##
We can use this model on 2016 data as well. Here, our accuracy decreased on testing data to 71.65%.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 624
##
##
## | Predicted Type
## Actual Type | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## Donald Trump | 437 | 85 | 522 |
## | 0.700 | 0.136 | |
## ----------------|-----------|-----------|-----------|
## Hillary Clinton | 92 | 10 | 102 |
## | 0.147 | 0.016 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 529 | 95 | 624 |
## ----------------|-----------|-----------|-----------|
##
##
We used a rule-based method (jRip
) to model the 2012 data. Here, our accuracy in predicting county-level outcomes was 85.3%. We’ve also listed the rules below. Our successes with rule-based sequential covering methods (compared to conditional inference trees) suggests a lower level of complexity with regards to individual voting decisions.
For reference, the general syntax for the classification rules is: rule-set => prediction (# of instances correctly classified by rule, # of instances misclassified by rule)
.
## JRIP rules:
## ===========
##
## (p_white <= 53.5) and (p_hisp <= 29.3) and (p_black >= 46.6) => ObamaWin=1 (88.0/1.0)
## (p_asian >= 1.3) and (pop_sqr_mile2010 >= 446.7) and (p_white <= 68.6) => ObamaWin=1 (106.0/18.0)
## (p_asian >= 1.1) and (pop_sqr_mile2010 >= 282.7) and (p_female >= 51.4) => ObamaWin=1 (56.0/18.0)
## (p_bachelors >= 25.4) and (p_asian >= 3.4) and (pop_65+ >= 13.9) => ObamaWin=1 (36.0/8.0)
## (p_bachelors >= 28) and (p_hisp <= 2.2) => ObamaWin=1 (33.0/12.0)
## (p_asian >= 0.9) and (p_white <= 54.9) and (p_bachelors >= 15.9) => ObamaWin=1 (47.0/13.0)
## (p_bachelors >= 28) and (p_bachelors >= 39.4) and (pop_total_2010 <= 23509) => ObamaWin=1 (16.0/2.0)
## (p_white <= 31.5) => ObamaWin=1 (33.0/7.0)
## => ObamaWin=0 (2074.0/209.0)
##
## Number of Rules : 9
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 624
##
##
## | Predicted Type
## Actual Type | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 448 | 28 | 476 |
## | 0.718 | 0.045 | |
## -------------|-----------|-----------|-----------|
## 1 | 64 | 84 | 148 |
## | 0.103 | 0.135 | |
## -------------|-----------|-----------|-----------|
## Column Total | 512 | 112 | 624 |
## -------------|-----------|-----------|-----------|
##
##
Interestingly, when we apply this model trained on 2012 data to the 2016 data, our accuracy on test data increases to 93.5% (or in other words, we are able to correctly predict county-level outcomes 93.5% of the time)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 624
##
##
## | Predicted Type
## Actual Type | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## Donald Trump | 504 | 18 | 522 |
## | 0.808 | 0.029 | |
## ----------------|-----------|-----------|-----------|
## Hillary Clinton | 23 | 79 | 102 |
## | 0.037 | 0.127 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 527 | 97 | 624 |
## ----------------|-----------|-----------|-----------|
##
##
## JRIP rules:
## ===========
##
## (p_white <= 47.1) and (p_indian <= 1) => lead=Hillary Clinton (132.0/15.0)
## (p_bachelors >= 27.3) and (p_asian >= 3.5) => lead=Hillary Clinton (135.0/36.0)
## (p_white <= 54.9) and (pop_sqr_mile2010 >= 48.9) and (pop_sqr_mile2010 >= 303.8) => lead=Hillary Clinton (18.0/0.0)
## (p_white <= 54.1) and (p_white <= 33.3) and (p_white <= 23.3) => lead=Hillary Clinton (15.0/1.0)
## (p_bachelors >= 25.6) and (pop_sqr_mile2010 >= 619.7) and (pop_sqr_mile2010 >= 1115.3) => lead=Hillary Clinton (19.0/4.0)
## (p_bachelors >= 28) and (p_bachelors >= 39.4) => lead=Hillary Clinton (32.0/11.0)
## (p_bachelors >= 27.3) and (p_hisp <= 2.2) and (p_black <= 0.7) => lead=Hillary Clinton (14.0/4.0)
## (p_white <= 54.9) and (pop_sqr_mile2010 >= 48.9) and (median_Income <= 34285) => lead=Hillary Clinton (9.0/0.0)
## (p_bachelors >= 21.3) and (p_hisp >= 23.5) and (p_indian >= 2.1) and (pop_sqr_mile2010 >= 4.7) => lead=Hillary Clinton (10.0/0.0)
## (p_white <= 52.3) and (p_hisp <= 10.1) and (median_Income >= 36022) and (pop_total_2010 >= 7065) => lead=Hillary Clinton (12.0/3.0)
## (p_bachelors >= 18.2) and (p_white <= 36) => lead=Hillary Clinton (6.0/0.0)
## => lead=Donald Trump (2087.0/57.0)
##
## Number of Rules : 12
If we train a model and test it with purely 2016 data, interestingly, our accuracy decreases to 92%.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 624
##
##
## | Predicted Type
## Actual Type | Donald Trump | Hillary Clinton | Row Total |
## ----------------|-----------------|-----------------|-----------------|
## Donald Trump | 498 | 24 | 522 |
## | 0.798 | 0.038 | |
## ----------------|-----------------|-----------------|-----------------|
## Hillary Clinton | 26 | 76 | 102 |
## | 0.042 | 0.122 | |
## ----------------|-----------------|-----------------|-----------------|
## Column Total | 524 | 100 | 624 |
## ----------------|-----------------|-----------------|-----------------|
##
##
We also used a k-nearest-neighbor algorithm for 2012 and 2016 data. With 2012 data, we were able to achieve an accuracy of 86.8%.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 623
##
##
## | prc_test_pred
## prc_test_labels | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## 0 | 465 | 11 | 476 |
## | 0.977 | 0.023 | 0.764 |
## | 0.868 | 0.126 | |
## | 0.746 | 0.018 | |
## ----------------|-----------|-----------|-----------|
## 1 | 71 | 76 | 147 |
## | 0.483 | 0.517 | 0.236 |
## | 0.132 | 0.874 | |
## | 0.114 | 0.122 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 536 | 87 | 623 |
## | 0.860 | 0.140 | |
## ----------------|-----------|-----------|-----------|
##
##
Using 2016 training and testing data, our accuracy on a county-level basis increased to 93.57%.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 623
##
##
## | prc_test_pred
## prc_test_labels | Donald Trump | Hillary Clinton | Row Total |
## ----------------|-----------------|-----------------|-----------------|
## Donald Trump | 506 | 15 | 521 |
## | 0.971 | 0.029 | 0.836 |
## | 0.953 | 0.163 | |
## | 0.812 | 0.024 | |
## ----------------|-----------------|-----------------|-----------------|
## Hillary Clinton | 25 | 77 | 102 |
## | 0.245 | 0.755 | 0.164 |
## | 0.047 | 0.837 | |
## | 0.040 | 0.124 | |
## ----------------|-----------------|-----------------|-----------------|
## Column Total | 531 | 92 | 623 |
## | 0.852 | 0.148 | |
## ----------------|-----------------|-----------------|-----------------|
##
##
We can use our jRip
model on all of the 2016 data, not just the test data, with an accuracy of 93% (note the potential for bias due to the fact that the training data is being recycled for the prediction).
## JRIP rules:
## ===========
##
## (p_white <= 52.3) and (p_hisp <= 28) => ObamaWin=1 (230.0/31.0)
## (p_asian >= 1.3) and (pop_total_2010 >= 266929) and (pop_total_2010 >= 735332) => ObamaWin=1 (54.0/6.0)
## (p_asian >= 1.3) and (p_bachelors >= 27.7) and (pop_65+ >= 13.5) and (p_asian >= 2.7) => ObamaWin=1 (69.0/18.0)
## (p_bachelors >= 25.1) and (p_bachelors >= 38.7) and (median_Income <= 72745) => ObamaWin=1 (64.0/18.0)
## (p_asian >= 1.1) and (pop_sqr_mile2010 >= 492.6) and (p_HS <= 88.8) and (pop_sqr_mile2010 >= 1115.3) => ObamaWin=1 (16.0/2.0)
## (p_asian >= 1.1) and (pop_total_2010 >= 153920) and (p_white <= 59.5) => ObamaWin=1 (33.0/12.0)
## (p_white <= 55.7) and (p_white <= 32) => ObamaWin=1 (33.0/7.0)
## (p_HS >= 86.5) and (pop_sqr_mile2010 >= 16.6) and (p_HS >= 89.6) and (p_hisp <= 2.2) and (p_asian >= 0.6) and (median_Income <= 57565) and (p_bachelors >= 25.9) and (pop_sqr_mile2010 <= 199) => ObamaWin=1 (22.0/1.0)
## => ObamaWin=0 (2591.0/266.0)
##
## Number of Rules : 9
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3112
##
##
## | Predicted Type
## Actual Type | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## Donald Trump | 2500 | 125 | 2625 |
## | 0.803 | 0.040 | |
## ----------------|-----------|-----------|-----------|
## Hillary Clinton | 91 | 396 | 487 |
## | 0.029 | 0.127 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 2591 | 521 | 3112 |
## ----------------|-----------|-----------|-----------|
##
##
We can map the result of our jRip
model and compare it to the actual 2016 result.
We can actually visualize our model’s prediction error. The counties in grey represent correct predictions, while blue highlights represent Trump predictions that actually voted for Clinton and red highlights represent Clinton predictions that actually voted for Trump.
Additionally, while these models are able to predict county-level results on a win/loss basis, they are unable to predict Clinton/Trump votes in the area.