Attempt to Predict Primary/Caucus Results using Google Trends

Google Trends is an application freely available for use at http://www.google.com/trends. For any given keyword, it shows you how often that keyword was searched for over time. This frequency of searches over time is referred to as search volume. The actual numbers are relative to the total number of searches done on Google. Google only analyzes a portion of its actual searches, but that snapshot is a large enough sample to give some nice insights into Internet users.

Here's an example: How many people search for "facebook" on Google?

The horizontal scale at the bottom shows time (in this case, split into years and then divided into quarters). While there is no vertical scale, the graph is linearly plotted, meaning no transformations (besides dividing by the total number of searches performed) are done on the data. Thus, it is easy to see when the number of times a word is searched doubles or triples.

This graph shows that Facebook has been increasing in popularity, with approximately exponential growth. As you can see, Google provides news articles from points in time when major jumps occur in the popularity of a keyword, to loosely explain what may have precipitated the sudden jump. In addition, the graph underneath the main graph shows how often a keywoard is referenced in news articles.

Drilling Down

Using the dropdown menus at the bottom, we can choose a different time frame. The Facebook graph got really interesting over the past year, so let's look at that:

We can also look at different regions. Google logs the IP address associated with each search, which can be used to provide a fairly accurate, but not perfect, physical location of the user. Look at how Facebook has only recently infected Tunisia:

Within the US, we can analyze different states:

Looks like people in Illinois have been searching for (and probably using) Facebook for much longer than users from Idaho:

Again, there is no vertical axis. While both graphs seem to approach the same point, it's more likely that Illinois actually has many more users searching for "facebook" than Idaho.

What's with the sudden spike up from 0 in the middle of 2006 on that graph? Well, in order to protect user privacy, Google only displays results if a search term has been used enough times. That is, prior to 2006, "facebook" was searched so little within Idaho that it was below Google's search volume threshold. But sometime in 2006, enough users searched "facebook" per day that the number of times the keyword is searched per day exceeded the threshold, so data is available for those days.

Making Comparisons

Google Trends becomes much more interesting when we compare different keywords. What happens when we look at Facebook and MySpace on the same graph?

As you probably knew, MySpace has long been much more popular than Facebook. But lately MySpace's growth has been stagnating, allowing Facebook to actually become more searched for on Google than MySpace. They actually intersect several times throughout March and April. But this is for all regions -- everywhere anybody uses Google. When we compare the two only in the United States, we see that MySpace still has its grip:

This implies that, in other countries, MySpace is simply less popular, whereas Facebook has been launching campaigns in many countries and different languages.

Drawing Conclusions

Before you try to make any conclusions based on Google Trends results, keep in mind that it's still in Google's "Labs" stage. Google places the following disclaimer at the bottom of the Trends pages:

Google Trends aims to provide insights into broad search patterns. Several approximations are used when computing your results. Please keep this in mind when using it.

However, Google Trends still offers valuable data about what the typical Internet user is actually looking for, especially considering that Google is the world's most popular search engine. This can, in turn, suggest what users are interested in. Companies can see how popular their products are; producers can find out how popular their movies are. Advertisers can see how well they're doing or learn that they're heading in the wrong direction and need to change.

And candidates can determine how successful their campaigns have been.

Case Study: Predicting Election Results

In order to determine what kind of applications Google Trends may have to politics, I decided to compare the Republican candidates' last names for the 2008 primary season, and see how the Google search volume corresponds to voting behavior. After all, if someone is going to vote for a candidate, it's very likely that he or she will Google the candidate beforehand to learn more about the candidate's platform.

For many states, data is either not available at all or available only for some candidates.

Overall, the Democratic primaries were much, much easier to predict than the Republican ones. We hypothesized that this had something to do with the Democratic race having consistently been a much tighter, more competitive one. However, demographic differences between Republican and Democratic voters may also play a role, but the extent to which they affect Google Trends results has yet to be determined.

Update: I am in NO way claiming that this is a valid, scientific statistical analysis. Obviously, the sample is of Internet users, and specifically those that use Google for campaign information, which leads to all sorts of biases. The problem of using Obama and Clintons' first names (which was the best option as far as correctly predicting) is unfair, but it works better than anything else. The issues here are huge, so like I said, this is NOT statistically valid. Really, I just want to demonstrate that the potential may be out there.

Democratic Primaries

Look what happens when we try to compare Obama and Hillary in the US across all subregions for the current year:

The term "obama" has been hugely popular. However, I attribute this to his very high popularity among younger, and by extension more Internet-savvy, voters. Obama has had pretty much the most effective online campaign the world has ever seen, and this shows in Google Trends and skews the data too much to be useful.

Now see the changes when we try to use first names, instead of last names:

It appears that Obama and Clinton are much more evenly matched by using their first names, Barack and Hillary, respectively. Our model, then, will analyze the popularity of the keyboards "barack", "hillary" and "edwards" (because John is too common a first name to work).

Let's go ahead and analyze some states. For each state, we display the state's name, the Google Trends search volume graph for the month in which the primary (or caucus) was held, the date of the election, and the predicted and actual votes.

Iowa

1/3

PredictedActualVotes%
EdwardsObama94038%
ClintonEdwards74430%
ObamaClinton73729%

Using "barack" instead of "obama" led Edwards to clearly beat Obama in terms of search volume. Searching for last names correctly predicts the Obama, Edwards, Clinton order, but using these names in this model seems to do better overall, as you will see.

New Hampshire

1/8

PredictedActualVotes%
ClintonClinton112,25139%
ObamaObama104,77237%
EdwardsEdwards48,68117%

In New Hampshire, the prediction works well using first names.

Michigan

1/15

PredictedActualVotes%
ClintonClinton328,15155%
ObamaUncommitted237,76240%

Obama was not on Michigan's ballot. To rectify the situation, the DNC awarded Obama the delegates from the Uncommitted voters. We do not consider Michigan in our results, but it's shown here just to be thorough.

Nevada

1/19

PredictedActualVotes%
ClintonClinton5,35551%
ObamaObama4,77345%
EdwardsEdwards3964%

Edwards did not even appear in the Nevada search volume, implying that searches for Edwards did not surpass the threshold until the day of the primary, putting him in 3rd place.

South Carolina

1/26

PredictedActualVotes%
ObamaObama295,09155%
ClintonClinton141,12827%
EdwardsEdwards93,55218%

Again, this model works well. In this case, the candidates were closer, but the outcome was still clear days before the primary.

Florida

1/29

PredictedActualVotes%
ClintonClinton857,20850%
ObamaObama569,04133%
EdwardsEdwards248,60414%

Note that this prediction is subjective. Barack and Hillary tied on the day before the election, and Barack beat Hillary the day before that, but the overall trend for the two weeks before the primary clearly shows that Hillary was more popular. Of course, Hillary is more searched for on the day of the primary, but that information wouldn't be available to analysts trying to predict the outcome.

Alabama

2/5

PredictedActualVotes%
ObamaObama302,81456%
ClintonClinton226,50442%

The prediction fit the actual results.

Alaska

2/5

There isn't enough search volume to show any results for Barack, Hillary, Clinton or Edwards. However, Obama, who won by 50%, did show search volume, but as explained, we can't conclude much from that.

Arizona

2/5

PredictedActualVotes%
ClintonClinton228,15851%
ObamaObama191,68142%

Edwards did not appear on the ballot.

Arkansas

2/5

PredictedActualVotes%
N/AClinton217,31370%
 Obama80,77426%

Here, there isn't any search volume until the day of the primary, so no predictions can be made.

California

2/5

PredictedActualVotes%
ClintonClinton2,306,36152%
ObamaObama1,890,02643%

Yet again, the prediction holds true. Edwards was not on the ballot.

Colorado

2/5

PredictedActualVotes%
ObamaObama79,34467%
ClintonClinton38,58732%

The model works for caucuses, too.

Connecticut

2/5

PredictedActualVotes%
ObamaObama179,34951%
ClintonClinton164,83147%

And again.

Delaware

2/5

Not enough search volume. Again, only "Obama" showed up on the trend graph, but that's not allowable per our model. He did win by 10%.

Georgia

2/5

PredictedActualVotes%
ObamaObama700,36667%
ClintonClinton328,12931%

Edwards was not on the ballot.

Idaho

2/5

PredictedActualVotes%
ObamaObama16,88079%
 Clinton3,65517%

This one is strange. Clinton did not appear in the Google search graph, but Edwards did not appear on the ballot.

Illinois

2/5

PredictedActualVotes%
ObamaObama1,301,95465%
ClintonClinton662,84533%

This worked too, but any high school student could have predicted this one.

Kansas

2/5

PredictedActualVotes%
ObamaObama27,17274%
ClintonClinton9,46226%

In this caucus, we didn't have much search volume to go on. But Barack beat Hillary for most of the beginning of the month, up until the primary, only dropping down a bit on the day before.

Massachusetts

2/5

PredictedActualVotes%
ClintonClinton704,59156%
ObamaObama511,88741%

I'm becoming pretty proud of my model.

Minnesota

2/5

PredictedActualVotes%
ObamaObama141,72566%
ClintonClinton68,60732%

Although technically a success, Minnesota doesn't work as well. Obama won by 34%, but he and Clinton were neck to neck before the caucus. But on the day before, the 4th, Obama was ahead; he was ahead before the 3rd, with Clinton only gaining over Obama on the 3rd, leading to a prediction in favor of Obama. In addition, looking at the January chart, Obama is clearly ahead.

Missouri

2/5

PredictedActualVotes%
ObamaObama405,28449%
ClintonClinton395,28748%

Obama and Clinton were very close, both in the search volume graph and in the primary, but the prediction was nevertheless correct.

New Jersey

2/5

PredictedActualVotes%
ObamaClinton602,57654%
ClintonObama492,18644%

Whoops. Note that Obama and Clinton were very close in terms of search volume, but not at all in the actual results.

New Mexico

2/5

PredictedActualVotes%
ObamaClinton73,10549%
ClintonObama71,39648%

Again, the prediction was incorrect. Note, though, that the primary results were quite close. And the voter volume was fairly low.

New York

2/5

PredictedActualVotes%
ClintonClinton1,006,62357%
ObamaObama697,91440%

No surprise here. Well, maybe that the Trends results were so close.

North Dakota

2/5

In this caucus in a low-populated state, there was not enough search volume to produce a graph. Yet again, "Obama" showed up, and "Clinton" didn't, with Obama being the actual winner.

Oklahoma

2/5

PredictedActualVotes%
ClintonClinton228,42555%
ObamaObama130,08731%

Obama had almost no searches in the week before the primary.

Tennessee

2/5

PredictedActualVotes%
ObamaClinton332,59954%
ClintonObama250,73041%

For some reason, the prediction method failed miserably here.

Utah

2/5

PredictedActualVotes%
ClintonObama70,37357%
ObamaClinton48,71939%

This primary was very difficult. First of all, there's a low number of voters. Secondly, the graph is difficult to interpret. Obama does not show up untl the day of the primary, but in the week before (in January), neither Clinton nor Obama appear on the graph, except for Obama on the 28th.

Louisiana

2/9

PredictedActualVotes%
ObamaObama220,58857%
ClintonClinton136,95936%

The model performed flawlessly.

Nebraska

2/9

PredictedActualVotes%
ObamaObama26,12668%
ClintonClinton12,44532%

Although this was a caucus in a low-population state, Google had enough searches to display a search volume graph, which correctly predicted the outcome.

Washington

2/9

PredictedActualVotes%
ObamaObama21,62968%
ClintonClinton9,99231%

Very similar to the Nebraska results.

Maine

2/10

PredictedActualVotes%
ObamaObama2,07959%
ClintonClinton1,39640%

Again, there is little data and Obama and Clinton were very close for the time period just before the primary. Historically, though, Obama had done better (Feb. 6-8), so with nothing but that to go on, we made the only prediction we could.

District of Columbia

2/12

PredictedActualVotes%
ObamaObama85,53475%
ClintonClinton27,32624%

Google easily predicted this one.

Maryland

2/12

PredictedActualVotes%
ObamaObama464,47460%
ClintonClinton285,44037%

And this one.

Virginia

2/12

PredictedActualVotes%
ObamaObama623,14164%
ClintonClinton347,25235%

And, being so close geographically, this one.

Hawaii

2/19

PredictedActualVotes%
ObamaObama28,34776%
 Clinton8,83524%

Well. Clearly, we don't have much data. But Obama appeared often enough to show up on the graph on the day before the caucus, while Hillary did not. As the actual results show, there were few voters, but Obama was indeed much more popular.

Wisconsin

2/19

PredictedActualVotes%
ObamaObama646,00758%
ClintonClinton452,79541%

Although there are more primaries after this, we'll stop here instead of delving into March.

Conclusions

Wow. The power of Google Trends to predict the winners in this year's Democratic primaries was astonishing.

Of the 37 states studied (which were simply chosen by chronological order), 5 did not have enough data to draw conclusions, meaning we were able to attempt a prediction in over 86% of states.

Of the remaining 32 states, 27 were completely, correctly predicted, or a success rate of 84.375%! That's certainly not bad, especially considering that no political knowledge was used for the predictions; they could be done completely by machines.

Republican Primaries

For the Republican Primaries, last names could easily be used. Ron Paul was excluded. His last name is too common. Using his full name is not a good solution either, because he had massive popularity on the Internet, becoming a meme of sorts, which did not at all correspond with his actual successes (or lack thereof) in the primaries.

Iowa

1/3

PredictedActualVotes%
HuckabeeHuckabee40,84134%
RomneyRomney29,94925%
McCainThompson15,90413%
 McCain15,55913%
 Giuliani4,0974%

Huckabee and Romney were correctly predicted. Thompson did not have enough search volume to appear on the graph, so McCain was incorrectly predicted as 3rd; note that McCain and Thompson were incredibly close.

Wyoming

1/5

No data were available in Google Trends. This was a caucus, and in a less populated state, so not enough users searched for candidates to pass Google's search volume threshold.

New Hampshire

1/8

PredictedActualVotes%
McCainMcCain88,46637%
RomneyRomney75,34332%
HuckabeeHuckabee26,76811%
 Giuliani20,3959%
 Thompson2,8861%

All candidate positions were predicted correctly.

Michigan

1/15

PredictedActualVotes%
RomneyRomney337,84739%
McCainMcCain257,52130%
HuckabeeHuckabee139,69916%
ThompsonThompson32,1354%
 Giuliani24,7063%

All candidate positions were predicted correctly. Note the large number of voters and the large amounts of available data. With a large population and a large tech-savvy sub-population, Michigan's Google Trend results are more accurate.

Nevada

1/19

No data were available in Google Trends. This was a caucus; not enough users searched for candidates to pass Google's search volume threshold.

South Carolina

1/19

PredictedActualVotes%
HuckabeeMcCain147,28333%
McCainHuckabee132,44030%
 Thompson69,46716%
 Romney67,13215%
 Giuliani9,4942%

Although there were plenty of voters, there was insufficient data to predict more than two results, and those two were out of order. However, Thompson's place as 3rd could be extrapolated by his search volume in the week before the primary.

Florida

1/29

PredictedActualVotes%
RomneyMcCain693,50836%
McCainRomney598,18831%
HuckabeeGiuliani281,78115%
GiulianiHuckabee259,73514%

McCain and Romney were very close in the search volume chart, although Romney had a slight advantage over McCain, leading to an incorrect prediction. Huckabee and Giuliani were close in the actual primary, although Huckabee had a clear advantage in search volume over Giuliani. Thompson was not on the ballot.

Maine

2/2

No data were available in Google Trends. This was a caucus; not enough users searched for candidates to pass Google's search volume threshold.

Alabama

2/5

PredictedActualVotes%
McCainHuckabee230,69541%
HuckabeeMcCain211,07137%
RomneyRomney103,31818%

Huckabee's victory over McCain was close, but Huckabee did have a lower search volume by a not insignificant margin.

Alaska

2/5

No data were available in Google Trends. This was a caucus, and in a less populated state, so not enough users searched for candidates to pass Google's search volume threshold.

Arizona

2/5

PredictedActualVotes%
McCainMcCain254,33347%
RomneyRomney186,07535%
HuckabeeHuckabee48,6199%

Arizona was predicted very accurately.

Arkansas

2/5

PredictedActualVotes%
McCainHuckabee136,21661%
HuckabeeMcCain45,56320%
 Romney30,45314%

While Huckabee swept this election, McCain barely beat him in a tangled Trend history.

California

2/5

PredictedActualVotes%
McCainMcCain1,093,56042%
RomneyRomney890,85534%
HuckabeeHuckabee298,91412%

California, with its massive population, works very well in Google Trends.

Colorado

2/5

PredictedActualVotes%
McCainRomney33,28860%
RomneyMcCain10,62119%
HuckabeeHuckabee7,26613%

Caucuses are riskier. McCain seemed to be a bit more popular than Romney, but Romney swept the caucus.

Connecticut

2/5

PredictedActualVotes%
McCainMcCain78,74152%
RomneyRomney49,85133%
HuckabeeHuckabee10,5917%

In Connecticut, with a good population size, the search volume correlated well with the candidates' votes.

Delaware

2/5

No data were available in Google Trends. Very few people voted; McCain took first with 22,626 votes.

Georgia

2/5

PredictedActualVotes%
McCainHuckabee326,06934%
RomneyMcCain303,63932%
HuckabeeRomney289,73730%

Like in Alabama and Arkansas, Huckabee performed much better in the polls than he does in search volume.

Illinois

2/5

PredictedActualVotes%
McCainMcCain424,07147%
RomneyRomney256,80529%
HuckabeeHuckabee147,62617%

Not surprisingly considering Illinois' huge population, the search volume results match the actual outcome.

Massachusetts

2/5

PredictedActualVotes%
McCainRomney255,24851%
RomneyMcCain204,02741%
HuckabeeHuckabee19,1684%

Like in Colorado, Romney won in the Massachusetts primary, although McCain was more popular in Google searches. The opposite happened in Florida.

Minnesota

2/5

PredictedActualVotes%
McCainRomney25,99041%
RomneyMcCain13,82622%
HuckabeeHuckabee12,49320%

Again, McCain did better on Google than he actually performed. This caucus had the same discrepancy as the Colorado caucus.

Missouri

2/5

PredictedActualVotes%
McCainMcCain194,30433%
RomneyHuckabee185,627532%
HuckabeeRomney172,56429%

McCain was far superior in the Google search volume, but only barely won the primary. Huckabee beat Romney by 3 percentage points, but Romney scored higher on Google.

Montana

2/5

Not enough data was available in Google Trends. This was a caucus.

Kansas

2/9

PredictedActualVotes%
HuckabeeHuckabee11,62760%
McCainMcCain4,58724%

McCain's search volume decreased as Huckabee's increased just days before the caucus.

Louisiana

2/9

PredictedActualVotes%
McCainHuckabee69,66543%
 McCain67,60942%

Huckabee and McCain were almost tied. However, only McCain showed up on Google Trends.

District of Columbia

2/12

PredictedActualVotes%
McCainMcCain3,92968%
HuckabeeHuckabee96117%

McCain slaughtered Huckabee in the primary and had double the search volume on Google.

Maryland

2/12

PredictedActualVotes%
McCainMcCain163,67755%
HuckabeeHuckabee86,57329%

Like in D.C., Huckabee lost by quite a lot both in the primary and in the Google trend history.

Virginia

2/12

PredictedActualVotes%
McCainMcCain244,13550%
HuckabeeHuckabee198,24741%

Not surprisingly, Virginia worked much like McCain and D.C., although with many more voters.

Washington

2/19

PredictedActualVotes%
McCainMcCain262,29550%
HuckabeeHuckabee127,65724%

Again, Huckabee is decimated both in Google and in the real world.

Wisconsin

2/19

PredictedActualVotes%
McCainMcCain224,22655%
HuckabeeHuckabee151,20137%

Yet again. From this point forward, McCain continues to slaughter Huckabee in the remaining primaries and caucuses and on Google.

Conclusions

In reality, only about half of the "predictions" before the 2/12 primaries were actually accurate. This is caused by several factors. First of all, lack of data in several states leads to skewed, unreliable trends, so they really should be discarded. This problem is especially prevalent in states with a lower number of voters, due to low voter turnout (in caucuses) or simply a small state population. A second problem occurs when the search volume is greater for the first-place candidate before and after the actual primary, but greater for the second-place candidate in the day or two immediately preceeding the primary. For example, in Florida, McCain and Romney were neck to neck from Jan. 26 to Jan. 29, with Romney barely surpassing McCain on the 29th but losing on that day's actual primary by 5 percentage points. Of course, after this, McCain surges in terms of Google search volume, but this is to be expected; the results of the primary will affect the searches done on candidates.

Overall, this method shows a great deal of promise. The simple model performed exceptionally well for this year's Democratic primaries. More primaries will have to be studied before determining if Democratic primaries are usually easier to predict, or if closer primaries (Obama-Clinton) are easier to predict, or if a combination of the two affects success rate.

Even in the Republican primaries, the model has its strong points; it's just not robust. In many cases, where there is enough data in Google Trends and the data shows a clearly more popular candidate, the actual primary shows the same results.

Even disregarding actual "predictions," Google Trends can be a very useful tool for candidates. With Google Trends, campaigns can see how often their candidates are searched for, indicating how their popularity and name recognition are doing. While this doesn't correspond one-to-one with votes, it's fairly intuitive that the candidates searched for more often enjoy more interest among the voters and are more likely to be supported.

Furthermore, campaigns can see what issues are being searched for by potential voters, thereby determining which issues are on the voters' minds, or which issues are really important to them.

Sources

Contact Info

My name is Michael Giuffrida, and I developed this project as part of a senior seminar at the Maggie L. Walker Governor's School for Government and International Studies in Richmond, VA. I plan on majoring in Computer Science at Yale.

Email me: michaelg puttheatsignhere michaelg.us