Do you have some data with geolocation information that you want to map? You may not think of R when you're looking for a GIS platform, but new packages and standards have helped make the R programming language a surprisingly robust platform for some geospatial analysis.
These examples will demonstrate how to map election results, but the concepts can easily be used for any other kind of color-coded choropleth map. I'll show how to handle a straightforward two-person race and a more complex race with three or more candidates.
We'll be using two mapping packages in this tutorial: tmap and tmaptools for quick static maps and leaflet for interactive maps. You can install and load them now with
(Skip the install.packages lines for any R packages that are already on your system.)
Step 1: Get election results data
I'll start with the 2016 New Hampshire Democratic primary results, which are available from the New Hampshire secretary of state's office as a downloadable Excel spreadsheet.
Getting election data into the proper format for mapping is one of this project's biggest challenges — more so than actually creating the map. For simplicity, let's stick to results by county instead of drilling down to individual towns and precincts.
One common problem: Results data need to have one column with all election district names — whether counties, precincts or states — and candidate names as column headers. Many election returns, though, are reported with each election district in its own column and candidate results by row.
That's the case with the official New Hampshire results. I transposed the data to fix that and otherwise cleaned up the spreadsheet a bit before importing it into R (such as removing ", d" after each candidate's name). The first column now has county names, while every additional column is a candidate name; each row is a county result. I also got rid of the "total" row at the bottom, which can interfere with data sorting.
You can do the same — or, if you'd like to download the data file and all the other files I'm using, including R code, head to the "Mapping with R" file download page. (Free Insider registration needed. Bonus: You'll be helping me convince my boss that I ought to write more of these types of tutorials). If you download and unzip the mapping with R file, look for NHD2016.xlsx in the zip file.
To make your R mapping script as reusable as possible, I suggest putting data file names at the top of the script — that makes it easy to swap in different data files without having to hunt through code to find where a file name appears. You can put this toward the top of your R script:
nhdatafile <- "data/NHD2016.xlsx"
Note: My data file isn't in the same working directory as my R script; I have it in a data subdirectory. Make sure to include the appropriate file path for your system, using forward slashes even on Windows.
There are several packages for importing Excel files into R; but for ease of use, you can't beat rio. Install it with:
if it's not already on your system, and then run:
nhdata <- rio::import(datafile)
to store data from the election results spreadsheet into a variable called nhdata.
There were actually 28 candidates in the results; but to focus on mapping instead of data wrangling, let's not worry about the many minor candidates and pretend there were just two: Hillary Clinton and Bernie Sanders. Select just the County, Clinton and Sanders columns with:
nhdata <- nhdata[,c("County", "Clinton", "Sanders")]
Step 2: Decide what data to map
Now we need to think about what exactly we'd like to color-code on the map. We need to pick one column of data for the map's county colors, but all we have so far is raw vote totals. We probably want to calculate either the winner's overall percent of the vote, the winner's percentage-point margin of victory or, less common, the winner's margin expressed by number of votes (after all, winning by 5 points in a heavily populated county might be more useful than winning by 10 points in a place with way fewer people if the goal is to win the entire state).
It turns out that Sanders won every county; but if he didn't, we could still map the Sanders "margin of victory" and use negative values for counties he lost.
Let's add columns for candidates' margins of victory (or loss) and percent of the vote, again for now pretending there were votes cast only for the two main candidates. (Here and later in the story, click on the gray box and scroll to the right to see all the code.)
# Add columns for percents and margins nhdata$SandersMarginVotes <- nhdata$Sanders - nhdata$Clinton nhdata$SandersPct <- (nhdata$Sanders - nhdata$Clinton) / (nhdata$Sanders + nhdata$Clinton) # Will use formatting later to multiply by a hundred nhdata$ClintonPct <- (nhdata$Clinton - nhdata$Sanders) / (nhdata$Sanders + nhdata$Clinton) nhdata$SandersMarginPctgPoints <- nhdata$SandersPct - nhdata$ClintonPct
Step 3: Get your geographic data
Whether you're mapping results for your city, your state or the nation, you need geographic data for the area you'll be mapping in addition to election results. There are several common formats for such geospatial data; but for this tutorial, we'll focus on just one: shapefiles, a widely used format developed by Esri.
If you want to map results down to your city or town's precinct level, you'll probably need to get files from a local or state GIS office. For mapping by larger areas like cities, counties or states, the U.S. Census Bureau is a good place to find shapefiles.
For this New Hampshire mapping project by county, I downloaded files from the Census Bureau's Cartographic Boundary shapefiles page — these are smaller, simplified files designed for mapping projects where extraordinarily precise boundaries aren't needed. (Files for engineering projects or redistricting tend to be considerably larger).
I chose the national county file at http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_5m.zip and unzipped it within my data subdirectory. With R, it's easy to create a subset for just one state, or more; and now I've got a file I can reuse for other state maps by county as well.
There are a lot of files in that newly unzipped subdirectory; the one you want has the .shp extension. I'll store the name of this file in a variable called usshapefile:
usshapefile <- "data/cb_2014_us_county_5m/cb_2014_us_county_5m.shp"
Several R packages have functions for importing shapefiles into R. I'll use tmaptools's read_shape(), which I find quite intuitive:
usgeo <- read_shape(file=usshapefile, as.sf = TRUE)
as.sf = TRUE means I want usgeo to be a simple features object. The simple features standards were recently implemented in R with the sf package, and that's made GIS work in R a lot, well, simpler. Now, geospatial objects look similar to "regular" R data frames, with a special complex column for geography. If as.sf is set to FALSE, usgeo will have an overall more complicated structure.
If you want to check to see if the usgeo object looks like geography of the U.S., run tmap's quick thematic map command:
qtm(usgeo). This may take a while to load and appear small and rather boring, but if you've got a map of the U.S. with divisions, you're probably on the right track.
If you run
str(usgeo) to see the usgeo data structure. It will look like a conventional data frame except for the final geometry column with sfc_MULTIPOLYGON information.
Extracting geodata just for New Hampshire is similar to subsetting any other type of data in R, we just need the state FIPS code for New Hampshire, which turns out to be 33 — or in this case "33," since the codes are stored as factors, not integers in usgeo.
Here's the base R command to extract New Hampshire data using FIPS code 33:
nhgeo <- usgeo[usgeo@data$STATEFP=="33",]
Or, if you have dplyr version 0.6 or later installed, you can use dplyr::filter() on the sf object as you would on a conventional data frame:
nhgeo <- filter(usgeo, STATEFP=="33")
If you want to do a check to see if nhgeo looks correct, run the quick thematic map function again:
and you should see something like the image shown at left.
Still somewhat boring, but it looks like the Granite State with county-sized divisions, so it appears we've got the correct file subset.
Step 4: Merge spatial and results data
Like any database join or merge, this has two requirements: 1) a column shared by each data set, and 2) records that refer to the same entity in exactly the same way. (Having a county listed as "Hillsborough" in one file and FIPS code "011" in another wouldn't give R any idea how to match them up without some sort of translation table.)