Chapter 2 Data sources

The primary data sources of this project are Cook Political and Federal Election Commission (FEC). Cook Political gathered and reported the 2020 election state-level popular vote share between the two major presidential candidates: Joe Biden and Donald Trump. FEC gathers and processes campaign finance data and discloses information about public funding of US presidential elections.

From the above data sources, we collected and used three major datasets: 2020 Presidential Election Results dataset from Cook Political, as well as the Contribution Receipts dataset and the Campaign Disbursements dataset from FEC. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:

2.2 Contribution Receipts

Dataset: contribution_bind.csv

The contribution receipts data on fec.gov includes all receipts filed by election candidates and committees with records of contributions made by individuals, organizations, and committees.

Since contributions made in the election year seem to better reflect people’s support on their favorite candidate, we applied a date filter to select records with the receipt date falling between 01/01/2020 and 11/03/2020. Due to a huge data volume, we limited data exported to only the contributions with an amount over or equal to 200 dollars and raised by the committees of two final presidential candidates of 2020 Election. The filtered dataset was retrieved from here.

The exported dataset has 78 columns and 984,824 records in total. There are 302,731 entries for Donald J. Trump for President committee and 682,093 entries for Biden for President committee.

Table 2.2: Major columns used
Column Name Description
committee_id Recipient presidential committee id (chr)
committee_name Recipient presidential committee name (chr)
entity_type Contributor entity abbr. (chr)
entity_type_desc Contributor entity (chr)
contributor_name Full name of contributor (chr)
contributor_last_name Last name of contributor (chr)
contributor_city City of contributor’s address (chr)
contributor_state State of contributor’s address(chr)
contributor_zip Zip code of contributor’s address (chr)
contributor_occupation Contributor’s occupation (chr)
contribution_receipt_date Receipt date (date)
contribution_receipt_amount Receipt amount (num)
contributor_aggregate_ytd Contributor aggregated contribution of this year (num)

Issues with this dataset:

  1. The original dataset is extremely big with over 200 million records, but many of them were very small contribubutions, therefore we filtered out records with amounts less than 200 dollars. As a result, the retrieved dataset might not capture information of all contributors, but at the same time, it largely eliminateds noise that we can focus on contributors who made significant contributions to support their favorite candidate.

  2. The dataset contains contribution amounts of negatives and zeros, which are probably relate to unsuccessful transfers.

  3. Many contributors didn’t provide all the information by some reason, so there are missing values and meaningless records in some columns like name, occupation, employer, and even city and states.

  4. Since the information was gathered based on filings of committees and campaigns, the records do not comply with a unified format. For example, the contributor suffix column has records for “JR”, “JR.”, and “JR.,” which are actually the same thing. There are also errors in the spellings of states and cities.

2.3 The Disbursements data

Dataset: schedule_b-2020-11-08T16_10_30.csv

The Disbursements data from fec.gov includes the operating expenditure records of House and Senate election committees, Presidential election committees, and PAC and Party committees since 2003.

For this project, we filtered the data to focus on the spending of presidential election committees of 2020 US Election, which the disbursement dates are from 01/01/2020 to 11/03/2020, consistent with the scope of the contribution receipt dataset. The dataset with filters applied was exported from FEC Official Database. The retrieved dataset has 260,587 records and 78 columns.

Table 2.3: Major columns used
Column Name Description
committee_name Presidential Committee Name (chr)
disbursement_description Details of the disbursement (chr)
disbursement_date Date that disbursement is made (date).
disbursement_amount Amount of the disbursement (num)
disbursement_purpose_category Purpose of the disbursement (chr)
recipient_city City of disbursement recipient’s address (chr)
recipient_zip Zip code of disbursement recipient’s address (chr)
recipient_state State of disbursement recipient’s address (chr)
recipient_name Entity name of disbursement recipient (chr)

Issues with this dataset:

The disbursement data is still getting updated with additive filings of previous disbursements being archived. The total number of records is subject to change, but the change will be trivial compared to the size of the entire dataset. Although this will not affect our analysis or overall conclusion, it is worth attention to the data incompleteness here.

2.4 Other Minor Datasets Used

We also used other auxiliary datasets to develop more comprehensive analysis and visualizations.

1. U.S. Historical Voting Results from 1976 to 2016: 1976-2016-president.csv

The dataset downloaded from Harvard Dataverse contains 14 columns and 3740 records about the state level voting results of presidential elections from 1976 to 2016. We used 4 columns from this dataset: year, state, party and candidatevotes.

Table 2.4: Major columns used
Column Name Description
year Election Year (num)
state State Name (chr)
party Party Name (chr)
candidatevotes Popular Votes (num)

Issues with this dataset: Among the four columns that we used, the party column has a mixture of String and Boolean data, which we need to uniform the data type of this column.

2. U.S. County Level Voting Results of 2020 Presidential Election: 2020_US_County_Level_Presidential_Results.csv

The dataset contains 10 columns and 3153 records covering the voting results of 2020 presidential election by US county. Since the county level data was not downloadable from the website, we exported this dataset from the github repo 2020_US_County_Level_Presidential_Results, which contains the latest voting data scraped by Tony McGovern from Fox News website.

Table 2.5: Major columns used
Column Name Description
state_name State name (chr)
county_fips The 5-digit FIPS code corresponding to the county (chr)
county_name County name (chr)
votes_gop Votes for Trump (num)
votes_dem Votes for Biden (num)
total_votes Total Votes (num)
diff Difference between votes for Trump and Biden (num)
per_gop Percentage of votes for Trump (num)
per_dem Percentage of votes for Biden (num)
per_point_diff Difference between percentages of votes for Trump and Biden (num)

Issues with this dataset: This dataset is scraped from results published by from Fox News, Politico, and the New York Times, so it kept getting updated and some columns may be changed. Also, the election votes counting and post-election processing are ongoing in several states, so individual statistics we used here might change.

3. US Presidential Election vote shares: elections_historic

This dataset is contained in the socviz library in R. It is a dataset of historic US presidential elections from 1824 to 2016, with information about the winner, runner up, and various measures of vote share.

Table 2.6: Major columns used
Column Name Description
ec_pct Winner’s share of elctoral college vote. (Range is 0 to 1. ) (num)
popular_pct Winner’s share of popular vote. (Range is 0 to 1. ) (num)

Issues with this dataset: As claimed in the elections_historic {socviz} documentation, the data for 2016 are provisional as of early December 2016. So we would need to update the 2016 data to reflect on the final results.

4. Number of Registered Voters: RegisteredVotersByState.csv

We collected the dataset from World Population Review. This dataset contains the registered voter count and percentage for each state. The records are all updated after June, 2020.

Table 2.7: Major columns used
Column Name Description
state State Name (chr)
totalRegistered Total number of people registered (num)
Pop Total state population (num)
registeredPerc Percent of peoplewho are registered voters (num)
asOf Last update date (date)

Issues with this dataset: The data records in this file were not all updated on the same day.