Chapter 2 Data sources

The primary data sources of this project are Cook Political and Federal Election Commission (FEC). Cook Political gathered and reported the 2020 election state-level popular vote share between the two major presidential candidates: Joe Biden and Donald Trump. FEC gathers and processes campaign finance data and discloses information about public funding of US presidential elections.

From the above data sources, we collected and used three major datasets: 2020 Presidential Election Results dataset from Cook Political, as well as the Contribution Receipts dataset and the Campaign Disbursements dataset from FEC. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:

2.1 U.S. State Level Popular Vote Results of 2020 Presidential Election

Dataset : Popular vote backend_1216.csv

The dataset downloaded from Cook Political contains the 2020 Presidential Election State-Level Popular Vote Share and Result Margin.

Table 2.1: Major columns used
Column Name	Description
stateid	State Postal Codes (chr)
state	State Name (chr)
dem_votes	Votes for Biden (num)
rep_votes	Votes for Trump (num)
dem_percent	Vote Percentage for Biden (num)
rep_percent	Vote Percentage for Trump (num)

Issues with this dataset:

Only the 2020 election popular voting count per candidate per state in this table are considered as the raw data. The rest of the columns in this file such as the vote margin and margin shift are calculated by Cook Political Report based on the past voting result dataset that is not published on their website. So we are not able to verify these numbers exactly using the dataset they provided. But, if necessary, we can collect the historical election popular vote outcome from other data sources (such as Harvard Dataverse) by ourselves and verify those numbers. However, since this Cook Political is a reliable data source, we would consider those values to be trustworthy and leave such verification process for future investigations.
Cook Political updates their data report every day. We have updated the dataset pulled from their website on December 16th to get an up-to-date version. All of the results on the table are shown as certified and finalized. So there should be no more updates that would affect our analysis outcome.

2.2 Contribution Receipts

Dataset: contribution_bind.csv

The contribution receipts data on fec.gov includes all receipts filed by election candidates and committees with records of contributions made by individuals, organizations, and committees.

Since contributions made in the election year seem to better reflect people’s support on their favorite candidate, we applied a date filter to select records with the receipt date falling between 01/01/2020 and 11/03/2020. Due to a huge data volume, we limited data exported to only the contributions with an amount over or equal to 200 dollars and raised by the committees of two final presidential candidates of 2020 Election. The filtered dataset was retrieved from here.

The exported dataset has 78 columns and 984,824 records in total. There are 302,731 entries for Donald J. Trump for President committee and 682,093 entries for Biden for President committee.

Table 2.2: Major columns used
Column Name	Description
committee_id	Recipient presidential committee id (chr)
committee_name	Recipient presidential committee name (chr)
entity_type	Contributor entity abbr. (chr)
entity_type_desc	Contributor entity (chr)
contributor_name	Full name of contributor (chr)
contributor_last_name	Last name of contributor (chr)
contributor_city	City of contributor’s address (chr)
contributor_state	State of contributor’s address(chr)
contributor_zip	Zip code of contributor’s address (chr)
contributor_occupation	Contributor’s occupation (chr)
contribution_receipt_date	Receipt date (date)
contribution_receipt_amount	Receipt amount (num)
contributor_aggregate_ytd	Contributor aggregated contribution of this year (num)

Issues with this dataset:

The original dataset is extremely big with over 200 million records, but many of them were very small contribubutions, therefore we filtered out records with amounts less than 200 dollars. As a result, the retrieved dataset might not capture information of all contributors, but at the same time, it largely eliminateds noise that we can focus on contributors who made significant contributions to support their favorite candidate.
The dataset contains contribution amounts of negatives and zeros, which are probably relate to unsuccessful transfers.
Many contributors didn’t provide all the information by some reason, so there are missing values and meaningless records in some columns like name, occupation, employer, and even city and states.
Since the information was gathered based on filings of committees and campaigns, the records do not comply with a unified format. For example, the contributor suffix column has records for “JR”, “JR.”, and “JR.,” which are actually the same thing. There are also errors in the spellings of states and cities.

2.3 The Disbursements data

Dataset: schedule_b-2020-11-08T16_10_30.csv

The Disbursements data from fec.gov includes the operating expenditure records of House and Senate election committees, Presidential election committees, and PAC and Party committees since 2003.

For this project, we filtered the data to focus on the spending of presidential election committees of 2020 US Election, which the disbursement dates are from 01/01/2020 to 11/03/2020, consistent with the scope of the contribution receipt dataset. The dataset with filters applied was exported from FEC Official Database. The retrieved dataset has 260,587 records and 78 columns.

Table 2.3: Major columns used
Column Name	Description
committee_name	Presidential Committee Name (chr)
disbursement_description	Details of the disbursement (chr)
disbursement_date	Date that disbursement is made (date).
disbursement_amount	Amount of the disbursement (num)
disbursement_purpose_category	Purpose of the disbursement (chr)
recipient_city	City of disbursement recipient’s address (chr)
recipient_zip	Zip code of disbursement recipient’s address (chr)
recipient_state	State of disbursement recipient’s address (chr)
recipient_name	Entity name of disbursement recipient (chr)

Issues with this dataset:

The disbursement data is still getting updated with additive filings of previous disbursements being archived. The total number of records is subject to change, but the change will be trivial compared to the size of the entire dataset. Although this will not affect our analysis or overall conclusion, it is worth attention to the data incompleteness here.

2.4 Other Minor Datasets Used

We also used other auxiliary datasets to develop more comprehensive analysis and visualizations.

1. U.S. Historical Voting Results from 1976 to 2016: 1976-2016-president.csv

The dataset downloaded from Harvard Dataverse contains 14 columns and 3740 records about the state level voting results of presidential elections from 1976 to 2016. We used 4 columns from this dataset: year, state, party and candidatevotes.

Table 2.4: Major columns used
Column Name	Description
year	Election Year (num)
state	State Name (chr)
party	Party Name (chr)
candidatevotes	Popular Votes (num)

Issues with this dataset: Among the four columns that we used, the party column has a mixture of String and Boolean data, which we need to uniform the data type of this column.

2. U.S. County Level Voting Results of 2020 Presidential Election: 2020_US_County_Level_Presidential_Results.csv

The dataset contains 10 columns and 3153 records covering the voting results of 2020 presidential election by US county. Since the county level data was not downloadable from the website, we exported this dataset from the github repo 2020_US_County_Level_Presidential_Results, which contains the latest voting data scraped by Tony McGovern from Fox News website.

Table 2.5: Major columns used
Column Name	Description
state_name	State name (chr)
county_fips	The 5-digit FIPS code corresponding to the county (chr)
county_name	County name (chr)
votes_gop	Votes for Trump (num)
votes_dem	Votes for Biden (num)
total_votes	Total Votes (num)
diff	Difference between votes for Trump and Biden (num)
per_gop	Percentage of votes for Trump (num)
per_dem	Percentage of votes for Biden (num)
per_point_diff	Difference between percentages of votes for Trump and Biden (num)

Issues with this dataset: This dataset is scraped from results published by from Fox News, Politico, and the New York Times, so it kept getting updated and some columns may be changed. Also, the election votes counting and post-election processing are ongoing in several states, so individual statistics we used here might change.

3. US Presidential Election vote shares: elections_historic

This dataset is contained in the socviz library in R. It is a dataset of historic US presidential elections from 1824 to 2016, with information about the winner, runner up, and various measures of vote share.

Table 2.6: Major columns used
Column Name	Description
ec_pct	Winner’s share of elctoral college vote. (Range is 0 to 1. ) (num)
popular_pct	Winner’s share of popular vote. (Range is 0 to 1. ) (num)

Issues with this dataset: As claimed in the elections_historic {socviz} documentation, the data for 2016 are provisional as of early December 2016. So we would need to update the 2016 data to reflect on the final results.

4. Number of Registered Voters: RegisteredVotersByState.csv

We collected the dataset from World Population Review. This dataset contains the registered voter count and percentage for each state. The records are all updated after June, 2020.

Table 2.7: Major columns used
Column Name	Description
state	State Name (chr)
totalRegistered	Total number of people registered (num)
Pop	Total state population (num)
registeredPerc	Percent of peoplewho are registered voters (num)
asOf	Last update date (date)

Issues with this dataset: The data records in this file were not all updated on the same day.