Chapter 2 Data sources
The primary data sources of this project are Cook Political and Federal Election Commission (FEC). Cook Political gathered and reported the 2020 election state-level popular vote share between the two major presidential candidates: Joe Biden and Donald Trump. FEC gathers and processes campaign finance data and discloses information about public funding of US presidential elections.
From the above data sources, we collected and used three major datasets: 2020 Presidential Election Results dataset from Cook Political, as well as the Contribution Receipts dataset and the Campaign Disbursements dataset from FEC. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:
2.1 U.S. State Level Popular Vote Results of 2020 Presidential Election
Dataset : Popular vote backend_1216.csv
The dataset downloaded from Cook Political contains the 2020 Presidential Election State-Level Popular Vote Share and Result Margin.
Column Name | Description |
---|---|
stateid | State Postal Codes (chr) |
state | State Name (chr) |
dem_votes | Votes for Biden (num) |
rep_votes | Votes for Trump (num) |
dem_percent | Vote Percentage for Biden (num) |
rep_percent | Vote Percentage for Trump (num) |
Issues with this dataset:
- Only the 2020 election popular voting count per candidate per state in this table are considered as the raw data. The rest of the columns in this file such as the vote margin and margin shift are calculated by Cook Political Report based on the past voting result dataset that is not published on their website. So we are not able to verify these numbers exactly using the dataset they provided. But, if necessary, we can collect the historical election popular vote outcome from other data sources (such as Harvard Dataverse) by ourselves and verify those numbers. However, since this Cook Political is a reliable data source, we would consider those values to be trustworthy and leave such verification process for future investigations.
- Cook Political updates their data report every day. We have updated the dataset pulled from their website on December 16th to get an up-to-date version. All of the results on the table are shown as certified and finalized. So there should be no more updates that would affect our analysis outcome.
2.2 Contribution Receipts
Dataset: contribution_bind.csv
The contribution receipts data on fec.gov includes all receipts filed by election candidates and committees with records of contributions made by individuals, organizations, and committees.
Since contributions made in the election year seem to better reflect people’s support on their favorite candidate, we applied a date filter to select records with the receipt date falling between 01/01/2020 and 11/03/2020. Due to a huge data volume, we limited data exported to only the contributions with an amount over or equal to 200 dollars and raised by the committees of two final presidential candidates of 2020 Election. The filtered dataset was retrieved from here.
The exported dataset has 78 columns and 984,824 records in total. There are 302,731 entries for Donald J. Trump for President committee and 682,093 entries for Biden for President committee.
Column Name | Description |
---|---|
committee_id | Recipient presidential committee id (chr) |
committee_name | Recipient presidential committee name (chr) |
entity_type | Contributor entity abbr. (chr) |
entity_type_desc | Contributor entity (chr) |
contributor_name | Full name of contributor (chr) |
contributor_last_name | Last name of contributor (chr) |
contributor_city | City of contributor’s address (chr) |
contributor_state | State of contributor’s address(chr) |
contributor_zip | Zip code of contributor’s address (chr) |
contributor_occupation | Contributor’s occupation (chr) |
contribution_receipt_date | Receipt date (date) |
contribution_receipt_amount | Receipt amount (num) |
contributor_aggregate_ytd | Contributor aggregated contribution of this year (num) |
Issues with this dataset:
The original dataset is extremely big with over 200 million records, but many of them were very small contribubutions, therefore we filtered out records with amounts less than 200 dollars. As a result, the retrieved dataset might not capture information of all contributors, but at the same time, it largely eliminateds noise that we can focus on contributors who made significant contributions to support their favorite candidate.
The dataset contains contribution amounts of negatives and zeros, which are probably relate to unsuccessful transfers.
Many contributors didn’t provide all the information by some reason, so there are missing values and meaningless records in some columns like name, occupation, employer, and even city and states.
Since the information was gathered based on filings of committees and campaigns, the records do not comply with a unified format. For example, the contributor suffix column has records for “JR”, “JR.”, and “JR.,” which are actually the same thing. There are also errors in the spellings of states and cities.
2.3 The Disbursements data
Dataset: schedule_b-2020-11-08T16_10_30.csv
The Disbursements data from fec.gov includes the operating expenditure records of House and Senate election committees, Presidential election committees, and PAC and Party committees since 2003.
For this project, we filtered the data to focus on the spending of presidential election committees of 2020 US Election, which the disbursement dates are from 01/01/2020 to 11/03/2020, consistent with the scope of the contribution receipt dataset. The dataset with filters applied was exported from FEC Official Database. The retrieved dataset has 260,587 records and 78 columns.
Column Name | Description |
---|---|
committee_name | Presidential Committee Name (chr) |
disbursement_description | Details of the disbursement (chr) |
disbursement_date | Date that disbursement is made (date). |
disbursement_amount | Amount of the disbursement (num) |
disbursement_purpose_category | Purpose of the disbursement (chr) |
recipient_city | City of disbursement recipient’s address (chr) |
recipient_zip | Zip code of disbursement recipient’s address (chr) |
recipient_state | State of disbursement recipient’s address (chr) |
recipient_name | Entity name of disbursement recipient (chr) |
Issues with this dataset:
The disbursement data is still getting updated with additive filings of previous disbursements being archived. The total number of records is subject to change, but the change will be trivial compared to the size of the entire dataset. Although this will not affect our analysis or overall conclusion, it is worth attention to the data incompleteness here.
2.4 Other Minor Datasets Used
We also used other auxiliary datasets to develop more comprehensive analysis and visualizations.
1. U.S. Historical Voting Results from 1976 to 2016: 1976-2016-president.csv
The dataset downloaded from Harvard Dataverse contains 14 columns and 3740 records about the state level voting results of presidential elections from 1976 to 2016. We used 4 columns from this dataset: year
, state
, party
and candidatevotes
.
Column Name | Description |
---|---|
year | Election Year (num) |
state | State Name (chr) |
party | Party Name (chr) |
candidatevotes | Popular Votes (num) |
Issues with this dataset: Among the four columns that we used, the party
column has a mixture of String and Boolean data, which we need to uniform the data type of this column.
2. U.S. County Level Voting Results of 2020 Presidential Election: 2020_US_County_Level_Presidential_Results.csv
The dataset contains 10 columns and 3153 records covering the voting results of 2020 presidential election by US county. Since the county level data was not downloadable from the website, we exported this dataset from the github repo 2020_US_County_Level_Presidential_Results, which contains the latest voting data scraped by Tony McGovern from Fox News website.
Column Name | Description |
---|---|
state_name | State name (chr) |
county_fips | The 5-digit FIPS code corresponding to the county (chr) |
county_name | County name (chr) |
votes_gop | Votes for Trump (num) |
votes_dem | Votes for Biden (num) |
total_votes | Total Votes (num) |
diff | Difference between votes for Trump and Biden (num) |
per_gop | Percentage of votes for Trump (num) |
per_dem | Percentage of votes for Biden (num) |
per_point_diff | Difference between percentages of votes for Trump and Biden (num) |
Issues with this dataset: This dataset is scraped from results published by from Fox News, Politico, and the New York Times, so it kept getting updated and some columns may be changed. Also, the election votes counting and post-election processing are ongoing in several states, so individual statistics we used here might change.
3. US Presidential Election vote shares: elections_historic
This dataset is contained in the socviz
library in R. It is a dataset of historic US presidential elections from 1824 to 2016, with information about the winner, runner up, and various measures of vote share.
Column Name | Description |
---|---|
ec_pct | Winner’s share of elctoral college vote. (Range is 0 to 1. ) (num) |
popular_pct | Winner’s share of popular vote. (Range is 0 to 1. ) (num) |
Issues with this dataset: As claimed in the elections_historic {socviz}
documentation, the data for 2016 are provisional as of early December 2016. So we would need to update the 2016 data to reflect on the final results.
4. Number of Registered Voters: RegisteredVotersByState.csv
We collected the dataset from World Population Review. This dataset contains the registered voter count and percentage for each state. The records are all updated after June, 2020.
Column Name | Description |
---|---|
state | State Name (chr) |
totalRegistered | Total number of people registered (num) |
Pop | Total state population (num) |
registeredPerc | Percent of peoplewho are registered voters (num) |
asOf | Last update date (date) |
Issues with this dataset: The data records in this file were not all updated on the same day.