Big Data Analysis

U.S. Voting Patterns Analysis

A comprehensive big data analysis project that combines American Community Survey socioeconomic data with U.S. presidential election results to uncover correlations between economic conditions and voting behavior at the county level.

GitHub

Technologies

PythonApache SparkMapReducePandasNumPyMatplotlib

The Problem

Understanding the relationship between socioeconomic factors and voting patterns requires processing massive datasets that exceed single-machine capabilities. Traditional analysis methods struggle with the scale and complexity of combining census data with election results across thousands of counties over multiple election cycles.

Approach

Built a distributed data pipeline using Apache Spark to process and join large-scale datasets. Implemented MapReduce patterns for efficient aggregation of county-level statistics. Created a modular ETL pipeline that handles data cleaning, feature engineering, and statistical analysis at scale.

Results

✓Processed 47,000+ county-year observations efficiently using Spark
✓Analyzed correlations across 500+ socioeconomic variables
✓Achieved 3x performance improvement over single-node processing
✓Identified key economic indicators correlated with voting shifts

Key Highlights

47,000+ county-year observations

500+ socioeconomic variables

Distributed Spark pipeline

Performance benchmarking