Spark for Social Science

The Urban Institute's Approach to Big Data Statistics

Me

Alex, hi.
I Teach Data [Viz & Science] & Public Policy @UChicago
Contributing Data Scientist @ Urban Institute

Twitter: @alexcengler
alexcengler@uchicago.edu
alexcengler@urban.org

Prison Population Forecaster

A Matter of Time

Our Problem

Big Data
Small Budget
Advanced Statistics
Programming Limitations

'Big Data'

~One Billion Rows (Growing to low tens of billions);
~1TB of Data;
Couple Hundred Columns;

Ideally, this, or even several orders of magnitude higher, should be trivial.

Small Budget

Affordability from Elasticity

Standing Clusters - either in Cloud or on Premise are unaffordable.

Advanced Statistics

Programming Limitations

Preference for SAS/STATA
Some R-Users, a few Python-Users
Java/Scala/C are off the table

Accessibility as a General Concern

Our Solution

Apache Spark
R/Python w/ IDEs in Browser
AWS Elastic MapReduce (EMR)

Let's Launch a Cluster

Amazon Web Services (AWS) Elastic MapReduce (EMR)

Elastic - Only Pay for Clusters During Use
Fast - 10-12 Min Spin Up w/ Bootstrap
Free Data Transfer From S3 (AWS Storage)

Amazon Web Services (AWS) Elastic MapReduce (EMR)

Also Cheap:

Four R3.8X Large EC2 Instances (Renting Four Computers in the Sky) gets you:

244 GB Memory Each (~1TB Total)
32v vCPU Each (128 Total)
1300 GB SSD Storage

All for ~$10/Hour

List of EC2 Instances

Our Solution (Theory)

Our Solution (Reality)

So, we wrote a bunch of bootstrap scripts.

Data Security & Access

FedRamp Authority to Operate (ATO)
Virtual Private Cloud
Centrally Controlled Data Access in S3

Use Cases

Computationally Expensive Analysis We Were Already Doing
New Stuff:

Police-sentiment Twitter Analysis
Digital Segregation Analysis
Large datasets from government and non-governmental sources
Microsimulation in the Cloud
Synthetic Data Generation

Next Steps

Course: Advanced Big Data Methods using Spark
Expanding Available Social Science Methods
Apache Toree for Integrating R/Python Kernels

What is Apache Spark?

A distributed in-memory framework for big data analysis.

Written in Scala/Java but has R/Python APIs
Good & Improving Statistical Methods
Free & Open Source

Largely implicitly parallel

Origins of Apache Spark

Michael Franklin now at UChicago!

Origins of Apache Spark

RDD: Partitioning

Spark's RDD works by implicitly executing code on partitions - logical splits of the data by observation (or row) - that are spread across a cluster of machines.

The Dataframe API allows you to run implicitly parallelized operations across these partitions.

Transformation: Operations on an RDD that don't return or write anything.

Action: Operations on an RDD that return or write a result. Executes all required prior transformations.

Actions Return Something or Write Something to External Storage

The Directed Acyclic Graph (DAG)

Fault Tolerance

The RDD solved an important problem - efficient fault tolerance in distributed memory. While, Hadoop relied on replication, which is untenable (too expensive) in distributed memory, the RDD uses the Directed Acyclic Graph model.

Shuffles are Expensive

Reshaping requires a shuffle

Statistical Methods for Regression

Optimization Methods: L-BFGS & SGD

This solves for the GLM and thus also fixed effects and difference-in-difference models.

However, other methods are not yet included (e.g. random effects) or require different not-yet-developed optimization in distributed memory (e.g. quantile regression).

List of Spark Methods

Spark for Social Science

Me

Our Problem

'Big Data'

Small Budget

Advanced Statistics

Programming Limitations

Our Solution

Apache Spark

Amazon Web Services (AWS) Elastic MapReduce (EMR)

Amazon Web Services (AWS) Elastic MapReduce (EMR)

Our Solution (Theory)

Our Solution (Reality)

Launch from Command Line

AWS CloudFormation

Data Security & Access

Use Cases

Next Steps

What is Apache Spark?

Origins of Apache Spark

Origins of Apache Spark

RDD: Partitioning

The Directed Acyclic Graph (DAG)

Fault Tolerance

Shuffles are Expensive

Statistical Methods for Regression

Other Statistical Learning Methods

More Resources