Spark for Social Science

The Urban Institute's Approach to Big Data Statistics

Me

  • Alex, hi.
  • I Teach Data [Viz & Science] & Public Policy @UChicago
  • Contributing Data Scientist @ Urban Institute

  • Twitter: @alexcengler
  • alexcengler@uchicago.edu
  • alexcengler@urban.org


Prison Population Forecaster
A Matter of Time

Our Problem

  • Big Data
  • Small Budget
  • Advanced Statistics
  • Programming Limitations

'Big Data'

  • ~One Billion Rows (Growing to low tens of billions);
  • ~1TB of Data;
  • Couple Hundred Columns;

  • Ideally, this, or even several orders of magnitude higher, should be trivial.

Small Budget


Affordability from Elasticity

Standing Clusters - either in Cloud or on Premise are unaffordable.

Advanced Statistics

Programming Limitations


  • Preference for SAS/STATA
  • Some R-Users, a few Python-Users
  • Java/Scala/C are off the table

Accessibility as a General Concern

Our Solution

  • Apache Spark
  • R/Python w/ IDEs in Browser
  • AWS Elastic MapReduce (EMR)


Let's Launch a Cluster

Apache Spark


More on Spark in a Minute

Relatively Familiar Languages




Development Environments in Browser

  • RStudio for R
  • Jupyter Notebooks for Python

Amazon Web Services (AWS) Elastic MapReduce (EMR)

  • Elastic - Only Pay for Clusters During Use
  • Fast - 10-12 Min Spin Up w/ Bootstrap
  • Free Data Transfer From S3 (AWS Storage)

Amazon Web Services (AWS) Elastic MapReduce (EMR)

Also Cheap:

Four R3.8X Large EC2 Instances (Renting Four Computers in the Sky) gets you:

  • 244 GB Memory Each (~1TB Total)
  • 32v vCPU Each (128 Total)
  • 1300 GB SSD Storage

All for ~$10/Hour

List of EC2 Instances

`

Our Solution (Theory)

Our Solution (Reality)

So, we wrote a bunch of bootstrap scripts.

Launch from Command Line

AWS CloudFormation

Data Security & Access

  • FedRamp Authority to Operate (ATO)
  • Virtual Private Cloud
  • Centrally Controlled Data Access in S3

Use Cases

  • Computationally Expensive Analysis We Were Already Doing
  • New Stuff:
    • Police-sentiment Twitter Analysis
    • Digital Segregation Analysis
    • Large datasets from government and non-governmental sources
    • Microsimulation in the Cloud
    • Synthetic Data Generation

Next Steps

  • Course: Advanced Big Data Methods using Spark
  • Expanding Available Social Science Methods
  • Apache Toree for Integrating R/Python Kernels

What is Apache Spark?


A distributed in-memory framework for big data analysis.


Origins of Apache Spark


Michael Franklin now at UChicago!

Origins of Apache Spark


RDD: Partitioning


Spark's RDD works by implicitly executing code on partitions - logical splits of the data by observation (or row) - that are spread across a cluster of machines.


The Dataframe API allows you to run implicitly parallelized operations across these partitions.

Transformation: Operations on an RDD that don't return or write anything.

Action: Operations on an RDD that return or write a result. Executes all required prior transformations.

Actions Return Something or Write Something to External Storage

The Directed Acyclic Graph (DAG)

Fault Tolerance


The RDD solved an important problem - efficient fault tolerance in distributed memory. While, Hadoop relied on replication, which is untenable (too expensive) in distributed memory, the RDD uses the Directed Acyclic Graph model.

Shuffles are Expensive


Reshaping requires a shuffle

Statistical Methods for Regression

Optimization Methods: L-BFGS & SGD


This solves for the GLM and thus also fixed effects and difference-in-difference models.


However, other methods are not yet included (e.g. random effects) or require different not-yet-developed optimization in distributed memory (e.g. quantile regression).

List of Spark Methods

Other Statistical Learning Methods

More Resources