Chapter 1 - Introduction

Most social scientists perform statistical analysis on personal computers or on servers dedicated to statistical processing. While these environments do provide effective research platforms for the majority of projects, their inherent constraints on data size and processing speed limit the ability of social scientists to perform empirical analysis on large data sets.This is especially problematic considering emerging data sources of massive scale (e.g. from social media, the “internet of things”“, or satellies) that can be highly pertinent to social science.

We begin this guide by discussing: (1) alternative frameworks for data storage and performing statistical analysis, (2) when social scientists may benefit from these approaches, and (3) the solution developed by the Urban Institute to address the massive data needs of public policy researchers.

We subsequently provide an overview of the framework that supports the solution developed by Urban and best-practices for utilizing this solution. While understanding the mechanisms behind the framework in detail is not necessary to leverage the big data approach described below, a basic understanding will help researchers to perform their analyses much more efficiently.

1.1 Outgrowing the server: When should social scientists consider distributed computing?

When working with data in traditional platforms on a personal computer, or accessing a server, the system pulls data from physical storage (your hard drive) and loads it into your system’s memory while you work with it (e.g. Stata), or iteratively reads the data into and out of memory (e.g. SAS). Whether you are working from your personal computer’s hard drive and memory, or utilizing a server, the size of any potential data is constrained by the capacity of that system. For the majority of what is traditionally considered to be social science research, this framework is effective and convenient; Spark for Social Science does not replace any existing solutions in this range. It is only when the data becomes large enough or the processing becomes intense enough that researchers need to look beyond traditional frameworks for statistical analysis.

When data size outgrows this defined capacity level, relying on this framework results in inefficiencies and/or prohibitive costs for most organizations. Acquiring a larger server is one solution, but aside from the often extraordinary expense associated with this, the size of the data that can be stored and analyzed on any given server is still finite.

This problem is compounded by the fact that an organization will always need a server large enough for its most computationally demanding project. Given the intermittent nature of social science research, this will often result in paying for long stretches of unused capacity. Additionally, if a new, larger project comes along, or if several researchers need computational resources at the same time, the server has no capacity to adapt.

Subsequent sections of this manual describe the Urban Institute approach, coined Spark Social Science, to address the needs of policy researchers that work with massive data. The system described below does not require the owning and maintainance of a server capable of handling massive data. Our approach instead leverages scalable cloud computing hosted by Amazon Web Services (AWS) and the distributed computing platform Apache Spark (Spark). The framework is directly tied to the elasticity of researcher demand for computational capacity, simultaneously providing researchers with greater flexibility at dramatically reduced organizational costs.

When should researchers make the jump from the desktop or server, to a distributed computing platform like Spark? The answer depends on the size of the researcher’s data, the size of the server that the researcher currently relies on, and the number of researchers sharing that server. Researchers should utilize distributed computing if their data is larger in size than the server memory and storage available to them (after sharing server resources with other researchers). Alternatively, some statistical analyses that are very demanding of processing power may also benefit greatly from distributed computing.

1.2 Making distributed computing straightforward & cost-effective

The Spark Social Science approach relies on cloud-based, distributed computing rather than an on-site server. The Spark distributed computing platform stores and processes data across a cluster of machines, distributing the data storage and processing tasks across the cluster. This allows many computers with hundreds or thousands or processers and terabytes of memory to work together to analyze data.

Specifically, our framework allows researchers to “spin-up” and shut down clusters, or groups, of computers using AWS Elastic MapReduce (EMR)1.

Since these clusters are only temporarily in use, data is stored permanently in the AWS data storage infrastructure, called S3. The individual machines that make up the cluster are called EC2 instances2.

This cloud-based, distributed framework effectively removes any computing constraint on the size of data that researchers can store and analyze. Researchers can rent any number of machines from AWS and then use Spark, which coordinates tasks between machines, to implement their data analysis and manipulation. Data is stored in S3, and then distributed in memory across the cluster machines when tasks need to be performed. This can be scaled to a massive degree – the largest publicly-known cluster to have utilized Spark is 8,000 machines, and Spark has been shown to perform well processing up to several petabytes (each 1 million gigabytes) worth of data.3

Installation of Spark occurs during AWS cluster configuration so that researchers can immediately perform work with their data in the distributed environment. In addition, the scripts we have written in the Spark Social Science GitHub page set up analytical environments (RStudio or Jupyter Notebooks) so researchers can work in R or Python once a cluster is finished spinning up. We describe and compare supported programming languages in subsequent sections of this manual.


  1. Google Cloud’s DataProc would also likely work for this task, though we have not yet implemented that approach

  2. A complete list of EC2 instance types can be found at https://aws.amazon.com/ec2/instance-types/.

  3. http://spark.apache.org/faq.html