Chapter 1 - Introduction
Most social scientists perform statistical analysis on personal computers or on servers dedicated to statistical processing. While these environments do provide effective research platforms for the majority of projects, their inherent constraints on data size and processing speed limit the ability of social scientists to perform empirical analysis on large data sets.This is especially problematic considering emerging data sources of massive scale (e.g. from social media, the “internet of things”“, or satellies) that can be highly pertinent to social science.
We begin this guide by discussing: (1) alternative frameworks for data storage and performing statistical analysis, (2) when social scientists may benefit from these approaches, and (3) the solution developed by the Urban Institute to address the massive data needs of public policy researchers.
We subsequently provide an overview of the framework that supports the solution developed by Urban and best-practices for utilizing this solution. While understanding the mechanisms behind the framework in detail is not necessary to leverage the big data approach described below, a basic understanding will help researchers to perform their analyses much more efficiently.
1.2 Making distributed computing straightforward & cost-effective
The Spark Social Science approach relies on cloud-based, distributed computing rather than an on-site server. The Spark distributed computing platform stores and processes data across a cluster of machines, distributing the data storage and processing tasks across the cluster. This allows many computers with hundreds or thousands or processers and terabytes of memory to work together to analyze data.
Specifically, our framework allows researchers to “spin-up” and shut down clusters, or groups, of computers using AWS Elastic MapReduce (EMR)1.
Since these clusters are only temporarily in use, data is stored permanently in the AWS data storage infrastructure, called S3. The individual machines that make up the cluster are called EC2 instances2.
This cloud-based, distributed framework effectively removes any computing constraint on the size of data that researchers can store and analyze. Researchers can rent any number of machines from AWS and then use Spark, which coordinates tasks between machines, to implement their data analysis and manipulation. Data is stored in S3, and then distributed in memory across the cluster machines when tasks need to be performed. This can be scaled to a massive degree – the largest publicly-known cluster to have utilized Spark is 8,000 machines, and Spark has been shown to perform well processing up to several petabytes (each 1 million gigabytes) worth of data.3
Installation of Spark occurs during AWS cluster configuration so that researchers can immediately perform work with their data in the distributed environment. In addition, the scripts we have written in the Spark Social Science GitHub page set up analytical environments (RStudio or Jupyter Notebooks) so researchers can work in R or Python once a cluster is finished spinning up. We describe and compare supported programming languages in subsequent sections of this manual.
Google Cloud’s DataProc would also likely work for this task, though we have not yet implemented that approach↩
A complete list of EC2 instance types can be found at https://aws.amazon.com/ec2/instance-types/.↩