Chapter 2 - Getting Started with Spark

The open source Apache Spark project has generously taken the time to support use in R and Python, far more accessible languages than what Spark itself is written in (Scala).

Below are links to tutorials written by the Urban Institute Research Programming team to help guide researchers through tasks typically performed by social scientists, but driven by Spark. As you read through this book, you will see hyperlinks for SparkR and PySpark that will take you to code relevant to the concept being discussed.

SparkR (R): https://github.com/UrbanInstitute/sparkr-tutorials
PySpark (Python): https://github.com/UrbanInstitute/pyspark-tutorials

Note: The tutorials linked above assume that readers have some familiarity with R and Python, respectively.

2.1 What About Stata and SAS?

Unfortunately, neither Stata nor SAS are supported by Spark. Stata and SAS are proprietary tools, so it’s very unlikely they will ever be supported by the Spark project. This goes for other similar programs as well, such as SPSS and EViews. While there is a framework for distributed computing in SAS, this software is very expensive and also requires constantly running clusters, raising the cost above consideration for most social science organizations.

2.2 Choosing Between SparkR and PySpark

If you have a strong preference of language for R or Python, you should let that preference guide your decision. SparkR and PySpark are not dramatically different in terms of speed or available functionality (though we have learned that generalized linear modeling is more intuitive in SparkR).

Speed: Each of the Spark language implementations (R, Python, Java and Scala) communicates with the Spark data type DataFrame, which is conceptually equivalent to a table in a relational database or a data.frame/DataFrame in R/Python. Spark DataFrames use Scala and Java (languages that compose core Spark) to optimize operations so that the execution speeds for standard data processing tasks written in any of these languages are functionally the same.
Available Functionality: In the older 1.X versions of Spark, there was a notable difference between SparkR and PySpark, with PySpark being clearly farther along in development. SparkR was missing a number of key functions, including the ability for the user to define a custom function. This gap has been largely eliminated in the 2.X versions of Spark, although there will always likely be some asymmetry between the two due to the nature of open-source development.

That being said, there are some notable differences that might affect your decision:

Statistical Modeling: We have found SparkR to natively support statistical modeling with a bit more ease than PySpark. R’s normal formula syntax works well, whereas PySpark’s implementation has added complexity. If easy and familiar statistical modeling is important, you might want to stick with SparkR.
Code Syntax: SparkR feels a bit more like normal R than PySpark feels like Python, although both have peculiarities. This is perhaps the case since R is natively built around data.frames, which are analogous to Spark DataFrames. For this reason, you might prefer SparkR if you’re equally competent at R and Python. Both of the tutorials linked above are oriented around the Spark DataFrame data type.
Open Source Development: It seems like Python has the advantage here, in that more of the packages extending the functionality come written in or are accessible through PySpark than SparkR. This may be a product of the current Spark community – more data scientists and engineers than traditional statisticians and social scientists, as well as the general size of the Python community relative to R.