Chapter 5 - Data Visualization

5.1 Visualization Process

While Spark is an excellent framework for the analysis of big data, big data comes with some inherent visualization issues that Spark does not surmount. Generally, it is best to use Spark purely as a data manipulation and analytical tool to enable visualization of aggregates on other platforms, rather than doing the visualization itself. Since R and Python both have powerful and extensive libraries for data visualization, there is a substantial advantage in creating graphics from local data on the master node, or on data you reduce and export to a desktop.

It’s possible that these visualization tools will eventually be implemented directly into Spark, but unlikely that they will offer as thorough functionality as the non-distributed versions. For instance, the ggplot2.SparkR project is promising, and does offers a limited subset of R’s ggplot2 geoms directly in SparkR. However, the package has not been updated regularly and the coverage of the graph types is not enough to be sufficent on its own.

Instead of relying on these packages, it is better to invest in understanding important graph types well enough to split the graphing process into a data processing step followed by a visualizing process.