Chapter 4 - Sorting Distributed Data

4.1 Sorting in Distributed Computing

When working with datasets of sizes traditionally seen in social science research, sorting the data by some variable is an easy task. Since all of the data is in the memory of one computer, all of the shuffling can be done quickly and efficiently.

However, when working with massive data in a distributed environment, in which many millions of observations may be distributed across hundreds of machines, sorting data becomes an enormously expensive operation. Intuitively, this is because your data is partitioned. This means that the rows of data are spread in groups across all the different machines in your cluster. In order to perform the sort, Spark must execute a shuffle operation, which repartitions the data, moving much of it between the machines of the cluster.

Notably, sorts (and therefore shuffles) occur far more often than you might realize, given that they are essential part of other common operations. For instance, sorting occurs during:

Merging (SparkR & PySpark)
Grouped Aggregations (SparkR & PySpark
Calculating Quantiles (SparkR)

4.2 Shuffling Intuition

Spark needs to first consider your sorting criteria (e.g. dates, alphabetical, numerical, ascending, descending), then search through all of the systems in the cluster to figure out what and where the first and last values are, what the intervals are between values and whether there are duplicates or not. From there it has to estimate where all of the data should go and where to shift the values of which this data will take the spot.

Each observation takes up space in memory, and since we’ve spread our data out across many nodes due to its size, Spark cannot simply hold every piece of data in its current location until it has the correct place for it. The data, therefore, becomes more spread out when information is “shuffled” since Spark holds observations elsewhere until it determines where they should be placed. Imagine, for example, six computers holding twelve observations, one for each month. If no machine can hold more than two observations, and January and November are on one machine, Spark might go find February and decide it needs to be next to January. However, in order to move it there it needs to first put November somewhere temporarily.

Every time something is moved between machines, it involves both recalculating partitions and movement over the computer network, which is much slower than moving data within one machine. Additionally, data may be forced onto the hard drives of the nodes if there is not enough extra memory space. Hard drive traffic is much slower than even network traffic.

Any operation that causes a shuffling of data, or a “shuffle,” can be very costly in terms of processing power, hard drive traffic and system memory. And since Spark clusters on AWS are paid for by the second of usage, excessive shuffling can be costly in dollars also. Researchers should understand that several tasks that can be performed in Spark require a shuffle, and that this shuffling may be obscured. Spark operations that do require a shuffle include, for example, joining datasets on equal values of some specified column and finding unique values of a column. Some shuffling is often inevitable; just note that shuffles will be one of the largest drivers of how long your operations take to finish.