Chapter 9 - Glossary of Terms

  • Caching -

  • Computer Memory - A place data is stored on a computer that is much faster than physical storage devices (hard drives). However, it is also significantly smaller, more expensive, and requires constant power to hold the data.

  • Directed Acylic Graph -

  • Distributed Computing - Using multiple smaller systems working together, generally to handle data or processes that are prohibitively large for a single system to handle in a cost-effective maner.

  • LM-BFGS - Limited Memory - Broyden–Fletcher–Goldfarb–Shanno algorithm. A common way to compute models on big data.

  • MapReduce - The conceptual predecessor to Spark, developed and made open source by Google.

  • Master Node - A system in a Spark cluster that manages some number of worker nodes, rather than holding and working on its own data.

  • Parallel Processing - The use of multiple CPUs or CPU cores to perform operations simultaneously.

  • Partition -

  • Persistance - Due to Spark’s “lazy”" computing, commands are chained between the loaded data and the eventual output, and are only processed in the final stage. In cases where that chain needs to branch to multiple outputs, Spark can be told to “persist” the intermediate stage where the branch happens. This allows it to avoid re-processing all the stages before the branch for each output. For example, if a chain looks like this: Load Data -> Step A -> Step B -> Step C -> Show Output, then steps A, B and C will only be computed when “show output” is reached. If a second branch is added: Load Data -> Step A -> Step B -> Step D -> Show Output, then calling “persist” after step B will allow Spark to only calculate steps A and B once.

  • Shuffle - An operation in Spark that forces data to be moved between partitions, nodes, and possibly memory and physical storage.

  • Stochastic Gradient Descent (SGD) - An iterative process for finding global minimums, used in some large data processes.

  • Worker Node - One of the systems in a Spark cluster that uses its resources to hold data, and performs operations on that data, under the direction of a master node.