Tag Archives: Data Intensive

Overview of Spark 2.0

As a general data processing framework, Apache Spark becomes very popular these days. Its performance is the key concern for many developers as well as researchers. For examples, some researchers from database area pointed out that MapReduce, Nosql and Spark miss many important features included in the database areas (e.g., number 3 in http://www.hpts.ws/papers/2015/lightning/HPTS-stonebraker.pdf). However, with the development of Spark, it begins to integrate more and more ideas from modern compilers and MPP databases. The performance improving plan is lunched as “Project Tungsten” since Spark 1.5 (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html). It achieves significant performance improvements compared with spark 1.4 (more than 2x performance improvement for aggregation, sorting and shuffle). With the continuous development (Spark 1.6), till today, Spark 2.0 has further achieved 10x performance improvement (https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html).

In order to learn how these performance improvements are achieved, I plan to write a series of articles to introduce the principles and details behind these numbers. The performance improvements are mainly achieved in 2 phases.

Phase 1: Foundation (Spark 1.5, 1.6)

  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation

Phase 2: Order-of-magnitude Faster (Spark 1.6)

  • Whole-stage Code Generation
  • Vectorization

In the following days, I will introduce the development of Spark with a series of articles by following these kind of structures.

  • Basic concepts of these improvements.
  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation
  • Whole-stage Code Generation
  • Vectorization.