Monthly Archives: May 2016

Why I like Scala

Scala is a functional and OO language at the same time. Functional languages allows the programmer to perform complex operations in a standardized way regardless the object type. OO languages includes many  convenient ideas from object oriented area which can break the functional rules when used. Languages can either one of these two categories, but Scala mixes them together. I will introduce Scala programming from these two parts.

  • Functional Part

I mainly introduce some features provided by Scala that make Scala programmer more productive.

  1. Algebraic data types
  2. Pattern matching
  3. Higher order functions and combinators
  4. Ad hoc polymorphism with type classes using implicits
  5. Higher kinds of type constructor polymorphism
  6. Monadic programming
  • OO Part

I assume all of you are familiar with object oriented programming. Here, I introduce a specific set of features for developers coming from JAVA which want to become more productive.

  1. Better type system (Complicated type system)
  2. Type inference
  3. Mixin based inheritance using traits
  4. Better treatment of type variance
  5. Abstract type members and abstract vals
  6. Better module system

Overview of Spark 2.0

As a general data processing framework, Apache Spark becomes very popular these days. Its performance is the key concern for many developers as well as researchers. For examples, some researchers from database area pointed out that MapReduce, Nosql and Spark miss many important features included in the database areas (e.g., number 3 in http://www.hpts.ws/papers/2015/lightning/HPTS-stonebraker.pdf). However, with the development of Spark, it begins to integrate more and more ideas from modern compilers and MPP databases. The performance improving plan is lunched as “Project Tungsten” since Spark 1.5 (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html). It achieves significant performance improvements compared with spark 1.4 (more than 2x performance improvement for aggregation, sorting and shuffle). With the continuous development (Spark 1.6), till today, Spark 2.0 has further achieved 10x performance improvement (https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html).

In order to learn how these performance improvements are achieved, I plan to write a series of articles to introduce the principles and details behind these numbers. The performance improvements are mainly achieved in 2 phases.

Phase 1: Foundation (Spark 1.5, 1.6)

  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation

Phase 2: Order-of-magnitude Faster (Spark 1.6)

  • Whole-stage Code Generation
  • Vectorization

In the following days, I will introduce the development of Spark with a series of articles by following these kind of structures.

  • Basic concepts of these improvements.
  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation
  • Whole-stage Code Generation
  • Vectorization.