Understanding Paxos in 10 Minutes

Paxos is an algorithm to solve the consensus problem in the distributed system. Consensus involves multiple servers in the distributed environment agreeing on values. These values we want to get consensus can be a real number (e.g., Master node ID in the leader election problem), a Log (log replications) or multiple logs (database transaction logs). Paxos is trying to make sure that the values are consensus among multiple machines even some machines fail.

At first, we introduce two terms before we go into the detail of the paxos algorithm.

Process: one of the machine in the system

Client: a machine who isn’t a part of the system, is the one who asking what the value is or asking to write a new value in the system.


Now, we illustrate the paxox in the following two aspects: Reading and Writing.

Reading part:

To read a value from the system, a client asks all processes in the system that what is the current value, then the client takes the value agreed by majority of the processes. We can see that the system work once more than half the machines still works and some machines own the out-of-the-date values will not impact the result.

Writing part:

Writing consists of 2 phases:

First phase, the client will contract one process in the system and tell the process one value. We call this process the proposer. The proposer will broadcast the value to all the other processes in the system. The proposal is labelled with a unique sequence number. Highest number indicates the latest value. Once each other process receives the proposal, it will compare the number with the largest number received so far. If it is the largest, it means the value is the latest one. This process will accept this value. Otherwise, it will reject the proposal and tell the proposer the latest value and its corresponding sequence number.

Second phase, after the proposer get the responses from majority of the processes, no matter the other node accept or reject the proposal of the proposer process, the proposer knows the latest value and the sequence number. It will send the accept message to those responding processes. The message contain the latest value and the sequence number. The process receives the accept message, if it find that the sequence number is the largest, it will accept it. Otherwise, it means the value proposed by the proposer is not the latest value. Therefore, the process will tell the proposer will restart the proposal process with the latest value.

The slides from stanford is a very good introduction on paxos. https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf

The basic process is shown in the Figure

Scala for impatient Reading notes

scala for impatient

Table of contents

2.Control structure and function
3.Array and operations
4.Map and Array
7.Package and import
9.File and regular expression
12.High-order function
14.Pattern matching and case class
17.Type parameters
18.Advanced type
21.Implicit conversion and implicit parameter

## Basic ##

Val defines a const, var defines a variable;
> basic numerical types, Boolean, String, classes;
> scala do not differenciate basic types and reference types like java;

Operators like (+-*/) are actually method in scala:

a + b <=> a.+(b)

function and method call:
> function call:

import scala.math._
> scala doesn’t have static method, its corresponding feature is singleton object;

> class in scala has a companion object, the methods in the companion object are similar to the static method in java;

> the brackets of the method can be omitted if there is no parameter;

Apply method
> a kind of grammar similar to method calling;

val hello = “Hello”
hello(1) // ‘e’
// equals to

> create the object of one class with the apply method in the companion object of the class;

val num = BigInt.apply(“999999”)

Some notes:
> String objective is usually converted implicitly to a StringOps object which own many string operators. Similar to Int, Double, Char -> RichInt, RichDouble, RichChar;


> O and C near the class name in scala doc, C is class and O is its companion object;

> function can be used as the parameter of the method;

def count(p: (char) => Boolean) : Int // p is a function accpet a char parameter and return True or False


## Control structure and function ##

2016 小结


技术方面自己研究了Akka和一些modern concurrency model。着手了在写一个基于actor的DB





2)继续完善我的actor db

3)新技术学习:code generation,container-based scheduling,新的scheduling相关的工作得跟进。


Get Started Scala-native

To get closer to bare metal, Scala-native is proposed. https://github.com/scala-native/scala-native

I build it in my own laptop with Mac osx system.

The environment setup can be referred here http://scala-native.readthedocs.io/en/latest/user/setup.html

I still met some problem when I run sbt in scala-native and I just record them for somebody also met these issues.

  • Cannot find sbt-cross plugin. You can download https://github.com/scala-native/sbt-cross and publish a local repository with sbt.
  • The path of the local publishing is ivy/local path. However, when building scala-native, the dependencies lookup path is ivy/cache. Once you want to use the local published repository, you can remove the folder of the same dependency under cache.

Building scala-native with:

Then you can run a native demo here by running:

In this program, it imports your published native-scala libraries like this:

import scalanative.native._, stdlib._, stdio._

It is a real native problem and you can check its dependencies with otool

otool -L demo/native/target/scala-2.11/demonative-out
/usr/local/opt/bdw-gc/lib/libgc.1.dylib (compatibility version 2.0.0, current version 2.3.0)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1226.10.1)

Akka-based Database for Graphql

I build a project called GraphqlAkka: https://github.com/hustnn/GraphqlAkka

It is an Akka implementation backend data store for Graphql.

Graphql is proposed by Facebook http://graphql.org/.

This project tries to utilize the Akka to implement an efficient, robust and large-scale backend data storage system for Graphql. The mobile applications from billions of devices can efficiently perform the query using the language provided by Graphql and get the answers from the unified GraphqlAkka store (huge amount of data) as faster as possible (our objective is minimizing this response time for each mobile request and also maximizing the data access throughput for all devices).

Based on the Akka implementation, each record in GraphqlAkka is an actor in Akka which means that the origin optimizations need to be revisted and re-optimized in this scenario. It is challenging but it is also very interesting.

Techniquies we used in the project:
Graphql Akka core, Akka sharding, Akka persistent(levelDB or Cassandra), Akka http.

I have already attempted to show how to support graphql by utilizing akka actor as the backend data store which is shown in the current codebase. Currently, the query is parsed in our akka http layer, then the data request is sent to the corresponding actor and get the responses. It shows that my prososal is possbile and I will continue to finish this interesting project.


After running it, you can input the query like this and get the result:

curl -X POST localhost:8080/graphql \ -H "Content-Type:application/json" \ -d '{"query": "query Test($humanId: String!){human(id: $humanId) {name, homePlanet, friends {name}}}", "variables": {"humanId": "1000"}}'

Stay tunned for a complete prototype.

I am implementing it use my leasure time. It is very interesting and challenging to if we want to get a highly efficient store for Graphql. Welcome to join me if you are interested in it.

Heterogeneous Computing with Actor Model

Actor-based model is a simple and high-level abstractions for parallel and concurrent programming. It is an asynchronous, non-blocking and efficient event-driven programming model. The actor is the very lightweight event-driven processes in the actor-based model which naturally fits the concurrency and parallelism (several million actors per GB of heap memory [1]). These actors can be distributed, deployed and scheduled into different computing devices, different machines and even different times slices of the physical devices without any impact on the correctness of the results. The actors are completely decoupled in terms of the space (distributed) and time (asynchronous).

Akka is an actor-model implementation in the JVM platform. Everything in Akka designed to work in a distributed environment. Each actor instance in Akka is the basic particle size and Akka guarantee each operation of the actor instance is atomic. All interactions of actors use pure message passing. Every behavior is encapsulated inside the one actor which makes everything inside Akka is asynchronous. One actor cannot block the behaviors of the other actors. You can benefit synchronicity and lockless concurrency without hassle from Akka.

Recently, modern many core hardware such as Intel Xeon Phi, graphics processing unit (GPUs) or many other coprocessors are increasingly available for general-purpose computation [2, 3, 4]. These heterogeneous designs are already proved to be very efficient for many applications, such as Deep Learning [5] and Database [6]. Due to actors is designed from scratch especially for the highly distributed and parallel environment, they tend to be a better fit for parallel processing units. We are developing a new heterogeneous computing platforms based on Akka. It can simplify the development on these heterogeneous architectures and can provide efficient computing ability with the help of the power Akka.

[1] http://akka.io/docs/
[2] Manyfold Actors: Extending the C++ Actor Framework to Heterogeneous Many-Core Machines using OpenCL
[3] Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor
[4] Revisiting Actor Programming in C++
[5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
[6] http://blazingdb.com/

Remote Debug With SBT and Intellij Idea

In this Blog, I will show you how to leverage the SBT and IntelliJ Idea to debug scala applications remotely.

1, we open the run/debug configuration setting window and create a new Remote configuration as the below picture shows.

remote debug

2, we copy the the debug parameters.

debug vm options

3, we open the project and config the SBT tool.
We add the command we copied just now to the VM parameters of SBT command.

sbt vm parameter settings

4, we start to run the sbt by click the run button.

run sbt console

5. We run the debug and it shows the remote debug connection is successfully connected.

connect remote debug

6. At last, we input run command in the sbt console. Then the program will start to run and will stop in the break points where you set.

sbt run

Why I like Scala

Scala is a functional and OO language at the same time. Functional languages allows the programmer to perform complex operations in a standardized way regardless the object type. OO languages includes many  convenient ideas from object oriented area which can break the functional rules when used. Languages can either one of these two categories, but Scala mixes them together. I will introduce Scala programming from these two parts.

  • Functional Part

I mainly introduce some features provided by Scala that make Scala programmer more productive.

  1. Algebraic data types
  2. Pattern matching
  3. Higher order functions and combinators
  4. Ad hoc polymorphism with type classes using implicits
  5. Higher kinds of type constructor polymorphism
  6. Monadic programming
  • OO Part

I assume all of you are familiar with object oriented programming. Here, I introduce a specific set of features for developers coming from JAVA which want to become more productive.

  1. Better type system (Complicated type system)
  2. Type inference
  3. Mixin based inheritance using traits
  4. Better treatment of type variance
  5. Abstract type members and abstract vals
  6. Better module system

Overview of Spark 2.0

As a general data processing framework, Apache Spark becomes very popular these days. Its performance is the key concern for many developers as well as researchers. For examples, some researchers from database area pointed out that MapReduce, Nosql and Spark miss many important features included in the database areas (e.g., number 3 in http://www.hpts.ws/papers/2015/lightning/HPTS-stonebraker.pdf). However, with the development of Spark, it begins to integrate more and more ideas from modern compilers and MPP databases. The performance improving plan is lunched as “Project Tungsten” since Spark 1.5 (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html). It achieves significant performance improvements compared with spark 1.4 (more than 2x performance improvement for aggregation, sorting and shuffle). With the continuous development (Spark 1.6), till today, Spark 2.0 has further achieved 10x performance improvement (https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html).

In order to learn how these performance improvements are achieved, I plan to write a series of articles to introduce the principles and details behind these numbers. The performance improvements are mainly achieved in 2 phases.

Phase 1: Foundation (Spark 1.5, 1.6)

  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation

Phase 2: Order-of-magnitude Faster (Spark 1.6)

  • Whole-stage Code Generation
  • Vectorization

In the following days, I will introduce the development of Spark with a series of articles by following these kind of structures.

  • Basic concepts of these improvements.
  • Memory Management and Binary Processing
  • Cache-aware Computation
  • Code Generation
  • Whole-stage Code Generation
  • Vectorization.