Sunday, February 05, 2017

Programming with Apache Spark and Cassandra -draft

Putting the knowledge gained so far in this  and frequent questions that many may ask and what we have asked ourselves.
1.1         What is the need for using Spark ?
Spark gives you horizontal scale ability in a programmer friendly way.
1.2         But what about other options ?
There are other options as well. I have listed them below, which describes and highlights Sparks place in the architecture

Level of Granularity
Request Level
(usually HTTP requests)
Works well for Request-response type client server protocols. Works also well in context of microservices in application program side
However to scale the processing insdie the application programs this is inadequate
Task Managers
(celery, other MQ based)
Task Level
Helps to scale processing in the application program.Takes care of Task handling. However the onus is on the developer to split application logic to independent tasks. Usually
only the simplest things are really split into tasks. Equally hard problem is combining the outputs
Cluster Computing
(Apache Spark,Hadoop)
Application Level,Function
Helps to scale processing in the application layer across. Takes care of all the above. The onus is still on the developer to use this properly. However if the few API*, map, foreach,reduce and groupBy/partionBy are used , the programmer can be written as if it is is running in a single node, in a single thread. The system manages shared RAM across mutiple nodes, shared cores, task scheduling, multithreading etc. *P.S - Spark has an extensive library for machine learning as well,which could be the gateway for future
Function level
Helps to scale the processing inside a single node across nodes. Usually has to be done with care to avoid the complexity of threading related problems which many programmers are unaware
Green Threads
Funciton level/Stack level
Ex Greenlets in Python ; Good for switching stack in IO bound applications ; example socket server etc; Not really parallel, but wait time in one stack frame can be used by other stacks waiting to execute. Rather specific for general purpose usage

How stable is Apache Spark and Apache Cassandra ?
Speaking from our limited experience in running the prototype, all of the Spark and Casandra JVMs survived 20 days of load runs, network problems , application exceptions we threw at them.And that too in a low end cloud lab. Looks to be well written
1.5         What is the most important thing to take care when using Apache Cassandra ?
Data modelling and connected the Primary key and partition key design. It is important to design your primary key and the partition key so that write are distributed as well as read are faster. This is explained well by the Cassandra expert here ->
The hash of the partition key is used by Cassandra to identify the node in which to store. So choosing a partition key that distributes the load equally among nodes prevent write hotspots. Example can be seem in the performance run page
P,S - There are few trivial but important things , like writing commit log and data(SSTable)  in different partition. This link gives basic info about write path.
1.6         What is the most important thing to take care when using Apache Spark?
Have not come across as single important thing as such, but couple of pointers
1.    Avoid doing any major work in Spark driver , rdd.collect() or the more better rdd.toLocalIteraror() are not good ideas and don't scale; You get OOM error soon
2.    There is no way to share state like counters etc between driver and workers, though in the code it may seem so. Only way is via accumilators ; and there workers cannot read;
3.    The way you partition the RDD may be important for performance; esp for operation like group by etc ; need to test and understand this better

No comments:

Install/Upgrade NVIDI Driver in Ubuntu for CUDA SDK

Most linux distribution comes with the Nouveau display driver configured. If you need to use NVIDIA...