Distributed Machine Learning with Apache Spark

We live in the times when new computer algorithms help us to sense the World. When the cat appears ahead of us, an average human knows 0.5 seconds later, that it is a black cat crossing his walking path. To achieve the same with the machine, thousands of operations need to be performed, just to get closer to the accuracy of a human being.

This post draws practical engineering problem. How to get closer to the human cognitive performance for affordable cost? There are several approaches being actively developed in machine learning field. The one I would like to introduce here is distributed machine learning on Apache Spark and Apache Hadoop virtualized infrastructure. Many computer nodes with cost optimized configurations should together contribute to a system able to learn from training data and able to perform inference on the test data.

The most promising class of machine learning algorithms are deep learning algorithms. These algorithms are especially known for their results in image recognition, able to reach human performance leveraging complex neural networks.

I have selected Caffe and TensorFlow among the most popular deep learning frameworks to see how they perform with the popular CIFAR10 dataset. Detailed information about the dataset could be found here: https://www.cs.toronto.edu/~kriz/cifar.html

In addition, I have compared performance between CPU and GPU to see, if we can confirm a general expectation GPU would speedup both training and testing deep learning phases.

Please note,  Caffe and TensorFlow cannot run directly on the Apache Spark cluster. Data to/from Caffe or TensorFlow need to be passed to/from the Spark’s resilient distributed datasets residing in memory. These data transformations are necessary in order to benefit from Apache Spark infrastructure. Two community projects which make mentioned data transformations possible are available at GitHub:



Infrastructure used for my experiments included VMware vSphere, Cloudera Hadoop and Apache Spark software components. Compute cluster was composed of one physical server with four worker nodes as the virtual machines. Additional three virtual machines hosted on different infrastructure provided administrative services. Please refer to Figure 1.

Figure 1: Testing Apache Spark Cluster

Direct attached storage on the spinning SAS hard drives was used to provide capacity for data. This way we would get the best performance from virtualized storage. Please refer to Figure 2. Four GPU cores of two graphic cards were propagated to virtual machines via vSphere PCIe pass through.

Figure 2: Direct Attached Storage Layout

Complete hardware and software specifications for your reference are available in following tables.


Component Description Quantity
System 2 CPU socket x86_64 server 1
CPU E5-2667 v3 @ 3.2Ghz 2
GPU M60 2
HDD SAS 600G 3.5” 15K.7 ST3600057SS 12
NIC 1 10G LAN 1

Software Versions

VMware vSphere 6.5d
VMware vCenter Server Appliance 6.5.0-5318154
VMware ESXi, 6.5.0, 5310538
Cloudera Hadoop
Cloudera Manager 5.11.2
Apache Hadoop 2.6.0
Apache Spark 2.1
CentOS 7.3, Linux 3.10.0-514.el7.x86_64
Deep Learning Frameworks
TensorFlow 1.4.0rc0
TensorFlowOnSpark – build from source – October 2017
CaffeOnSpark (Caffe included) – build from source – October 2017



VM and Cloudera Spark/Hadoop Roles Mapping

VM CDH Components
CDH1 – master node –        HDFS namenode

–        HDFS HttpFS

–        HDFS Balancer

–        Cloudera Management Service:

Activity Monitor

Alert Publisher

Event Server

Host Monitor

Reports Manager

Service Monitor

–        Spark 2 Gateway

–        Spark 2 History Server

–        YARN NodeManager

–        ZooKeeper Server

CDH2 – master node –        HDFS SecondaryNameNode

–        HDFS NFS Gateway

–        Spark 2 Gateway

–        ZooKeeper Server

CDH3 – master node –        ZooKeeper Server
CDH4-7 – worker nodes –        HDFS DataNode

–        Spark 2 Gateway

–        YARN NodeManager

VM profile and related Hadoop/Spark settings

Parameter Value
vCPU 8
vRAM 100GB
OS disk 50GiB
Data disks 3x300GiB
Cloudera Hadoop 5.11
dfs.blocksize 256MiB
dfs.replication 3
yarn.nodemanager.resource.cpu-vcores 8
yarn.nodemanager.resource.memory-mb 60GiB
Oracle JDK 1.8
Apache Spark 2.1

Test results

Dataset DL framework Model/Solver Acceleration Type Training + Test


GPU:CPU Accuracy
CIFAR10 Caffe cifar10_quick_solver CPU 4561
CIFAR10 Caffe cifar10_quick_solver GPU 580 7.8x 82%
CIFAR10 TensorFlow ImageNet CPU 343
CIFAR10 TensorFlow ImageNet GPU 145 2.3x 86%

You can notice quite significant differences in run time between CaffeOnSpark and TensorFlowOnSpark. Peak prediction accuracy was found similar for both frameworks. We can therefore conclude on TensorFlowOnSpark as the winner of the test. One more interesting observation from the test results is how GPU has accelerated the computation. More than 7 times speed-up is probably acceptable, if we consider hardware price. When we see “just” 2 times speedup we probably might be in doubt whether to purchase GPU card. Modern CPU and GPU prices are quite comparable. Both could range from several hundreds USD to 10000 USD.

One thought on “Distributed Machine Learning with Apache Spark

Add Comment