Implementing Apache Submarine On Cisco Data Intelligence Platform

Muhammad Afzal · ‎04-03-2020

Preface

This technical blog implements Apache Submarine in Cisco UCS Integrated Infrastructure for Big Data and Analytics(https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/Cisco_UCS_Integrated_Infrastructure_for_Big_Data_with_Hortonworks_28node.html) and Cisco Data Intelligence Platform (https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/cisco_ucs_cdip_cloudera.html). At the time when this PoC was implemented in our lab environment, Apache Submarine was in tech-preview, means not intended for production grade setup. This blog aims to extend already existing Hadoop cluster as outlined in the previously mentioned CVDs link with Apache Submarine.

Overview

Hadoop Submarine is the latest machine learning framework subproject in the Apache Hadoop. It allows infra engineer/data scientist to run deep learning frameworks such as Tensorflow, Pytorch, and etc. on resource management platform such as YARN in this case.

Hadoop 2.x enabled YARN to support Docker container. Hadoop Submarine introduced YARN to schedule and run the distributed deep learning framework within Docker container.

To make distributed deep learning/machine learning applications easily launched, managed and monitored, Hadoop community initiated the Submarine project along with other improvements such as first-class GPU support, Docker container support, container-DNS support, scheduling improvements, etc.

These improvements make distributed deep learning/machine learning applications run on Apache Hadoop YARN as simple as running it locally, which can let machine-learning engineers focus on algorithms instead of worrying about underlying infrastructure. By upgrading to latest Hadoop, users can now run deep learning workloads with other ETL/streaming jobs running on the same cluster. This can achieve easy access to data on the same cluster and achieve better resource utilization.

Caveats and Limitations

Submarine setup included in this document is for POC purposes and is in tech-preview.
Zeppelin 0.9.0-SNAPSHOT is not part of HDP Hadoop distro. It has to be setup separately after the whole cluster is deployed via Ambari. It is also noted that Zeppelin 0.9.0 is not GA when this document was written.
It is only validated and tested with YARN.
It is tested with nvidia-docker v1. However, submarine also supports nvidia-docker v2
hadoop-yarn-submarine-3.2.0.jar is not part of HDP 3.0/3.1. you need to download it from apache mirror site mentioned in reference section.

Prerequisties

All YARN services related dependencies must be properly setup.
Docker and if GPU is planned to be used, docker related dependencies such as nvidia-driver, cuda, and nvidia-docker must be configured.
Container communication among hosts must be configured. In this reference design, Calico is used. Without it, container distributed across nodes cannot communicate to each other. For example, PS (Parameter Server) running in one host not communicating with Worker node running in another host.
Without proper YARN DNS Registry configuration, you cannot get the distributed AI/ML or the submarine working via YARN. Master, workers, and parameter services communicate each other using DNS.
ETCD must be setup which is distributed reliable key-value store. Key-value store is required for setting up Calico networking. ETCD is setup in a quorum of three.
Docker private registry must be setup, if it does not exist already.
Use submarine-installer to install dependencies, however, it can also be setup up manually. In this reference design, submarine-installer will be used to installed calico and etcd.
Docker GPU scheduling and isolation must be configured in YARN for GPU enabled nodes. Also ensures that Docker is using cgroupfs cgroupdriver option if enabled YARN cgroup support.

Submarine Architecture

Hadoop community initiated the Submarine project to make distributed deep learning/machine learning applications easily launched, managed and monitored.

Submarine Components

Submarine Computation Engine
- Submit customized deep learning to YARN
Submarine echo-system integration

Submarine-Zeppelin Integration – Allow data scientists to code inside Zeppelin notebook and submit/manage training jobs from Zeppelin.
Submarine Azkaban Integration - Allow data scientist to submit a set of workflow tasks with dependencies directly to Azkaban from Zeppelin notebooks.
Submarine Installer – Install submarine and its dependent components.

Screen Shot 2020-04-03 at 10.22.50 PM.png

Software Versions

Following are the software versions used to validate the PoC

Component	Version
Operating System	Red Hat Enterprise Linux 7.6
Docker	1.13.1
Nvidia-docker	v1
Calico	v1.11.7
Calicoctl	v3.2.3
etcd	v3.3.9
Zeppelin	0.9.0-SNAPSHOT
HDP	3.0.1
Hadoop	3.1.1
YARN	3.1.1
Java	1.8.0

Configuration Instructions

Download Submarine-Installer

Download the submarine-installer by running the following command in all the nodes.

# git clone  https://github.com/hadoopsubmarine/submarine-installer.git

Setup Submarine-Installer

Submarine-installer can install all dependencies such as etcd, docker, nvidia-docker, and so on directly from the network. However, in cases where servers are not connected to internet, download server can be setup.

Setup download server for submarine installer by performing the following changes in /submarine-installer/install.conf

DOWNLOAD_SERVER_IP="10.16.1.58"
DOWNLOAD_SERVER_PORT="19000"
LOCAL_DNS_HOST=”10.16.16”
ETCD_HOSTS=(10.16.1.31 10.16.1.32 10.16.1.33)

Note: Minimum of three servers are required for setting up ETCD. In this case, rhel1, rhel2, and rhel3 are selected for setting up ETCD.

This document setup submarine in already installed HDP environment. Therefore, don't install anything else, as we assume yarn, docker, nvidia-driver, cuda, and nvidia-docker are already installed and properly configured. Verify it by running the following command:

# nvidia-docker run --rm nvidia/cuda:9.2-base nvidia-smi

Run the installer by running the following command.

# ./sibmarine-installer/install.sh

Enter “y”. Main menu will be launched. Enter “6” to start download server as shown in Figure 2. Enter “y” to start the download http server

This will download all the dependencies and packages in /submarine-installer/downloads folder and start the http server, which will be available on download server IP and port specified in /submarine-installer/install.conf file. Do not close the submarine-installer running on this server until all servers are configured.

When submarine-installer/install.sh is executed in other servers and selected to install components, it will automatically download and install components from download server

Install Components

In this reference design, we will install ETCD and Calico network only by using the submarine-installer. This document deploys and setup submarine in already configured and deployed HDP environment where it is assumed that docker, nvidia-driver, nvidia-docker, and yarn services are already setup. Any submarine specific configuration of already installed components will be covered in subsequent section.

Install ETCD

etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers. Alternatives like zookeeper, Consul can also be used, however, it is NOT tested in this reference design.

Perform the following steps to install components.

Run ./submarine-installer/install.sh in rhel1, rhel2, and rhel3 to install and setup ETCD as shown in Figure. Type 2 to install component and then Type 1 to intall etcd and hit enter.

Note: Etcd must be installed in three servers to form a cluster. Here, we installed etcd in rhel1, rhel2, and rhel3.

After the install is completed, start the etcd service in all the three servers by running the following command.

# systemctl start etcd.service
# systemctl status etcd. Service

verify etcd install by running the following

# etcdctl cluster-health

verify cluster membership

# etcdctl member list

Configure Docker

Modifiy /etc/docker/daemon.json file. If it does not exist, create this file with the following contents in all the nodes.

{
    "live-restore" : true,
    "debug" : true,
    "insecure-registries": ["${image_registry_ip}:5000"],
    "cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
    "cluster-advertise":"{localhost_ip}:2375",
    "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip}"],
    "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
}

Replace the variables in daemon.json file according to your environment as shown for reference purpose.

[root@rhel4 submarine-installer]# cat /etc/docker/daemon.json
{
    "live-restore" : true,
    "debug" : true,
    "insecure-registries" : ["linuxjh.hdp3.cisco.com:5000"],
    "cluster-store" : "etcd://10.16.1.31:2379,10.16.1.32:2379,10.16.1.33:2379",
    "cluster-advertise" : "10.16.1.34:2375",
    "hosts" : ["tcp://10.16.1.34:2375", "unix:///var/run/docker.sock"]

}

Reload docker daemon and restart docker services.

# systemctl daemon-reload
# systemctl restart docker

As previously noted in pre-requisites, ensure that Docker is using the cgroupfs cgroupdriver option if enabled YARN cgroups support.

vi /usr/lib/systemd/system/docker.service

Find and fix the cgroupdriver:

--exec-opt native.cgroupdriver=cgroupfs \

Install Calico

Calico creates and manages a flat three-tier network, and each container is assigned a routable IP.

To install Calico on specified servers, run ./submarine-installer/install.sh. Enter “3” to install calico network, and press enter key as shown in Figure

Start calico node service.

# systemctl start calico-node.service
# systemctl status calico-node.service

Verify calico network by running the following command. It will show all host status except the localhost.

[root@rhel17 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+------------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |              INFO              |
+--------------+-------------------+-------+------------+--------------------------------+
| 10.16.1.31   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.40   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.41   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.43   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.44   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.34   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.35   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.36   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.37   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.38   | node-to-node mesh | up    | 2019-08-27 | Established                    |
| 10.16.1.58   | node-to-node mesh | up    | 2019-08-28 | Established                    |
+--------------+-------------------+-------+------------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

[root@rhel17 ~]#

Test the calico network. Calico network install also contains script that provision two docker container in two different nodes and run ping. It can fail, if the other node is not currently setup with Calico. You can follow the following manual step after setting up calico.

# docker network create --driver calico --ipam-driver calico-ipam calico-network
# docker network ls

Create a container in node 1 on new network

docker run --net calico-network --name workload-A -tid busybox

create a container in node 2 on the same network

docker run --net calico-network --name workload-B -tid busybox

Ping from node1 container to node 2 container

docker exec workload-A ping workload-B

YARN Configuration for Calico Network

Configure docker network created using calico driver in YARN for yarn.nodemanager.runtime.linux.docker.allowed-container-networks

In Ambari, click YARNàCONFIGSàADVANCED, filter for allowed-container-networks, and add calico-network as shown.

Click SAVE and restart all affected services.

Verify nvidia-docker

In this reference design, nvidia-docker v1 is used. It is assumed that nvidia-driver, cuda, and nvidia-docker is properly configured. Before we proceed further, perform the following steps to verify the environment is ready for running GPU enabled distributed AI/ML.

Perform the following tests:

Test-1:

[root@rhel14 ~]#  nvidia-smi
[root@rhel14 ~]#  nvidia-docker run --rm nvidia/cuda:9.2-base nvidia-smi

Test-2:

[root@rhel14 ~]# nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
root@0b34d22fac56:/notebooks# python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
2019-08-30 14:17:41.424470: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-08-30 14:17:41.812712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:5e:00.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
2019-08-30 14:17:41.812760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2019-08-30 14:17:42.291711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-30 14:17:42.291763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0
2019-08-30 14:17:42.291775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N
2019-08-30 14:17:42.292189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/device:GPU:0 with 14879 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0)
True
>>>

Create Docker Image for Submarine

Create the following docker file.

[root@linuxjh dockerfile]# cat tf-1.8.0-gpu.Dockerfile
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends  --allow-change-held-packages \
        build-essential \
        cuda-command-line-tools-9-0 \
        cuda-cublas-9-0 \
        cuda-cufft-9-0 \
        cuda-curand-9-0 \
        cuda-cusolver-9-0 \
        cuda-cusparse-9-0 \
        curl \
        libcudnn7=7.0.5.15-1+cuda9.0 \
        libfreetype6-dev \
        libpng12-dev \
        libzmq3-dev \
        pkg-config \
        python \
        python-dev \
        rsync \
        software-properties-common \
        unzip \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean

RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
    python get-pip.py && \
    rm get-pip.py

RUN pip --no-cache-dir install \
        Pillow \
        h5py \
        ipykernel \
        jupyter \
        matplotlib \
        numpy \
        pandas \
        scipy \
        sklearn \
        && \
    python -m ipykernel.kernelspec

# Install TensorFlow GPU version.
RUN pip --no-cache-dir install \
    http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl
RUN apt-get update && apt-get install git -y

RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# Downloadhadoop-3.1.1.tar.gz
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
RUN tar zxf hadoop-3.1.1.tar.gz
RUN mv hadoop-3.1.1 hadoop-3.1.0

# Download jdk which supports kerberos
#RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz'
#RUN tar xzf jdk8.tar.gz -C /opt
#RUN mv /opt/jdk* /opt/java
#RUN rm jdk8.tar.gz
#RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100
#RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100

#ENV JAVA_HOME /opt/java
#ENV PATH $PATH:$JAVA_HOME/bin
[root@linuxjh dockerfile]#

Build Docker image

[root@linuxjh dockerfile]# docker build -t tf-1.8.0-gpu . -f tf-1.8.0-gpu.Dockerfile

Tag the image and push it to private docker registry

# docker tag tf-1.8.0-gpu linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:0.0.1
# docker push linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:0.0.1
# docker image ls

Run Submarine

To test Machine Learning with Submarine Cifar-10 Tensorflow estimator example have been utilized.

Perform the following steps before you begin running submarine standalone or distributed jobs.

Dowload submarine jar file

Download hadoop-yarn-submarine-3.2.0.jar from https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz. It is included in hadoop 3.2 mirror. Extract the downloaded file and copy the hadoop-yarn-submarine-3.2.0.jar to /usr/hdp/3.0.1.0-187/hadoop-yarn/ folder. hadoop-yarn-submarine-3.2.0.jar file should be copied in all NodeManager nodes.

Prepare Test Data

CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.

Download tensorflow model by running the following command.

git clone https://github.com/tensorflow/models/

Go to models/tutorials/image/cifar10_estimator. Generate data by using following command: (required Tensorflow installed)

python generate_cifar10_tfrecords.py --data-dir=cifar-10-data

Upload data to HDFS

# hds dfs -put cifar-10-data/ /tmp/cifar-10-data

Run Standalone TensorFlow Job

Run the following command for standalone tensorflow job. For standalone, parameter server and number of workers to specify is not needed. Below command will provision one container for worker node and one for tensorboard. However, tensorboard is optional and can be omitted in the following command

[root@rhel4 ~]# cat standalone-tf.sh
hdfs dfs -rm -r /tmp/cifar-10-jobdir
yarn app -destroy standalone-tf-01
yarn jar /usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar job run \
--name standalone-tf-01 \
--verbose \
--docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0  \
--input_path hdfs://CiscoHDP/tmp/cifar-10-data \
--checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync" \
--ps_docker_image  linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt  \
--tensorboard \
--tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt

[root@rhel4 ~]#

Note: Please note that YARN service doesn’t allow multiple services with the same name, so please run following command. yarn application -destroy <service-name>

Run Distributed TensorFlow Job

Following is the command to submit distributed tensorflow. For distributed tensorflow, number of workers should be greater than 1 and parameter server is also required. In this example, we are provision two worker nodes and one parameter server. Each worker nodes will have 8G of RAM, 2 CPU, and 1 GPU. Parameter server requests 2G of RAM and 2 CPU. More workers can be requested depending on your environment and the resources available.

[root@rhel4 ~]# cat distributed-tf.sh
hdfs dfs -rm -r /tmp/cifar-10-jobdir
yarn app -destroy dtf-job-01
yarn jar /usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar job run \
--name dtf-job-01 \
--verbose \
--docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0  \
--input_path hdfs://CiscoHDP/tmp/cifar-10-data \
--checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--num_workers 2 \
--worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync" \
--ps_docker_image  linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt  \
--num_ps 1 \
--ps_resources memory=2G,vcores=2  \
--ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--tensorboard \
--tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt

[root@rhel4 ~]#

Verify Distributed TensorFlow Job

After submitting the job using YARN in the above step, perform the following steps to verify if the job is successfully running.

Launch YARN (NodeManager) GUI. Click Services. The job submitted will show up in Applications and Services tab as shown in below figure.

Click dtf-job-01 to see the details of this services. Below figure shows the components created as result of this job. As we requested 2 workers, 1 PS, and Tensorboard which can be verified.

Click Attempt List and then Attempt ID. Grid View will show the details of the containers launched, their status, and the nodes where they are provisioned. If GPU resource is requested, for example in this run, YARN schedule the workers docker container in the nodes where GPU exists. After the job is completed successfully, resources are returned to the cluster automatically.

Note: If requested resources for a job such as RAM, CPU, and GPU are greater than the available resources, job will fail. Verify cluster resources in YARN GUI to make sure advanced resource such as GPU is not already occupied by other jobs or users.

Click one of the worker nodes in the Logs column to verify if the job is running. For example, click rhel17 logs, and then click stderr.txt file. Verify the output as shown in below figure

We can also verify the resources utilized where worker is running. For example, as shown in below figure for rhel17 server.

Cluster level resources can also be viewed by clicking “Cluster Overview” in YARN GUI. We launched submarine job with 2 workers each with 1 GPU. Cluster overview in below diagram shows the resource utilization at cluster level.

As we have requested YARN to launch Tensorboard for this job. Tensorboard can be launch by clicking Settings-->Tensorboard in Services/application tab.

Tensorboard can be viewed as below

ssh to one of the node where worker is provisioned and run nvidia-smi command as shown below.

Zeppelin for Submarine

Zeppelin is a web-based notebook that supports interactive data analysis. You can use SQL, Scala, Python, etc. to make data-driven, interactive, collaborative documents.

There are more than 20 interpreters in Zeppelin (for example Spark, Hive, Cassandra, Elasticsearch, Kylin, HBase, etc.) to collect data, clean data, feature extraction, etc. in the data in Hadoop before completing the machine learning model training. The data preprocessing process.

Submarine interpreter is newly developed interpreter which support machine learning engineers and data scientists doing development from Zeppelin notebook, and submit training jobs directly to YARN job and get results from notebook.

Zeppelin Submarine Interpreter Properties

Launch zeppelin. Click username and then Interpreter dropdown to configure submarine interpreter properties

Type submarine in the filter and click Edit to configure submarine interpreter according to your environment.

Following properties were setup in our lab setup.

Name	Value
DOCKER_HADOOP_HDFS_HOME	/hadoop-3.1.0
DOCKER_JAVA_HOME	/usr/lib/jvm/java-8-openjdk-amd64/jre/
HADOOP_YARN_SUBMARINE_JAR	/usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar
INTERPRETER_LAUNCH_MODE	local
SUBMARINE_HADOOP_CONF_DIR	/etc/hadoop/conf
SUBMARINE_HADOOP_HOME	/usr/hdp/3.0.1.0-187/hadoop
docker.container.network	calico-network
submarine.algorithm.hdfs.path	hdfs://CiscoHDP/user/root/algorithm
submarine.hadoop.home	/usr/hdp/3.0.1.0-187/hadoop
tf.checkpoint.path	hdfs://CiscoHDP/user/root/checkpoint/cifar-10-jobdir
tf.parameter.services.cpu	2
tf.parameter.services.docker.image	linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt
tf.parameter.services.gpu	0
tf.parameter.services.memory	2G
tf.parameter.services.num	1
tf.tensorboard.enable	true
tf.worker.services.cpu	2
tf.worker.services.docker.image	linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0
tf.worker.services.gpu	1
tf.worker.services.memory	4G
tf.worker.services.num	2
yarn.webapp.http.address	http://rhel2.hdp3.cisco.com:8088
zeppelin.submarine.auth.type	simple

Below figure depicts setting up the submarine interpreter properties

Run Distributed Tensorflow using Zeppelin Submarine Interpreter

Perform the following steps to submit distributed tensorflow using Zeppelin submarine interpreter

Click Notebook and select + Create new note. Create new note window will pop-up. Specify Note Name and select submarine from Default Interpreter drop down as shown below

Type the following in the note and give a title to the note by clicking note settings icon on the right

%submarine
dashboard

Click Run icon. Select JOB RUN from the Command drop down and enter the following.

Checkpoint Path	Submarine sets up a separate Checkpoint path for each user's Note for Tensorflow training. Saved the training data for this Note history, Used to train the output of model data, Tensorboard uses the data in this path for model presentation. Users cannot modify it. For example: hdfs://CiscoHDP/... , The environment variable name for Checkpoint Path is %checkpoint_path%, You can use %checkpoint_path% instead of the input value in Data Path in PS Launch Cmd and Worker Launch Cmd. You cannot modify Checkpoint path in the Job Run as it comes from submarine interpreter property tf.checkpoint.path
Input Path	The user specifies the data data directory of the Tensorflow algorithm. Only HDFS-enabled directories are supported. The environment variable name for Data Path is %input_path%, You can use %input_path% instead of the input value in Data Path in PS Launch Cmd and Worker Launch Cmd. For example hdfs://CiscoHDP/tmp/cifar-10-data
PS Launch Cmd:	Tensorflow Parameter services launch command. cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0
Worker Launch Cmd:	Tensorflow Worker service launch command. cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=30000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync

Click Run Command to submit the submarine job. The zeppelin submarine interpreter automatically merges the algorithm files into sections and submits them to the submarine computation engine for execution. Below diagram shows the Submarine execution log.

Launch YARN GUI. Click Services tab and you will find the job running as shown

Click Service Name for anonymous-2ejmbseuu for detailed view as shown in below figure.

Click Attempt List. Click appattempt in the next screen.

Rest of application attempts, logs, resource utilization, and so on can be explored in YARN GUI as previously discussed in this document.

Click TENSORBOARD RUN in the command drop down and click Run Command as shown below.

YARN GUI Services tab will show new service is launched as shown below.

Click on the service (e.g. anonymous-tb). Service detail page will show up. Click Settings-->Tensorboard as shown in the below figure to launch the Tensorboard

Launch the Tensorboard

References

Martin L · ‎06-10-2020

Thanks for your hard work and for sharing!

Muhammad Afzal · ‎06-10-2020

Thanks Martin.

KevinSu13100 · ‎02-15-2021

@Muhammad Afzal Thanks for your great work and for sharing!

I'm a Submarine committer and PMC, I'd like to know more about Submarine use cases in Cisco

Could I have your email? or you can send a message to my email (pingsutw@apache.org)

I can also help you solve some problems when using Submarine.

Implementing Apache Submarine On Cisco Data Intelligence Platform

How to Boot ESXi on UCS-M2-HWRAID Boot-Optimized M.2 RAID Controller

UCS Performance Manager 2.0 - New Version Released

Low cost, small footprint solution for high-density XenApp workloads