Preface
This technical blog implements Apache Submarine in Cisco UCS Integrated Infrastructure for Big Data and Analytics(https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/Cisco_UCS_Integrated_Infrastructure_for_Big_Data_with_Hortonworks_28node.html) and Cisco Data Intelligence Platform (https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/cisco_ucs_cdip_cloudera.html). At the time when this PoC was implemented in our lab environment, Apache Submarine was in tech-preview, means not intended for production grade setup. This blog aims to extend already existing Hadoop cluster as outlined in the previously mentioned CVDs link with Apache Submarine.
Overview
Hadoop Submarine is the latest machine learning framework subproject in the Apache Hadoop. It allows infra engineer/data scientist to run deep learning frameworks such as Tensorflow, Pytorch, and etc. on resource management platform such as YARN in this case.
Hadoop 2.x enabled YARN to support Docker container. Hadoop Submarine introduced YARN to schedule and run the distributed deep learning framework within Docker container.
To make distributed deep learning/machine learning applications easily launched, managed and monitored, Hadoop community initiated the Submarine project along with other improvements such as first-class GPU support, Docker container support, container-DNS support, scheduling improvements, etc.
These improvements make distributed deep learning/machine learning applications run on Apache Hadoop YARN as simple as running it locally, which can let machine-learning engineers focus on algorithms instead of worrying about underlying infrastructure. By upgrading to latest Hadoop, users can now run deep learning workloads with other ETL/streaming jobs running on the same cluster. This can achieve easy access to data on the same cluster and achieve better resource utilization.
Caveats and Limitations
Prerequisties
Submarine Architecture
Hadoop community initiated the Submarine project to make distributed deep learning/machine learning applications easily launched, managed and monitored.
Submarine Components
Following are the software versions used to validate the PoC
Component |
Version |
Operating System |
Red Hat Enterprise Linux 7.6 |
Docker |
1.13.1 |
Nvidia-docker |
v1 |
Calico |
v1.11.7 |
Calicoctl |
v3.2.3 |
etcd |
v3.3.9 |
Zeppelin |
0.9.0-SNAPSHOT |
HDP |
3.0.1 |
Hadoop |
3.1.1 |
YARN |
3.1.1 |
Java |
1.8.0 |
Configuration Instructions
Download Submarine-Installer
# git clone https://github.com/hadoopsubmarine/submarine-installer.git
Setup Submarine-Installer
Submarine-installer can install all dependencies such as etcd, docker, nvidia-docker, and so on directly from the network. However, in cases where servers are not connected to internet, download server can be setup.
Setup download server for submarine installer by performing the following changes in /submarine-installer/install.conf
DOWNLOAD_SERVER_IP="10.16.1.58" DOWNLOAD_SERVER_PORT="19000" LOCAL_DNS_HOST=”10.16.16” ETCD_HOSTS=(10.16.1.31 10.16.1.32 10.16.1.33)
Note: Minimum of three servers are required for setting up ETCD. In this case, rhel1, rhel2, and rhel3 are selected for setting up ETCD.
This document setup submarine in already installed HDP environment. Therefore, don't install anything else, as we assume yarn, docker, nvidia-driver, cuda, and nvidia-docker are already installed and properly configured. Verify it by running the following command:
# nvidia-docker run --rm nvidia/cuda:9.2-base nvidia-smi
Run the installer by running the following command.
# ./sibmarine-installer/install.sh
Enter “y”. Main menu will be launched. Enter “6” to start download server as shown in Figure 2. Enter “y” to start the download http server
This will download all the dependencies and packages in /submarine-installer/downloads folder and start the http server, which will be available on download server IP and port specified in /submarine-installer/install.conf file. Do not close the submarine-installer running on this server until all servers are configured.
When submarine-installer/install.sh is executed in other servers and selected to install components, it will automatically download and install components from download server
Install Components
In this reference design, we will install ETCD and Calico network only by using the submarine-installer. This document deploys and setup submarine in already configured and deployed HDP environment where it is assumed that docker, nvidia-driver, nvidia-docker, and yarn services are already setup. Any submarine specific configuration of already installed components will be covered in subsequent section.
Install ETCD
etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers. Alternatives like zookeeper, Consul can also be used, however, it is NOT tested in this reference design.
Perform the following steps to install components.
Run ./submarine-installer/install.sh in rhel1, rhel2, and rhel3 to install and setup ETCD as shown in Figure. Type 2 to install component and then Type 1 to intall etcd and hit enter.
Note: Etcd must be installed in three servers to form a cluster. Here, we installed etcd in rhel1, rhel2, and rhel3.
After the install is completed, start the etcd service in all the three servers by running the following command.
# systemctl start etcd.service # systemctl status etcd. Service
verify etcd install by running the following
# etcdctl cluster-health
verify cluster membership
# etcdctl member list
Configure Docker
Modifiy /etc/docker/daemon.json file. If it does not exist, create this file with the following contents in all the nodes.
{ "live-restore" : true, "debug" : true, "insecure-registries": ["${image_registry_ip}:5000"], "cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379", "cluster-advertise":"{localhost_ip}:2375", "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip}"], "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"] }
Replace the variables in daemon.json file according to your environment as shown for reference purpose.
[root@rhel4 submarine-installer]# cat /etc/docker/daemon.json { "live-restore" : true, "debug" : true, "insecure-registries" : ["linuxjh.hdp3.cisco.com:5000"], "cluster-store" : "etcd://10.16.1.31:2379,10.16.1.32:2379,10.16.1.33:2379", "cluster-advertise" : "10.16.1.34:2375", "hosts" : ["tcp://10.16.1.34:2375", "unix:///var/run/docker.sock"] }
Reload docker daemon and restart docker services.
# systemctl daemon-reload # systemctl restart docker
As previously noted in pre-requisites, ensure that Docker is using the cgroupfs cgroupdriver option if enabled YARN cgroups support.
vi /usr/lib/systemd/system/docker.service
Find and fix the cgroupdriver:
--exec-opt native.cgroupdriver=cgroupfs \
Install Calico
Calico creates and manages a flat three-tier network, and each container is assigned a routable IP.
To install Calico on specified servers, run ./submarine-installer/install.sh. Enter “3” to install calico network, and press enter key as shown in Figure
Start calico node service.
# systemctl start calico-node.service # systemctl status calico-node.service
Verify calico network by running the following command. It will show all host status except the localhost.
[root@rhel17 ~]# calicoctl node status Calico process is running. IPv4 BGP status +--------------+-------------------+-------+------------+--------------------------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +--------------+-------------------+-------+------------+--------------------------------+ | 10.16.1.31 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.40 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.41 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.43 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.44 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.34 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.35 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.36 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.37 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.38 | node-to-node mesh | up | 2019-08-27 | Established | | 10.16.1.58 | node-to-node mesh | up | 2019-08-28 | Established | +--------------+-------------------+-------+------------+--------------------------------+ IPv6 BGP status No IPv6 peers found. [root@rhel17 ~]#
Test the calico network. Calico network install also contains script that provision two docker container in two different nodes and run ping. It can fail, if the other node is not currently setup with Calico. You can follow the following manual step after setting up calico.
# docker network create --driver calico --ipam-driver calico-ipam calico-network # docker network ls
Create a container in node 1 on new network
docker run --net calico-network --name workload-A -tid busybox
create a container in node 2 on the same network
docker run --net calico-network --name workload-B -tid busybox
Ping from node1 container to node 2 container
docker exec workload-A ping workload-B
YARN Configuration for Calico Network
Configure docker network created using calico driver in YARN for yarn.nodemanager.runtime.linux.docker.allowed-container-networks
In Ambari, click YARNàCONFIGSàADVANCED, filter for allowed-container-networks, and add calico-network as shown.
Click SAVE and restart all affected services.
Verify nvidia-docker
In this reference design, nvidia-docker v1 is used. It is assumed that nvidia-driver, cuda, and nvidia-docker is properly configured. Before we proceed further, perform the following steps to verify the environment is ready for running GPU enabled distributed AI/ML.
Perform the following tests:
Test-1:
[root@rhel14 ~]# nvidia-smi [root@rhel14 ~]# nvidia-docker run --rm nvidia/cuda:9.2-base nvidia-smi
Test-2:
[root@rhel14 ~]# nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash root@0b34d22fac56:/notebooks# python Python 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.test.is_gpu_available() 2019-08-30 14:17:41.424470: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-08-30 14:17:41.812712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:5e:00.0 totalMemory: 15.78GiB freeMemory: 15.37GiB 2019-08-30 14:17:41.812760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0 2019-08-30 14:17:42.291711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-08-30 14:17:42.291763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0 2019-08-30 14:17:42.291775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N 2019-08-30 14:17:42.292189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/device:GPU:0 with 14879 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0) True >>>
Create Docker Image for Submarine
Create the following docker file.
[root@linuxjh dockerfile]# cat tf-1.8.0-gpu.Dockerfile FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 # Pick up some TF dependencies RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends --allow-change-held-packages \ build-essential \ cuda-command-line-tools-9-0 \ cuda-cublas-9-0 \ cuda-cufft-9-0 \ cuda-curand-9-0 \ cuda-cusolver-9-0 \ cuda-cusparse-9-0 \ curl \ libcudnn7=7.0.5.15-1+cuda9.0 \ libfreetype6-dev \ libpng12-dev \ libzmq3-dev \ pkg-config \ python \ python-dev \ rsync \ software-properties-common \ unzip \ && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ python get-pip.py && \ rm get-pip.py RUN pip --no-cache-dir install \ Pillow \ h5py \ ipykernel \ jupyter \ matplotlib \ numpy \ pandas \ scipy \ sklearn \ && \ python -m ipykernel.kernelspec # Install TensorFlow GPU version. RUN pip --no-cache-dir install \ http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl RUN apt-get update && apt-get install git -y RUN apt-get update && apt-get install -y openjdk-8-jdk wget # Downloadhadoop-3.1.1.tar.gz RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz RUN tar zxf hadoop-3.1.1.tar.gz RUN mv hadoop-3.1.1 hadoop-3.1.0 # Download jdk which supports kerberos #RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz' #RUN tar xzf jdk8.tar.gz -C /opt #RUN mv /opt/jdk* /opt/java #RUN rm jdk8.tar.gz #RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100 #RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100 #ENV JAVA_HOME /opt/java #ENV PATH $PATH:$JAVA_HOME/bin [root@linuxjh dockerfile]#
Build Docker image
[root@linuxjh dockerfile]# docker build -t tf-1.8.0-gpu . -f tf-1.8.0-gpu.Dockerfile
Tag the image and push it to private docker registry
# docker tag tf-1.8.0-gpu linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:0.0.1 # docker push linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:0.0.1 # docker image ls
Run Submarine
To test Machine Learning with Submarine Cifar-10 Tensorflow estimator example have been utilized.
Perform the following steps before you begin running submarine standalone or distributed jobs.
Dowload submarine jar file
Download hadoop-yarn-submarine-3.2.0.jar from https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz. It is included in hadoop 3.2 mirror. Extract the downloaded file and copy the hadoop-yarn-submarine-3.2.0.jar to /usr/hdp/3.0.1.0-187/hadoop-yarn/ folder. hadoop-yarn-submarine-3.2.0.jar file should be copied in all NodeManager nodes.
Prepare Test Data
CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.
Download tensorflow model by running the following command.
git clone https://github.com/tensorflow/models/
Go to models/tutorials/image/cifar10_estimator. Generate data by using following command: (required Tensorflow installed)
python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
Upload data to HDFS
# hds dfs -put cifar-10-data/ /tmp/cifar-10-data
Run Standalone TensorFlow Job
Run the following command for standalone tensorflow job. For standalone, parameter server and number of workers to specify is not needed. Below command will provision one container for worker node and one for tensorboard. However, tensorboard is optional and can be omitted in the following command
[root@rhel4 ~]# cat standalone-tf.sh hdfs dfs -rm -r /tmp/cifar-10-jobdir yarn app -destroy standalone-tf-01 yarn jar /usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar job run \ --name standalone-tf-01 \ --verbose \ --docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0 \ --input_path hdfs://CiscoHDP/tmp/cifar-10-data \ --checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \ --worker_resources memory=8G,vcores=2,gpu=1 \ --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync" \ --ps_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt \ --tensorboard \ --tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt [root@rhel4 ~]#
Note: Please note that YARN service doesn’t allow multiple services with the same name, so please run following command. yarn application -destroy <service-name>
Run Distributed TensorFlow Job
Following is the command to submit distributed tensorflow. For distributed tensorflow, number of workers should be greater than 1 and parameter server is also required. In this example, we are provision two worker nodes and one parameter server. Each worker nodes will have 8G of RAM, 2 CPU, and 1 GPU. Parameter server requests 2G of RAM and 2 CPU. More workers can be requested depending on your environment and the resources available.
[root@rhel4 ~]# cat distributed-tf.sh hdfs dfs -rm -r /tmp/cifar-10-jobdir yarn app -destroy dtf-job-01 yarn jar /usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar job run \ --name dtf-job-01 \ --verbose \ --docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0 \ --input_path hdfs://CiscoHDP/tmp/cifar-10-data \ --checkpoint_path hdfs://CiscoHDP/tmp/cifar-10-jobdir \ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \ --num_workers 2 \ --worker_resources memory=8G,vcores=2,gpu=1 \ --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync" \ --ps_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt \ --num_ps 1 \ --ps_resources memory=2G,vcores=2 \ --ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \ --tensorboard \ --tensorboard_docker_image linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt [root@rhel4 ~]#
Verify Distributed TensorFlow Job
After submitting the job using YARN in the above step, perform the following steps to verify if the job is successfully running.
Launch YARN (NodeManager) GUI. Click Services. The job submitted will show up in Applications and Services tab as shown in below figure.
Click dtf-job-01 to see the details of this services. Below figure shows the components created as result of this job. As we requested 2 workers, 1 PS, and Tensorboard which can be verified.
Click Attempt List and then Attempt ID. Grid View will show the details of the containers launched, their status, and the nodes where they are provisioned. If GPU resource is requested, for example in this run, YARN schedule the workers docker container in the nodes where GPU exists. After the job is completed successfully, resources are returned to the cluster automatically.
Note: If requested resources for a job such as RAM, CPU, and GPU are greater than the available resources, job will fail. Verify cluster resources in YARN GUI to make sure advanced resource such as GPU is not already occupied by other jobs or users.
Click one of the worker nodes in the Logs column to verify if the job is running. For example, click rhel17 logs, and then click stderr.txt file. Verify the output as shown in below figure
We can also verify the resources utilized where worker is running. For example, as shown in below figure for rhel17 server.
Cluster level resources can also be viewed by clicking “Cluster Overview” in YARN GUI. We launched submarine job with 2 workers each with 1 GPU. Cluster overview in below diagram shows the resource utilization at cluster level.
As we have requested YARN to launch Tensorboard for this job. Tensorboard can be launch by clicking Settings-->Tensorboard in Services/application tab.
Tensorboard can be viewed as below
ssh to one of the node where worker is provisioned and run nvidia-smi command as shown below.
Zeppelin for Submarine
Zeppelin is a web-based notebook that supports interactive data analysis. You can use SQL, Scala, Python, etc. to make data-driven, interactive, collaborative documents.
There are more than 20 interpreters in Zeppelin (for example Spark, Hive, Cassandra, Elasticsearch, Kylin, HBase, etc.) to collect data, clean data, feature extraction, etc. in the data in Hadoop before completing the machine learning model training. The data preprocessing process.
Submarine interpreter is newly developed interpreter which support machine learning engineers and data scientists doing development from Zeppelin notebook, and submit training jobs directly to YARN job and get results from notebook.
Zeppelin Submarine Interpreter Properties
Launch zeppelin. Click username and then Interpreter dropdown to configure submarine interpreter properties
Type submarine in the filter and click Edit to configure submarine interpreter according to your environment.
Following properties were setup in our lab setup.
Name |
Value |
DOCKER_HADOOP_HDFS_HOME |
/hadoop-3.1.0 |
DOCKER_JAVA_HOME |
/usr/lib/jvm/java-8-openjdk-amd64/jre/ |
HADOOP_YARN_SUBMARINE_JAR |
/usr/hdp/3.0.1.0-187/hadoop-yarn/hadoop-yarn-submarine-3.2.0.jar |
INTERPRETER_LAUNCH_MODE |
local |
SUBMARINE_HADOOP_CONF_DIR |
/etc/hadoop/conf |
SUBMARINE_HADOOP_HOME |
/usr/hdp/3.0.1.0-187/hadoop |
docker.container.network |
calico-network |
submarine.algorithm.hdfs.path |
hdfs://CiscoHDP/user/root/algorithm |
submarine.hadoop.home |
/usr/hdp/3.0.1.0-187/hadoop |
tf.checkpoint.path |
hdfs://CiscoHDP/user/root/checkpoint/cifar-10-jobdir |
tf.parameter.services.cpu |
2 |
tf.parameter.services.docker.image |
linuxjh.hdp3.cisco.com:5000/tf-1.8.0-cpu-wt |
tf.parameter.services.gpu |
0 |
tf.parameter.services.memory |
2G |
tf.parameter.services.num |
1 |
tf.tensorboard.enable |
true |
tf.worker.services.cpu |
2 |
tf.worker.services.docker.image |
linuxjh.hdp3.cisco.com:5000/tf-1.8.0-gpu:1.0 |
tf.worker.services.gpu |
1 |
tf.worker.services.memory |
4G |
tf.worker.services.num |
2 |
yarn.webapp.http.address |
|
zeppelin.submarine.auth.type |
simple |
Below figure depicts setting up the submarine interpreter properties
Run Distributed Tensorflow using Zeppelin Submarine Interpreter
Perform the following steps to submit distributed tensorflow using Zeppelin submarine interpreter
Click Notebook and select + Create new note. Create new note window will pop-up. Specify Note Name and select submarine from Default Interpreter drop down as shown below
Type the following in the note and give a title to the note by clicking note settings icon on the right
%submarine dashboard
Click Run icon. Select JOB RUN from the Command drop down and enter the following.
Checkpoint Path |
Submarine sets up a separate Checkpoint path for each user's Note for Tensorflow training. Saved the training data for this Note history, Used to train the output of model data, Tensorboard uses the data in this path for model presentation. Users cannot modify it. For example: hdfs://CiscoHDP/... , The environment variable name for Checkpoint Path is %checkpoint_path%, You can use %checkpoint_path% instead of the input value in Data Path in PS Launch Cmd and Worker Launch Cmd. You cannot modify Checkpoint path in the Job Run as it comes from submarine interpreter property tf.checkpoint.path
|
Input Path |
The user specifies the data data directory of the Tensorflow algorithm. Only HDFS-enabled directories are supported. The environment variable name for Data Path is %input_path%, You can use %input_path% instead of the input value in Data Path in PS Launch Cmd and Worker Launch Cmd. For example hdfs://CiscoHDP/tmp/cifar-10-data |
PS Launch Cmd: |
Tensorflow Parameter services launch command. cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 |
Worker Launch Cmd: |
Tensorflow Worker service launch command. cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=30000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=1 --sync |
Click Run Command to submit the submarine job. The zeppelin submarine interpreter automatically merges the algorithm files into sections and submits them to the submarine computation engine for execution. Below diagram shows the Submarine execution log.
Launch YARN GUI. Click Services tab and you will find the job running as shown
Click Service Name for anonymous-2ejmbseuu for detailed view as shown in below figure.
Click Attempt List. Click appattempt in the next screen.
Rest of application attempts, logs, resource utilization, and so on can be explored in YARN GUI as previously discussed in this document.
Click TENSORBOARD RUN in the command drop down and click Run Command as shown below.
YARN GUI Services tab will show new service is launched as shown below.
Click on the service (e.g. anonymous-tb). Service detail page will show up. Click Settings-->Tensorboard as shown in the below figure to launch the Tensorboard
Launch the Tensorboard
References
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: