Cisco Cognitive Intelligence announces a release of project ORaF as open-source. ORaF is a library for distributed training of random forest-based machine learning models on Apache Spark. The library is a fork of Apache Spark MLlib and contains various improvements that result in more than 100-fold training speedup on our network telemetry based datasets and allows training of deeper trees. We use ORaF at Cognitive Intelligence for several months to train models on drastically larger datasets than we were able before.
ORaF is 40x faster than MLLib on a dataset with 10 million samples and more than 100x faster on a dataset with 30 million samples.
To optimize the training process, ORaF introduces a local training phase with improved task scheduling. Tree induction of sufficiently small nodes is completed in-memory on a single executor. Additionally, nodes for local training are grouped into larger and more balanced local training tasks using bin packing, training is then effectively scheduled by estimating the expected duration of training tasks.
The library is available at GitHub: https://github.com/cisco/oraf
A thorough explanation of the used methods and benchmark experiments can be found in the authors' thesis.