Disclaimer: this article describes the research activity performed inside the BDE2020 project. Created docker images are dedicated for development setup of the pipelines for the BDE platform and by no means should be used in a production environment.
In this article we will show how to create scalable HDFS/Spark setup using Docker and Docker-Compose. We include HDFS Hue FileBrowser as well as spark-notebook into our setup enabling users to work with both HDFS and Spark via graphical user interfaces.
Why do we need to pack Spark into Docker in the first place? In our infrastructure, we have a cluster with three servers, each has 256 GB RAM and 64 CPU cores. As we run some other applications on this cluster, we have chosen Docker to isolate applications from each other. Yet being able to connect them when necessary. Also, Docker can restrict RAM and set up a CPU quota for each container. Thus, it is possible to spawn several Spark-workers isolated from each other on the same machine. For instance, 60 Spark-workers on one of mentioned servers with 2 GB RAM and 1/70 of CPU time each.
Docker has existed for a couple of years already and has an ecosystem where users can find dozens of images for almost all of mainstream technologies, including HDFS/Spark. We collected all the available images by searching for “spark” in the Docker Hub looking for the out of the box solution, visualized them using imagelayers and then shortlisted by the following criteria:
- Less than 1 GB in size. The bigger images most likely do not have separation of components, which makes them hard to scale.
- HDFS/Spark separated? I want to scale HDFS nodes separately from Spark nodes.
- Is it easy to install? Is there documentation?
- Can I scale Spark with this image?
- Is it easy to connect Spark to HDFS?
After checking against criteria 1 and 2, which are relatively easy to figure out without trying images out, the remaining images were evaluated one by one. We found out that none of the images fit criterion 4.
$ ./buildall.sh $ docker network create hadoop $ docker-compose up
buildall.sh script will iterate through all the Dockerfiles inside the repo and build necessary images. As we are using version 1 of Docker-Compose, you’ll have to create Docker network manually with the docker network create command. After running docker-compose up you will have 1 namenode, 2 datanodes, 1 Spark-master, 1 Spark-worker, a Spark-notebook and a Hue HDFS FileBrowser running.
Navigate to http://your.docker.host:50070 to see status of your HDFS and to http://your.docker.host:8080 for Spark. On the Spark page you will only see 1 worker. Let’s scale it up to 5.
$ docker-compose scale spark-worker=5
Refresh the page on http://your.docker.host:8080. You will see 5 Spark-workers registered now.
## should be inside gitrepo folder ## docker-compose up should be running ## data dir would be created automatically $ cd data $ wget https://data.mattilsynet.no/vannverk/vannbehandlingsanlegg.csv $ docker run -it --rm --env-file=../hadoop.env --net hadoop bde2020/hadoop hadoop fs -mkdir -p /user/root $ docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop bde2020/hadoop hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root
Another possibility to load the data into HDFS is to use Hue FileBrowser. To perform the same actions as in the listing above navigate to http://your.docker.host:8088. Use “hue” username with any password to login into the FileBrowser (“hue” user is set up as a proxy user for HDFS, see hadoop.env for the configuration parameters). Click on “File Browser” in upper right corner of the screen and use GUI to create /user/root folder and upload the csv file there.
Go to http://your.docker.host:50070 and check if the file exists under the path ‘/user/root/vannbehandlingsanlegg.csv’.
Open Spark-notebook on http://your.docker.host:9000 and choose “core/simple spark” example. You can run the cells by clicking on them and pushing play button. Add a new cell (using a plus button in the toolbar):
val textFile = sc.textFile(“/user/root/vannbehandlingsanlegg.csv”) textFile.count()
It will show you the execution time and the number of lines in the csv file.
If you want to submit applications to the Spark cluster as a jar file, you can run a Spark container with mounted current dir $(pwd) and run your app from the container:
$ docker run -it --rm --net hadoop --name spark-submit-application --volume $(pwd):/data/ bde2020/hadoop-spark bash $ cd /data $ /opt/spark-1.6.1/bin/spark-submit --master spark://spark-master:7077 --class org.example.your.class --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue thequeue my.jar my_cli_option1 my_cli_option2
Here is an example on how to submit SparkPi:
$ docker run -it --rm --net hadoop --name spark-submit-application --volume $(pwd):/data/ bde2020/hadoop-spark ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-master:7077 --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue thequeue lib/spark-examples*.jar 10
That’s it! If you have any questions, feel free to open new issues on the BDE github.