How to deploy Apache Zeppelin in K8s with S3 support

Roman Krivtsov
Thinkport Technology Blog
6 min readMar 8, 2022

--

TL; DR

We have built Zeppelin v0.10.0 Docker images which can be used in Kubernetes with Spark. You can just start over with our yaml file in Kubernetes after building the images with this script. As a bonus feature, you can set up S3 connections in Spark jobs if required.

Hi, I am Chris Ulpinnis, a certified Kubernetes administrator with about 3 years of experience working as a cloud architect for Thinkport since this year. Before I had worked for about 5 years as a bioinformatician on different research projects. In that time I shifted from projects in data science to scientific infrastructure projects.
In my spare time I am interested in microcontrollers and craft beer. I like to wear clothes with cat motives and to watch really bad movies with my friends.

Introduction

As a former Data Scientist in Bioinformatics, I know that not only the engineering part of data analytics is important, but also the accessibility and comprehensibleness of your results and pipelines. Often you will work with people from different fields or teams. For that reason writing your analysis in Notebooks can be a good idea. Notebooks offer the potential to structure, comment, embed and template your analytics and their results. For example, notebooks can be used by colleagues for their own analysis or by course participants learning from you. There are different notebook options available. Today we want to write about Apache Zeppelin.

Apache Zeppelin is a multi-purpose interactive notebook solution for data ingestion, discovery, analytics and visualisation. It also has multi-user support for collaborative work on the same notebooks. It features multiple interpreters to run your analytics in your desired language (for example R or Python). You can even mix up different languages in one notebook.
One key benefit of Zeppelin is the integration of Apache Spark. Spark is an engine for data science, SQL analytics and machine learning supporting batch and streaming data. It offers different language bindings (for example R and Python again) to create analytics pipelines. Spark can utilise the power of cluster computing, e.g. when running in Kubernetes.
In this article, we will show you how to run Apache Zeppelin with Spark integration on Kubernetes. As a bonus, we will show you how to integrate and configure Simple Storage Service (S3) support into Spark running in Zeppelin for accessing files in the cloud. We will mark those parts so that you can skip them if you do not want to use s3 storage.

Building Apache Zeppelin — Server and Interpreters

At the moment there are no official Dockerimages from Apache to run Zeppelin containerised in Kubernetes. Images that can be found in public repositories like Dockerhub are mostly outdated or not ready to use as they are.

The Kubernetes implementation of Zeppelin consists of a server image, an interpreter image and a Spark image. Server and interpreter are based on the distribution image while the Spark image is independent. In our approach, the Spark image will be based on the interpreter image (and in this way also on the Zeppelin distribution image).

The following Dockerfiles are important:

We modified them so that you do not need to download them from the Zeppelin Repository. Just follow the instructions in my GitHub repository. In general you just need to clone my repository and run build_from_binary.sh.

+

Kubernetes deployment

There are some parts in the zeppelin-server.yaml that can/must be adjusted when using your own images. You have to edit the configMap zeppelin-server-conf-map:

  • ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE: your Spark image
  • ZEPPELIN_K8S_CONTAINER_IMAGE: your Zeppelin interpreter image

Furthermore, in the deployment, you have to change the Container image to the name of your server image. You can find an example yaml file here.

If you are using our images, you can use this yaml directly. If you just want to play around without persistent storage, remove the volumeMounts for notebook and settings as well as the PVC entries.

S3 Spark configuration in Zeppelin

We will create a new Spark interpreter with the settings to connect to our S3 store. You can create multiple interpreters for each S3 connection. In our case we used rook-ceph [LINK ZU ROOK PROJEKT] with a RADOS gateway to create a S3 data store. Of course you can also use AWS S3 storage.

Interpreter Settings

Open our Zeppelin instance (we used a SSH tunnel) and click on your username in the upper right corner. Select “Interpreter” from the widget.

In the interpreter settings click on “Create” in the upper right corner and create a new interpreter in the spark interpreter group. We named the interpreter spark-s3.

You can copy most of the settings from the default spark interpreter like SPARK_HOME and spark.master. To get the S3 support it is very important to add the additional Hadoop jar files to spark.jars. For our v0.10.0 build you have to add /spark/jars/hadoop-aws-3.2.0.jar,/spark/jars/aws-java-sdk-bundle-1.11.375.jar,/spark/jars/wildfly-openssl-1.0.7.Final.jar . We have to do so, because the Spark distribution in the zeppelin interpreter container is copied to /spark and not to /opt/spark (as in the Spark image) as mentioned before [Verweis auf die Stelle].

Finally scroll down to the bottom of the interpreter configuration to add the S3 configuration parameters. Add the following parameters:

Example for using S3 in PySpark

My colleague Alex created a small test script to evaluate if the connection works. Do not miss his Medium article about how to use the result of Spark jobs in Trino by glueing them together with S3 and a Hive metastore.
Create a new notebook and select our newly created interpreter. We used the following code:

%spark-s3.pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import Row
import pyspark.sql.functions as f
print("################################################")
print("Test 1: create RDD")
rdd = sc.parallelize(range(100000000))
print(rdd.sum())
print("################################################")
print("Test 2: read singel file from s3 (city_value_histogram)")

# read simple csv
df=(spark
.read
.option("delimiter",";")
.csv("s3a://import/city/city_value_histogram.csv",inferSchema=True,header=True)
)
df.show()
print("################################################")
print("Test 3: read large file from s3 (bike-sharing-dataset/raw/hour.csv)")
# read large csv
df_bikes1=(spark
.read
.option("delimiter",",")
.csv("s3a://import/bike-sharing-dataset/raw/hour.csv",inferSchema=True,header=True)
)
df_bikes2=(df_bikes1
)
df_bikes2.show()
df_bikes2.printSchema()
df_bikes2.select("mnth").distinct().show()

You can use this code as a template for reading and testing your own files located at a S3 data store.
The result should look like this:

Great! Now you have a running Zeppelin instance with all common interpreters and S3 support for Spark.

Caveats

Our Spark interpreter image has one big drawback — its size. Building it upon the zeppelin interpreter image will eliminate compatibility issues on the one hand, but on the other it will create a pretty big image (around 4,4 GB). If a node pulls the image for the first time when the interpreter is run, this could result in a timeout of the interpreter. Kubernetes will download the image and then terminate the container. This means on the next try it will work on this node. For this reason it is wise to store the images on the worker nodes beforehand or to use a a private image registry or registry cache.

Conclusion

If you’d like to learn more about Kubernetes and Spark take a look at our Kubernetes and Spark workshops, where we share our experience of building large scaled applications and Data Lakes.

We are also looking forward to your feedback!

--

--