Welcome back to Learning Journal. In this video, I am going to talk about the choices of the Spark programming languages. You already know that Spark APIs are available in Scala, Java, and Python. Recently Spark also started supporting R programming language. Spark APIs are available in these four main languages. However, there are few more language bindings that the open source community is working. You can get a list of those languages in the Spark documentation. There are a bunch of people working to develop few other language bindings that are still not listed on this page.
Best language for Spark
The language selection is a common question among the Spark learners. Should I use Scala or Python? Is there any downside
of choosing Python over Scala? I often get this question from many people. As the number of Spark
language bindings are growing, this question is becoming more and more critical. I am not going to
recommend a language to you. Instead, I will talk about the various considerations that should form
a basis for your language selection. With that knowledge, you will be empowered to take an appropriate
decision.
Great! Let's start.
How to select a Spark programming language?
There are three main parameters that we can use to evaluate a language for our Spark applications.
- Skills and the proficiency in the language.
- Design goals and the availability of the features.
- Performance Implications.
Availability of the skill is one of the critical factors. If you don't have Scala programmers in your team or if you are
facing difficulty in building a Scala team but you already have a bunch of Python developers, you
will give a plus to the Python over Scala. Right?
Similarly, if your data science team is more comfortable with R language, then you would want
to prefer R over Python.
Your design goals and the capability of the language to model your requirement is another deciding
factor. If you are willing to use some feature or the API that is not available in R language, you
will give a minus to the R language. Similarly, if you are willing to work with a specific type of
binary data format, and you already have a Java library to read and manipulate that format, you would
want to give a plus to the Java or the Scala language.
Why to compromise with one Spark language?
The most important point to note here is that you have a choice in your hand and it is not necessary to do everything or the entire project in a single language. We often see people doing few jobs in one language and other jobs in a different language. The idea is to break your project into smaller parts and do each piece in the most suitable language.
Apache Spark language types
Great! The last piece of this puzzle is the performance of the code. To understand the performance implications of a language, we need to deep dive into the Spark Architecture. To start that discussion, let's classify the available Spark languages into two categories.
- JVM languages. This category includes Scala and Java.
- Non-JVM languages. This category is for Python and R.
Now the next thing that we need to understand is the answer to the following question.
How does a language binding works in Spark?
I mean, Spark core is written in Scala. Then how does it execute the Python code or the R language
code? Once you learn the high-level architecture of the language binding, you will have a good understanding
of the performance bottlenecks.
Spark performance for Scala
Let's start with a Spark architecture refresher.
When we submit a Spark application to the cluster, the framework creates a driver process. Spark framework is a Scala code,
and hence the driver process is a JVM process. Right? Now assume you wrote your Spark application
in a JVM language. There is no complication to load your code in a JVM. Your application is in Java
or Scala, and the driver is a JVM process. Everything is in the same language. Right?
What happens next?
Since Spark is a distributed computing framework, the driver process does not execute all the
code locally. Instead, the driver will serialize and send the application code to the worker processes.
The worker processes are responsible for the actual execution. Right?
These worker processes are also a JVM process. Why? That is because the Apache Spark framework
is a JVM application. Right? Okay.
We wrote our job in a JVM language, and hence it executes in the worker JVM. If our application
code is loading some data from a distributed storage, that data also comes and sits in the JVM. So,
the application code and the data, both are in JVM at the worker end and hence it works smoothly
without any complication. That is a high-level overview of how things work when everything is in
Scala or Java.
Spark performance for Python
The complication starts when we write code using a Non-JVM language like Python or R. For the simplicity, we will talk about Python only. However, the architecture is almost same for the R application as well.
Let's try to understand the process when you submit a Python application to the Spark cluster.
Same as earlier, the Spark framework starts a Spark driver process.
But this time, the driver is a Python process because you submitted a Python code. We can't run
Python code in a JVM, so the framework starts a Python driver. However, the Spark core implementation
is in Scala. Those APIs can't execute in a Python process. So, the Spark framework also starts a
JVM driver. Now we have two driver processes. The first driver is a Python driver or an R driver
depending upon your application code. The second driver is always a JVM driver. The Python driver
talks to the JVM driver using a socket-based APIs. What does that mean?
Let me explain. You have written a Python code, and you are making calls to the PySpark APIs,
but ultimately, your API calls are used to invoke a Java API on the JVM using a socket-based API
call.
The JVM driver is responsible for creating all the necessary objects and garbage collect them
when they get out of the scope in Python. That's a complicated architecture, and we don't need to
get into too many details of it. However, the point that I want to make is straightforward.
Your Python and R code ultimately invokes a Java API. That inter-process communication happens
over socket-based APIs, and it requires some overhead. However, the Spark creators claim that the
cost is almost fixed and negligible, and they might be improving the architecture further over the
new releases.
However, if you are collecting the output data back to your driver from the workers, the cost
is variable depending on the amount of data. The output takes some additional time to serialize and
ship that data from the JVM to the Python driver. This serialization can take longer depending on
the amount of data that you want to retrieve. However, if you are not running some interactive workloads
and saving the output to some storage instead of bringing it back to the driver, you don't have to
worry about this overhead.
So, there is not much problem on the driver side if you are not pulling data back to your driver.
The only problem is that you will have two different processes and they will spend a few hundred
milliseconds for passing control messages. That's it.
Now let's look at the worker side. We already know that the JVM driver is responsible for starting
the worker process and you will always have a JVM work on the executer nodes. That's the Spark standard.
Since your PySpark APIs are internally invoking the Spark Java APIs on the driver, it is the Java
code that needs to be serialized and sent to the workers for execution. So, in a typical case, your
APIs are translated into the Java APIs, and the Spark can execute everything in a JVM worker.
However, the problem starts when you write a pure Python code that doesn't have a corresponding
Spark Java API. One example is to use a third-party Python library that doesn't have a similar Java
API implementation. Another example is to create a UDF in Python that uses pure Python code and call
that UDF in your application.
In these two scenarios, your Python code doesn't have a JAVA API. So, your driver can't invoke
a Java Method using a socket-based API call. This scenario doesn't arrive if you restrict your code
to the PySpark APIs. However, if you create a Python UDF or use a third-party Python library that
doesn't have a PySpark implementation, you will have to deal with the Python code problem.
What do you think? What will happen in that situation?
The driver will serialize the Python code and send it to the worker. The worker will start a
Python worker process and execute the Python code.
This python worker process is an additional overhead for the worker because you already have
a JVM worker as well.
Invoking one or multiple Python workers and coordinating work between JVM and Python process
is an additional overhead for Spark.
The Spark creators claimed that this kind of coordination is not a big deal and for a typical
Spark job, the communication effort has a negligible impact on performance. However, if they need
to move data between these workers process, it may severely impact the performance. If you have a
Python function that needs to work on each row of the partition, then the JVM worker must serialize
the data and send it to the Python worker. The Python worker will transform that row and serialize
the output back to the JVM. The most critical point is that the problem scales with the volume of
the data. The JVM and the Python process also compete for the memory on the same worker node. If
you don't have enough memory for the Python worker, you may even start seeing exceptions. That is
where the Python and R languages start bleeding on a Spark cluster.
However, if you have created your UDF in Scala or used a Java/Scala library, everything happens
in the JVM, and there is no need for moving data from one process to another.
Great! I hope you have some understanding about the language selection. You can use this knowledge
to make a right decision. However, the bottom line is that the Spark application is a set of a bunch
of jobs and you can't settle for a single language for all the task. You must take that decision
on a job to job basis.
Thank you for watching Learning Journal. Keep learning and keep growing.