In this video, I want to cover fault tolerance and high availability features of Hadoop.
Let’s start with Fault Tolerance.
Hadoop Fault Tolerance
Let me ask you a simple question. If some data node fails, what will happen to your file? I
mean, your data file was broken
into blocks, and you stored them on different data nodes. If one of such data node is not
working,
how will you read your file? You can’t read it right because you lost some part of your file
with
that faulty machine.
The Hadoop offers a straightforward solution to this problem. Create a backup copy of each
block
and keep it on some other data node. That’s it. If one copy is not available, you can read it
from
the second copy. In Hadoop's terminology, it is called
Replication Factor. We can set replication factor on file to file basis, and we can
change
it even after creating a file in HDFS. So, if I configure the replication factor of a file as
2,
the HDFS will automatically make two copies of each block for this file. Hadoop will also
ensure
that it keeps these two copies on two different machines. We typically set the replication
factor
to 3. Making three copies is reasonably good. However, if your file is super critical, you can
increase
the replication factor to some higher value.
Now, let’s come back to our cluster architecture. Let’s assume that we have three copies of
the
file on three different nodes as shown in the figure.
You may ask a question. What will happen if the entire Rack fails? All three copies are gone.
Isn’t it? Hadoop offers a solution
to this problem as well. You can configure Hadoop to become rack-aware. Once you set rack
awareness,
Hadoop will ensure that at least one copy is placed in a different Rack to protect you from
Rack
failure.
HDFS takes this replication factor seriously. I mean, if you set replication factor as 3
and
HDFS created three copies as shown in the figure.
Now, let’s assume the one node fails. One failure leaves you with two copies, another crash
will leave you with just one
copy, but you wanted HDFS to maintain three copies. We already know that each data node sends a
periodic
heartbeat to Name Node. Name node will soon discover that this particular data node is not
sending
heartbeat, so it must have failed. In such situations, Name node will initiate replication of
the
block and bring it back to 3 replicas.
The point that I want to make is that the Name Node continually tracks the replication
factor
of each block and initiates replication whenever necessary. The necessity for re-replication
may
arise due to many reasons:
- A Data Node become unavailable
- A replica becomes corrupted
- A hard disk on a Data Node failed
- You increased the replication factor of a file
The sole purpose of the replication is to provide you protection against failures, but it costs
you a lot of space. Making
three copies of your file reduces the storage capacity of your cluster by one-third and
increases
cost. Hadoop 2.x offers storage policies to minimize the price, and Hadoop 3.x provides erasure
encoding
as an alternative to replication. I will cover both features in a later video.
However, replication is a traditional method for avoiding faults and cost concerns are not
too
high because disks are reasonably cheap.
High Availability in Hadoop
The next thing that I want to cover is the High Availability. Let me ask you some questions.
- What is High Availability? If you don’t know the answer, there is no point in learning HA feature in Hadoop.
- If you know HA, can you explain how it is different than fault tolerance?
Puzzled? Let me explain.
You already know that HA refers to the uptime of the system. It shows the percentage of time
the
service is up and operational. Every enterprise desires a 99.999% uptime for their critical
systems.
The faults that we discussed earlier like a data node failure or a disk failure or even for
that
matter a rack failure, those faults don’t bring the entire system down. Your Hadoop cluster remains
available.
Some files may not be available due to those faults, but we learned that replication is the
solution
for those faults. So, we already have protection against data node failures.
What happens when the Name Node fails?
We already learned that name node maintains the filesystem namespace. Name node is the only machine
in a Hadoop cluster that
knows the list of directories and files. It also manages the file to block mapping. Every client
interaction
starts with name node, so if name node fails, we can't use Hadoop cluster. We can't read or write
anything
to the cluster. So, the name node is the single point of failure in Hadoop cluster.
To achieve high availability for a Hadoop cluster, we need to protect it against name node
failures.
The question is how to do it?
How to achieve High Availability in Hadoop?
The solution to protect yourself from any failure is the backup. That's it. In this case, we need to make a backup of two things.
- HDFS namespace information.
All the information that a name node maintains should be continuously backed up at some other place. So that in the case of a failure, you have all the necessary information to start a new name node on a different machine. - Standby Name node machine.
To minimize the time to start a new name node, we should already have a standby computer preconfigured and ready to take over the role of name node.
Now, let's come to the namespace information backup. We already learned that name node maintains
the entire file system in
memory and we call it in memory
fsImage. Name node also maintains an edit log in its local disk. Every time name node
makes
a change in the filesystem. It records that change in the
editLog. The
editLog is like a journal ledger of name node. If we have the
editLog, we can reconstruct the in-memory
fsImage. So, we need to make a backup of name node
editLog.
But the question is where and how?
Hadoop Quorum Journal Manager
There are many options, but the best solution offered by Hadoop 2.x architecture is
QJM. We call it Quorum Journal Manager. The QJM is a set of at least three machines. Each
of
these three devices is configured to execute a
JournalNode
daemon.
JournalNode a very lightweight software so you can pick any three computers from your
existing
cluster. We don't need a dedicated machine for
JournalNode. Once you have QJM, the next thing is to configure name node to write
editLog entries to the QJM instead of writing it to the local disk.
You might be wondering why we have three Journal Nodes in a QJM?
That gives us double protection. The
editLog is so critical that we don't want a backup at only one other place. That's why we
have
three. In case you need higher protection, you can have a QJM of five or seven nodes.
That's all about making a backup of namespace information.
Let's move to Standby name node.
Standby Hadoop Name Node
So, we add a new machine to the cluster and make it a standby name node. We also configure it to
keep reading the
editLog
from the QJM and keep itself updated. This configuration makes standby ready to take up the active
name node role in just
a few seconds.
There are two other important things in an active-standby configuration.
All the data nodes are configured to send the block report to both name nodes. Block Report is
a
kind of health information for the blocks maintained by the data node.
The final question for the HA configuration is How standby knows that active name node failed,
and
it should take over the active name node role?
Hadoop Zookeeper failover controller
We achieve this by placing a zookeeper and two failover controllers on each name node. The ZKFC of the active name node maintains a lock in zookeeper. The standby name node keeps trying to get the lock, but since active name node already maintains it, the standby never receives that lock. In case, the active fails or crashes, the lock acquired by active name node expires, and the standby succeeds to get a lock. That's it. As soon as Standby gets the lock, it starts to transition from standby to the active name node.
What is Secondary Hadoop Name Node
There is another component of HDFS which is worth mentioning at this point.
Secondary name node.
Secondary name node is often confused with standby name node. As I explained in this session, a
standby
is a backup for the name node. In the case of a name node failure, the standby should take over and
perform
name node responsibilities. However secondary name node takes care of a different responsibility.
Let me explain.
We already learned about following two things.
- In memory fsImage
- edit logs
The in-memory
fsImage is the latest and updated picture of the Hadoop file system namespace. The edit
log
is the persistent copy of all the changes made to the file system. Correct?
Let me ask you a question.
What will happen if you restart your name node?
You may need to reboot the name node due to some maintenance activity. On a restart, the name
node
will lose the
fsImage because it is an in-memory image. Right? That shouldn't be a problem because we
also
have
editLog. The name node will read
editLog and create a new
fsImage. There isn’t any problem with this approach except one.
The problem is with the time to read the
editLog. The log keeps growing bigger and bigger every day. The size of log directly
impacts
the restart time for a name node. We don’t want our name node to take an hour to start just because
it
is reading edit log and making a picture of the latest filesystem state.
Hadoop Secondary Name Node Checkpoint
The secondary name node is deployed to solve this problem. The secondary name node performs a
checkpoint activity every hour.
During the checkpoint, the secondary name node will read the edit log, create the latest filesystem
state
and save it to disk. This state is exactly same as the in-memory
fsImage. We call it on disk
fsImage. Once we have the on-disk
fsImage, the secondary name node will truncate the edit log because all the changes are
already
applied. Next time, I mean after an hour, the secondary name node will read the on-disk-
fsImage
and apply all the changes from the edit log that we accumulated during the last one hour. It will
then replace the old on-disk-
fsImage with the new one and truncate the edit log once again.
So, the checkpoint activity is nothing but a merging of an
on-disk-fsImage and
Editlog. The Checkpoint doesn’t take much time because we have just an hour old edit log
to
apply to the on-disk
fsImage. The time to read
fsImage is short because the
fsImage is small compared to the
Editlog, and I hope you understand the reason. The
Editlog is like a journal ledger that records every transaction. The
fsImage is like a balance sheet that shows the final state. So, the
fsImage is small.
Great!So, In the case of a restart, the name node also performs the same checkpoint activity
that
it can finish in a short time.
I hope you understand this process. The whole purpose of secondary name node and checkpoint is
to
minimize the restart time for the name node.
The secondary name node service is not required when we implement Hadoop HA configuration. The
standby name node also performs
the checkpoint activity, and hence we don’t need secondary name node in HA configuration.
I talked about the various HDFS architecture features and tried to explain how they work. A
good
understanding of architecture and functionalities take you a long way in designing and implementing
a
practical solution. The actual setup and configuration specifics are for admin and operations
people.
Since our focus is for developer and architect, we skip those things however I will cover some
basics
of installation and configuration in following videos.
Thank you for watching learning journal. Keep learning and keep growing.