In the previous session, I introduced you to Hadoop and talked about some history. We learned about three core components of Hadoop.
- Hadoop distributed files (HDFS)
- Map Reduce framework
- YARN
I intend to cover all of these in detail. So, let’s start with the first one.
What is HDFS?
The H stands for Hadoop and DFS is distributed file system. If you have been using computers,
you already know the meaning
of filesystem. So, let me ask you a simple question. What can you do with a Filesystem? You
know
the answer.Right?
Filesystem allows you to create directories and files. Then, you can perform some
operations
on them like copying them from one location to another location, rename them, move them, or
delete
them. There are many other services that a filesystem supports.
HDFS is also a filesystem. So, the core purpose of HDFS is to provide you file management
services.
All that you want to do with HDFS is to create some directories and store your data in files.
That’s
it.
Why HDFS is Great?
You might be wondering, if it is just a filesystem, then why is it so popular? What was the big
thing that Google published
in their paper? Why people build a new file system when we already had so many of them? I mean,
every
operating system comes with a default filesystem.
Well, Hadoop is not an ordinary filesystem. The creators of Hadoop designed it with some
unique
capabilities and features that make it unique.
So, what are those features? In this video, I will quickly touch base upon following
features
that make HDFS a powerful and unique filesystem.
- Distributed
- Scalable
- Cost-effective
- Fault-tolerant
- High throughput
Horizontal Scaling vs Vertical Scaling
What does that mean? And why do we need them? Let’s try to understand that.
If your data grows more massive than the storage capacity of your computer, how would you
store
it?
Well, if I filled my disk, I will buy a new one of higher capacity. If I have an extra
slot,
I will add another drive to my computer.
That’s a reasonable answer. This approach is known as
vertical scaling. So, when you are scaling the capacity of a single system, we call it
vertical
scaling.
Most of the enterprises were taking the same approach. This method worked for years. But
think
of a search engine crawler. The job of a search engine crawler is to read the internet and save
it.
You buy the most massive available system in the market, and the crawler fills it in less than
a
year. Now you are dependent on the hardware manufacturers to make a bigger machine. Big data
smashes
all your capacity planning.
To handle an internet-scale of data, we need to take an approach that is similar to the web
itself.
The network approach. The solution is simple. Instead of relying on a single giant machine, use
a
network of several smaller devices. When you consume the combined storage capacity of your
network,
buy few more cheap computers and add them to the cluster. This approach is called
horizontal scaling. The idea is fantastic. But to make it workable, we need a software
that
combines the storage capacity of the entire network into a single unit. As a user, we just
wanted
to look at it as a single large disk.
HDFS is a distributed file system
We needed a network-based filesystem. The HDFS is precisely that. A network-based filesystem or
a distributed filesystem.
You can create a file of 100 TB using HDFS, and you don’t have to care about how HDFS is
storing
that data on individual computers in the network. The entire storage capacity of the
infrastructure
is just like a single disk, and we use it as we are using a single computer.
If you understand this idea, all other HDFS features are straightforward.
HDFS is Scalable
It is scalable. I already explained that. When you need more storage, buy some more computers and add them to the network. So HDFS is horizontally scalable. You will never run out of space.
HDFS is cost effective
It is cost effective because you can start as small as a single computer and then scale it as and when you need more capacity. One more important feature is that you don’t need to buy high-end expensive server machines. You can get the reasonable ones. People call it commodity hardware. They are affordable and readily available PC grade systems.
HDFS is fault tolerant
The next is fault tolerance. So, when you create a network using hundreds of commodity machines, it is likely that something breaks every month or maybe every week. Some computer crashes, or a network switch fails, or a disk fails. Anything can happen. HDFS is capable of tolerating such failures. So even if a drive crashes or a computer breaks, your HDFS system will be able to work without any problem. Your system will be functioning, and you won’t lose your data. As an end user, you may not even realize that there is some fault behind the HDFS. We will learn more details about how HDFS achieves all of this in upcoming sessions.
HDFS offers high throughput
The last one is high throughput. If you have been using data systems, you may have already heard these two terms.
- Latency
- Throughput
Latency is the time to get the first record. The latency is very critical for interactive
systems where a user clicked a
button and waiting to get the response back.
Throughput is different. Throughput is the number of records processed per unit of time.
So,
if your goal is to handle a large volume of data, your focus should be to get the highest
throughput
rather than lowest latency.
The focus of HDFS is to maximize the throughput. It was a key design goal of HDFS. And
probably
that's why Hadoop is not an excellent choice for interactive requirements because it doesn’t
offer
you the best possible latency. It gives you an excellent throughput and hence minimizes the
total
time to process your large data set.
Great! In this session, I tried to answer following questions.
- What is HDFS?
- What can we do with HDFS?
- What are the distinctive features of HDFS?
I hope you can answer these questions.
Thank you for watching Learning Journal. Keep learning and keep growing.