Sitemap

Understanding HDFS from the Perspective of Supply and Demand

4 min readFeb 23, 2025

--

The Hadoop Distributed File System (HDFS) is designed to address the challenges of storing and processing large-scale data efficiently. From a supply-demand perspective, the demand arises from the need to handle massive data volumes, ensure fault tolerance, and automate storage management. The supply is provided by HDFS, which distributes data across multiple machines, ensures reliability through replication, and enables efficient access to large files.

HDFS Architecture and Solutions

Storage Challenges and Solutions

1. Storing Large Files That Exceed a Single Disk Capacity

Large files that cannot fit on a single disk must be stored efficiently.

Solution: HDFS splits large files into fixed-size blocks (default 128MB in Hadoop 2.x) and distributes these blocks across multiple machines (DataNodes).

2. Optimizing Disk Utilization and Reducing Wasted Space

Storing multiple small files individually can cause inefficiencies.

Solution: The block-based architecture of HDFS ensures better utilization of disk space by storing multiple blocks per disk, rather than requiring entire files to fit within a single storage unit.

3. Ensuring Fault Tolerance and Data Availability

If a single block stored on one drive is lost, the entire file could become unreadable.

Solution: HDFS replicates each block across multiple DataNodes. By default, each block has three copies stored on different machines. If one copy is lost due to a hardware failure, other copies remain accessible.

4. Efficiently Managing File Storage and Retrieval

A mechanism is required to track where file blocks are stored and how they are retrieved.

Solution: HDFS employs a master-slave architecture with a NameNode (master) that maintains the file system’s metadata, including file paths, block locations, and access permissions. DataNodes (slaves) store the actual blocks and periodically report their status to the NameNode.

Role of the NameNode and DataNodes

NameNode:

• Stores and manages metadata, such as file paths, file sizes, block locations, and permissions.

• Maintains a hierarchical file system structure (similar to a directory tree).

• Does not store actual file data, only metadata about stored blocks.

• Uses an in-memory metadata store for faster access.

DataNodes:

• Store actual data blocks.

• Periodically send “heartbeats” and block reports to the NameNode to confirm availability.

• Perform read and write operations requested by clients.

Ensuring High Availability and Recovery

1. Protecting NameNode Data Loss

Since the NameNode holds critical metadata in memory, a failure can cause system-wide issues.

Solution: HDFS writes a full snapshot of metadata, called fsimage, to persistent storage.

2. Handling Metadata Changes in Real-Time

If a failure occurs right after writing fsimage to disk but before the next snapshot, recent changes could be lost.

Solution: HDFS maintains an EditLog, which records metadata updates. Upon restart, the NameNode loads fsimage and replays the EditLog to restore the most recent state.

3. Ensuring NameNode High Availability (HA)

If the NameNode fails, the entire file system can become inaccessible.

Solution: Modern Hadoop deployments use Active and Standby NameNodes to ensure high availability. The active NameNode processes requests while the standby NameNode synchronizes metadata and can take over in case of failure.

4. Addressing Namenode Scalability

A single NameNode may not be sufficient for very large-scale clusters.

Solution: HDFS Federation allows multiple NameNodes, each managing a separate namespace and block pool, to improve scalability and eliminate bottlenecks.

How HDFS Handles Data Storage and Retrieval

Writing a Large File to HDFS

1. A client sends a request to the NameNode via the HDFS API.

2. The NameNode checks the file path, verifies user permissions, and updates metadata in the EditLog.

3. The NameNode selects suitable DataNodes for storing the file’s blocks based on system load and fault tolerance policies.

4. The client writes data directly to the assigned DataNodes. Each DataNode stores the data and replicates it according to the replication factor.

5. DataNodes report their block status to the NameNode periodically. The NameNode updates the EditLog and periodically merges it into the fsimage.

Reading a Large File from HDFS

1. A client requests a file from the NameNode.

2. The NameNode returns a list of blocks and the addresses of DataNodes storing them.

3. The client establishes a TCP connection with the nearest DataNode to read the blocks in parallel.

4. Data integrity is verified using checksums stored separately from the data blocks. If a block is corrupted, the client requests another copy from a different DataNode.

5. The client reassembles the blocks to reconstruct the original file.

Bonus:

HDFS is not as dominant as it was in the early days of big data, and its decline in popularity can be attributed to several key factors:

  1. Cloud Storage Solutions: automatic scaling and pay-as-you-go and easier integration to cloud-based streaming services like AWS Kinesis, and Google Pub/Sub. GCS and AWS S3 are object storage systems, Compared to distributed file storage systems, GCS and AWS S3 have lower storage costs, decoupled compute and storage, and a stronger API ecosystem. Some files storages are achieved by Data Lakes. Data Lakes allow schema-on-read, multi-format storage, AI/ML, and integrate with modern query engines like Apache Spark, Presto, Trino, and Google BigQuery.
  2. HDFS has complex NameNode management and Security and Compliance management.

--

--

Sean Zhang
Sean Zhang

Written by Sean Zhang

Data Science | Machine Learning| Data Engineer

No responses yet