// file-systems-storage

HDFS

Hadoop Distributed File System

About HDFS

A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.

Key Features

1Distributed file system designed to store very large files across commodity hardware
2Block replication (default 3x) for fault tolerance across DataNodes
3NameNode maintains the filesystem namespace and block mappings
4High throughput reads optimized for sequential streaming access patterns
5Foundation for Hadoop ecosystem: Hive, Spark, HBase, and Sqoop all use HDFS

How Python Data Engineers Use HDFS

Python data engineers interact with HDFS using `pyarrow.fs.HadoopFileSystem` or the `hdfs` Python client. PySpark accesses HDFS transparently via `spark.read.parquet('hdfs:///path/')` — the cluster configuration points Spark to the NameNode. Python scripts that manage file operations (listing, deleting, moving files) use the `subprocess` module to call `hdfs dfs` commands or the WebHDFS REST API.

Frequently Asked Questions

What is HDFS used for?▾

Is HDFS free to use?▾

Yes, HDFS is free to use.

What category does HDFS belong to?▾

HDFS is listed under the File Systems & Storage category on Python Data Engineering.

Verified Listing

Visit Website

// contains affiliate links

Details

Similar File Systems & Storage Tools

3 tools

Tool	Pricing	Rating
AL Alluxio Memory-Centric Storage System	Freemium	★ 4.2	→
CE CEPH Unified Distributed Storage	Free	★ 4.4	→
GL GlusterFS Scalable Network File System	Free	★ 4.0	→