A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.
Python data engineers interact with HDFS using `pyarrow.fs.HadoopFileSystem` or the `hdfs` Python client. PySpark accesses HDFS transparently via `spark.read.parquet('hdfs:///path/')` — the cluster configuration points Spark to the NameNode. Python scripts that manage file operations (listing, deleting, moving files) use the `subprocess` module to call `hdfs dfs` commands or the WebHDFS REST API.
A distributed file system designed to run on commodity hardware as part of the Apache Hadoop ecosystem. HDFS provides high-throughput access to application data and is the foundation for storing massive datasets in Hadoop-based data platforms.
Yes, HDFS is free to use.
HDFS is listed under the File Systems & Storage category on Python Data Engineering.
Details
Related
| Tool | Pricing | Rating | |
|---|---|---|---|
AL Alluxio Memory-Centric Storage System | Freemium | ★ 4.2 | → |
CE CEPH Unified Distributed Storage | Free | ★ 4.4 | → |
GL GlusterFS Scalable Network File System | Free | ★ 4.0 | → |