File Systems & Storage
Unified Distributed Storage
★ 4.4
Hadoop Distributed File System
★ 4.4
pip install cephpip install hdfspip install cephpip install hdfsPython data engineers in on-premise or private cloud environments use Ceph's S3-compatible RADOS Gateway as a drop-in replacement for AWS S3 — boto3 and awswrangler work unchanged by pointing them at the Ceph endpoint URL. CephFS is mounted as a shared file system that multiple Python pipeline worker nodes read from and write to simultaneously.
Python data engineers interact with HDFS using `pyarrow.fs.HadoopFileSystem` or the `hdfs` Python client. PySpark accesses HDFS transparently via `spark.read.parquet('hdfs:///path/')` — the cluster configuration points Spark to the NameNode. Python scripts that manage file operations (listing, deleting, moving files) use the `subprocess` module to call `hdfs dfs` commands or the WebHDFS REST API.
Individual Tool Pages