File Systems & Storage
Hadoop Distributed File System
★ 4.4
Cloud-Native File System
★ 4.3
pip install hdfsN/A — CLI binary, see juicefs.compip install hdfsN/A — CLI binary, see juicefs.comPython data engineers interact with HDFS using `pyarrow.fs.HadoopFileSystem` or the `hdfs` Python client. PySpark accesses HDFS transparently via `spark.read.parquet('hdfs:///path/')` — the cluster configuration points Spark to the NameNode. Python scripts that manage file operations (listing, deleting, moving files) use the `subprocess` module to call `hdfs dfs` commands or the WebHDFS REST API.
Python data engineers use JuiceFS to mount cloud object storage as a local POSIX file system — enabling Python pipeline code that reads and writes local files to work seamlessly with S3 or GCS as the backing store without using boto3 or cloud-specific SDKs. PySpark jobs on JuiceFS benefit from its Hadoop-compatible interface and local cache for repeated dataset reads.
Individual Tool Pages