Serialization Formats
Optimized Row Columnar Format
★ 4.3
Columnar Storage Format
★ 4.8
pip install pyorcpip install pyarrowpip install pyorcpip install pyarrowPython data engineers use `pyorc` to read and write ORC files when working with Hive-based data lake environments where ORC is the standard format. In PySpark pipelines, ORC is specified as the write format for tables that will be queried via HiveQL with ACID upsert support — Spark handles ORC read/write transparently via the DataFrame API.
Parquet is the standard output format for Python data pipelines writing to a data lake. Engineers use `pandas.to_parquet()` or `pyarrow.parquet.write_table()` to write DataFrames as efficiently compressed columnar files. Reading is equally simple — `pd.read_parquet('s3://bucket/prefix/')` reads an entire partitioned dataset, with DuckDB and Athena capable of querying Parquet files directly without loading.
Individual Tool Pages