Serialization Formats
Schema-Based Data Serialization
★ 4.5
Optimized Row Columnar Format
★ 4.3
pip install avro-python3pip install pyorcpip install avro-python3pip install pyorcPython data engineers use `fastavro` to serialize and deserialize Avro records in Kafka-based pipelines. Schema Registry integration means Python producers validate records against the registered schema before publishing, and consumers deserialize binary Avro messages back to Python dicts automatically. Avro's compact binary encoding reduces Kafka topic storage costs compared to JSON.
Python data engineers use `pyorc` to read and write ORC files when working with Hive-based data lake environments where ORC is the standard format. In PySpark pipelines, ORC is specified as the write format for tables that will be queried via HiveQL with ACID upsert support — Spark handles ORC read/write transparently via the DataFrame API.
Individual Tool Pages