Serialization Formats
Columnar Storage Format
★ 4.8
Google's Data Interchange Format
★ 4.7
pip install pyarrowpip install protobufpip install pyarrowpip install protobufParquet is the standard output format for Python data pipelines writing to a data lake. Engineers use `pandas.to_parquet()` or `pyarrow.parquet.write_table()` to write DataFrames as efficiently compressed columnar files. Reading is equally simple — `pd.read_parquet('s3://bucket/prefix/')` reads an entire partitioned dataset, with DuckDB and Athena capable of querying Parquet files directly without loading.
Python data engineers use `protobuf` (the `google.protobuf` package) to serialize and deserialize structured messages in Kafka topics and gRPC services. Proto schemas define the contract between Python producers and consumers — `protoc` generates Python classes from `.proto` files, and engineers call `.SerializeToString()` and `ParseFromString()` to encode and decode messages efficiently.
Individual Tool Pages