ETL Frameworks
Data Manipulation & Analysis Library
★ 4.9
Python API for Apache Spark
★ 4.8
pip install pandaspip install pysparkpip install pandaspip install pysparkPandas is the go-to tool for data wrangling in Python pipelines. Engineers use DataFrames to load raw data from CSVs or databases, clean and transform it (renaming columns, filtering rows, filling nulls), then write results to Parquet or a data warehouse. It is the standard intermediate layer between data ingestion and downstream processing.
PySpark is the standard Python interface for large-scale ETL on Hadoop and cloud clusters. Data engineers write transformation logic using the DataFrame API — reading from S3 or Hive, applying joins and aggregations, then writing to Delta Lake or a data warehouse — with Spark distributing the work across hundreds of nodes.
Individual Tool Pages