Apache Parquet
Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval, especially for analytical workloads. It is open source and widely supported across big data tools.
TLDR;
- Parquet is optimized for analytics: columnar, compressed, self-describing, and schema-aware.
- It significantly reduces storage and speeds up columnar queries compared to CSV/JSON.
- Works best when data is large, structured, and queried by column subsets.
Core Characteristics
-
Columnar Storage Data is stored column by column, unlike row-based formats (e.g., CSV). Queries that read only a few columns can skip the rest, saving I/O and memory.
-
Efficient Compression & Encoding Each column can use a compression and encoding strategy suited to its data type (e.g., dictionary encoding for strings, run-length encoding for repeated values).
-
Schema Evolution The file includes the schema. You can add or remove columns later without rewriting the entire dataset.
-
Cross-Language & Tool Support Supported in Apache Spark, Hive, Presto/Trino, Drill, Flink, Dremio, BigQuery, and libraries like PyArrow and pandas.
-
Self-Describing Metadata (schema, column statistics, min/max values, compression type) is embedded in the file, helping query engines optimize access.
Advantages Compared to Other Formats
Feature | Parquet | CSV | JSON | Avro |
---|---|---|---|---|
Storage Type | Columnar | Row-based | Row-based | Row-based |
Compression Efficiency | High (per-column encoding) | Low | Moderate | Moderate |
Schema Support | Yes | No | Weak (implicit) | Strong |
Read Performance | Very fast for column subset | Slow (full row read) | Slow (full parse) | Moderate |
Write Performance | Moderate | Fast | Fast | Fast |
Query Optimization | Predicate pushdown, statistics | None | None | Limited |
Data Type Support | Rich (nested, arrays, structs) | Strings only | Nested but inefficient | Rich |
Key Takeaways:
- Parquet vs CSV: Parquet compresses far better and only reads needed columns; CSV must scan all columns.
- Parquet vs JSON: Parquet is binary and structured; JSON is text-based and slower to parse.
- Parquet vs Avro: Avro is better for streaming and row-oriented use cases; Parquet excels in analytics and querying subsets of columns.
Practical Examples
Python (pandas + PyArrow)
import pandas as pd
# Write to Parquet
df = pd.DataFrame({
"user_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"score": [95.5, 88.0, 92.3]
})
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy")
# Read Parquet
df2 = pd.read_parquet("data.parquet", engine="pyarrow")
print(df2)
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
df = spark.read.parquet("s3://my-bucket/data.parquet")
df.select("user_id", "score").filter(df.score > 90).show()
DuckDB (ad-hoc querying)
When to Use Parquet
- Data Lakes / Warehouses (e.g., on S3, GCS, Azure Blob)
- Analytics with Spark, Trino, Presto
- ETL Pipelines needing schema evolution and compression
- Machine Learning Feature Storage
Avoid Parquet if:
- You need fast row-based writes (e.g., transactional logs).
- Data is very small and human-readable inspection matters (CSV/JSON may be simpler).
Primary / Authoritative Sources
-
Apache Parquet Documentation (official) — overview, file format specs, internals https://parquet.apache.org/docs/ (Apache Parquet)
-
Parquet Format GitHub / Specification (Thrift definitions etc.) https://github.com/apache/parquet-format (GitHub)
-
Apache Parquet — File Format section (metadata, column chunks etc.) https://parquet.apache.org/docs/file-format/ (Apache Parquet)
-
Apache Parquet — Types supported https://parquet.apache.org/docs/file-format/types/ (Apache Parquet)
Comparative / Research / Analysis Papers & Articles
- An Empirical Evaluation of Columnar Storage Formats — benchmark and comparison of Parquet vs ORC under modern workloads (arXiv)
- High-Performance Data Storage: A Comparative Analysis of AVRO, Parquet, ORC — comparing Parquet with Avro & ORC in big data systems (espjeta.org)
- Big Data File Formats: Evolution, Performance, and the Rise of Columnar — history, tradeoffs, analysis (ijsat.org)
- Mastering Big Data Formats: ORC, Parquet, Avro, Iceberg, … — comparative analysis and decision criteria (Seventh Sense Research Group®)
Comparison / Tutorial Articles
- Avro vs. Parquet: A Complete Comparison for Big Data Storage (DataCamp blog) (DataCamp)
- CSV vs Parquet vs Avro: Choosing the Right Tool for the Right Job (Read Medium articles with AI)
- Parquet, ORC, and Avro: The File Format Fundamentals of Big Data (Upsolver blog) (Upsolver)
- Choosing the Right File Format for Big Data (Ghost in the Data) (ghostinthedata.info)
- StackOverflow discussion: “Why Avro or Parquet faster than CSV?” (Stack Overflow)
- StackOverflow: Pros and cons of Parquet vs other formats (Stack Overflow)
- GeekLogBook — Differences between Parquet, Avro, JSON, CSV (geeklogbook.com)
If you like, I can collect a PDF “reading pack” (some papers + articles) and send you links or summaries.