Apache Parquet

Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval, especially for analytical workloads. It is open source and widely supported across big data tools.

TLDR;

Parquet is optimized for analytics: columnar, compressed, self-describing, and schema-aware.
It significantly reduces storage and speeds up columnar queries compared to CSV/JSON.
Works best when data is large, structured, and queried by column subsets.

Core Characteristics

Columnar Storage Data is stored column by column, unlike row-based formats (e.g., CSV). Queries that read only a few columns can skip the rest, saving I/O and memory.
Efficient Compression & Encoding Each column can use a compression and encoding strategy suited to its data type (e.g., dictionary encoding for strings, run-length encoding for repeated values).
Schema Evolution The file includes the schema. You can add or remove columns later without rewriting the entire dataset.
Cross-Language & Tool Support Supported in Apache Spark, Hive, Presto/Trino, Drill, Flink, Dremio, BigQuery, and libraries like PyArrow and pandas.
Self-Describing Metadata (schema, column statistics, min/max values, compression type) is embedded in the file, helping query engines optimize access.

Advantages Compared to Other Formats

Feature	Parquet	CSV	JSON	Avro
Storage Type	Columnar	Row-based	Row-based	Row-based
Compression Efficiency	High (per-column encoding)	Low	Moderate	Moderate
Schema Support	Yes	No	Weak (implicit)	Strong
Read Performance	Very fast for column subset	Slow (full row read)	Slow (full parse)	Moderate
Write Performance	Moderate	Fast	Fast	Fast
Query Optimization	Predicate pushdown, statistics	None	None	Limited
Data Type Support	Rich (nested, arrays, structs)	Strings only	Nested but inefficient	Rich

Key Takeaways:

Parquet vs CSV: Parquet compresses far better and only reads needed columns; CSV must scan all columns.
Parquet vs JSON: Parquet is binary and structured; JSON is text-based and slower to parse.
Parquet vs Avro: Avro is better for streaming and row-oriented use cases; Parquet excels in analytics and querying subsets of columns.

Practical Examples

Python (pandas + PyArrow)

import pandas as pd

# Write to Parquet
df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 92.3]
})
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy")

# Read Parquet
df2 = pd.read_parquet("data.parquet", engine="pyarrow")
print(df2)

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetExample").getOrCreate()
df = spark.read.parquet("s3://my-bucket/data.parquet")
df.select("user_id", "score").filter(df.score > 90).show()

DuckDB (ad-hoc querying)

SELECT user_id, score
FROM 'data.parquet'
WHERE score > 90;

When to Use Parquet

Data Lakes / Warehouses (e.g., on S3, GCS, Azure Blob)
Analytics with Spark, Trino, Presto
ETL Pipelines needing schema evolution and compression
Machine Learning Feature Storage

Avoid Parquet if:

You need fast row-based writes (e.g., transactional logs).
Data is very small and human-readable inspection matters (CSV/JSON may be simpler).

Primary / Authoritative Sources

Apache Parquet Documentation (official) — overview, file format specs, internals https://parquet.apache.org/docs/ (Apache Parquet)
Parquet Format GitHub / Specification (Thrift definitions etc.) https://github.com/apache/parquet-format (GitHub)
Apache Parquet — File Format section (metadata, column chunks etc.) https://parquet.apache.org/docs/file-format/ (Apache Parquet)
Apache Parquet — Types supported https://parquet.apache.org/docs/file-format/types/ (Apache Parquet)

Comparative / Research / Analysis Papers & Articles

An Empirical Evaluation of Columnar Storage Formats — benchmark and comparison of Parquet vs ORC under modern workloads (arXiv)
High-Performance Data Storage: A Comparative Analysis of AVRO, Parquet, ORC — comparing Parquet with Avro & ORC in big data systems (espjeta.org)
Big Data File Formats: Evolution, Performance, and the Rise of Columnar — history, tradeoffs, analysis (ijsat.org)
Mastering Big Data Formats: ORC, Parquet, Avro, Iceberg, … — comparative analysis and decision criteria (Seventh Sense Research Group®)

Comparison / Tutorial Articles

Avro vs. Parquet: A Complete Comparison for Big Data Storage (DataCamp blog) (DataCamp)
CSV vs Parquet vs Avro: Choosing the Right Tool for the Right Job (Read Medium articles with AI)
Parquet, ORC, and Avro: The File Format Fundamentals of Big Data (Upsolver blog) (Upsolver)
Choosing the Right File Format for Big Data (Ghost in the Data) (ghostinthedata.info)
StackOverflow discussion: “Why Avro or Parquet faster than CSV?” (Stack Overflow)
StackOverflow: Pros and cons of Parquet vs other formats (Stack Overflow)
GeekLogBook — Differences between Parquet, Avro, JSON, CSV (geeklogbook.com)

If you like, I can collect a PDF “reading pack” (some papers + articles) and send you links or summaries.