Skip to content

Advanced Transformations (Hands On)

  • This notebook demonstrates advanced Spark SQL transformations using the bookstore dataset, which includes customers, orders, and books tables.
  • Key transformations include parsing JSON strings, handling arrays, using aggregations, applying set operations, and pivoting datasets.

Parsing JSON Strings in Columns

Accessing Nested JSON Values

Spark SQL allows direct access to nested JSON string values using colon syntax:

SELECT
  profile:firstname AS first_name,
  profile:address:country AS country
FROM customers
````

* Useful for lightweight access without transformation.
* Strings remain untyped and unstructured.

### Converting JSON Strings to Struct Types

Use `from_json` to convert JSON strings into native `struct` types:

```sql
SELECT
  from_json(profile, schema_of_json('{ "firstname": "Jane", "lastname": "Doe", "gender": "F", "address": { "street": "Main", "city": "NY", "country": "USA" } }')) AS profile_struct
FROM customers
  • Requires a valid JSON schema (can be derived via schema_of_json()).
  • Produces a typed struct column.

Flattening Struct Columns

Use dot (.) notation to extract subfields from struct types:

SELECT
  profile_struct.firstname,
  profile_struct.lastname,
  profile_struct.gender,
  profile_struct.address.city
FROM parsed_customers

Or use * to explode all struct fields into individual columns:

SELECT
  profile_struct.*
FROM parsed_customers

Working with Arrays

Exploding Arrays into Rows

Use the explode() function to turn array elements into separate rows:

SELECT
  customer_id,
  order_id,
  explode(books) AS book
FROM orders

Each array element is extracted as a new row. Other columns are repeated.

Flattening Nested Arrays

Combine collect_set() with flatten() and array_distinct() to aggregate and deduplicate array values:

SELECT
  customer_id,
  array_distinct(flatten(collect_set(books.book_id))) AS unique_books
FROM orders
GROUP BY customer_id
  • collect_set: gathers unique arrays
  • flatten: collapses nested arrays into a single array
  • array_distinct: removes duplicate values

Join Operations

Spark SQL supports standard join types:

  • INNER JOIN
  • LEFT JOIN
  • RIGHT JOIN
  • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • LEFT ANTI JOIN
  • CROSS JOIN

Example: Enriching Orders with Book Details

SELECT
  o.customer_id,
  o.order_id,
  b.title,
  b.author,
  b.category
FROM exploded_orders o
INNER JOIN books b
ON o.book.book_id = b.book_id

Set Operations

UNION

Combines rows from two datasets:

SELECT * FROM orders
UNION
SELECT * FROM orders_updates

INTERSECT

Returns only records present in both datasets:

SELECT * FROM orders
INTERSECT
SELECT * FROM orders_updates

EXCEPT

Returns records present in the first dataset but not in the second:

SELECT * FROM orders
EXCEPT
SELECT * FROM orders_updates

Pivoting Data

The PIVOT clause aggregates and rotates data columns into rows.

Syntax

SELECT *
FROM (
  SELECT customer_id, book_id, price FROM orders
)
PIVOT (
  SUM(price) FOR book_id IN ('B01', 'B02', 'B03')
)

Explanation

  • The inner query defines the base dataset.
  • SUM(price) is the aggregation function.
  • book_id is the pivot key.
  • 'B01', 'B02', 'B03' become column headers.

Use Case: Flattening order data for each customer to prepare for visualization or machine learning pipelines.


Summary

Transformation Function / Syntax Use Case
JSON traversal json_column:key1:key2 Read values from JSON strings
JSON to struct from_json() Parse and type JSON content
Explode arrays explode() One row per array element
Flatten nested arrays flatten() + array_distinct() Remove nested array levels and duplicates
Joins JOIN variants Combine datasets
Set operations UNION, INTERSECT, EXCEPT Compare datasets
Pivoting PIVOT Transform rows into columns

Spark SQL's advanced features allow deep data manipulation for analytics, ETL, and data science workflows.