Setting Up Delta Tables
This document provides detailed coverage of working with Delta Lake tables in Databricks. It focuses on creating Delta tables using CTAS statements, enforcing table constraints, and creating clones (deep or shallow) of Delta tables. It also highlights best practices and the comparative features of CTAS and regular table creation methods.
1. Creating Delta Tables with CTAS
1.1. What Is CTAS?
CTAS (Create Table As Select) is a SQL statement used to:
- Create a new Delta table.
- Automatically populate the table with data from a
SELECT
query.
1.2. Key Characteristics
- Schema inference: Schema is derived from the SELECT statement. You cannot manually declare column types.
- Immediate data population: Unlike regular
CREATE TABLE
, CTAS inserts data during creation. -
Supports transformations:
-
Rename columns.
- Select subset of columns.
1.3. Example
CREATE TABLE user_info
COMMENT 'Contains PII (name, email)'
PARTITIONED BY (city, birthdate)
LOCATION '/mnt/external_data/user_info'
AS
SELECT
id,
name AS full_name,
email,
city,
birthdate
FROM raw_users;
COMMENT
: Documents the table content.PARTITIONED BY
: Organizes data in subfolders by partition keys.LOCATION
: Stores table data at a specified external location.
2. CTAS vs. Regular CREATE TABLE
Feature | CTAS | Regular CREATE TABLE |
---|---|---|
Schema Declaration | Inferred from SELECT | Must be declared manually |
Data Insertion | Immediate (from SELECT) | Requires INSERT INTO afterward |
Partition Support | Supported | Supported |
External Storage | Via LOCATION |
Via LOCATION |
Column Transformations | Inline via SELECT | Not applicable at creation |
3. Partitioning Best Practices
3.1. When to Partition
- Recommended for large Delta tables to enhance query performance.
-
Avoid for small/medium tables due to:
-
File fragmentation.
- Reduced file compaction efficiency.
- Ineffective data skipping.
3.2. Syntax Example
4. Adding Table Constraints
Delta tables support data integrity constraints, which are enforced during data writes.
4.1. Supported Constraint Types
NOT NULL
CHECK
(logical condition enforcement)
4.2. Constraint Requirements
- Existing data must conform to the constraint before applying.
- Violations on insert/update operations will trigger write failure.
4.3. Example: Adding a CHECK Constraint
- Constraints are similar in syntax to
WHERE
clauses.
5. Copying Delta Tables (Cloning)
Delta Lake supports two cloning methods: deep and shallow.
5.1. Deep Clone
5.1.1. Behavior
-
Full copy of:
-
Data
- Metadata
-
Suitable for:
-
Backup
- Long-term duplication
- Full environment replication
5.1.2. Syntax
- Can be re-run to sync changes incrementally.
- Time-intensive for large tables.
5.2. Shallow Clone
5.2.1. Behavior
- Copies only the Delta transaction log.
- No data movement.
- Fast operation.
-
Ideal for:
-
Testing
- Development environments
- Sandboxing
5.2.2. Syntax
5.3. Isolation and Data Integrity
- All modifications to cloned tables are isolated.
- Source tables remain unaffected by changes to clones.
6. Summary
Feature | CTAS | Regular CREATE | Deep Clone | Shallow Clone |
---|---|---|---|---|
Data Insert | During creation | Manual insert | Full data + metadata | Metadata only |
Schema | Inferred | Manual | Cloned | Cloned |
Partitioning | Supported | Supported | Preserved | Preserved |
External Storage | Via LOCATION |
Supported | Optional | Optional |
Use Case | Transform+load | Schema setup | Backup/replica | Fast development/test copy |
Data Movement | Yes | N/A | Yes | No |
7. Recommendations
- Prefer CTAS for quick data ingestion from existing tables.
- Use partitioning only for large datasets.
- Apply constraints to enforce business rules at the storage level.
- Use deep clones for full-scale duplication and backup.
- Use shallow clones for non-invasive dev/test scenarios.