Querying Files in Databricks Using Spark SQL

This guide outlines how to extract and query data from files in Databricks using Spark SQL.
It includes methods for inspecting raw file content, leveraging external data sources, and loading data into Delta Lake tables using CTAS and CREATE TABLE USING statements.

Basic File Querying

Syntax

To query a file directly using Spark SQL:

SELECT * FROM format.`/path/to/file`

Use backticks (`) around the file path.
Replace format with the appropriate file format (e.g., json, parquet).

Supported Formats

Self-describing formats with embedded schema definitions:

json
parquet

Non-self-describing formats (e.g., csv) require additional configuration.

Multi-File Access

You can:

Query a single file
Use wildcards (*) to access multiple files
Query an entire directory

Note: All files must share the same schema and format.

Raw Text and Binary Extraction

Text Format

To load raw text content from files:

SELECT * FROM text.`/path/to/textfiles`

Applicable formats:

json
csv
tsv
txt

Useful for:

Inspecting corrupted input
Custom parsing with string functions

Binary File Format

To load files as binary (e.g., images, unstructured data):

SELECT * FROM binaryFile.`/path/to/files`

Creating Delta Tables from Files

Using CTAS Statements

Create Table As Select (CTAS) allows loading file data into a Delta table:

CREATE TABLE table_name
USING DELTA
AS SELECT * FROM parquet.`/path/to/files`

Schema is inferred from the SELECT clause.
Data is physically moved into a managed Delta table.

Limitations of CTAS

No support for schema declaration
No support for file options (e.g., delimiter, header)
Best used with well-defined formats (e.g., parquet)

Warning: Not suitable for CSV ingestion due to lack of configurable options.

Creating External Tables with Options

Using `CREATE TABLE USING`

To create a table that references files directly (external table):

CREATE TABLE table_name
USING format
OPTIONS (
  key1 = 'value1',
  key2 = 'value2'
)
LOCATION '/path/to/files'

Table is a pointer to external files.
No data movement occurs.
The table is non-Delta.
Original file format is preserved.

CSV External Table Example

CREATE TABLE external_csv_table
USING csv
OPTIONS (
  header = 'true',
  delimiter = ';'
)
LOCATION '/mnt/data/mycsvfiles'

JDBC External Table Example

CREATE TABLE external_jdbc_table
USING jdbc
OPTIONS (
  url = 'jdbc:mysql://hostname:3306/db',
  dbtable = 'my_table',
  user = 'username',
  password = 'password'
)

Note: This points to an external SQL database table.

Loading External Data into Delta Tables

To load data from an external source into a Delta table:

Create a temporary view:

CREATE OR REPLACE TEMP VIEW temp_view_name
USING csv
OPTIONS (
  header = 'true',
  delimiter = ';'
)
LOCATION '/mnt/data/mycsvfiles'

Use CTAS to load into Delta Lake:

CREATE TABLE delta_table
USING DELTA
AS SELECT * FROM temp_view_name

Benefits:

Data is loaded and stored in Delta format.
Full Delta Lake features are enabled (e.g., time travel, ACID).

Summary

Method	Data Movement	Delta Support	File Options	Use Case
`SELECT * FROM format.``	No	No	No	Ad-hoc querying of files
`CREATE TABLE USING`	No	No	Yes	Reference external sources
`CTAS`	Yes	Yes	No	Load structured data into Delta
Temp View + CTAS	Yes	Yes	Yes (via view)	Load external unstructured data

Warning: External tables (USING) do not support Delta Lake features and may incur performance issues when referring to large datasets.

```