Catalog Integration
Lexega can operate fully offline with no database connection. However, when you provide a catalog snapshot, analysis becomes significantly more precise—detecting issues that are impossible to find from semantics alone.
The catalog is an enhancer, not a perimeter. Tables missing from the snapshot are analyzed normally—just without catalog-enhanced signals. An incomplete snapshot makes analysis better where it covers, never worse where it doesn't.
Supported Platforms
| Platform | Schema | Row Counts | Constraints | Grants | Policies |
|---|---|---|---|---|---|
| Snowflake | ✅ INFORMATION_SCHEMA | ✅ SHOW TABLES | ✅ PK/FK/UNIQUE | ✅ ACCOUNT_USAGE | ✅ Masking + Row Access |
| Databricks | ✅ Unity Catalog API | ✅ Delta statistics | ✅ PK/FK | ✅ Unity Catalog permissions | ✅ Row filters + Column masks |
What the Catalog Enables
| Without Catalog | With Catalog |
|---|---|
| "This JOIN lacks a constraint" | "This JOIN on orders.customer_id = customers.id lacks FK backing—customers has 1.2M rows, potential fanout" |
| "DISTINCT may be unnecessary" | "DISTINCT is unnecessary—user_id is a PRIMARY KEY" |
| "Nullable column in JOIN" | "JOIN on email which is nullable—NULL values will be silently dropped" |
| "CROSS JOIN detected" | "CROSS JOIN between events (10M rows) and sessions (500K rows)—Cartesian product: 5 trillion rows" |
Catalog Contents
A catalog snapshot is a JSON file containing metadata extracted from your data platform:
{
"schema_version": 2,
"generated_at": "2025-01-15T10:30:00Z",
"provider": "snowflake",
"databases": [
{
"name": "ANALYTICS",
"schemas": [
{
"name": "PUBLIC",
"tables": [
{
"name": "CUSTOMERS",
"kind": "Table",
"row_count_estimate": 1200000,
"columns": [
{ "name": "ID", "data_type": "NUMBER(38,0)", "nullable": false },
{ "name": "EMAIL", "data_type": "VARCHAR(256)", "nullable": true }
],
"constraints": [
{ "kind": "PrimaryKey", "columns": ["ID"] }
]
}
]
}
]
}
],
"policies": [],
"policy_references": [],
"grants": { "object_privileges": [] }
}
Names can be plain strings (as above) or objects with an optional case_sensitive flag for mixed-case identifiers: { "name": "My Table", "case_sensitive": true }. The sidecar generates the appropriate format automatically.
The provider field enables automatic platform-aware behavior—identifier normalization, privilege canonicalization, and grant analysis all adapt accordingly.
Key Metadata Fields
| Field | What It Enables |
|---|---|
provider | Platform-aware identifier normalization and grant analysis |
row_count_estimate | Detect large table scans, estimate JOIN fanout and CROSS JOIN products |
bytes_estimate | Track data volume for cost estimation |
columns[].nullable | Detect nullable JOIN columns (Q-JOIN-NULL-CENH) |
columns[].data_type | Detect type mismatches in JOINs (Q-JOIN-TYPEMIS-CENH) |
constraints | Detect missing FK backing (Q-JOIN-FANOUT), unnecessary DISTINCT |
tags | Enable tag-based custom rules (e.g., block queries on PII tables without audit) |
policies | Track masking/row-access policy coverage |
policy_references | Know which columns are protected |
grants.object_privileges | Compute effective access changes from GRANT/REVOKE |
Generating a Catalog Snapshot
Snowflake
# With external browser auth
lexega-catalog pull \
--provider snowflake \
--out .lexega/catalog.json \
-- --account ACCT --user USER --auth externalbrowser
# With password auth
lexega-catalog pull \
--provider snowflake \
--out .lexega/catalog.json \
-- --account ACCT --user USER --password "$SNOWFLAKE_PASSWORD"
# Include grant graph (role hierarchy + object privileges)
lexega-catalog pull \
--provider snowflake \
--include-grants \
--out .lexega/catalog.json \
-- --account ACCT --user USER --auth externalbrowser
Databricks
# Basic catalog pull (Unity Catalog)
lexega-catalog pull \
--provider databricks \
--workspace-url https://your-workspace.cloud.databricks.com \
--token-env DATABRICKS_TOKEN \
--out .lexega/catalog.json
# With grants (Unity Catalog permissions)
lexega-catalog pull \
--provider databricks \
--workspace-url https://your-workspace.cloud.databricks.com \
--token-env DATABRICKS_TOKEN \
--include-grants \
--out .lexega/catalog.json
# Filter to specific catalogs
lexega-catalog pull \
--provider databricks \
--workspace-url https://your-workspace.cloud.databricks.com \
--token-env DATABRICKS_TOKEN \
--database my_catalog \
--out .lexega/catalog.json
Databricks authentication: Set DATABRICKS_TOKEN to a personal access token (PAT) from your workspace settings. The sidecar uses the Unity Catalog REST API (/api/2.1/unity-catalog/) — no SQL warehouse or cluster required.
Row counts: For Delta tables with statistics collected (via ANALYZE TABLE or auto-stats), row counts and byte sizes are automatically extracted from spark.sql.statistics.numRows and spark.sql.statistics.totalSize in table properties.
Provider Auto-Detection
When analyzing SQL with a catalog snapshot, Lexega automatically detects the provider from the snapshot's provider field:
# No --provider flag needed — auto-detected from snapshot
lexega-sql analyze model.sql --catalog .lexega/catalog.json
You can also explicitly override the provider:
lexega-sql analyze model.sql --catalog .lexega/catalog.json --provider databricks
Sidecar Data Sources
Snowflake Queries
The sidecar runs read-only queries against Snowflake:
Schema Metadata (INFORMATION_SCHEMA)
| Query | Purpose |
|---|---|
SHOW DATABASES | List all accessible databases |
SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLES | Table metadata per database |
SELECT ... FROM {db}.INFORMATION_SCHEMA.COLUMNS | Column types and nullability |
SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLE_CONSTRAINTS | Primary keys, unique constraints |
SELECT ... FROM {db}.INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS | FK relationships |
SHOW TABLES IN SCHEMA {db}.{schema} | Row count and byte estimates |
SHOW IMPORTED KEYS IN DATABASE {db} | FK column membership |
Grant Graph (ACCOUNT_USAGE) — requires --include-grants
| Query | Purpose |
|---|---|
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON = 'ROLE' | Role hierarchy |
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON IN ('TABLE', 'VIEW', ...) | Object privileges |
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS | User-to-role assignments |
Governance Policies (ACCOUNT_USAGE)
| Query | Purpose |
|---|---|
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.MASKING_POLICIES | Masking policy definitions |
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.ROW_ACCESS_POLICIES | Row access policy definitions |
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.POLICY_REFERENCES | Policy-to-table mappings |
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES | Tags on tables and columns |
Databricks API Endpoints
The sidecar uses the Unity Catalog REST API (no SQL warehouse needed):
| Endpoint | Purpose |
|---|---|
GET /api/2.1/unity-catalog/catalogs | List all catalogs |
GET /api/2.1/unity-catalog/schemas?catalog_name=... | List schemas per catalog |
GET /api/2.1/unity-catalog/tables?catalog_name=...&schema_name=... | List tables and columns |
GET /api/2.1/unity-catalog/tables/{full_name} | Table detail (constraints, masks, row filters, Delta stats) |
GET /api/2.1/unity-catalog/permissions/{type}/{name} | Object permissions (with --include-grants) |
Manual Generation
You can also generate the JSON yourself. The schema is documented in the JSON Schema file:
# Get the JSON Schema for validation
lexega-sql catalog schema > catalog.schema.json
Catalog Commands
# Inspect a catalog snapshot (summary)
lexega-sql catalog inspect .lexega/catalog.json
# Diff two catalog snapshots
lexega-sql catalog diff old.json new.json
Using the Catalog in Analysis
Pass the catalog to any analysis command with --catalog:
# Semantic diff with catalog
lexega-sql diff main..HEAD models/ -r --catalog .lexega/catalog.json
# Review with catalog
lexega-sql review main..HEAD models/ -r --catalog .lexega/catalog.json
# Static analysis with catalog
lexega-sql analyze model.sql --catalog .lexega/catalog.json
The provider is auto-detected from the snapshot. No additional flags needed.
Catalog-Enhanced Signals
Q-JOIN-NULL-CENH: Nullable JOIN Column
Without catalog: Cannot detect (no column metadata)
With catalog:
[HIGH] Q-JOIN-NULL-CENH: Silent Data Loss Risk
JOIN on `orders.user_id = users.id` where `user_id` is nullable
↳ NULL values in user_id will be silently dropped from results
Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN
Without catalog: Cannot detect
With catalog:
[MEDIUM] Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN
JOIN comparing `orders.customer_id` (VARCHAR) to `customers.id` (NUMBER)
↳ Implicit cast may cause performance issues or unexpected matches
Cross-Join Product Estimation
Without catalog: "CROSS JOIN detected" (no size info)
With catalog:
[MEDIUM] Explicit CROSS JOIN between system.lakeflow.zerobus_ingest and
system.lakeflow.zerobus_stream. Cartesian product (199,439,331,565,824
rows) may cause performance issues on large tables.
↳ left_rows=33,059,072 right_rows=6,032,817
Large Unbounded Table Scan
Without catalog: Cannot detect
With catalog:
[HIGH] Unbounded query on large table without WHERE clause
↳ tables=system.access.audit, total_rows=229,743,971,240
Q-JOIN-DISTINCT-MASK: DISTINCT on JOINs with Non-Unique Keys
Without catalog: Warns about potential duplicates
With catalog:
[HIGH] Q-JOIN-DISTINCT-MASK: DISTINCT with Non-Unique JOIN Keys
DISTINCT applied after JOIN on `orders.product_id`
↳ `product_id` is not unique (no PK/UNIQUE constraint)
↳ products has 50,000 rows—potential fanout before DISTINCT
Q-JOIN-FANOUT: Unconstrained JOIN Fanout
Without catalog: Cannot detect
With catalog:
[HIGH] Q-JOIN-FANOUT: Unconstrained JOIN Fanout
JOIN on `events.session_id = sessions.id` lacks FK or UNIQUE constraint
↳ events has 10M rows, sessions has 500K rows
↳ Result set may be unexpectedly large
Grant Graph Analysis (GRT-ACCESS-EXP, GRT-ACCESS-EXP-HI, GRT-SYSROLE-EXP, GRT-BROAD-PRIV, GRT-BROAD-PRIV-HI)
With the grants section populated, Lexega can compute effective access changes:
[CRITICAL] GRT-SYSROLE-EXP: Critical System Role Grant
GRANT ROLE ACCOUNTADMIN TO ROLE data_team
↳ data_team inherits all ACCOUNTADMIN privileges
↳ Affects 15 users via role inheritance
Identifier Normalization
Lexega automatically normalizes identifiers based on the catalog provider:
| Platform | Unquoted Identifiers | Quoted Identifiers |
|---|---|---|
| Snowflake | Uppercased (my_table → MY_TABLE) | Case-preserved ("My Table" → My Table) |
| Databricks | Lowercased (MY_TABLE → my_table) | Case-preserved |
This means SQL like SELECT * FROM MY_CATALOG.MY_SCHEMA.MY_TABLE correctly resolves against a Databricks catalog where names are stored in lowercase—and vice versa for Snowflake uppercase conventions.
CI/CD Integration
Refresh Catalog in CI
For accurate analysis, refresh the catalog periodically:
Snowflake
# .github/workflows/catalog-refresh.yml
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM
jobs:
refresh:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
lexega-catalog pull \
--provider snowflake \
--include-grants \
--out .lexega/catalog.json \
-- --account ${{ secrets.SNOWFLAKE_ACCOUNT }} \
--user ${{ secrets.SNOWFLAKE_USER }} \
--password "${{ secrets.SNOWFLAKE_PASSWORD }}"
- run: |
git config user.name "github-actions"
git config user.email "github-actions@github.com"
git add .lexega/catalog.json
git commit -m "chore: refresh catalog snapshot" || exit 0
git push
Databricks
# .github/workflows/catalog-refresh.yml
on:
schedule:
- cron: '0 6 * * *'
jobs:
refresh:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
lexega-catalog pull \
--provider databricks \
--workspace-url ${{ secrets.DATABRICKS_WORKSPACE_URL }} \
--token-env DATABRICKS_TOKEN \
--include-grants \
--out .lexega/catalog.json
env:
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
- run: |
git config user.name "github-actions"
git config user.email "github-actions@github.com"
git add .lexega/catalog.json
git commit -m "chore: refresh catalog snapshot" || exit 0
git push
Use Cached Catalog in PR Checks
# .github/workflows/sql-review.yml
on: [pull_request]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- run: |
lexega-sql review ${{ github.event.pull_request.base.sha }}..${{ github.sha }} \
models/ -r \
--catalog .lexega/catalog.json \
--policy .lexega/policy.yaml \
--env prod
Catalog Staleness
Catalog snapshots can become stale as your schema evolves. Lexega uses row_count_estimate_as_of and bytes_estimate_as_of timestamps to track freshness when available (Snowflake populates these automatically).
Recommendations:
- Refresh daily for actively changing schemas
- Refresh weekly for stable schemas
- For Databricks, run
ANALYZE TABLEperiodically to keep Delta statistics current
License Requirements
Catalog operations require a Team, Pro, or Enterprise license. The core CLI works without a catalog on all license tiers.
Need Help?
Can't find what you're looking for? Check out our GitHub or reach out to support.