Catalog Integration

Lexega can operate fully offline with no database connection. However, when you provide a catalog snapshot, analysis becomes significantly more precise—detecting issues that are impossible to find from semantics alone.

The catalog is an enhancer, not a perimeter. Tables missing from the snapshot are analyzed normally—just without catalog-enhanced signals. An incomplete snapshot makes analysis better where it covers, never worse where it doesn't.

Supported Platforms

PlatformSchemaRow CountsConstraintsGrantsPolicies
Snowflake✅ INFORMATION_SCHEMA✅ SHOW TABLES✅ PK/FK/UNIQUE✅ ACCOUNT_USAGE✅ Masking + Row Access
Databricks✅ Unity Catalog API✅ Delta statistics✅ PK/FK✅ Unity Catalog permissions✅ Row filters + Column masks

What the Catalog Enables

Without CatalogWith Catalog
"This JOIN lacks a constraint""This JOIN on orders.customer_id = customers.id lacks FK backing—customers has 1.2M rows, potential fanout"
"DISTINCT may be unnecessary""DISTINCT is unnecessary—user_id is a PRIMARY KEY"
"Nullable column in JOIN""JOIN on email which is nullable—NULL values will be silently dropped"
"CROSS JOIN detected""CROSS JOIN between events (10M rows) and sessions (500K rows)—Cartesian product: 5 trillion rows"

Catalog Contents

A catalog snapshot is a JSON file containing metadata extracted from your data platform:

{
  "schema_version": 2,
  "generated_at": "2025-01-15T10:30:00Z",
  "provider": "snowflake",
  "databases": [
    {
      "name": "ANALYTICS",
      "schemas": [
        {
          "name": "PUBLIC",
          "tables": [
            {
              "name": "CUSTOMERS",
              "kind": "Table",
              "row_count_estimate": 1200000,
              "columns": [
                { "name": "ID", "data_type": "NUMBER(38,0)", "nullable": false },
                { "name": "EMAIL", "data_type": "VARCHAR(256)", "nullable": true }
              ],
              "constraints": [
                { "kind": "PrimaryKey", "columns": ["ID"] }
              ]
            }
          ]
        }
      ]
    }
  ],
  "policies": [],
  "policy_references": [],
  "grants": { "object_privileges": [] }
}

Names can be plain strings (as above) or objects with an optional case_sensitive flag for mixed-case identifiers: { "name": "My Table", "case_sensitive": true }. The sidecar generates the appropriate format automatically.

The provider field enables automatic platform-aware behavior—identifier normalization, privilege canonicalization, and grant analysis all adapt accordingly.

Key Metadata Fields

FieldWhat It Enables
providerPlatform-aware identifier normalization and grant analysis
row_count_estimateDetect large table scans, estimate JOIN fanout and CROSS JOIN products
bytes_estimateTrack data volume for cost estimation
columns[].nullableDetect nullable JOIN columns (Q-JOIN-NULL-CENH)
columns[].data_typeDetect type mismatches in JOINs (Q-JOIN-TYPEMIS-CENH)
constraintsDetect missing FK backing (Q-JOIN-FANOUT), unnecessary DISTINCT
tagsEnable tag-based custom rules (e.g., block queries on PII tables without audit)
policiesTrack masking/row-access policy coverage
policy_referencesKnow which columns are protected
grants.object_privilegesCompute effective access changes from GRANT/REVOKE

Generating a Catalog Snapshot

Snowflake

# With external browser auth
lexega-catalog pull \
  --provider snowflake \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --auth externalbrowser

# With password auth
lexega-catalog pull \
  --provider snowflake \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --password "$SNOWFLAKE_PASSWORD"

# Include grant graph (role hierarchy + object privileges)
lexega-catalog pull \
  --provider snowflake \
  --include-grants \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --auth externalbrowser

Databricks

# Basic catalog pull (Unity Catalog)
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --out .lexega/catalog.json

# With grants (Unity Catalog permissions)
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --include-grants \
  --out .lexega/catalog.json

# Filter to specific catalogs
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --database my_catalog \
  --out .lexega/catalog.json

Databricks authentication: Set DATABRICKS_TOKEN to a personal access token (PAT) from your workspace settings. The sidecar uses the Unity Catalog REST API (/api/2.1/unity-catalog/) — no SQL warehouse or cluster required.

Row counts: For Delta tables with statistics collected (via ANALYZE TABLE or auto-stats), row counts and byte sizes are automatically extracted from spark.sql.statistics.numRows and spark.sql.statistics.totalSize in table properties.

Provider Auto-Detection

When analyzing SQL with a catalog snapshot, Lexega automatically detects the provider from the snapshot's provider field:

# No --provider flag needed — auto-detected from snapshot
lexega-sql analyze model.sql --catalog .lexega/catalog.json

You can also explicitly override the provider:

lexega-sql analyze model.sql --catalog .lexega/catalog.json --provider databricks

Sidecar Data Sources

Snowflake Queries

The sidecar runs read-only queries against Snowflake:

Schema Metadata (INFORMATION_SCHEMA)

QueryPurpose
SHOW DATABASESList all accessible databases
SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLESTable metadata per database
SELECT ... FROM {db}.INFORMATION_SCHEMA.COLUMNSColumn types and nullability
SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLE_CONSTRAINTSPrimary keys, unique constraints
SELECT ... FROM {db}.INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTSFK relationships
SHOW TABLES IN SCHEMA {db}.{schema}Row count and byte estimates
SHOW IMPORTED KEYS IN DATABASE {db}FK column membership

Grant Graph (ACCOUNT_USAGE)requires --include-grants

QueryPurpose
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON = 'ROLE'Role hierarchy
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON IN ('TABLE', 'VIEW', ...)Object privileges
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERSUser-to-role assignments

Governance Policies (ACCOUNT_USAGE)

QueryPurpose
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.MASKING_POLICIESMasking policy definitions
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.ROW_ACCESS_POLICIESRow access policy definitions
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.POLICY_REFERENCESPolicy-to-table mappings
SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCESTags on tables and columns

Databricks API Endpoints

The sidecar uses the Unity Catalog REST API (no SQL warehouse needed):

EndpointPurpose
GET /api/2.1/unity-catalog/catalogsList all catalogs
GET /api/2.1/unity-catalog/schemas?catalog_name=...List schemas per catalog
GET /api/2.1/unity-catalog/tables?catalog_name=...&schema_name=...List tables and columns
GET /api/2.1/unity-catalog/tables/{full_name}Table detail (constraints, masks, row filters, Delta stats)
GET /api/2.1/unity-catalog/permissions/{type}/{name}Object permissions (with --include-grants)

Manual Generation

You can also generate the JSON yourself. The schema is documented in the JSON Schema file:

# Get the JSON Schema for validation
lexega-sql catalog schema > catalog.schema.json

Catalog Commands

# Inspect a catalog snapshot (summary)
lexega-sql catalog inspect .lexega/catalog.json

# Diff two catalog snapshots
lexega-sql catalog diff old.json new.json

Using the Catalog in Analysis

Pass the catalog to any analysis command with --catalog:

# Semantic diff with catalog
lexega-sql diff main..HEAD models/ -r --catalog .lexega/catalog.json

# Review with catalog
lexega-sql review main..HEAD models/ -r --catalog .lexega/catalog.json

# Static analysis with catalog
lexega-sql analyze model.sql --catalog .lexega/catalog.json

The provider is auto-detected from the snapshot. No additional flags needed.

Catalog-Enhanced Signals

Q-JOIN-NULL-CENH: Nullable JOIN Column

Without catalog: Cannot detect (no column metadata)

With catalog:

[HIGH] Q-JOIN-NULL-CENH: Silent Data Loss Risk
  JOIN on `orders.user_id = users.id` where `user_id` is nullable
  ↳ NULL values in user_id will be silently dropped from results

Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN

Without catalog: Cannot detect

With catalog:

[MEDIUM] Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN
  JOIN comparing `orders.customer_id` (VARCHAR) to `customers.id` (NUMBER)
  ↳ Implicit cast may cause performance issues or unexpected matches

Cross-Join Product Estimation

Without catalog: "CROSS JOIN detected" (no size info)

With catalog:

[MEDIUM] Explicit CROSS JOIN between system.lakeflow.zerobus_ingest and
  system.lakeflow.zerobus_stream. Cartesian product (199,439,331,565,824
  rows) may cause performance issues on large tables.
  ↳ left_rows=33,059,072 right_rows=6,032,817

Large Unbounded Table Scan

Without catalog: Cannot detect

With catalog:

[HIGH] Unbounded query on large table without WHERE clause
  ↳ tables=system.access.audit, total_rows=229,743,971,240

Q-JOIN-DISTINCT-MASK: DISTINCT on JOINs with Non-Unique Keys

Without catalog: Warns about potential duplicates

With catalog:

[HIGH] Q-JOIN-DISTINCT-MASK: DISTINCT with Non-Unique JOIN Keys
  DISTINCT applied after JOIN on `orders.product_id`
  ↳ `product_id` is not unique (no PK/UNIQUE constraint)
  ↳ products has 50,000 rows—potential fanout before DISTINCT

Q-JOIN-FANOUT: Unconstrained JOIN Fanout

Without catalog: Cannot detect

With catalog:

[HIGH] Q-JOIN-FANOUT: Unconstrained JOIN Fanout
  JOIN on `events.session_id = sessions.id` lacks FK or UNIQUE constraint
  ↳ events has 10M rows, sessions has 500K rows
  ↳ Result set may be unexpectedly large

Grant Graph Analysis (GRT-ACCESS-EXP, GRT-ACCESS-EXP-HI, GRT-SYSROLE-EXP, GRT-BROAD-PRIV, GRT-BROAD-PRIV-HI)

With the grants section populated, Lexega can compute effective access changes:

[CRITICAL] GRT-SYSROLE-EXP: Critical System Role Grant
  GRANT ROLE ACCOUNTADMIN TO ROLE data_team
  ↳ data_team inherits all ACCOUNTADMIN privileges
  ↳ Affects 15 users via role inheritance

Identifier Normalization

Lexega automatically normalizes identifiers based on the catalog provider:

PlatformUnquoted IdentifiersQuoted Identifiers
SnowflakeUppercased (my_tableMY_TABLE)Case-preserved ("My Table"My Table)
DatabricksLowercased (MY_TABLEmy_table)Case-preserved

This means SQL like SELECT * FROM MY_CATALOG.MY_SCHEMA.MY_TABLE correctly resolves against a Databricks catalog where names are stored in lowercase—and vice versa for Snowflake uppercase conventions.

CI/CD Integration

Refresh Catalog in CI

For accurate analysis, refresh the catalog periodically:

Snowflake

# .github/workflows/catalog-refresh.yml
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          lexega-catalog pull \
            --provider snowflake \
            --include-grants \
            --out .lexega/catalog.json \
            -- --account ${{ secrets.SNOWFLAKE_ACCOUNT }} \
               --user ${{ secrets.SNOWFLAKE_USER }} \
               --password "${{ secrets.SNOWFLAKE_PASSWORD }}"
      - run: |
          git config user.name "github-actions"
          git config user.email "github-actions@github.com"
          git add .lexega/catalog.json
          git commit -m "chore: refresh catalog snapshot" || exit 0
          git push

Databricks

# .github/workflows/catalog-refresh.yml
on:
  schedule:
    - cron: '0 6 * * *'

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          lexega-catalog pull \
            --provider databricks \
            --workspace-url ${{ secrets.DATABRICKS_WORKSPACE_URL }} \
            --token-env DATABRICKS_TOKEN \
            --include-grants \
            --out .lexega/catalog.json
        env:
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      - run: |
          git config user.name "github-actions"
          git config user.email "github-actions@github.com"
          git add .lexega/catalog.json
          git commit -m "chore: refresh catalog snapshot" || exit 0
          git push

Use Cached Catalog in PR Checks

# .github/workflows/sql-review.yml
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - run: |
          lexega-sql review ${{ github.event.pull_request.base.sha }}..${{ github.sha }} \
            models/ -r \
            --catalog .lexega/catalog.json \
            --policy .lexega/policy.yaml \
            --env prod

Catalog Staleness

Catalog snapshots can become stale as your schema evolves. Lexega uses row_count_estimate_as_of and bytes_estimate_as_of timestamps to track freshness when available (Snowflake populates these automatically).

Recommendations:

  • Refresh daily for actively changing schemas
  • Refresh weekly for stable schemas
  • For Databricks, run ANALYZE TABLE periodically to keep Delta statistics current

License Requirements

Catalog operations require a Team, Pro, or Enterprise license. The core CLI works without a catalog on all license tiers.

Need Help?

Can't find what you're looking for? Check out our GitHub or reach out to support.