Catalog Integration

Name: Lexega
Author: Lexega

Lexega can operate fully offline with no database connection. However, when you provide a catalog snapshot, analysis becomes significantly more precise—detecting issues that are impossible to find from semantics alone.

The catalog is an enhancer, not a perimeter. Tables missing from the snapshot are analyzed normally—just without catalog-enhanced signals. An incomplete snapshot makes analysis better where it covers, never worse where it doesn't.

Supported Platforms

Platform	Schema	Row Counts	Constraints	Grants	Policies
Snowflake	✅ INFORMATION_SCHEMA	✅ SHOW TABLES	✅ PK/FK/UNIQUE	✅ ACCOUNT_USAGE	✅ Masking + Row Access
Databricks	✅ Unity Catalog API	✅ Delta statistics	✅ PK/FK	✅ Unity Catalog permissions	✅ Row filters + Column masks

What the Catalog Enables

Without Catalog	With Catalog
"This JOIN lacks a constraint"	"This JOIN on `orders.customer_id = customers.id` lacks FK backing—customers has 1.2M rows, potential fanout"
"DISTINCT may be unnecessary"	"DISTINCT is unnecessary—`user_id` is a PRIMARY KEY"
"Nullable column in JOIN"	"JOIN on `email` which is nullable—NULL values will be silently dropped"
"CROSS JOIN detected"	"CROSS JOIN between `events` (10M rows) and `sessions` (500K rows)—Cartesian product: 5 trillion rows"

Catalog Contents

A catalog snapshot is a JSON file containing metadata extracted from your data platform:

{
  "schema_version": 2,
  "generated_at": "2025-01-15T10:30:00Z",
  "provider": "snowflake",
  "databases": [
    {
      "name": "ANALYTICS",
      "schemas": [
        {
          "name": "PUBLIC",
          "tables": [
            {
              "name": "CUSTOMERS",
              "kind": "Table",
              "row_count_estimate": 1200000,
              "columns": [
                { "name": "ID", "data_type": "NUMBER(38,0)", "nullable": false },
                { "name": "EMAIL", "data_type": "VARCHAR(256)", "nullable": true }
              ],
              "constraints": [
                { "kind": "PrimaryKey", "columns": ["ID"] }
              ]
            }
          ]
        }
      ]
    }
  ],
  "policies": [],
  "policy_references": [],
  "grants": { "object_privileges": [] }
}

Names can be plain strings (as above) or objects with an optional case_sensitive flag for mixed-case identifiers: { "name": "My Table", "case_sensitive": true }. The sidecar generates the appropriate format automatically.

The provider field enables automatic platform-aware behavior—identifier normalization, privilege canonicalization, and grant analysis all adapt accordingly.

Key Metadata Fields

Field	What It Enables
`provider`	Platform-aware identifier normalization and grant analysis
`row_count_estimate`	Detect large table scans, estimate JOIN fanout and CROSS JOIN products
`bytes_estimate`	Track data volume for cost estimation
`columns[].nullable`	Detect nullable JOIN columns (Q-JOIN-NULL-CENH)
`columns[].data_type`	Detect type mismatches in JOINs (Q-JOIN-TYPEMIS-CENH)
`constraints`	Detect missing FK backing (Q-JOIN-FANOUT), unnecessary DISTINCT
`tags`	Enable tag-based custom rules (e.g., block queries on PII tables without audit)
`policies`	Track masking/row-access policy coverage
`policy_references`	Know which columns are protected
`grants.object_privileges`	Compute effective access changes from GRANT/REVOKE

Generating a Catalog Snapshot

Snowflake

# With external browser auth
lexega-catalog pull \
  --provider snowflake \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --auth externalbrowser

# With password auth
lexega-catalog pull \
  --provider snowflake \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --password "$SNOWFLAKE_PASSWORD"

# Include grant graph (role hierarchy + object privileges)
lexega-catalog pull \
  --provider snowflake \
  --include-grants \
  --out .lexega/catalog.json \
  -- --account ACCT --user USER --auth externalbrowser

Databricks

# Basic catalog pull (Unity Catalog)
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --out .lexega/catalog.json

# With grants (Unity Catalog permissions)
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --include-grants \
  --out .lexega/catalog.json

# Filter to specific catalogs
lexega-catalog pull \
  --provider databricks \
  --workspace-url https://your-workspace.cloud.databricks.com \
  --token-env DATABRICKS_TOKEN \
  --database my_catalog \
  --out .lexega/catalog.json

Databricks authentication: Set DATABRICKS_TOKEN to a personal access token (PAT) from your workspace settings. The sidecar uses the Unity Catalog REST API (/api/2.1/unity-catalog/) — no SQL warehouse or cluster required.

Row counts: For Delta tables with statistics collected (via ANALYZE TABLE or auto-stats), row counts and byte sizes are automatically extracted from spark.sql.statistics.numRows and spark.sql.statistics.totalSize in table properties.

Provider Auto-Detection

When analyzing SQL with a catalog snapshot, Lexega automatically detects the provider from the snapshot's provider field:

# No --provider flag needed — auto-detected from snapshot
lexega-sql analyze model.sql --catalog .lexega/catalog.json

You can also explicitly override the provider:

lexega-sql analyze model.sql --catalog .lexega/catalog.json --provider databricks

Sidecar Data Sources

Snowflake Queries

The sidecar runs read-only queries against Snowflake:

Schema Metadata (INFORMATION_SCHEMA)

Query	Purpose
`SHOW DATABASES`	List all accessible databases
`SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLES`	Table metadata per database
`SELECT ... FROM {db}.INFORMATION_SCHEMA.COLUMNS`	Column types and nullability
`SELECT ... FROM {db}.INFORMATION_SCHEMA.TABLE_CONSTRAINTS`	Primary keys, unique constraints
`SELECT ... FROM {db}.INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS`	FK relationships
`SHOW TABLES IN SCHEMA {db}.{schema}`	Row count and byte estimates
`SHOW IMPORTED KEYS IN DATABASE {db}`	FK column membership

Grant Graph (ACCOUNT_USAGE) — requires --include-grants

Query	Purpose
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON = 'ROLE'`	Role hierarchy
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES WHERE GRANTED_ON IN ('TABLE', 'VIEW', ...)`	Object privileges
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS`	User-to-role assignments

Governance Policies (ACCOUNT_USAGE)

Query	Purpose
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.MASKING_POLICIES`	Masking policy definitions
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.ROW_ACCESS_POLICIES`	Row access policy definitions
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.POLICY_REFERENCES`	Policy-to-table mappings
`SELECT ... FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES`	Tags on tables and columns

Databricks API Endpoints

The sidecar uses the Unity Catalog REST API (no SQL warehouse needed):

Endpoint	Purpose
`GET /api/2.1/unity-catalog/catalogs`	List all catalogs
`GET /api/2.1/unity-catalog/schemas?catalog_name=...`	List schemas per catalog
`GET /api/2.1/unity-catalog/tables?catalog_name=...&schema_name=...`	List tables and columns
`GET /api/2.1/unity-catalog/tables/{full_name}`	Table detail (constraints, masks, row filters, Delta stats)
`GET /api/2.1/unity-catalog/permissions/{type}/{name}`	Object permissions (with `--include-grants`)

Manual Generation

You can also generate the JSON yourself. The schema is documented in the JSON Schema file:

# Get the JSON Schema for validation
lexega-sql catalog schema > catalog.schema.json

Catalog Commands

# Inspect a catalog snapshot (summary)
lexega-sql catalog inspect .lexega/catalog.json

# Diff two catalog snapshots
lexega-sql catalog diff old.json new.json

Using the Catalog in Analysis

Pass the catalog to any analysis command with --catalog:

# Semantic diff with catalog
lexega-sql diff main..HEAD models/ -r --catalog .lexega/catalog.json

# Review with catalog
lexega-sql review main..HEAD models/ -r --catalog .lexega/catalog.json

# Static analysis with catalog
lexega-sql analyze model.sql --catalog .lexega/catalog.json

The provider is auto-detected from the snapshot. No additional flags needed.

Catalog-Enhanced Signals

Q-JOIN-NULL-CENH: Nullable JOIN Column

Without catalog: Cannot detect (no column metadata)

With catalog:

[HIGH] Q-JOIN-NULL-CENH: Silent Data Loss Risk
  JOIN on `orders.user_id = users.id` where `user_id` is nullable
  ↳ NULL values in user_id will be silently dropped from results

Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN

Without catalog: Cannot detect

With catalog:

[MEDIUM] Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN
  JOIN comparing `orders.customer_id` (VARCHAR) to `customers.id` (NUMBER)
  ↳ Implicit cast may cause performance issues or unexpected matches

Cross-Join Product Estimation

Without catalog: "CROSS JOIN detected" (no size info)

With catalog:

[MEDIUM] Explicit CROSS JOIN between system.lakeflow.zerobus_ingest and
  system.lakeflow.zerobus_stream. Cartesian product (199,439,331,565,824
  rows) may cause performance issues on large tables.
  ↳ left_rows=33,059,072 right_rows=6,032,817

Large Unbounded Table Scan

Without catalog: Cannot detect

With catalog:

[HIGH] Unbounded query on large table without WHERE clause
  ↳ tables=system.access.audit, total_rows=229,743,971,240

Q-JOIN-DISTINCT-MASK: DISTINCT on JOINs with Non-Unique Keys

Without catalog: Warns about potential duplicates

With catalog:

[HIGH] Q-JOIN-DISTINCT-MASK: DISTINCT with Non-Unique JOIN Keys
  DISTINCT applied after JOIN on `orders.product_id`
  ↳ `product_id` is not unique (no PK/UNIQUE constraint)
  ↳ products has 50,000 rows—potential fanout before DISTINCT

Q-JOIN-FANOUT: Unconstrained JOIN Fanout

Without catalog: Cannot detect

With catalog:

[HIGH] Q-JOIN-FANOUT: Unconstrained JOIN Fanout
  JOIN on `events.session_id = sessions.id` lacks FK or UNIQUE constraint
  ↳ events has 10M rows, sessions has 500K rows
  ↳ Result set may be unexpectedly large

Grant Graph Analysis (GRT-ACCESS-EXP, GRT-ACCESS-EXP-HI, GRT-SYSROLE-EXP, GRT-BROAD-PRIV, GRT-BROAD-PRIV-HI)

With the grants section populated, Lexega can compute effective access changes:

[CRITICAL] GRT-SYSROLE-EXP: Critical System Role Grant
  GRANT ROLE ACCOUNTADMIN TO ROLE data_team
  ↳ data_team inherits all ACCOUNTADMIN privileges
  ↳ Affects 15 users via role inheritance

Identifier Normalization

Lexega automatically normalizes identifiers based on the catalog provider:

Platform	Unquoted Identifiers	Quoted Identifiers
Snowflake	Uppercased (`my_table` → `MY_TABLE`)	Case-preserved (`"My Table"` → `My Table`)
Databricks	Lowercased (`MY_TABLE` → `my_table`)	Case-preserved

This means SQL like SELECT * FROM MY_CATALOG.MY_SCHEMA.MY_TABLE correctly resolves against a Databricks catalog where names are stored in lowercase—and vice versa for Snowflake uppercase conventions.

CI/CD Integration

Refresh Catalog in CI

For accurate analysis, refresh the catalog periodically:

Snowflake

# .github/workflows/catalog-refresh.yml
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          lexega-catalog pull \
            --provider snowflake \
            --include-grants \
            --out .lexega/catalog.json \
            -- --account ${{ secrets.SNOWFLAKE_ACCOUNT }} \
               --user ${{ secrets.SNOWFLAKE_USER }} \
               --password "${{ secrets.SNOWFLAKE_PASSWORD }}"
      - run: |
          git config user.name "github-actions"
          git config user.email "github-actions@github.com"
          git add .lexega/catalog.json
          git commit -m "chore: refresh catalog snapshot" || exit 0
          git push

Databricks

# .github/workflows/catalog-refresh.yml
on:
  schedule:
    - cron: '0 6 * * *'

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          lexega-catalog pull \
            --provider databricks \
            --workspace-url ${{ secrets.DATABRICKS_WORKSPACE_URL }} \
            --token-env DATABRICKS_TOKEN \
            --include-grants \
            --out .lexega/catalog.json
        env:
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      - run: |
          git config user.name "github-actions"
          git config user.email "github-actions@github.com"
          git add .lexega/catalog.json
          git commit -m "chore: refresh catalog snapshot" || exit 0
          git push

Use Cached Catalog in PR Checks

# .github/workflows/sql-review.yml
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - run: |
          lexega-sql review ${{ github.event.pull_request.base.sha }}..${{ github.sha }} \
            models/ -r \
            --catalog .lexega/catalog.json \
            --policy .lexega/policy.yaml \
            --env prod

Catalog Staleness

Catalog snapshots can become stale as your schema evolves. Lexega uses row_count_estimate_as_of and bytes_estimate_as_of timestamps to track freshness when available (Snowflake populates these automatically).

Recommendations:

Refresh daily for actively changing schemas
Refresh weekly for stable schemas
For Databricks, run ANALYZE TABLE periodically to keep Delta statistics current

License Requirements

Catalog operations require a Team, Pro, or Enterprise license. The core CLI works without a catalog on all license tiers.

Catalog Integration

Supported Platforms

What the Catalog Enables

Catalog Contents

Key Metadata Fields

Generating a Catalog Snapshot

Snowflake

Databricks

Provider Auto-Detection

Sidecar Data Sources

Snowflake Queries

Databricks API Endpoints

Manual Generation

Catalog Commands

Using the Catalog in Analysis

Catalog-Enhanced Signals

Q-JOIN-NULL-CENH: Nullable JOIN Column

Q-JOIN-TYPEMIS-CENH: Type Mismatch in JOIN

Cross-Join Product Estimation

Large Unbounded Table Scan

Q-JOIN-DISTINCT-MASK: DISTINCT on JOINs with Non-Unique Keys

Q-JOIN-FANOUT: Unconstrained JOIN Fanout

Grant Graph Analysis (GRT-ACCESS-EXP, GRT-ACCESS-EXP-HI, GRT-SYSROLE-EXP, GRT-BROAD-PRIV, GRT-BROAD-PRIV-HI)

Identifier Normalization

CI/CD Integration

Refresh Catalog in CI

Snowflake

Databricks

Use Cached Catalog in PR Checks

Catalog Staleness

License Requirements

Need Help?