Data Lineage from SQL Alone
Data Lineage from SQL Alone
Most lineage tools ask you to install an agent that intercepts query execution, parse warehouse logs, or connect to a dedicated metadata service. This is significant infrastructure for a problem that is mostly solvable by parsing SQL.
If you store your DDL and transformation SQL (views, CTEs, stored procedures), you already have everything needed to derive table-to-table and column-to-column lineage statically.
Why Agent-Installed Lineage Is Overkill for Most Teams
Runtime lineage agents track every query that executes. This produces complete lineage — including ad-hoc analyst queries — at the cost of:
- A sidecar or warehouse extension running continuously.
- Query interception latency.
- Metadata storage proportional to query volume.
For the 80 % of lineage questions that concern defined transformations (views, dbt models, pipelines), static SQL parsing is sufficient, free, and instant.
sqlglot's lineage()
sqlglot is a Python SQL parser and transpiler. Its lineage() function takes a SQL expression and produces a directed graph of source-to-target column relationships.
import sqlglot
from sqlglot.lineage import lineage
sql = """
CREATE VIEW revenue_summary AS
SELECT
o.customer_id,
SUM(oi.unit_price * oi.quantity) AS total_revenue,
COUNT(DISTINCT o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.customer_id
"""
node = lineage("total_revenue", sql)
for n in node.walk():
print(n.name, "←", [s.name for s in n.sources])
Output:
total_revenue ← ['unit_price', 'quantity']
unit_price ← ['order_items.unit_price']
quantity ← ['order_items.quantity']
Table-to-Table and Column-to-Column Edges
For a view, sqlglot produces two layers of graph edges:
Table-level: revenue_summary depends on orders and order_items.
Column-level: revenue_summary.total_revenue is derived from order_items.unit_price and order_items.quantity. revenue_summary.customer_id passes through from orders.customer_id.
Column-level derivation is traceable wherever SQL is deterministic (direct references, arithmetic, COALESCE). It is not traceable for Python UDFs or opaque function calls — sqlglot marks these as UNKNOWN sources rather than guessing.
Building a Lineage Graph Across Many Definitions
import networkx as nx
from sqlglot.lineage import lineage as sql_lineage
G = nx.DiGraph()
for view_name, view_sql in all_view_definitions.items():
try:
node = sql_lineage(view_name, view_sql)
for n in node.walk():
for source in n.sources:
G.add_edge(source.name, n.name)
except sqlglot.errors.ParseError:
pass # Log and skip unparseable SQL dialects
print(nx.dag_longest_path(G)) # Longest dependency chain
DQ runs this graph construction on every profiling run and exposes the result in /docs/lineage. Impact analysis ("which downstream views break if I rename this column?") is available without any runtime agents.
Limitations
- sqlglot does not parse all SQL dialects perfectly. BigQuery-specific functions or Snowflake QUALIFY clauses may require dialect hints.
- Dynamic SQL (EXECUTE format, stored procedures with string-building) cannot be statically analyzed.
- Lineage from external data loads (S3, Fivetran syncs) requires metadata enrichment from those sources.
For catalog-level column grouping on top of lineage, see /docs/catalog.
FAQ
Q: Does DQ need access to query history to build lineage?
A: No. DQ parses the SQL definitions stored in your database catalog (information_schema.views, dbt manifest files, or uploaded SQL files). No query log access required.
Q: Which SQL dialects does sqlglot support?
A: As of 2026, sqlglot supports BigQuery, Snowflake, Spark SQL, Trino, DuckDB, PostgreSQL, MySQL, and more. Pass dialect="bigquery" to handle dialect-specific syntax.
Q: Can lineage be used for impact analysis before a schema change?
A: Yes. Query G.successors("orders.customer_id") in the NetworkX graph to find every downstream column and view that will be affected.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.