Best Practice

Project Scoping & Domain Boundaries (Avoid the “One Giant Graph” Trap)

Best practice

Create separate Wren AI projects per business use case / domain, instead of building one huge, mixed graph for the entire company.

Why

Different domains often have different definitions for the same term
e.g. “Revenue” in Finance vs “Revenue” in Marketing vs “GMV” in Ops
A single giant graph with mixed semantics makes it much easier for the AI to:
- Pick the wrong tables/metrics
- Mix conflicting definitions
- Produce more hallucinations or subtly wrong answers
- Smaller, focused projects give you:
  - Cleaner semantic layers
  - Clearer Instructions and Question-SQL Pairs per domain
  - Easier governance and testing

How to scope projects

Use separate projects for things like:

Marketing & Growth analytics (campaigns, channels, attribution, CAC, ROAS)
Product analytics (events, feature usage, funnels, retention)
Sales & Revenue (pipeline, bookings, MRR, ARR)
Operations / Supply chain / Manufacturing (lead times, yield, utilization, defects)

If a domain:

Has its own owners/stakeholders, and
Has its own metric definitions / dashboards,

...it almost always deserves its own Wren AI project.

Before You Start Wren AI: Define Your “First 20 Questions.”

Best practice: don’t start by modeling everything. Start by modeling what people actually ask.

Sit with your stakeholders and list:
- 10–20 questions they ask every week (dashboards, reports, daily checks).
- The core metrics they care about (MRR, AOV, LTV, churn, CAC, retention, etc.).
Mark each question as:
- Simple retrieval (counts, sums, simple group-by),
- Business metric (has a specific formula),
- Complex analysis (window functions, multi-table joins, intermediate steps).

You’ll use this list to decide:

What must be clear in Semantics?
What needs Instructions to guide behavior?
What deserves Question-SQL Pairs as gold-standard examples?

Step 0 – Preprocess the Data (Create Reporting-Ready Tables/Views)

Best practice: Create a small set of analytics-friendly reporting tables/views per domain to simplify complex business logic before you model semantics.

Why this matters

LLMs are much more reliable with explicit columns (booleans, fiscal periods, status flags) than with string parsing and ad-hoc logic.
It prevents subtle errors like:
using the wrong fiscal month mapping,
forgetting exclusions (e.g., partners/internal/test),
inconsistent definitions across questions.

What to do

Add boolean flags for common business logic (true/false instead of parsing strings): is_nurture_email, is_partner, is_test_account, is_valid_revenue, etc.
Pre-join common dimensions (fiscal calendar, account, campaign/channel): Add fields like fiscal_year, fiscal_quarter, fiscal_year_quarter.
Flatten/denormalize frequently-used joins into a single “reporting” model: This reduces multi-table join complexity and makes AI-generated SQL more stable.

Example: “Net Revenue last quarter” (exclude refunds/canceled/test/internal)

Before (raw tables, easy to miss rules)

SELECT SUM(amount) AS revenue
FROM payments
WHERE paid_at >= '2025-10-01' AND paid_at < '2026-01-01'
  AND status = 'paid';

Common problems:

doesn’t exclude refunds/chargebacks
might include canceled orders or test/internal customers
“last quarter” might need fiscal quarter, not calendar quarter

Step 0 Preprocessing (create a reporting-ready view/table)

Create flags and fiscal fields once:

CREATE OR REPLACE VIEW finance_revenue_facts AS
SELECT
  p.payment_id,
  p.order_id,
  p.customer_id,
  p.paid_at,
  fc.fiscal_year_quarter,
  p.amount,

  /* business rules as booleans */
  (p.status = 'paid') AS is_paid,
  (p.refund_amount IS NOT NULL AND p.refund_amount > 0) AS is_refunded,
  (c.is_test_account = TRUE OR c.email ILIKE '%@yourcompany.com') AS is_internal_or_test,

  /* net amount */
  (p.amount - COALESCE(p.refund_amount, 0)) AS net_amount
FROM payments p
JOIN customers c ON p.customer_id = c.customer_id
JOIN fiscal_calendar fc ON DATE(p.paid_at) = fc.calendar_date;

After (AI queries become simple + consistent)

SELECT SUM(net_amount) AS net_revenue
FROM finance_revenue_facts
WHERE fiscal_year_quarter = '2025Q4'
  AND is_paid = TRUE
  AND is_refunded = FALSE
  AND is_internal_or_test = FALSE;

Why this is better:

“Net revenue” logic is encoded once (not re-invented per question)
fiscal quarter is always correct
exclusions are harder to forget, so fewer subtle “looks right” mistakes

Step 1 – Build a Semantic Layer That Speaks Your Business Language

Wren AI’s semantic layer is where you tell the AI what the data represents: entities, metrics, relationships, and business meaning.

1.1 Use business-first descriptions

When defining models and columns (manually or via Modeling AI Assistant ):

Describe tables as real-world objects, not technical storage:
- Instead of: “daily_report – production table”
- Use: “Daily actual production status per machine per day.”
Call out purpose/role in the description:
- “Target production volume provided by management” vs “daily actual production status” so the AI knows which table is “Goals” vs “Actuals”.

This helps Wren AI choose the right tables when users ask for targets, KPIs, or actuals.

1.2 Mark measures, dimensions, and entities clearly

From your schema:

Measures/metrics
- Numeric fields that should be summed or averaged (e.g., run_minutes, revenue, quantity).
- In descriptions, say “Total X” / “Amount of Y” / “Metric used for …”.
Dimensions
- Dates, categories, and IDs used for grouping and filtering (e.g., state_date, region, plan_type).
- Explicitly mention “Date for …” or “Category of …”.
Entity identifiers
- Mark IDs as entity keys with descriptions like “Unique identifier for customer/employee/order”.
- Wren AI can then infer COUNT(DISTINCT employee_id) for “headcount” instead of counting rows.

1.3 Define relationships, not just joins

Use the modeling UI / AI assistant to:

Declare one-to-one, one-to-many, and many-to-one relationships between models (e.g., “One organization → many projects”).
Make sure key joins (orders→order_items, customers→subscriptions, etc.) are correctly defined so users can ask “by customer”, “by product”, “by month” without thinking about join logic.

Anti-pattern: only connecting tables at the SQL level and hoping the LLM discovers joins. In Wren AI, you should encode the real relationships explicitly.

1.4 Use semantics to resolve jargon & ambiguity

Add column/table descriptions that map internal jargon to business meaning:

If you have multiple LTV definitions (raw_ltv, calculated_ltv), describe each precisely and note in which contexts they’re used.

Rule of thumb

If a junior PM would confuse two columns, add more semantic description.

Step 2 – Add Instructions as Your “Universal Rules.”

Instructions are reusable rules that shape SQL generation, chart behavior, and summaries. They can be:

Global – always applied to all queries.
Question-Matching – only applied when questions match certain patterns/keywords.

2.1 Start with these core Global Instructions

For a first project, we recommend you configure at least:

Data quality/exclusion rules
- Example (from docs):
  
  “Exclude orders with order_status IN ('canceled', 'unavailable') from any sales or revenue-related calculations.”
- Similarly define:
  - statuses to ignore (test users, internal orders),
  - flags to exclude (fraud, sandbox).
Date and time defaults
- For example:
  - “When a question mentions ‘orders last week’, use order_created_at as the default date field.”
  - “If no date range is specified, default to the last 30 days.”
Numeric & percentage formatting
- “Round all monetary metrics to 2 decimal places.”
- “Format rates as percentages with 1 decimal place (e.g., 12.5%).”
Chart & visualization rules
- Examples:
  - “When a question implies a time trend (trend, over time, monthly), default to a line chart with time on the X-axis.”
  - “For ‘top N’ questions, default to a horizontal bar chart, sorted descending.”
Summary style guidelines
- Example from docs:
  - Start summaries with: “Here is the analysis summary: …”
  - Always include the total first, and if comparing periods, include the % change.

Examples:

Instruction Type	Configuration Content	Wren AI Behavior
Default Date Handling	When a user mentions only 'orders', default to using the `order_created_at` field for time filtering, not `shipped_date`.	When asked "What is the number of orders last week," the AI automatically uses `order_created_at` for the filter.
Decimal Precision Standard	For all monetary metrics (e.g., Revenue, AOV), always `ROUND` the result to 2 decimal places. For count metrics (e.g., Orders, Users), cast results as integers.	Ensures clean reporting, displaying `$1,250.50` instead of `$1,250.4999` and preventing decimal values for headcounts.
Percentage Formatting	When calculating rates or ratios (e.g., Conversion Rate), output the result as a percentage with 1 decimal place (e.g., `ROUND(value * 100, 1)`).	Improves readability by displaying values like `12.5%` rather than raw decimals like `0.12543`.
Chart Palette	Use the following chart palette for all visualizations. Primary colors (in order): Brand Primary - #0000FF Brand Secondary - #FF0000 Accent 1 - #FFFF00 Accent 2 - #FF6600	Automatically apply the expected chart palette instead of using system-default colors.
Ranking Visualization Rule	For ranking or comparison questions (e.g., "Top 5 products", "Sales by region"), default to a Bar Chart sorted in descending order.	Helps users instantly identify top performers without needing to manually adjust chart settings.
Summary Style Guide	When summarizing data, always explicitly mention the Total Value first. If comparing time periods, include the percentage change in brackets.	Generates concise, executive-style summaries like: "Total revenue is $50k (+12% vs last month)."
Universal Exclusion Rule	When querying any product-related data, automatically exclude records where `category_id` is '999' (internal test products).	Ensures all product-related analyses only include production data.
Naming Ambiguity Resolution	If a user mentions 'LTV', use the `calculated_ltv` column. If they mention 'life time value', use the `raw_ltv` column.	Resolves situations where business terminology maps to multiple database fields.
Naming Ambiguity Resolution / Real-World use cases	When the question involves product categories in general (e.g., grouping by product or listing categories), always include the column `product_category_name_english` in the SELECT clause. Only use product_category_name if the question specifically asks for the product category in Portuguese	Address situations where AI needs to follow updated data transformation results in real-world use cases

These immediately make answers feel “on-brand” and consistent across your team.

2.2 Use Question-Matching Instructions for tricky concepts

Use Question-Matching Instructions when:

A specific domain concept has a precise definition (late deliveries, churn, VIP customers, etc.).
You want all questions that mention certain phrases to use the same logic.

Example from docs:

For “late delivery” questions, calculate lateness as order_delivered_customer_date > order_estimated_delivery_date in the olist_orders_dataset.

Patterns you should consider:

“late delivery”, “delayed”, “on-time rate”
“churned customer”, “active customer”
“trial to paid conversion”
“retention”, “cohort”

Step 3 – Capture Gold-Standard Queries as Question-SQL Pairs

Question-SQL Pairs are your Key Case Handbook: they encode the exact SQL that should be used for high-stakes or complex questions.

3.1 When to create a Question-SQL Pair

Create a Pair when:

The metric is core to the business (MRR, AOV, LTV, churn, CAC, yield rate, utilization).
The SQL is non-trivial:
- Multiple joins across many tables,
- Window functions,
- Complex filters/exclusions,
- Weighted averages over time.
Your existing BI tool already has an approved query for it.

Examples:

Operation Method	Natural Language Question (Q)	Gold Standard SQL (S) - Example Logic	Applicable Scenario (Purpose)
Save to Knowledge - After a Successful Query	"List the top 5 machines with the lowest efficiency."	`RANK() OVER (ORDER BY utilization_rate ASC) ... LIMIT 5`	Stores a standard ranking pattern to consistently identify under performers without manual sorting configuration.
Manual Addition for Complex Business Logic	"Show the difference between actual production and target goals by month."	`SELECT COALESCE(a.val, 0) - COALESCE(t.val, 0)FROM actual a FULL OUTER JOIN target t ON ...`	Handles complex, multi-step logic involving data merging (Full Joins) to ensure dates missing in one dataset (e.g., no production but has target) are still calculated correctly.
Manual Addition for Complex Business Logic	"Which specific machine parts caused the most downtime based on the logs?"	`CASE WHEN POSITION('oven' IN note) > 0 THEN 'Oven' ELSE 'Other' END`	Parses unstructured text data into structured categories, allowing the AI to answer analytical questions based on raw log notes.
Manual Addition for Complex Business Logic	"Identify VIP customers whose revenue exceeded $100K for three consecutive months."	(Complex SQL query involving Window Functions)	Handles complex, multi-step logic that the AI cannot easily infer, ensuring the accuracy of advanced analysis.
Manual Addition / Optimization	"What is the AOV (average order value)?"	`ROUND(SUM(price + freight_value) / COUNT(DISTINCT order_id), 2)`	Ensures the calculation logic for core metrics (e.g., AOV) remains consistent and accurate across the organization.
Manual Addition / Optimization	"What is the overall Yield Rate for the last quarter?"	`ROUND(SUM(good_quantity) * 100.0 / SUM(total_quantity), 2)`	Ensures the calculation logic uses weighted averages for aggregated periods instead of averaging daily percentages, which prevents mathematical errors.
Manual Addition / Optimization	"Calculate the Machine Utilization Rate for Plant A."	`ROUND(SUM(run_minutes) * 100.0 / SUM(total_minutes), 2)WHERE total_minutes > 0`	Standardizes efficiency metrics by aggregating total runtime vs. available time, preventing division by zero errors in the analysis.
Manual Addition / Optimization	"What is the material breakage rate?"	`ROUND(SUM(abnormal_count) * 100000.0 / SUM(total_meter), 2)`	Enforces specific industry coefficients (per 100,000 units) to ensure the metric matches standard reporting formats.

3.2 How to build them efficiently

Run a question in Wren AI, review the SQL, and if correct:
- Click Save to Knowledge to store it as a Question-SQL Pair.
Or copy existing SQL from your warehouse / BI tool and:
- Go to Knowledge → Question-SQL Pairs, add a new item,
- Paste the question and the “gold” SQL.

3.3 Best practices

Write questions in natural stakeholder language:
- “What is our AOV?” instead of “AOV metric definition”.
Include variants if necessary:
- “average order value”
- “AOV”
- “avg spend per order.”
Keep the SQL fully explicit:
- Include all filters, joins, and aggregations,
- Use rounding/formatting logic that matches your reporting.

Over time, this becomes a “training set” for your own data – but without needing to fine-tune models.

Step 4 – Iterate with Real Questions (Human-in-the-Loop)

Once semantics, initial instructions, and a few Question-SQL Pairs are in place:

Use Wren AI’s Ask flow to ask realistic questions.
For each answer:
- Review the SQL breakdown.
- If it’s correct and valuable → Save as a new Question-SQL Pair.
- If it’s almost right but missing a rule → add or refine an Instruction.
Pay attention to Suggested Instructions:
- Wren AI may propose grouping, filtering, or aggregation rules based on your questions.
- Decide whether to save, edit, or ignore them.

This feedback loop turns everyday usage into continuous context-building.

Step 5 – Maintain Your Context as the Schema Evolves

As your data changes:

Use schema change detection to see when tables/columns are removed, renamed, or type-changed.
After resolving schema changes, review:
- Affected Semantics (descriptions, relationships),
- Any Instructions that reference changed fields,
- Question-SQL Pairs that might break.

Lightweight process to adopt:

Once per sprint:
- Check schema change notifications.
- Run a quick sanity set of your top 10 Questions (smoke test).
- Update any broken Question-SQL Pairs.

Project Scoping & Domain Boundaries (Avoid the “One Giant Graph” Trap)​

How to scope projects​

Before You Start Wren AI: Define Your “First 20 Questions.”​

Step 0 – Preprocess the Data (Create Reporting-Ready Tables/Views)​

Example: “Net Revenue last quarter” (exclude refunds/canceled/test/internal)​

Before (raw tables, easy to miss rules)​

Step 0 Preprocessing (create a reporting-ready view/table)​

After (AI queries become simple + consistent)​

Step 1 – Build a Semantic Layer That Speaks Your Business Language​

1.1 Use business-first descriptions​

1.2 Mark measures, dimensions, and entities clearly​

1.3 Define relationships, not just joins​

1.4 Use semantics to resolve jargon & ambiguity​

Step 2 – Add Instructions as Your “Universal Rules.”​

2.1 Start with these core Global Instructions​

2.2 Use Question-Matching Instructions for tricky concepts​

Step 3 – Capture Gold-Standard Queries as Question-SQL Pairs​

3.1 When to create a Question-SQL Pair​

3.2 How to build them efficiently​

3.3 Best practices​

Step 4 – Iterate with Real Questions (Human-in-the-Loop)​

Step 5 – Maintain Your Context as the Schema Evolves​