Best Practices

Storage

Lakehouse & Storage Strategy

Choose where and how to store data — the foundation that determines cost, performance, and accessibility of everything downstream.

Lakehouse vs. Warehouse

Fabric offers two primary storage artifacts. Choose based on your workload:

Feature	Lakehouse	Warehouse
Primary language	PySpark, Spark SQL	T-SQL
Data format	Delta (open format in OneLake)	Proprietary columnar
Schema	Schema-on-read + schema-on-write	Schema-on-write only
Best for	Data engineering, ML, unstructured data	BI, structured queries, stored procedures
SQL access	Read-only SQL endpoint (auto-generated)	Full T-SQL read/write
Multi-engine	Spark + SQL + Power BI	SQL + Power BI

✅ Recommendation

Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.

File Format Guidance

Recommended formats in Fabric

✅ Delta (Parquet + transaction log)  — Default for all Fabric tables
✅ Parquet                             — Good for external/read-only data
⚠️  CSV                                — Only for Bronze ingestion, convert to Delta ASAP
❌ JSON (nested)                       — Flatten before storing in tables
❌ Excel/XML                           — Convert during ingestion pipeline

V-Order Optimization

V-Order is a Fabric-native write-time optimization that sorts data within Parquet files for maximum read performance across all Fabric engines (SQL, Spark, Power BI Direct Lake).

Why (Performance): Queries scan less data thanks to better predicate pushdown and column-chunk elimination — up to 50% faster reads.
Why (Cost): Faster reads = shorter query sessions = fewer CUs. The write-time overhead (10–15% slower writes) pays for itself many times over for frequently-read tables.
When to enable: Always for Gold/Silver layer tables. Skip for Bronze/staging tables that are written once and immediately overwritten.

File Size & Partitioning

Optimize file sizes: Target 128 MB – 1 GB files. Use OPTIMIZE (bin-compaction) to merge small files. Why (Performance): Too many small files causes excessive metadata overhead and slow reads across all engines.
Partition wisely: Partition by low-cardinality columns used in filters (e.g., year, region). Avoid over-partitioning. Why (Cost): Over-partitioning creates thousands of tiny files that hurt read performance and inflate storage transactions.
Use Delta caching: SSD caching for repeated reads on the same data. Why (Performance): Avoids repeated network I/O to OneLake for hot data.

Shortcuts Strategy

Shortcuts are pointers to data in external locations — they appear as folders in your lakehouse but don't copy data. Use them to:

Access data in other lakehouses or workspaces without duplication
Connect to external ADLS Gen2, AWS S3, or Google Cloud Storage
Enable cross-domain data sharing without moving data
Create a virtual data mesh where each domain manages its data but shares via shortcuts

📚 Learn More

Lakehouse Overview ↗ Delta Optimization & V-Order ↗ Data Warehousing ↗

Kernels

Notebook Kernel Selection

Choose the right notebook kernel — Python, Spark, or T-SQL — based on compute cost, data scale, Delta Lake requirements, and operational needs.

Fabric notebooks support three kernel types: Python, Spark, and T-SQL. The Spark kernel supports PySpark, SparkSQL, Scala, and SparkR. This guide focuses on the Python vs. Spark decision — the most common choice for data engineering workloads.

⚠️ It's Not Just About Cost or Data Size

Compute configuration, Delta Lake feature requirements, engine maturity, and expected data growth all play important roles. A common misconception is that the Python kernel is always cheaper — in reality, cost depends on how you configure compute for each kernel.

Compute Options Compared

Configuration	vCores	Startup Time	CUs Consumed
Python kernel (default)	2 vCores (scalable to 64)	~5 seconds	1 CU (minimum)
Spark — Starter pool (default)	8-core worker, autoscale	~5 seconds	8 CUs (after proactive scale-up)
Spark — Single-node 8-vCore	8 cores shared (driver + executor)	~5 seconds	4 CUs
Spark — Single-node 4-vCore (custom pool)	4 cores shared	3–5 minutes	2 CUs
Spark — Multi-node (custom pool)	Scales with cluster	3–5 minutes	Varies

💡 Single-Node Spark ≈ Python Kernel Cost

A single-node 8-vCore Spark session using the starter pool starts in ~5 seconds — comparable to the Python kernel — while giving you full access to Spark-native capabilities including the Native Execution Engine (NEE).

Performance by Data Scale

Data Scale (compressed)	Engine Advantage
< ~140 MB	Python engines (DuckDB, Polars) are faster
~1–2 GB	Python still has an edge, but Spark + NEE becomes competitive
~10–13 GB	Spark + NEE matches or outperforms single-machine engines; Python engines risk OOM
100 GB+	Spark + NEE is the fastest and most reliable option

Delta Lake Compatibility

This is a critical differentiator. Fabric Spark has native, full-featured Delta Lake support. Python engines (delta-rs, DuckDB, Polars) have meaningful gaps:

Delta Feature	Fabric Spark	delta-rs (Python)	DuckDB	Polars
Read Delta tables	✅	✅	✅	✅
Full write support (MERGE, UPDATE, DELETE)	✅	✅	⚠️ INSERT only	⚠️ No UPDATE/DELETE
Deletion vectors (write)	✅	❌	❌	❌
Column mapping	✅	❌	✅	❌
Liquid Clustering	✅	❌	❌	❌
Type widening	✅	❌	⚠️ Read only	❌
Change Data Feed (write)	✅	❌	❌	❌
OPTIMIZE	✅	✅	❌	❌
VACUUM	✅	❌	❌	❌
Auto compaction	✅	❌	❌	❌
V-Order	✅	❌	❌	❌

⚠️ Deletion Vectors

Deletion vectors are enabled by default in Fabric Spark Runtime 2.0. No Python engine supports writing deletion vectors — if you use a Python engine to write to DV-enabled tables, you'll encounter compatibility errors.

Engine Maturity & Support

Fabric Spark: Microsoft-maintained fork — full end-to-end support, including NEE (built on Velox + Gluten). You can open support tickets for engine bugs, query plans, and memory issues. Code gets faster with runtime updates without changes.
Python engines (DuckDB, Polars): Microsoft supports only OneLake integration (auth, file system access). Engine bugs, API changes between versions, and performance issues require engaging OSS communities directly.
Operational visibility: Spark has the Spark UI + Fabric telemetry for live job monitoring. DuckDB/Polars have no equivalent — when a job hangs, there's no way to inspect what's happening.

Decision Guidance

✅ Use the Python Kernel When

Data is small (< ~1 GB compressed) and single-machine speed matters most
Building lightweight API orchestration, REST/gRPC integrations
Rapid interactive exploration of small datasets
Workload requires a specific Python version not in the Spark runtime
You accept the Delta Lake feature limitations

✅ Use the Spark Kernel When

Data is ≥ 1 GB compressed or will grow to that scale
You need full Delta Lake compatibility (deletion vectors, OPTIMIZE, VACUUM, ACID)
You need live monitoring and production-grade features (env vars, High Concurrency, FAIR/FIFO scheduling)
You use Spark-native APIs (MLlib, Spark SQL, Streaming)
You want Microsoft end-to-end engine support
You need to scale from single-node to multi-node without rewriting code

Key Differences at a Glance

Category	Python Kernel	Spark Kernel
Default compute	2-vCore single-node (up to 64)	8-vCore starter pool with autoscale
Distributed execution	❌ No	✅ Yes
Languages	Python	PySpark, SparkSQL, Scala, SparkR
Full Delta Lake support	❌ No	✅ Yes
Live monitoring	Limited	Full (Spark UI + Job monitoring)
Microsoft engine support	OneLake integrations only	Full runtime support
High Concurrency	❌ No	✅ Yes
V-Order for Direct Lake	❌ No	✅ Yes
Scales to multi-node	❌ No	✅ Yes

💡 Recommended Starting Point

For workloads ≥ 1 GB compressed, start with a single-node 8-vCore Spark cluster using the starter pool. You get near-instant startup (~5s), full Fabric Spark capabilities including NEE, and the ability to scale out to multi-node when needed — all on a single node like the Python kernel.

Compute

Notebook & Spark Optimization

Reduce CU consumption, speed up jobs, and avoid expensive anti-patterns in Fabric notebooks.

Fabric notebooks consume Capacity Units (CUs) based on compute type, session duration, and cluster size. Every decision here directly impacts your monthly Fabric bill.

Python vs. PySpark — Choose the Right Engine

Not every job needs distributed computing. Using PySpark for trivial workloads wastes resources because Spark must start a session (~20–30s), allocate executors, and maintain cluster overhead — all of which burn CUs even if the actual work is minimal.

Scenario	Recommended Engine	Why
Data < 1 GB, simple transforms	Python + pandas	No Spark session overhead — runs on notebook VM only. ~70% cheaper for small jobs.
Exploratory analysis / prototyping	Python + pandas	Faster iteration; no cluster startup wait. You can switch to Spark later if data grows.
Data 1–10 GB with complex joins	PySpark	Distributed processing avoids OOM errors; Spark optimizer handles join strategies.
Data > 10 GB or heavy aggregations	PySpark	Only viable option — single-node memory can't handle the volume.
ML feature engineering at scale	PySpark + MLlib	Distributed training/feature computation is orders of magnitude faster.
Small ML model training (< 1 GB)	Python + scikit-learn	Lower overhead, richer library ecosystem, no serialization cost.

💡 Cost Impact

A PySpark notebook that processes 500 MB of data uses Spark Compute CUs (cluster startup + executor time). The same job in plain Python uses only the notebook VM CUs — typically 2–4× cheaper for datasets under 1 GB. Over hundreds of daily pipeline runs, this adds up to significant savings.

Native Execution Engine

The Native Execution Engine (generally available) runs Spark SQL and DataFrame operations on optimized C++/vectorized code instead of the JVM. It delivers up to 4× faster performance with zero code changes.

Why (Cost): Faster execution = shorter session duration = fewer CUs consumed. No infrastructure changes needed — it's a configuration toggle.
How: Enable at workspace or session level: spark.conf.set("spark.native.enabled", "true")
Limitation: Not all operations are supported yet (UDFs, some complex types fall back to JVM). Monitor the Spark UI for fallback warnings.

High Concurrency Mode

When you orchestrate multiple notebooks in a pipeline, each notebook normally starts its own Spark session — paying the startup cost every time.

Why (Cost): Session startup costs 20–30s of CU burn per notebook. With 10 notebooks in a pipeline, that's 3–5 minutes of wasted compute.
Solution: Enable High Concurrency Mode and assign the same Session Tag to related notebooks. They share one Spark session, reducing startup to ~5s total.
Savings: Up to 30% reduction in pipeline execution time and CU cost.

⚠️ Constraint

Notebooks sharing a session must use the same lakehouse context and Spark compute properties. Plan your pipeline grouping accordingly.

Session Management — Stop Paying for Idle

Spark sessions remain alive (and consuming CUs) even when no code is running. This is the most common source of wasted capacity.

Set aggressive timeouts: Configure session idle timeout to 5–10 minutes (default is 20 min). Every idle minute is CUs burned with no work done.
Explicit session stop: In pipelines, add mssparkutils.session.stop() at notebook end to release resources immediately.
Monitor zombie sessions: Use the Fabric Capacity Metrics app to identify long-running or abandoned sessions draining your capacity.

Common Anti-Patterns

Anti-Pattern	Problem	Fix
`df.collect()` on large data	Infrastructure: Moves all data to driver node — OOM crash risk and network bottleneck.	Process in Spark; only `collect()` small aggregated results.
Python UDFs instead of built-in functions	Performance: Data serializes between JVM→Python→JVM per row. 10–100× slower than native Spark functions.	Use `pyspark.sql.functions` (vectorized). If UDF needed, use Pandas UDFs (`@pandas_udf`).
Row-by-row loops	Performance: Defeats Spark's parallel execution — processes one record at a time on the driver.	Use vectorized DataFrame operations or `withColumn()` transformations.
`toPandas()` on large DataFrames	Infrastructure + Cost: Collects entire dataset to driver memory. Same risk as `collect()`.	Only convert small/aggregated results. For large data, use `pandas API on Spark` (`pyspark.pandas`).
Default cluster size for all jobs	Cost: Large clusters for small jobs waste 60–80% of allocated CUs sitting idle.	Right-size: use Small (4 vCores) for light jobs, scale up only when Spark UI shows resource contention.
No caching of reused DataFrames	Cost + Performance: Re-reads and re-computes data multiple times, multiplying I/O and CU usage.	Use `.cache()` or `.persist()` for DataFrames used more than once. Call `.unpersist()` when done.

Notebook Design for Production

Keep notebooks small and focused: One notebook = one logical unit of work. Monolithic notebooks are harder to debug, can't be parallelized in pipelines, and waste CUs when you re-run everything to fix one step.
Parameterize everything: File paths, date ranges, environment names — use Fabric notebook parameters so the same notebook works across dev/test/prod without code changes.
Separate compute-heavy from light steps: Put your heavy PySpark ETL in one notebook (needs large cluster) and your validation/metadata updates in another (runs on tiny VM with Python). This prevents over-provisioning for the light steps.
Error handling: Use try/catch blocks and log errors to a monitoring table — don't let silent failures cascade through your pipeline.
Version control: Connect notebooks to Git for versioning and code review. Fabric's Git integration tracks notebook changes alongside pipeline configs.

📊 Quick Decision Flowchart

Data < 1 GB? → Use Python/pandas (no Spark needed)
Data 1–10 GB? → PySpark with Small cluster, enable Native Execution Engine
Data > 10 GB? → PySpark with Medium/Large cluster, partition wisely
Running multiple notebooks in pipeline? → Enable High Concurrency Mode
Table read frequently by Power BI/SQL? → Enable V-Order on write

📚 Learn More

Native Execution Engine Overview ↗ High Concurrency Mode ↗ Spark Performance & Cost Best Practices ↗ Notebooks with VS Code ↗

dbt

dbt Integration Best Practices

SQL-first transformation framework with built-in testing, documentation, and lineage for Fabric Warehouse.

Why dbt in Fabric?

dbt (data build tool) is an open-source transformation framework that lets analytics engineers write modular SQL models with built-in testing, documentation, and lineage tracking. It connects to Fabric via the dbt-fabric adapter targeting the Warehouse SQL endpoint.

🤔 Should you use dbt or Spark notebooks?

This isn't either/or — many teams use both. The choice depends on your transformation type, team skills, and existing investments.

dbt vs. Spark Notebooks — When to Use Each

Criteria	dbt (SQL models)	Spark Notebooks	Why It Matters
Transformation type	SQL-based (SELECT, JOIN, GROUP BY)	Python/PySpark, complex logic, ML	Architecture: dbt excels at declarative SQL transformations; notebooks handle imperative, multi-step logic.
Testing & quality	Built-in (unique, not_null, relationships, custom tests)	Manual (assert statements, Great Expectations)	Quality: dbt's test framework catches data issues automatically on every run — no extra setup needed.
Lineage & documentation	Auto-generated DAG, column-level lineage, docs site	Manual documentation, no native lineage	Governance: dbt provides instant visibility into data dependencies — critical for impact analysis and compliance.
Incremental loads	Built-in materializations (append, merge, microbatch)	Custom logic with Delta merge	Cost: dbt's incremental models process only new/changed data, reducing Warehouse CU consumption.
Target engine	Fabric Warehouse (T-SQL)	Fabric Lakehouse (Spark)	Infrastructure: dbt uses the Warehouse compute engine (no Spark cluster needed). Lower cost for SQL workloads.
Version control & CI	Git-native, standard PR reviews, CI testing	Git via Fabric integration (notebook JSON)	Quality: dbt models are plain SQL files — easier to diff, review, and test in CI pipelines.
Team profile	Analytics engineers, SQL-first teams	Data engineers, Python/Spark teams	Adoption: Use the tool that matches your team's primary language to maximize productivity.

✅ Recommendation

Use dbt for Silver → Gold transformations in Fabric Warehouse when your logic is primarily SQL (joins, aggregations, business rules). Use Spark notebooks for Bronze → Silver where you need Python for parsing, complex cleansing, or handling semi-structured data. Many production architectures combine both.

Architecture: dbt in the Medallion Stack

Where dbt fits in a typical Fabric architecture

┌─────────────────────────────────────────────────────────────────┐
│  SOURCES                                                         │
│  (Databases, SaaS, Files, APIs)                                  │
└────────────────────────┬────────────────────────────────────────┘
                         │  Ingestion: Pipelines, Mirroring, Dataflows
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  BRONZE (Lakehouse)                                              │
│  Raw data — Spark notebooks for parsing & cleansing              │
└────────────────────────┬────────────────────────────────────────┘
                         │  Spark notebooks (Python/PySpark)
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  SILVER (Warehouse)                                              │
│  Cleansed, normalized — dbt models for conforming & dedup        │
└────────────────────────┬────────────────────────────────────────┘
                         │  dbt models (SQL)
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  GOLD (Warehouse)                                                │
│  Business-ready — dbt marts, dimensional models, KPIs            │
│  → Power BI Direct Lake / SQL analytics endpoint                 │
└─────────────────────────────────────────────────────────────────┘

Setup & Configuration

profiles.yml — Fabric Warehouse connection

my_fabric_project:
  target: dev
  outputs:
    dev:
      type: fabric
      driver: "ODBC Driver 18 for SQL Server"
      server: "your-workspace.datawarehouse.fabric.microsoft.com"
      database: "your_warehouse"
      schema: "dbt_dev"
      authentication: ServicePrincipal   # Recommended for CI/CD
      tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
      client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
      client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
    prod:
      type: fabric
      driver: "ODBC Driver 18 for SQL Server"
      server: "your-workspace.datawarehouse.fabric.microsoft.com"
      database: "your_warehouse"
      schema: "dbt_prod"
      authentication: ServicePrincipal
      tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
      client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
      client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"

⚠️ Important: Warehouse, Not Lakehouse

The dbt-fabric adapter targets the Warehouse (full DML/DDL support). The Lakehouse SQL endpoint is read-only and cannot execute dbt's CREATE/INSERT/MERGE statements. If you need dbt on Lakehouse data, create shortcuts from Lakehouse to Warehouse, or use the dbt-fabricspark adapter (limited, via Livy API).

Project Structure

Organize by layer: /models/staging/ (Bronze refs), /models/intermediate/ (Silver), /models/marts/ (Gold) — mirrors your medallion architecture.
One model = one transformation: Keep models focused. Don't combine 5 joins and 3 aggregations in a single model — break them into testable steps.
Use sources & refs: Define sources.yml for raw tables and use {{ ref('model_name') }} between models. This enables automatic lineage and dependency resolution.

Cost Optimization

Incremental over full refresh: Use materialized='incremental' with a reliable timestamp column. Full refreshes reprocess all data and burn unnecessary Warehouse CUs.
No Spark overhead: dbt runs on the Warehouse SQL engine — no Spark session startup, no cluster CUs. For SQL-centric transforms, this is inherently cheaper.
Schema-per-environment: Use different schemas (dbt_dev, dbt_prod) instead of separate warehouses. Avoids duplicating infrastructure costs.

Quality & Testing

Test everything: Add unique and not_null tests to all primary keys. Add relationships tests for foreign keys. These run automatically and catch issues before bad data reaches Power BI.
Custom tests for business rules: Write SQL tests like "revenue should never be negative" or "order_date must be ≤ ship_date". dbt fails the pipeline if assertions break.
Freshness checks: Use source freshness to alert when upstream data stops arriving — prevents stale dashboards going unnoticed.

CI/CD Integration

PR validation: Run dbt build --select state:modified+ on pull requests to test only changed models and their downstream dependencies.
Service principal auth: Use Entra service principals (not personal credentials) in CI/CD pipelines for automated, auditable deployments.
Slim CI: Compare against production manifest (--defer --state prod-manifest/) to avoid rebuilding unchanged models in CI.

Limitations to Know

Limitation	Impact	Workaround
Lakehouse SQL endpoint is read-only	Cannot use dbt materializations against Lakehouse	Use Warehouse as dbt target; reference Lakehouse data via shortcuts or cross-database queries
Adapter maturity	Some advanced dbt features may have edge cases	Pin adapter version, test upgrades in dev first, contribute issues to GitHub
No Python models in dbt-fabric	Python-based dbt models not supported (Warehouse is T-SQL only)	Use Spark notebooks for Python transforms; feed results to dbt via Warehouse tables
Snapshot limitations	SCD Type 2 snapshots work but may have performance limits on very large tables	Partition snapshot queries by date range; consider incremental snapshots

📚 Learn More

Set Up dbt for Fabric Warehouse ↗ dbt-fabric Adapter (GitHub) ↗ dbt Docs — Fabric Setup ↗

Real-Time

Real-Time Intelligence

Streaming data ingestion and time-series analytics with Real-Time Intelligence.

Overview

Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.

Key Components

Eventhouse

The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.

KQL Database

Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) — a powerful language for log and time-series analytics.

Eventstreams

Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.

Real-Time Dashboards

Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.

Streaming Ingestion Patterns

Source	Ingestion Method	Latency
Azure Event Hubs	Eventstream → KQL Database	Seconds
Azure IoT Hub	Eventstream → KQL Database	Seconds
Apache Kafka	Kafka connector → Eventstream	Seconds
Custom apps	REST API / SDK ingestion	Seconds
Databases (CDC)	Fabric Mirroring → OneLake	Minutes

When to Use Real-Time vs. Batch

Real-Time Intelligence: IoT telemetry, application logs, clickstream analytics, fraud detection, operational monitoring
Batch (Lakehouse/Warehouse): Historical reporting, data warehousing, regulatory reporting, large-scale transformations
Hybrid: Ingest in real-time to Eventhouse for operational use, then copy to Lakehouse for long-term analytics

✅ Best Practice

Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.

📚 Learn More

Real-Time Intelligence ↗ Eventstreams ↗

Analytics

Power BI Integration

Maximizing the reporting and analytics experience with Fabric-native Power BI features.

Direct Lake Mode

Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery — without copying data into the semantic model.

Reads data directly from Delta tables in OneLake
No data refresh needed — reports always show the latest data
Performance comparable to Import mode for most scenarios
Falls back to DirectQuery if data exceeds memory (configurable behavior)

💡 Key Insight

Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.

Connectivity Modes — Direct Lake vs. DirectQuery vs. Import

Choosing the right connectivity mode is one of the highest-impact decisions for Power BI performance and cost. Each mode trades off freshness, speed, and resource consumption differently.

Criteria	Import	DirectQuery	Direct Lake
How it works	Copies data into Power BI's in-memory engine (VertiPaq)	Sends queries live to the source at report render time	Reads Delta/Parquet files directly from OneLake into memory on demand
Data freshness	Stale until next scheduled refresh	Always real-time (queries source live)	Always fresh — reads latest committed Delta files
Query performance	🟢 Fastest (pre-compressed in-memory columnar)	🔴 Slowest (depends on source query speed + network)	🟢 Near-Import speed (columnar read from OneLake)
Capacity cost	🟡 High memory usage (entire dataset cached). Refresh jobs consume CUs.	🟢 Low memory (no cache). But high CU per query if source is slow.	🟢 Low-to-moderate memory (loads columns on demand, evicts when idle). No refresh CUs.
Data size limits	Bounded by capacity memory (1 GB – 400 GB depending on SKU)	Unlimited (data stays at source)	Bounded by capacity memory, but with automatic fallback to DirectQuery
Source requirements	Any Power Query-supported source	Any DirectQuery-compatible source (SQL, Fabric, Dataverse, etc.)	Fabric only — Delta tables in OneLake (Lakehouse or Warehouse)
Refresh needed?	Yes — scheduled or on-demand refresh	No — live queries	No — reads latest data automatically
DAX calculation support	Full (all DAX functions)	Limited (some DAX patterns perform poorly or are unsupported)	Full (same engine as Import)
RLS enforcement	In the semantic model	At source + semantic model	In the semantic model (respects OneLake security roles)

When to Use Each Mode

Scenario	Recommended Mode	Why
Gold-layer tables in Fabric Lakehouse/Warehouse	Direct Lake	Cost + Freshness: No refresh schedule to manage, no duplicate storage, Import-class speed. The default choice for Fabric-native BI.
Data in external systems (SQL Server, Dataverse, SAP)	DirectQuery	Architecture: Data can't move to OneLake (regulatory, real-time, or volume constraints). Accept slower queries for real-time freshness.
Small, highly curated datasets (< 1 GB) refreshed daily	Import	Performance: Maximum DAX flexibility and fastest queries. Refresh cost is negligible for small data.
Very large tables (100+ GB) with infrequent access	Direct Lake (with fallback)	Cost: Only loads accessed columns into memory. Falls back to DirectQuery for cold data — avoids paying for 100 GB of memory.
Mixed: some Fabric tables + some external sources	Composite model (Direct Lake + DirectQuery)	Architecture: Combine both in one model — Direct Lake for Fabric tables, DirectQuery for external sources that can't be moved.
Near real-time dashboards (< 1 min latency requirement)	DirectQuery or Direct Lake	Freshness: Both deliver real-time. Choose Direct Lake if data is in OneLake; DirectQuery if it's external.
Legacy reports migrating to Fabric	Start with Import → migrate to Direct Lake	Risk: Import works identically to before. Once Gold tables are in Lakehouse/Warehouse, switch to Direct Lake for cost savings.

✅ Default recommendation

If your data is in Fabric, use Direct Lake. It gives you Import-level performance with zero refresh overhead and always-fresh data. Only fall back to Import for edge cases requiring unsupported DAX patterns, or to DirectQuery when data must stay outside OneLake.

⚠️ Direct Lake Fallback Behavior

When a Direct Lake model exceeds the memory guardrails for your capacity SKU, it falls back to DirectQuery mode automatically. This means queries still work but may be slower. Monitor the DirectLakeFallback event in the Capacity Metrics app — frequent fallbacks indicate your tables need OPTIMIZE, V-Order, or a larger SKU.

Semantic Model Best Practices

Build star schemas in your Gold layer — separate facts and dimensions
Use surrogate keys (integer) instead of natural keys for relationships
Minimize calculated columns — push computations to the Lakehouse/Warehouse layer
Use calculation groups for reusable time intelligence (YTD, MTD, YoY)
Enable automatic aggregations for large tables to speed up high-level queries

Report Governance

Centralize semantic models: Create shared models that multiple reports connect to (avoid dataset proliferation)
Use endorsement: Certify production-ready models and promote development models
Apply RLS: Configure row-level security on semantic models for multi-tenant reports
Monitor usage: Use the usage metrics to track report adoption

Performance Optimization

Technique	Impact	Effort
V-Order on Delta tables	🟢 High	Low
Star schema design	🟢 High	Medium
Reduce cardinality of columns	🟢 High	Low
Use aggregation tables	🟡 Medium	Medium
Optimize DAX measures	🟡 Medium	Varies
Composite models	🟡 Medium	Medium
Paginated reports for large exports	🟡 Medium	Low

📚 Learn More

Direct Lake Mode ↗ Star Schema Design ↗

AI Readiness

Preparing Data for AI

Best practices for preparing your semantic models and data sources for Copilot for Power BI and Fabric Data Agents.

🤖 Why This Matters

Without proper data preparation, AI features can struggle to interpret your data correctly — leading to generic, inaccurate, or misleading outputs. Investing in data prep helps AI understand your unique business context, prioritize the right information, and deliver responses that are consistent, reliable, and aligned with your goals.

📖 Full AI & Copilot Guide

For a comprehensive overview of all AI capabilities in Fabric — Copilot across workloads, Data Agents, AI Skill, and governance — see the dedicated AI & Copilot page. This section focuses specifically on preparing your data for optimal AI results.

Copilot for Power BI — Prep Data for AI Toolkit

Power BI provides a unified "Prep data for AI" experience (available in both Desktop and the Power BI service) with three key features to optimize your semantic model for Copilot interactions.

📋 Prerequisites

Requires a paid Fabric capacity (F2+) or Power BI Premium (P1+) and the Copilot admin setting enabled. Power BI Q&A should be enabled on your semantic model for natural language queries. Copilot is now available on all paid SKUs and in sovereign cloud regions.

1. AI Data Schema — Simplify Your Model

Define a focused subset of your model's fields for Copilot to prioritize. By selecting only the most relevant fields, you reduce ambiguity and help Copilot deliver clearer, more accurate responses.

Open Prep data for AI → Simplify data schema tab
Select fields you want Copilot to reason over — prioritize clean columns with limited ambiguity
Remove any fields that could be confusing (duplicate names, internal IDs, staging columns)
Hidden fields in the semantic model are automatically excluded in the initial schema
Relationships between fields are still respected even if one side is excluded

⚠️ Scope Limitation

The AI data schema only applies to Copilot capabilities that use the schema (e.g., answering data questions). It does not affect report page creation, DAX queries, or dataset summaries — those still use the full model.

2. AI Instructions — Add Business Context

Provide context, business logic, and specific guidance directly on the semantic model. Copilot uses these instructions to better interpret user questions by incorporating organizational language, terminology, and analytical priorities.

📊 Business Context

"Busy season is October to February"
"Frame insights with a focus on risk assessment"
"When a user mentions ABCD, they mean the total invoice field"
"A lower attrition percent is more positive"

📐 Analysis Rules

"Always analyze sales on a quarterly basis"
"For retail insights, prioritize the customsegmentationtable"
"Use the sales_fact table as the primary source for all sales questions"
"When a user asks about product sales, always ask for clarification on location"

💡 Prompt Engineering Tips for AI Instructions

Be explicit and specific — Assume Copilot has no understanding of your business
Use analogies and examples — Concrete field/measure references help the model
Avoid ambiguity — Spell out which measure or table to use (and which to avoid)
Order matters — Test different orderings; the sequence of instructions can affect output quality
Break down complex instructions — Split multi-step logic into simple sequential steps
Keep focused — Conflicting or overly complex instructions can confuse the LLM. Less is sometimes more.

Good vs. Bad: AI Instructions

❌ Bad — Vague or Ambiguous	✅ Good — Specific and Actionable
"You're a seasoned BI Analyst who is detail oriented."	"You're a BI Analyst for a food distributor. Responses should be detail oriented and focused on revenue and profitability."
"Use the right sales measure."	"For product-specific sales, use the measure `Total_Sales_Product`. Filter on the Product column from the Sales table. Examples of products: Word, PowerPoint, Excel."
"Show active partners count."	"For Total Active Partners, use the measure `Monthly Active Partner Count_ID`. Do NOT filter on the Customers table."
"Analyze sales by time."	"Always analyze sales on a quarterly basis. When showcasing revenue, break it down by quarter and compare to the `industry` field."
"Find top customers."	"Define top customers by first looking at the revenue table, then returning only customers with the highest order values. Some partners are also customers — don't remove duplicates; indicate whether each is a partner or customer based on the ID value."

Good vs. Bad: Semantic Model Design for AI

❌ Bad — AI-Unfriendly Model	✅ Good — AI-Ready Model
Column named `col_amt_1` with no description	Column named `Total Sales Amount` with description: "Net revenue after discounts in USD"
Flat table with 80 columns mixing facts and dimensions	Star schema with separate `fact_sales`, `dim_product`, `dim_date` tables
All 80 columns exposed to AI data schema	AI data schema includes only the 25 most relevant, well-named fields
Staging tables (`stg_raw_import`) visible in the model	Staging/internal tables hidden; only Gold-layer tables exposed
Duplicate columns: `Revenue`, `Revenue_v2`, `Rev_Final`	Single canonical `Revenue` column; deprecated variants removed
No Q&A enabled, no descriptions, no AI instructions	Q&A enabled, measures have descriptions, AI instructions explain business context and analysis rules

3. Verified Answers — Curated Responses

Verified answers are human-approved, visual responses triggered by predefined phrases. They provide consistent, reliable answers to common business questions and help Copilot learn from effective responses over time.

Select a visual → "Set up a verified answer" from the … menu
Add 5–7 trigger phrases per answer that reflect how users naturally ask about the data
Use Copilot-generated suggestions for additional trigger phrases
Add up to 3 filters that users can adjust via natural language (e.g., "for the Northeast region")
Copilot uses both exact and semantic matching — users don't need to match trigger phrases word-for-word
Stored at the semantic model level — works across all reports using that model

Limit	Value
Verified answers per model	250
Trigger phrases per answer	15
Character limit per trigger	500
Filter permutations per answer	10

4. Mark Model "Approved for Copilot"

After preparing your semantic model, mark it as Approved for Copilot in the model's settings in the Power BI service. This removes the friction treatment (confidence warnings) from answers in the standalone Copilot experience. Changes can take up to 24 hours to propagate for models with many reports.

Fabric Data Agents — Data Preparation

Fabric Data Agents are configurable AI artifacts that enable conversational Q&A over your organizational data. They use NL-to-SQL, NL-to-DAX, and NL-to-KQL to query data across OneLake, semantic models, and KQL databases.

Selecting Data Sources & Tables

A Data Agent supports up to 5 data sources in any combination: Lakehouses, Warehouses, KQL databases, Power BI semantic models, Ontologies, and Microsoft Graph
For each source, select only the relevant tables — focus the agent on data that matters
If your data starts as files (CSV, JSON), ingest into tables first — agents query tables, not individual files
For Power BI semantic models, users only need Read permission to interact via the agent

Agent Instructions & Examples

📝 Agent Instructions

Provide custom rules, terminology, and data source routing guidance:

Direct financial metrics questions to a Power BI semantic model
Route raw data exploration queries to the Lakehouse
Send log analysis questions to the KQL database
Clarify organizational terminology and abbreviations

🔍 Example Queries

Provide question–query pairs to guide agent behavior:

Natural language question + corresponding SQL/KQL query
Helps the agent understand how to interpret similar questions
Not currently supported for Power BI semantic model sources
Include complex real-world examples (joins, aggregations, CTEs)

Good vs. Bad: Data Agent Setup

❌ Bad — Poor Agent Configuration	✅ Good — Well-Configured Agent
Add all lakehouse tables (including staging, temp, and system tables)	Select only curated Gold-layer tables relevant to the agent's purpose
No agent instructions — let the AI figure it out	Instructions specify: "Direct financial questions to the `finance_model` semantic model. Route raw event queries to the Lakehouse. Send log analysis to the KQL database."
No example queries provided	5–10 example question–SQL pairs covering joins, aggregations, and common business questions
Data agent exposes CSV files directly from OneLake	CSV/JSON files ingested into lakehouse tables before adding to agent
Published without testing — users report wrong answers	Tested iteratively with colleagues, instructions refined based on feedback, then published

Governance & Security Layers

Data Agents enforce a strict intent precedence model — higher layers always override lower ones:

Layer	Description	Precedence
Organizational Intent	Tenant-wide policies, compliance requirements	🔴 Highest
Role-based Intent	Workspace governance, permission boundaries	🟠 High
Developer Intent	Custom instructions, example queries, data source config	🟡 Medium
User Intent	Questions and prompts from end users	🟢 Lowest

Agents enforce read-only data access — no writes, no mutations
All queries run with the requesting user's credentials (least-privilege)
Microsoft Purview policies apply: DLP, access restrictions, risk discovery, and audit
Optional Azure AI Content Safety integration for content risk controls

AI Readiness Checklist

A combined checklist for preparing your data for both Copilot for Power BI and Fabric Data Agents:

✅ Semantic Model Prep (Copilot)

Use clean, descriptive column names — avoid abbreviations
Add descriptions to measures and important columns
Build a star schema with clear fact/dimension separation
Enable Q&A on the semantic model
Set an AI data schema — select only relevant fields
Write AI instructions with business context and analysis rules
Create verified answers for your top 10 business questions
Enable Copilot indexing for faster responses
Test changes iteratively in the Copilot pane
Mark model as "Approved for Copilot"

✅ Data Agent Prep

Choose the right data sources (max 5 per agent)
Select only relevant tables — don't expose staging/temp tables
Ingest file-based data into Lakehouse tables
Write agent instructions with routing rules and terminology
Provide example query pairs for complex question patterns
Configure user permissions (Read on semantic models)
Test with colleagues and iterate on instructions
Publish and share via Copilot in Power BI
Review Purview governance policies for compliance
Enable Azure AI Content Safety for production agents

📚 Learn More

Prep Data for AI — Overview ↗ AI Data Schema ↗ AI Instructions ↗ Verified Answers ↗ Fabric Data Agents — Overview ↗ Data Agent Scenario Walkthrough ↗

Migration

🧭 Migration Complexity Scorer

Answer 8 questions about your current environment to get a complexity estimate, risk assessment, and phased migration timeline.

Lakehouse & Storage Strategy

Lakehouse vs. Warehouse

File Format Guidance

V-Order Optimization

File Size & Partitioning

Shortcuts Strategy

📚 Learn More

📎 See Also

Notebook Kernel Selection

Compute Options Compared

Performance by Data Scale

Delta Lake Compatibility

Engine Maturity & Support

Decision Guidance

✅ Use the Python Kernel When

✅ Use the Spark Kernel When

Key Differences at a Glance

📚 Official Documentation

📎 See Also

Notebook & Spark Optimization

Python vs. PySpark — Choose the Right Engine

Native Execution Engine

High Concurrency Mode

Session Management — Stop Paying for Idle

Common Anti-Patterns

Notebook Design for Production

📚 Learn More

📎 See Also

dbt Integration Best Practices

Why dbt in Fabric?

dbt vs. Spark Notebooks — When to Use Each

Architecture: dbt in the Medallion Stack

Setup & Configuration

Best Practices

Project Structure

Cost Optimization

Quality & Testing

CI/CD Integration

Limitations to Know

📚 Learn More

📎 See Also

Real-Time Intelligence

Overview

Key Components

Eventhouse

KQL Database

Eventstreams

Real-Time Dashboards

Streaming Ingestion Patterns

When to Use Real-Time vs. Batch

📚 Learn More

Power BI Integration

Direct Lake Mode

Connectivity Modes — Direct Lake vs. DirectQuery vs. Import

When to Use Each Mode

Semantic Model Best Practices

Report Governance

Performance Optimization

📚 Learn More

Preparing Data for AI

Copilot for Power BI — Prep Data for AI Toolkit

1. AI Data Schema — Simplify Your Model

2. AI Instructions — Add Business Context

📊 Business Context

📐 Analysis Rules

Good vs. Bad: AI Instructions

Good vs. Bad: Semantic Model Design for AI

3. Verified Answers — Curated Responses

4. Mark Model "Approved for Copilot"

Fabric Data Agents — Data Preparation

Selecting Data Sources & Tables

Agent Instructions & Examples

📝 Agent Instructions

🔍 Example Queries

Good vs. Bad: Data Agent Setup

Governance & Security Layers

AI Readiness Checklist

✅ Semantic Model Prep (Copilot)

✅ Data Agent Prep

📚 Learn More