πŸ‘€ Who is this for?

Data Engineer BI Developer Analytics Engineer β€” This section covers storage decisions (Lakehouse vs. Warehouse), notebook & Spark optimization, dbt integration, real-time streaming with Eventhouse, and Power BI Direct Lake integration.

Storage

Lakehouse & Storage Strategy

Choose where and how to store data β€” the foundation that determines cost, performance, and accessibility of everything downstream.

Lakehouse vs. Warehouse

Fabric offers two primary storage artifacts. Choose based on your workload:

FeatureLakehouseWarehouse
Primary languagePySpark, Spark SQLT-SQL
Data formatDelta (open format in OneLake)Proprietary columnar
SchemaSchema-on-read + schema-on-writeSchema-on-write only
Best forData engineering, ML, unstructured dataBI, structured queries, stored procedures
SQL accessRead-only SQL endpoint (auto-generated)Full T-SQL read/write
Multi-engineSpark + SQL + Power BISQL + Power BI
βœ… Recommendation

Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.

File Format Guidance

Recommended formats in Fabric
βœ… Delta (Parquet + transaction log)  β€” Default for all Fabric tables
βœ… Parquet                             β€” Good for external/read-only data
⚠️  CSV                                β€” Only for Bronze ingestion, convert to Delta ASAP
❌ JSON (nested)                       β€” Flatten before storing in tables
❌ Excel/XML                           β€” Convert during ingestion pipeline

V-Order Optimization

V-Order is a Fabric-native write-time optimization that sorts data within Parquet files for maximum read performance across all Fabric engines (SQL, Spark, Power BI Direct Lake).

File Size & Partitioning

Shortcuts Strategy

Shortcuts are pointers to data in external locations β€” they appear as folders in your lakehouse but don't copy data. Use them to:

Kernels

Notebook Kernel Selection

Choose the right notebook kernel β€” Python, Spark, or T-SQL β€” based on compute cost, data scale, Delta Lake requirements, and operational needs.

Fabric notebooks support three kernel types: Python, Spark, and T-SQL. The Spark kernel supports PySpark, SparkSQL, Scala, and SparkR. This guide focuses on the Python vs. Spark decision β€” the most common choice for data engineering workloads.

⚠️ It's Not Just About Cost or Data Size

Compute configuration, Delta Lake feature requirements, engine maturity, and expected data growth all play important roles. A common misconception is that the Python kernel is always cheaper β€” in reality, cost depends on how you configure compute for each kernel.

Compute Options Compared

ConfigurationvCoresStartup TimeCUs Consumed
Python kernel (default)2 vCores (scalable to 64)~5 seconds1 CU (minimum)
Spark β€” Starter pool (default)8-core worker, autoscale~5 seconds8 CUs (after proactive scale-up)
Spark β€” Single-node 8-vCore8 cores shared (driver + executor)~5 seconds4 CUs
Spark β€” Single-node 4-vCore (custom pool)4 cores shared3–5 minutes2 CUs
Spark β€” Multi-node (custom pool)Scales with cluster3–5 minutesVaries
πŸ’‘ Single-Node Spark β‰ˆ Python Kernel Cost

A single-node 8-vCore Spark session using the starter pool starts in ~5 seconds β€” comparable to the Python kernel β€” while giving you full access to Spark-native capabilities including the Native Execution Engine (NEE).

Performance by Data Scale

Data Scale (compressed)Engine Advantage
< ~140 MBPython engines (DuckDB, Polars) are faster
~1–2 GBPython still has an edge, but Spark + NEE becomes competitive
~10–13 GBSpark + NEE matches or outperforms single-machine engines; Python engines risk OOM
100 GB+Spark + NEE is the fastest and most reliable option

Delta Lake Compatibility

This is a critical differentiator. Fabric Spark has native, full-featured Delta Lake support. Python engines (delta-rs, DuckDB, Polars) have meaningful gaps:

Delta FeatureFabric Sparkdelta-rs (Python)DuckDBPolars
Read Delta tablesβœ…βœ…βœ…βœ…
Full write support (MERGE, UPDATE, DELETE)βœ…βœ…βš οΈ INSERT only⚠️ No UPDATE/DELETE
Deletion vectors (write)βœ…βŒβŒβŒ
Column mappingβœ…βŒβœ…βŒ
Liquid Clusteringβœ…βŒβŒβŒ
Type wideningβœ…βŒβš οΈ Read only❌
Change Data Feed (write)βœ…βŒβŒβŒ
OPTIMIZEβœ…βœ…βŒβŒ
VACUUMβœ…βŒβŒβŒ
Auto compactionβœ…βŒβŒβŒ
V-Orderβœ…βŒβŒβŒ
⚠️ Deletion Vectors

Deletion vectors are enabled by default in Fabric Spark Runtime 2.0. No Python engine supports writing deletion vectors β€” if you use a Python engine to write to DV-enabled tables, you'll encounter compatibility errors.

Engine Maturity & Support

Decision Guidance

βœ… Use the Python Kernel When

  • Data is small (< ~1 GB compressed) and single-machine speed matters most
  • Building lightweight API orchestration, REST/gRPC integrations
  • Rapid interactive exploration of small datasets
  • Workload requires a specific Python version not in the Spark runtime
  • You accept the Delta Lake feature limitations

βœ… Use the Spark Kernel When

  • Data is β‰₯ 1 GB compressed or will grow to that scale
  • You need full Delta Lake compatibility (deletion vectors, OPTIMIZE, VACUUM, ACID)
  • You need live monitoring and production-grade features (env vars, High Concurrency, FAIR/FIFO scheduling)
  • You use Spark-native APIs (MLlib, Spark SQL, Streaming)
  • You want Microsoft end-to-end engine support
  • You need to scale from single-node to multi-node without rewriting code

Key Differences at a Glance

CategoryPython KernelSpark Kernel
Default compute2-vCore single-node (up to 64)8-vCore starter pool with autoscale
Distributed execution❌ Noβœ… Yes
LanguagesPythonPySpark, SparkSQL, Scala, SparkR
Full Delta Lake support❌ Noβœ… Yes
Live monitoringLimitedFull (Spark UI + Job monitoring)
Microsoft engine supportOneLake integrations onlyFull runtime support
High Concurrency❌ Noβœ… Yes
V-Order for Direct Lake❌ Noβœ… Yes
Scales to multi-node❌ Noβœ… Yes
πŸ’‘ Recommended Starting Point

For workloads β‰₯ 1 GB compressed, start with a single-node 8-vCore Spark cluster using the starter pool. You get near-instant startup (~5s), full Fabric Spark capabilities including NEE, and the ability to scale out to multi-node when needed β€” all on a single node like the Python kernel.

Compute

Notebook & Spark Optimization

Reduce CU consumption, speed up jobs, and avoid expensive anti-patterns in Fabric notebooks.

Fabric notebooks consume Capacity Units (CUs) based on compute type, session duration, and cluster size. Every decision here directly impacts your monthly Fabric bill.

Python vs. PySpark β€” Choose the Right Engine

Not every job needs distributed computing. Using PySpark for trivial workloads wastes resources because Spark must start a session (~20–30s), allocate executors, and maintain cluster overhead β€” all of which burn CUs even if the actual work is minimal.

ScenarioRecommended EngineWhy
Data < 1 GB, simple transformsPython + pandasNo Spark session overhead β€” runs on notebook VM only. ~70% cheaper for small jobs.
Exploratory analysis / prototypingPython + pandasFaster iteration; no cluster startup wait. You can switch to Spark later if data grows.
Data 1–10 GB with complex joinsPySparkDistributed processing avoids OOM errors; Spark optimizer handles join strategies.
Data > 10 GB or heavy aggregationsPySparkOnly viable option β€” single-node memory can't handle the volume.
ML feature engineering at scalePySpark + MLlibDistributed training/feature computation is orders of magnitude faster.
Small ML model training (< 1 GB)Python + scikit-learnLower overhead, richer library ecosystem, no serialization cost.
πŸ’‘ Cost Impact

A PySpark notebook that processes 500 MB of data uses Spark Compute CUs (cluster startup + executor time). The same job in plain Python uses only the notebook VM CUs β€” typically 2–4Γ— cheaper for datasets under 1 GB. Over hundreds of daily pipeline runs, this adds up to significant savings.

Native Execution Engine

The Native Execution Engine (generally available) runs Spark SQL and DataFrame operations on optimized C++/vectorized code instead of the JVM. It delivers up to 4Γ— faster performance with zero code changes.

High Concurrency Mode

When you orchestrate multiple notebooks in a pipeline, each notebook normally starts its own Spark session β€” paying the startup cost every time.

⚠️ Constraint

Notebooks sharing a session must use the same lakehouse context and Spark compute properties. Plan your pipeline grouping accordingly.

Session Management β€” Stop Paying for Idle

Spark sessions remain alive (and consuming CUs) even when no code is running. This is the most common source of wasted capacity.

Common Anti-Patterns

Anti-PatternProblemFix
df.collect() on large data Infrastructure: Moves all data to driver node β€” OOM crash risk and network bottleneck. Process in Spark; only collect() small aggregated results.
Python UDFs instead of built-in functions Performance: Data serializes between JVMβ†’Pythonβ†’JVM per row. 10–100Γ— slower than native Spark functions. Use pyspark.sql.functions (vectorized). If UDF needed, use Pandas UDFs (@pandas_udf).
Row-by-row loops Performance: Defeats Spark's parallel execution β€” processes one record at a time on the driver. Use vectorized DataFrame operations or withColumn() transformations.
toPandas() on large DataFrames Infrastructure + Cost: Collects entire dataset to driver memory. Same risk as collect(). Only convert small/aggregated results. For large data, use pandas API on Spark (pyspark.pandas).
Default cluster size for all jobs Cost: Large clusters for small jobs waste 60–80% of allocated CUs sitting idle. Right-size: use Small (4 vCores) for light jobs, scale up only when Spark UI shows resource contention.
No caching of reused DataFrames Cost + Performance: Re-reads and re-computes data multiple times, multiplying I/O and CU usage. Use .cache() or .persist() for DataFrames used more than once. Call .unpersist() when done.

Notebook Design for Production

πŸ“Š Quick Decision Flowchart

Data < 1 GB? β†’ Use Python/pandas (no Spark needed)
Data 1–10 GB? β†’ PySpark with Small cluster, enable Native Execution Engine
Data > 10 GB? β†’ PySpark with Medium/Large cluster, partition wisely
Running multiple notebooks in pipeline? β†’ Enable High Concurrency Mode
Table read frequently by Power BI/SQL? β†’ Enable V-Order on write

dbt

dbt Integration Best Practices

SQL-first transformation framework with built-in testing, documentation, and lineage for Fabric Warehouse.

Why dbt in Fabric?

dbt (data build tool) is an open-source transformation framework that lets analytics engineers write modular SQL models with built-in testing, documentation, and lineage tracking. It connects to Fabric via the dbt-fabric adapter targeting the Warehouse SQL endpoint.

πŸ€” Should you use dbt or Spark notebooks?

This isn't either/or β€” many teams use both. The choice depends on your transformation type, team skills, and existing investments.

dbt vs. Spark Notebooks β€” When to Use Each

Criteriadbt (SQL models)Spark NotebooksWhy It Matters
Transformation type SQL-based (SELECT, JOIN, GROUP BY) Python/PySpark, complex logic, ML Architecture: dbt excels at declarative SQL transformations; notebooks handle imperative, multi-step logic.
Testing & quality Built-in (unique, not_null, relationships, custom tests) Manual (assert statements, Great Expectations) Quality: dbt's test framework catches data issues automatically on every run β€” no extra setup needed.
Lineage & documentation Auto-generated DAG, column-level lineage, docs site Manual documentation, no native lineage Governance: dbt provides instant visibility into data dependencies β€” critical for impact analysis and compliance.
Incremental loads Built-in materializations (append, merge, microbatch) Custom logic with Delta merge Cost: dbt's incremental models process only new/changed data, reducing Warehouse CU consumption.
Target engine Fabric Warehouse (T-SQL) Fabric Lakehouse (Spark) Infrastructure: dbt uses the Warehouse compute engine (no Spark cluster needed). Lower cost for SQL workloads.
Version control & CI Git-native, standard PR reviews, CI testing Git via Fabric integration (notebook JSON) Quality: dbt models are plain SQL files β€” easier to diff, review, and test in CI pipelines.
Team profile Analytics engineers, SQL-first teams Data engineers, Python/Spark teams Adoption: Use the tool that matches your team's primary language to maximize productivity.
βœ… Recommendation

Use dbt for Silver β†’ Gold transformations in Fabric Warehouse when your logic is primarily SQL (joins, aggregations, business rules). Use Spark notebooks for Bronze β†’ Silver where you need Python for parsing, complex cleansing, or handling semi-structured data. Many production architectures combine both.

Architecture: dbt in the Medallion Stack

Where dbt fits in a typical Fabric architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SOURCES                                                         β”‚
β”‚  (Databases, SaaS, Files, APIs)                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚  Ingestion: Pipelines, Mirroring, Dataflows
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BRONZE (Lakehouse)                                              β”‚
β”‚  Raw data β€” Spark notebooks for parsing & cleansing              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚  Spark notebooks (Python/PySpark)
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SILVER (Warehouse)                                              β”‚
β”‚  Cleansed, normalized β€” dbt models for conforming & dedup        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚  dbt models (SQL)
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GOLD (Warehouse)                                                β”‚
β”‚  Business-ready β€” dbt marts, dimensional models, KPIs            β”‚
β”‚  β†’ Power BI Direct Lake / SQL analytics endpoint                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Setup & Configuration

profiles.yml β€” Fabric Warehouse connection
my_fabric_project:
  target: dev
  outputs:
    dev:
      type: fabric
      driver: "ODBC Driver 18 for SQL Server"
      server: "your-workspace.datawarehouse.fabric.microsoft.com"
      database: "your_warehouse"
      schema: "dbt_dev"
      authentication: ServicePrincipal   # Recommended for CI/CD
      tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
      client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
      client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
    prod:
      type: fabric
      driver: "ODBC Driver 18 for SQL Server"
      server: "your-workspace.datawarehouse.fabric.microsoft.com"
      database: "your_warehouse"
      schema: "dbt_prod"
      authentication: ServicePrincipal
      tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
      client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
      client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
⚠️ Important: Warehouse, Not Lakehouse

The dbt-fabric adapter targets the Warehouse (full DML/DDL support). The Lakehouse SQL endpoint is read-only and cannot execute dbt's CREATE/INSERT/MERGE statements. If you need dbt on Lakehouse data, create shortcuts from Lakehouse to Warehouse, or use the dbt-fabricspark adapter (limited, via Livy API).

Best Practices

Project Structure

Cost Optimization

Quality & Testing

CI/CD Integration

Limitations to Know

LimitationImpactWorkaround
Lakehouse SQL endpoint is read-only Cannot use dbt materializations against Lakehouse Use Warehouse as dbt target; reference Lakehouse data via shortcuts or cross-database queries
Adapter maturity Some advanced dbt features may have edge cases Pin adapter version, test upgrades in dev first, contribute issues to GitHub
No Python models in dbt-fabric Python-based dbt models not supported (Warehouse is T-SQL only) Use Spark notebooks for Python transforms; feed results to dbt via Warehouse tables
Snapshot limitations SCD Type 2 snapshots work but may have performance limits on very large tables Partition snapshot queries by date range; consider incremental snapshots
Real-Time

Real-Time Intelligence

Streaming data ingestion and time-series analytics with Real-Time Intelligence.

Overview

Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.

Key Components

Eventhouse

The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.

KQL Database

Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) β€” a powerful language for log and time-series analytics.

Eventstreams

Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.

Real-Time Dashboards

Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.

Streaming Ingestion Patterns

SourceIngestion MethodLatency
Azure Event HubsEventstream β†’ KQL DatabaseSeconds
Azure IoT HubEventstream β†’ KQL DatabaseSeconds
Apache KafkaKafka connector β†’ EventstreamSeconds
Custom appsREST API / SDK ingestionSeconds
Databases (CDC)Fabric Mirroring β†’ OneLakeMinutes

When to Use Real-Time vs. Batch

βœ… Best Practice

Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.

Analytics

Power BI Integration

Maximizing the reporting and analytics experience with Fabric-native Power BI features.

Direct Lake Mode

Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery β€” without copying data into the semantic model.

πŸ’‘ Key Insight

Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.

Connectivity Modes β€” Direct Lake vs. DirectQuery vs. Import

Choosing the right connectivity mode is one of the highest-impact decisions for Power BI performance and cost. Each mode trades off freshness, speed, and resource consumption differently.

CriteriaImportDirectQueryDirect Lake
How it works Copies data into Power BI's in-memory engine (VertiPaq) Sends queries live to the source at report render time Reads Delta/Parquet files directly from OneLake into memory on demand
Data freshness Stale until next scheduled refresh Always real-time (queries source live) Always fresh β€” reads latest committed Delta files
Query performance 🟒 Fastest (pre-compressed in-memory columnar) πŸ”΄ Slowest (depends on source query speed + network) 🟒 Near-Import speed (columnar read from OneLake)
Capacity cost 🟑 High memory usage (entire dataset cached). Refresh jobs consume CUs. 🟒 Low memory (no cache). But high CU per query if source is slow. 🟒 Low-to-moderate memory (loads columns on demand, evicts when idle). No refresh CUs.
Data size limits Bounded by capacity memory (1 GB – 400 GB depending on SKU) Unlimited (data stays at source) Bounded by capacity memory, but with automatic fallback to DirectQuery
Source requirements Any Power Query-supported source Any DirectQuery-compatible source (SQL, Fabric, Dataverse, etc.) Fabric only β€” Delta tables in OneLake (Lakehouse or Warehouse)
Refresh needed? Yes β€” scheduled or on-demand refresh No β€” live queries No β€” reads latest data automatically
DAX calculation support Full (all DAX functions) Limited (some DAX patterns perform poorly or are unsupported) Full (same engine as Import)
RLS enforcement In the semantic model At source + semantic model In the semantic model (respects OneLake security roles)

When to Use Each Mode

ScenarioRecommended ModeWhy
Gold-layer tables in Fabric Lakehouse/Warehouse Direct Lake Cost + Freshness: No refresh schedule to manage, no duplicate storage, Import-class speed. The default choice for Fabric-native BI.
Data in external systems (SQL Server, Dataverse, SAP) DirectQuery Architecture: Data can't move to OneLake (regulatory, real-time, or volume constraints). Accept slower queries for real-time freshness.
Small, highly curated datasets (< 1 GB) refreshed daily Import Performance: Maximum DAX flexibility and fastest queries. Refresh cost is negligible for small data.
Very large tables (100+ GB) with infrequent access Direct Lake (with fallback) Cost: Only loads accessed columns into memory. Falls back to DirectQuery for cold data β€” avoids paying for 100 GB of memory.
Mixed: some Fabric tables + some external sources Composite model (Direct Lake + DirectQuery) Architecture: Combine both in one model β€” Direct Lake for Fabric tables, DirectQuery for external sources that can't be moved.
Near real-time dashboards (< 1 min latency requirement) DirectQuery or Direct Lake Freshness: Both deliver real-time. Choose Direct Lake if data is in OneLake; DirectQuery if it's external.
Legacy reports migrating to Fabric Start with Import β†’ migrate to Direct Lake Risk: Import works identically to before. Once Gold tables are in Lakehouse/Warehouse, switch to Direct Lake for cost savings.
βœ… Default recommendation

If your data is in Fabric, use Direct Lake. It gives you Import-level performance with zero refresh overhead and always-fresh data. Only fall back to Import for edge cases requiring unsupported DAX patterns, or to DirectQuery when data must stay outside OneLake.

⚠️ Direct Lake Fallback Behavior

When a Direct Lake model exceeds the memory guardrails for your capacity SKU, it falls back to DirectQuery mode automatically. This means queries still work but may be slower. Monitor the DirectLakeFallback event in the Capacity Metrics app β€” frequent fallbacks indicate your tables need OPTIMIZE, V-Order, or a larger SKU.

Semantic Model Best Practices

Report Governance

Performance Optimization

TechniqueImpactEffort
V-Order on Delta tables🟒 HighLow
Star schema design🟒 HighMedium
Reduce cardinality of columns🟒 HighLow
Use aggregation tables🟑 MediumMedium
Optimize DAX measures🟑 MediumVaries
Composite models🟑 MediumMedium
Paginated reports for large exports🟑 MediumLow
AI Readiness

Preparing Data for AI

Best practices for preparing your semantic models and data sources for Copilot for Power BI and Fabric Data Agents.

πŸ€– Why This Matters

Without proper data preparation, AI features can struggle to interpret your data correctly β€” leading to generic, inaccurate, or misleading outputs. Investing in data prep helps AI understand your unique business context, prioritize the right information, and deliver responses that are consistent, reliable, and aligned with your goals.

πŸ“– Full AI & Copilot Guide

For a comprehensive overview of all AI capabilities in Fabric β€” Copilot across workloads, Data Agents, AI Skill, and governance β€” see the dedicated AI & Copilot page. This section focuses specifically on preparing your data for optimal AI results.

Copilot for Power BI β€” Prep Data for AI Toolkit

Power BI provides a unified "Prep data for AI" experience (available in both Desktop and the Power BI service) with three key features to optimize your semantic model for Copilot interactions.

πŸ“‹ Prerequisites

Requires a paid Fabric capacity (F2+) or Power BI Premium (P1+) and the Copilot admin setting enabled. Power BI Q&A should be enabled on your semantic model for natural language queries. Copilot is now available on all paid SKUs and in sovereign cloud regions.

1. AI Data Schema β€” Simplify Your Model

Define a focused subset of your model's fields for Copilot to prioritize. By selecting only the most relevant fields, you reduce ambiguity and help Copilot deliver clearer, more accurate responses.

⚠️ Scope Limitation

The AI data schema only applies to Copilot capabilities that use the schema (e.g., answering data questions). It does not affect report page creation, DAX queries, or dataset summaries β€” those still use the full model.

2. AI Instructions β€” Add Business Context

Provide context, business logic, and specific guidance directly on the semantic model. Copilot uses these instructions to better interpret user questions by incorporating organizational language, terminology, and analytical priorities.

πŸ“Š Business Context

  • "Busy season is October to February"
  • "Frame insights with a focus on risk assessment"
  • "When a user mentions ABCD, they mean the total invoice field"
  • "A lower attrition percent is more positive"

πŸ“ Analysis Rules

  • "Always analyze sales on a quarterly basis"
  • "For retail insights, prioritize the customsegmentationtable"
  • "Use the sales_fact table as the primary source for all sales questions"
  • "When a user asks about product sales, always ask for clarification on location"
πŸ’‘ Prompt Engineering Tips for AI Instructions
  • Be explicit and specific β€” Assume Copilot has no understanding of your business
  • Use analogies and examples β€” Concrete field/measure references help the model
  • Avoid ambiguity β€” Spell out which measure or table to use (and which to avoid)
  • Order matters β€” Test different orderings; the sequence of instructions can affect output quality
  • Break down complex instructions β€” Split multi-step logic into simple sequential steps
  • Keep focused β€” Conflicting or overly complex instructions can confuse the LLM. Less is sometimes more.

Good vs. Bad: AI Instructions

❌ Bad β€” Vague or Ambiguousβœ… Good β€” Specific and Actionable
"You're a seasoned BI Analyst who is detail oriented." "You're a BI Analyst for a food distributor. Responses should be detail oriented and focused on revenue and profitability."
"Use the right sales measure." "For product-specific sales, use the measure Total_Sales_Product. Filter on the Product column from the Sales table. Examples of products: Word, PowerPoint, Excel."
"Show active partners count." "For Total Active Partners, use the measure Monthly Active Partner Count_ID. Do NOT filter on the Customers table."
"Analyze sales by time." "Always analyze sales on a quarterly basis. When showcasing revenue, break it down by quarter and compare to the industry field."
"Find top customers." "Define top customers by first looking at the revenue table, then returning only customers with the highest order values. Some partners are also customers β€” don't remove duplicates; indicate whether each is a partner or customer based on the ID value."

Good vs. Bad: Semantic Model Design for AI

❌ Bad β€” AI-Unfriendly Modelβœ… Good β€” AI-Ready Model
Column named col_amt_1 with no description Column named Total Sales Amount with description: "Net revenue after discounts in USD"
Flat table with 80 columns mixing facts and dimensions Star schema with separate fact_sales, dim_product, dim_date tables
All 80 columns exposed to AI data schema AI data schema includes only the 25 most relevant, well-named fields
Staging tables (stg_raw_import) visible in the model Staging/internal tables hidden; only Gold-layer tables exposed
Duplicate columns: Revenue, Revenue_v2, Rev_Final Single canonical Revenue column; deprecated variants removed
No Q&A enabled, no descriptions, no AI instructions Q&A enabled, measures have descriptions, AI instructions explain business context and analysis rules

3. Verified Answers β€” Curated Responses

Verified answers are human-approved, visual responses triggered by predefined phrases. They provide consistent, reliable answers to common business questions and help Copilot learn from effective responses over time.

LimitValue
Verified answers per model250
Trigger phrases per answer15
Character limit per trigger500
Filter permutations per answer10

4. Mark Model "Approved for Copilot"

After preparing your semantic model, mark it as Approved for Copilot in the model's settings in the Power BI service. This removes the friction treatment (confidence warnings) from answers in the standalone Copilot experience. Changes can take up to 24 hours to propagate for models with many reports.

Fabric Data Agents β€” Data Preparation

Fabric Data Agents are configurable AI artifacts that enable conversational Q&A over your organizational data. They use NL-to-SQL, NL-to-DAX, and NL-to-KQL to query data across OneLake, semantic models, and KQL databases.

Selecting Data Sources & Tables

Agent Instructions & Examples

πŸ“ Agent Instructions

Provide custom rules, terminology, and data source routing guidance:

  • Direct financial metrics questions to a Power BI semantic model
  • Route raw data exploration queries to the Lakehouse
  • Send log analysis questions to the KQL database
  • Clarify organizational terminology and abbreviations

πŸ” Example Queries

Provide question–query pairs to guide agent behavior:

  • Natural language question + corresponding SQL/KQL query
  • Helps the agent understand how to interpret similar questions
  • Not currently supported for Power BI semantic model sources
  • Include complex real-world examples (joins, aggregations, CTEs)

Good vs. Bad: Data Agent Setup

❌ Bad β€” Poor Agent Configurationβœ… Good β€” Well-Configured Agent
Add all lakehouse tables (including staging, temp, and system tables) Select only curated Gold-layer tables relevant to the agent's purpose
No agent instructions β€” let the AI figure it out Instructions specify: "Direct financial questions to the finance_model semantic model. Route raw event queries to the Lakehouse. Send log analysis to the KQL database."
No example queries provided 5–10 example question–SQL pairs covering joins, aggregations, and common business questions
Data agent exposes CSV files directly from OneLake CSV/JSON files ingested into lakehouse tables before adding to agent
Published without testing β€” users report wrong answers Tested iteratively with colleagues, instructions refined based on feedback, then published

Governance & Security Layers

Data Agents enforce a strict intent precedence model β€” higher layers always override lower ones:

LayerDescriptionPrecedence
Organizational IntentTenant-wide policies, compliance requirementsπŸ”΄ Highest
Role-based IntentWorkspace governance, permission boundaries🟠 High
Developer IntentCustom instructions, example queries, data source config🟑 Medium
User IntentQuestions and prompts from end users🟒 Lowest

AI Readiness Checklist

A combined checklist for preparing your data for both Copilot for Power BI and Fabric Data Agents:

βœ… Semantic Model Prep (Copilot)

  • Use clean, descriptive column names β€” avoid abbreviations
  • Add descriptions to measures and important columns
  • Build a star schema with clear fact/dimension separation
  • Enable Q&A on the semantic model
  • Set an AI data schema β€” select only relevant fields
  • Write AI instructions with business context and analysis rules
  • Create verified answers for your top 10 business questions
  • Enable Copilot indexing for faster responses
  • Test changes iteratively in the Copilot pane
  • Mark model as "Approved for Copilot"

βœ… Data Agent Prep

  • Choose the right data sources (max 5 per agent)
  • Select only relevant tables β€” don't expose staging/temp tables
  • Ingest file-based data into Lakehouse tables
  • Write agent instructions with routing rules and terminology
  • Provide example query pairs for complex question patterns
  • Configure user permissions (Read on semantic models)
  • Test with colleagues and iterate on instructions
  • Publish and share via Copilot in Power BI
  • Review Purview governance policies for compliance
  • Enable Azure AI Content Safety for production agents
Migration

🧭 Migration Complexity Scorer

Answer 8 questions about your current environment to get a complexity estimate, risk assessment, and phased migration timeline.