Data Engineer BI Developer Analytics Engineer β This section covers storage decisions (Lakehouse vs. Warehouse), notebook & Spark optimization, dbt integration, real-time streaming with Eventhouse, and Power BI Direct Lake integration.
Lakehouse & Storage Strategy
Choose where and how to store data β the foundation that determines cost, performance, and accessibility of everything downstream.
Lakehouse vs. Warehouse
Fabric offers two primary storage artifacts. Choose based on your workload:
| Feature | Lakehouse | Warehouse |
|---|---|---|
| Primary language | PySpark, Spark SQL | T-SQL |
| Data format | Delta (open format in OneLake) | Proprietary columnar |
| Schema | Schema-on-read + schema-on-write | Schema-on-write only |
| Best for | Data engineering, ML, unstructured data | BI, structured queries, stored procedures |
| SQL access | Read-only SQL endpoint (auto-generated) | Full T-SQL read/write |
| Multi-engine | Spark + SQL + Power BI | SQL + Power BI |
Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.
File Format Guidance
β Delta (Parquet + transaction log) β Default for all Fabric tables β Parquet β Good for external/read-only data β οΈ CSV β Only for Bronze ingestion, convert to Delta ASAP β JSON (nested) β Flatten before storing in tables β Excel/XML β Convert during ingestion pipeline
V-Order Optimization
V-Order is a Fabric-native write-time optimization that sorts data within Parquet files for maximum read performance across all Fabric engines (SQL, Spark, Power BI Direct Lake).
- Why (Performance): Queries scan less data thanks to better predicate pushdown and column-chunk elimination β up to 50% faster reads.
- Why (Cost): Faster reads = shorter query sessions = fewer CUs. The write-time overhead (10β15% slower writes) pays for itself many times over for frequently-read tables.
- When to enable: Always for Gold/Silver layer tables. Skip for Bronze/staging tables that are written once and immediately overwritten.
File Size & Partitioning
- Optimize file sizes: Target 128 MB β 1 GB files. Use
OPTIMIZE(bin-compaction) to merge small files. Why (Performance): Too many small files causes excessive metadata overhead and slow reads across all engines. - Partition wisely: Partition by low-cardinality columns used in filters (e.g.,
year,region). Avoid over-partitioning. Why (Cost): Over-partitioning creates thousands of tiny files that hurt read performance and inflate storage transactions. - Use Delta caching: SSD caching for repeated reads on the same data. Why (Performance): Avoids repeated network I/O to OneLake for hot data.
Shortcuts Strategy
Shortcuts are pointers to data in external locations β they appear as folders in your lakehouse but don't copy data. Use them to:
- Access data in other lakehouses or workspaces without duplication
- Connect to external ADLS Gen2, AWS S3, or Google Cloud Storage
- Enable cross-domain data sharing without moving data
- Create a virtual data mesh where each domain manages its data but shares via shortcuts
Notebook Kernel Selection
Choose the right notebook kernel β Python, Spark, or T-SQL β based on compute cost, data scale, Delta Lake requirements, and operational needs.
Fabric notebooks support three kernel types: Python, Spark, and T-SQL. The Spark kernel supports PySpark, SparkSQL, Scala, and SparkR. This guide focuses on the Python vs. Spark decision β the most common choice for data engineering workloads.
Compute configuration, Delta Lake feature requirements, engine maturity, and expected data growth all play important roles. A common misconception is that the Python kernel is always cheaper β in reality, cost depends on how you configure compute for each kernel.
Compute Options Compared
| Configuration | vCores | Startup Time | CUs Consumed |
|---|---|---|---|
| Python kernel (default) | 2 vCores (scalable to 64) | ~5 seconds | 1 CU (minimum) |
| Spark β Starter pool (default) | 8-core worker, autoscale | ~5 seconds | 8 CUs (after proactive scale-up) |
| Spark β Single-node 8-vCore | 8 cores shared (driver + executor) | ~5 seconds | 4 CUs |
| Spark β Single-node 4-vCore (custom pool) | 4 cores shared | 3β5 minutes | 2 CUs |
| Spark β Multi-node (custom pool) | Scales with cluster | 3β5 minutes | Varies |
A single-node 8-vCore Spark session using the starter pool starts in ~5 seconds β comparable to the Python kernel β while giving you full access to Spark-native capabilities including the Native Execution Engine (NEE).
Performance by Data Scale
| Data Scale (compressed) | Engine Advantage |
|---|---|
| < ~140 MB | Python engines (DuckDB, Polars) are faster |
| ~1β2 GB | Python still has an edge, but Spark + NEE becomes competitive |
| ~10β13 GB | Spark + NEE matches or outperforms single-machine engines; Python engines risk OOM |
| 100 GB+ | Spark + NEE is the fastest and most reliable option |
Delta Lake Compatibility
This is a critical differentiator. Fabric Spark has native, full-featured Delta Lake support. Python engines (delta-rs, DuckDB, Polars) have meaningful gaps:
| Delta Feature | Fabric Spark | delta-rs (Python) | DuckDB | Polars |
|---|---|---|---|---|
| Read Delta tables | β | β | β | β |
| Full write support (MERGE, UPDATE, DELETE) | β | β | β οΈ INSERT only | β οΈ No UPDATE/DELETE |
| Deletion vectors (write) | β | β | β | β |
| Column mapping | β | β | β | β |
| Liquid Clustering | β | β | β | β |
| Type widening | β | β | β οΈ Read only | β |
| Change Data Feed (write) | β | β | β | β |
| OPTIMIZE | β | β | β | β |
| VACUUM | β | β | β | β |
| Auto compaction | β | β | β | β |
| V-Order | β | β | β | β |
Deletion vectors are enabled by default in Fabric Spark Runtime 2.0. No Python engine supports writing deletion vectors β if you use a Python engine to write to DV-enabled tables, you'll encounter compatibility errors.
Engine Maturity & Support
- Fabric Spark: Microsoft-maintained fork β full end-to-end support, including NEE (built on Velox + Gluten). You can open support tickets for engine bugs, query plans, and memory issues. Code gets faster with runtime updates without changes.
- Python engines (DuckDB, Polars): Microsoft supports only OneLake integration (auth, file system access). Engine bugs, API changes between versions, and performance issues require engaging OSS communities directly.
- Operational visibility: Spark has the Spark UI + Fabric telemetry for live job monitoring. DuckDB/Polars have no equivalent β when a job hangs, there's no way to inspect what's happening.
Decision Guidance
β Use the Python Kernel When
- Data is small (< ~1 GB compressed) and single-machine speed matters most
- Building lightweight API orchestration, REST/gRPC integrations
- Rapid interactive exploration of small datasets
- Workload requires a specific Python version not in the Spark runtime
- You accept the Delta Lake feature limitations
β Use the Spark Kernel When
- Data is β₯ 1 GB compressed or will grow to that scale
- You need full Delta Lake compatibility (deletion vectors, OPTIMIZE, VACUUM, ACID)
- You need live monitoring and production-grade features (env vars, High Concurrency, FAIR/FIFO scheduling)
- You use Spark-native APIs (MLlib, Spark SQL, Streaming)
- You want Microsoft end-to-end engine support
- You need to scale from single-node to multi-node without rewriting code
Key Differences at a Glance
| Category | Python Kernel | Spark Kernel |
|---|---|---|
| Default compute | 2-vCore single-node (up to 64) | 8-vCore starter pool with autoscale |
| Distributed execution | β No | β Yes |
| Languages | Python | PySpark, SparkSQL, Scala, SparkR |
| Full Delta Lake support | β No | β Yes |
| Live monitoring | Limited | Full (Spark UI + Job monitoring) |
| Microsoft engine support | OneLake integrations only | Full runtime support |
| High Concurrency | β No | β Yes |
| V-Order for Direct Lake | β No | β Yes |
| Scales to multi-node | β No | β Yes |
For workloads β₯ 1 GB compressed, start with a single-node 8-vCore Spark cluster using the starter pool. You get near-instant startup (~5s), full Fabric Spark capabilities including NEE, and the ability to scale out to multi-node when needed β all on a single node like the Python kernel.
Notebook & Spark Optimization
Reduce CU consumption, speed up jobs, and avoid expensive anti-patterns in Fabric notebooks.
Fabric notebooks consume Capacity Units (CUs) based on compute type, session duration, and cluster size. Every decision here directly impacts your monthly Fabric bill.
Python vs. PySpark β Choose the Right Engine
Not every job needs distributed computing. Using PySpark for trivial workloads wastes resources because Spark must start a session (~20β30s), allocate executors, and maintain cluster overhead β all of which burn CUs even if the actual work is minimal.
| Scenario | Recommended Engine | Why |
|---|---|---|
| Data < 1 GB, simple transforms | Python + pandas | No Spark session overhead β runs on notebook VM only. ~70% cheaper for small jobs. |
| Exploratory analysis / prototyping | Python + pandas | Faster iteration; no cluster startup wait. You can switch to Spark later if data grows. |
| Data 1β10 GB with complex joins | PySpark | Distributed processing avoids OOM errors; Spark optimizer handles join strategies. |
| Data > 10 GB or heavy aggregations | PySpark | Only viable option β single-node memory can't handle the volume. |
| ML feature engineering at scale | PySpark + MLlib | Distributed training/feature computation is orders of magnitude faster. |
| Small ML model training (< 1 GB) | Python + scikit-learn | Lower overhead, richer library ecosystem, no serialization cost. |
A PySpark notebook that processes 500 MB of data uses Spark Compute CUs (cluster startup + executor time). The same job in plain Python uses only the notebook VM CUs β typically 2β4Γ cheaper for datasets under 1 GB. Over hundreds of daily pipeline runs, this adds up to significant savings.
Native Execution Engine
The Native Execution Engine (generally available) runs Spark SQL and DataFrame operations on optimized C++/vectorized code instead of the JVM. It delivers up to 4Γ faster performance with zero code changes.
- Why (Cost): Faster execution = shorter session duration = fewer CUs consumed. No infrastructure changes needed β it's a configuration toggle.
- How: Enable at workspace or session level:
spark.conf.set("spark.native.enabled", "true") - Limitation: Not all operations are supported yet (UDFs, some complex types fall back to JVM). Monitor the Spark UI for fallback warnings.
High Concurrency Mode
When you orchestrate multiple notebooks in a pipeline, each notebook normally starts its own Spark session β paying the startup cost every time.
- Why (Cost): Session startup costs 20β30s of CU burn per notebook. With 10 notebooks in a pipeline, that's 3β5 minutes of wasted compute.
- Solution: Enable High Concurrency Mode and assign the same
Session Tagto related notebooks. They share one Spark session, reducing startup to ~5s total. - Savings: Up to 30% reduction in pipeline execution time and CU cost.
Notebooks sharing a session must use the same lakehouse context and Spark compute properties. Plan your pipeline grouping accordingly.
Session Management β Stop Paying for Idle
Spark sessions remain alive (and consuming CUs) even when no code is running. This is the most common source of wasted capacity.
- Set aggressive timeouts: Configure session idle timeout to 5β10 minutes (default is 20 min). Every idle minute is CUs burned with no work done.
- Explicit session stop: In pipelines, add
mssparkutils.session.stop()at notebook end to release resources immediately. - Monitor zombie sessions: Use the Fabric Capacity Metrics app to identify long-running or abandoned sessions draining your capacity.
Common Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
df.collect() on large data |
Infrastructure: Moves all data to driver node β OOM crash risk and network bottleneck. | Process in Spark; only collect() small aggregated results. |
| Python UDFs instead of built-in functions | Performance: Data serializes between JVMβPythonβJVM per row. 10β100Γ slower than native Spark functions. | Use pyspark.sql.functions (vectorized). If UDF needed, use Pandas UDFs (@pandas_udf). |
| Row-by-row loops | Performance: Defeats Spark's parallel execution β processes one record at a time on the driver. | Use vectorized DataFrame operations or withColumn() transformations. |
toPandas() on large DataFrames |
Infrastructure + Cost: Collects entire dataset to driver memory. Same risk as collect(). |
Only convert small/aggregated results. For large data, use pandas API on Spark (pyspark.pandas). |
| Default cluster size for all jobs | Cost: Large clusters for small jobs waste 60β80% of allocated CUs sitting idle. | Right-size: use Small (4 vCores) for light jobs, scale up only when Spark UI shows resource contention. |
| No caching of reused DataFrames | Cost + Performance: Re-reads and re-computes data multiple times, multiplying I/O and CU usage. | Use .cache() or .persist() for DataFrames used more than once. Call .unpersist() when done. |
Notebook Design for Production
- Keep notebooks small and focused: One notebook = one logical unit of work. Monolithic notebooks are harder to debug, can't be parallelized in pipelines, and waste CUs when you re-run everything to fix one step.
- Parameterize everything: File paths, date ranges, environment names β use Fabric notebook parameters so the same notebook works across dev/test/prod without code changes.
- Separate compute-heavy from light steps: Put your heavy PySpark ETL in one notebook (needs large cluster) and your validation/metadata updates in another (runs on tiny VM with Python). This prevents over-provisioning for the light steps.
- Error handling: Use try/catch blocks and log errors to a monitoring table β don't let silent failures cascade through your pipeline.
- Version control: Connect notebooks to Git for versioning and code review. Fabric's Git integration tracks notebook changes alongside pipeline configs.
Data < 1 GB? β Use Python/pandas (no Spark needed)
Data 1β10 GB? β PySpark with Small cluster, enable Native Execution Engine
Data > 10 GB? β PySpark with Medium/Large cluster, partition wisely
Running multiple notebooks in pipeline? β Enable High Concurrency Mode
Table read frequently by Power BI/SQL? β Enable V-Order on write
dbt Integration Best Practices
SQL-first transformation framework with built-in testing, documentation, and lineage for Fabric Warehouse.
Why dbt in Fabric?
dbt (data build tool) is an open-source transformation framework that lets analytics engineers write modular SQL models with built-in testing, documentation, and lineage tracking. It connects to Fabric via the dbt-fabric adapter targeting the Warehouse SQL endpoint.
This isn't either/or β many teams use both. The choice depends on your transformation type, team skills, and existing investments.
dbt vs. Spark Notebooks β When to Use Each
| Criteria | dbt (SQL models) | Spark Notebooks | Why It Matters |
|---|---|---|---|
| Transformation type | SQL-based (SELECT, JOIN, GROUP BY) | Python/PySpark, complex logic, ML | Architecture: dbt excels at declarative SQL transformations; notebooks handle imperative, multi-step logic. |
| Testing & quality | Built-in (unique, not_null, relationships, custom tests) | Manual (assert statements, Great Expectations) | Quality: dbt's test framework catches data issues automatically on every run β no extra setup needed. |
| Lineage & documentation | Auto-generated DAG, column-level lineage, docs site | Manual documentation, no native lineage | Governance: dbt provides instant visibility into data dependencies β critical for impact analysis and compliance. |
| Incremental loads | Built-in materializations (append, merge, microbatch) | Custom logic with Delta merge | Cost: dbt's incremental models process only new/changed data, reducing Warehouse CU consumption. |
| Target engine | Fabric Warehouse (T-SQL) | Fabric Lakehouse (Spark) | Infrastructure: dbt uses the Warehouse compute engine (no Spark cluster needed). Lower cost for SQL workloads. |
| Version control & CI | Git-native, standard PR reviews, CI testing | Git via Fabric integration (notebook JSON) | Quality: dbt models are plain SQL files β easier to diff, review, and test in CI pipelines. |
| Team profile | Analytics engineers, SQL-first teams | Data engineers, Python/Spark teams | Adoption: Use the tool that matches your team's primary language to maximize productivity. |
Use dbt for Silver β Gold transformations in Fabric Warehouse when your logic is primarily SQL (joins, aggregations, business rules). Use Spark notebooks for Bronze β Silver where you need Python for parsing, complex cleansing, or handling semi-structured data. Many production architectures combine both.
Architecture: dbt in the Medallion Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOURCES β
β (Databases, SaaS, Files, APIs) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β Ingestion: Pipelines, Mirroring, Dataflows
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BRONZE (Lakehouse) β
β Raw data β Spark notebooks for parsing & cleansing β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β Spark notebooks (Python/PySpark)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SILVER (Warehouse) β
β Cleansed, normalized β dbt models for conforming & dedup β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β dbt models (SQL)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GOLD (Warehouse) β
β Business-ready β dbt marts, dimensional models, KPIs β
β β Power BI Direct Lake / SQL analytics endpoint β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Setup & Configuration
my_fabric_project:
target: dev
outputs:
dev:
type: fabric
driver: "ODBC Driver 18 for SQL Server"
server: "your-workspace.datawarehouse.fabric.microsoft.com"
database: "your_warehouse"
schema: "dbt_dev"
authentication: ServicePrincipal # Recommended for CI/CD
tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
prod:
type: fabric
driver: "ODBC Driver 18 for SQL Server"
server: "your-workspace.datawarehouse.fabric.microsoft.com"
database: "your_warehouse"
schema: "dbt_prod"
authentication: ServicePrincipal
tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
The dbt-fabric adapter targets the Warehouse (full DML/DDL support). The Lakehouse SQL endpoint is read-only and cannot execute dbt's CREATE/INSERT/MERGE statements. If you need dbt on Lakehouse data, create shortcuts from Lakehouse to Warehouse, or use the dbt-fabricspark adapter (limited, via Livy API).
Best Practices
Project Structure
- Organize by layer:
/models/staging/(Bronze refs),/models/intermediate/(Silver),/models/marts/(Gold) β mirrors your medallion architecture. - One model = one transformation: Keep models focused. Don't combine 5 joins and 3 aggregations in a single model β break them into testable steps.
- Use sources & refs: Define
sources.ymlfor raw tables and use{{ ref('model_name') }}between models. This enables automatic lineage and dependency resolution.
Cost Optimization
- Incremental over full refresh: Use
materialized='incremental'with a reliable timestamp column. Full refreshes reprocess all data and burn unnecessary Warehouse CUs. - No Spark overhead: dbt runs on the Warehouse SQL engine β no Spark session startup, no cluster CUs. For SQL-centric transforms, this is inherently cheaper.
- Schema-per-environment: Use different schemas (dbt_dev, dbt_prod) instead of separate warehouses. Avoids duplicating infrastructure costs.
Quality & Testing
- Test everything: Add
uniqueandnot_nulltests to all primary keys. Addrelationshipstests for foreign keys. These run automatically and catch issues before bad data reaches Power BI. - Custom tests for business rules: Write SQL tests like "revenue should never be negative" or "order_date must be β€ ship_date". dbt fails the pipeline if assertions break.
- Freshness checks: Use
source freshnessto alert when upstream data stops arriving β prevents stale dashboards going unnoticed.
CI/CD Integration
- PR validation: Run
dbt build --select state:modified+on pull requests to test only changed models and their downstream dependencies. - Service principal auth: Use Entra service principals (not personal credentials) in CI/CD pipelines for automated, auditable deployments.
- Slim CI: Compare against production manifest (
--defer --state prod-manifest/) to avoid rebuilding unchanged models in CI.
Limitations to Know
| Limitation | Impact | Workaround |
|---|---|---|
| Lakehouse SQL endpoint is read-only | Cannot use dbt materializations against Lakehouse | Use Warehouse as dbt target; reference Lakehouse data via shortcuts or cross-database queries |
| Adapter maturity | Some advanced dbt features may have edge cases | Pin adapter version, test upgrades in dev first, contribute issues to GitHub |
| No Python models in dbt-fabric | Python-based dbt models not supported (Warehouse is T-SQL only) | Use Spark notebooks for Python transforms; feed results to dbt via Warehouse tables |
| Snapshot limitations | SCD Type 2 snapshots work but may have performance limits on very large tables | Partition snapshot queries by date range; consider incremental snapshots |
Real-Time Intelligence
Streaming data ingestion and time-series analytics with Real-Time Intelligence.
Overview
Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.
Key Components
Eventhouse
The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.
KQL Database
Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) β a powerful language for log and time-series analytics.
Eventstreams
Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.
Real-Time Dashboards
Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.
Streaming Ingestion Patterns
| Source | Ingestion Method | Latency |
|---|---|---|
| Azure Event Hubs | Eventstream β KQL Database | Seconds |
| Azure IoT Hub | Eventstream β KQL Database | Seconds |
| Apache Kafka | Kafka connector β Eventstream | Seconds |
| Custom apps | REST API / SDK ingestion | Seconds |
| Databases (CDC) | Fabric Mirroring β OneLake | Minutes |
When to Use Real-Time vs. Batch
- Real-Time Intelligence: IoT telemetry, application logs, clickstream analytics, fraud detection, operational monitoring
- Batch (Lakehouse/Warehouse): Historical reporting, data warehousing, regulatory reporting, large-scale transformations
- Hybrid: Ingest in real-time to Eventhouse for operational use, then copy to Lakehouse for long-term analytics
Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.
Power BI Integration
Maximizing the reporting and analytics experience with Fabric-native Power BI features.
Direct Lake Mode
Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery β without copying data into the semantic model.
- Reads data directly from Delta tables in OneLake
- No data refresh needed β reports always show the latest data
- Performance comparable to Import mode for most scenarios
- Falls back to DirectQuery if data exceeds memory (configurable behavior)
Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.
Connectivity Modes β Direct Lake vs. DirectQuery vs. Import
Choosing the right connectivity mode is one of the highest-impact decisions for Power BI performance and cost. Each mode trades off freshness, speed, and resource consumption differently.
| Criteria | Import | DirectQuery | Direct Lake |
|---|---|---|---|
| How it works | Copies data into Power BI's in-memory engine (VertiPaq) | Sends queries live to the source at report render time | Reads Delta/Parquet files directly from OneLake into memory on demand |
| Data freshness | Stale until next scheduled refresh | Always real-time (queries source live) | Always fresh β reads latest committed Delta files |
| Query performance | π’ Fastest (pre-compressed in-memory columnar) | π΄ Slowest (depends on source query speed + network) | π’ Near-Import speed (columnar read from OneLake) |
| Capacity cost | π‘ High memory usage (entire dataset cached). Refresh jobs consume CUs. | π’ Low memory (no cache). But high CU per query if source is slow. | π’ Low-to-moderate memory (loads columns on demand, evicts when idle). No refresh CUs. |
| Data size limits | Bounded by capacity memory (1 GB β 400 GB depending on SKU) | Unlimited (data stays at source) | Bounded by capacity memory, but with automatic fallback to DirectQuery |
| Source requirements | Any Power Query-supported source | Any DirectQuery-compatible source (SQL, Fabric, Dataverse, etc.) | Fabric only β Delta tables in OneLake (Lakehouse or Warehouse) |
| Refresh needed? | Yes β scheduled or on-demand refresh | No β live queries | No β reads latest data automatically |
| DAX calculation support | Full (all DAX functions) | Limited (some DAX patterns perform poorly or are unsupported) | Full (same engine as Import) |
| RLS enforcement | In the semantic model | At source + semantic model | In the semantic model (respects OneLake security roles) |
When to Use Each Mode
| Scenario | Recommended Mode | Why |
|---|---|---|
| Gold-layer tables in Fabric Lakehouse/Warehouse | Direct Lake | Cost + Freshness: No refresh schedule to manage, no duplicate storage, Import-class speed. The default choice for Fabric-native BI. |
| Data in external systems (SQL Server, Dataverse, SAP) | DirectQuery | Architecture: Data can't move to OneLake (regulatory, real-time, or volume constraints). Accept slower queries for real-time freshness. |
| Small, highly curated datasets (< 1 GB) refreshed daily | Import | Performance: Maximum DAX flexibility and fastest queries. Refresh cost is negligible for small data. |
| Very large tables (100+ GB) with infrequent access | Direct Lake (with fallback) | Cost: Only loads accessed columns into memory. Falls back to DirectQuery for cold data β avoids paying for 100 GB of memory. |
| Mixed: some Fabric tables + some external sources | Composite model (Direct Lake + DirectQuery) | Architecture: Combine both in one model β Direct Lake for Fabric tables, DirectQuery for external sources that can't be moved. |
| Near real-time dashboards (< 1 min latency requirement) | DirectQuery or Direct Lake | Freshness: Both deliver real-time. Choose Direct Lake if data is in OneLake; DirectQuery if it's external. |
| Legacy reports migrating to Fabric | Start with Import β migrate to Direct Lake | Risk: Import works identically to before. Once Gold tables are in Lakehouse/Warehouse, switch to Direct Lake for cost savings. |
If your data is in Fabric, use Direct Lake. It gives you Import-level performance with zero refresh overhead and always-fresh data. Only fall back to Import for edge cases requiring unsupported DAX patterns, or to DirectQuery when data must stay outside OneLake.
When a Direct Lake model exceeds the memory guardrails for your capacity SKU, it falls back to DirectQuery mode automatically. This means queries still work but may be slower. Monitor the DirectLakeFallback event in the Capacity Metrics app β frequent fallbacks indicate your tables need OPTIMIZE, V-Order, or a larger SKU.
Semantic Model Best Practices
- Build star schemas in your Gold layer β separate facts and dimensions
- Use surrogate keys (integer) instead of natural keys for relationships
- Minimize calculated columns β push computations to the Lakehouse/Warehouse layer
- Use calculation groups for reusable time intelligence (YTD, MTD, YoY)
- Enable automatic aggregations for large tables to speed up high-level queries
Report Governance
- Centralize semantic models: Create shared models that multiple reports connect to (avoid dataset proliferation)
- Use endorsement: Certify production-ready models and promote development models
- Apply RLS: Configure row-level security on semantic models for multi-tenant reports
- Monitor usage: Use the usage metrics to track report adoption
Performance Optimization
| Technique | Impact | Effort |
|---|---|---|
| V-Order on Delta tables | π’ High | Low |
| Star schema design | π’ High | Medium |
| Reduce cardinality of columns | π’ High | Low |
| Use aggregation tables | π‘ Medium | Medium |
| Optimize DAX measures | π‘ Medium | Varies |
| Composite models | π‘ Medium | Medium |
| Paginated reports for large exports | π‘ Medium | Low |
Preparing Data for AI
Best practices for preparing your semantic models and data sources for Copilot for Power BI and Fabric Data Agents.
Without proper data preparation, AI features can struggle to interpret your data correctly β leading to generic, inaccurate, or misleading outputs. Investing in data prep helps AI understand your unique business context, prioritize the right information, and deliver responses that are consistent, reliable, and aligned with your goals.
For a comprehensive overview of all AI capabilities in Fabric β Copilot across workloads, Data Agents, AI Skill, and governance β see the dedicated AI & Copilot page. This section focuses specifically on preparing your data for optimal AI results.
Copilot for Power BI β Prep Data for AI Toolkit
Power BI provides a unified "Prep data for AI" experience (available in both Desktop and the Power BI service) with three key features to optimize your semantic model for Copilot interactions.
Requires a paid Fabric capacity (F2+) or Power BI Premium (P1+) and the Copilot admin setting enabled. Power BI Q&A should be enabled on your semantic model for natural language queries. Copilot is now available on all paid SKUs and in sovereign cloud regions.
1. AI Data Schema β Simplify Your Model
Define a focused subset of your model's fields for Copilot to prioritize. By selecting only the most relevant fields, you reduce ambiguity and help Copilot deliver clearer, more accurate responses.
- Open Prep data for AI β Simplify data schema tab
- Select fields you want Copilot to reason over β prioritize clean columns with limited ambiguity
- Remove any fields that could be confusing (duplicate names, internal IDs, staging columns)
- Hidden fields in the semantic model are automatically excluded in the initial schema
- Relationships between fields are still respected even if one side is excluded
The AI data schema only applies to Copilot capabilities that use the schema (e.g., answering data questions). It does not affect report page creation, DAX queries, or dataset summaries β those still use the full model.
2. AI Instructions β Add Business Context
Provide context, business logic, and specific guidance directly on the semantic model. Copilot uses these instructions to better interpret user questions by incorporating organizational language, terminology, and analytical priorities.
π Business Context
- "Busy season is October to February"
- "Frame insights with a focus on risk assessment"
- "When a user mentions ABCD, they mean the total invoice field"
- "A lower attrition percent is more positive"
π Analysis Rules
- "Always analyze sales on a quarterly basis"
- "For retail insights, prioritize the
customsegmentationtable" - "Use the
sales_facttable as the primary source for all sales questions" - "When a user asks about product sales, always ask for clarification on location"
- Be explicit and specific β Assume Copilot has no understanding of your business
- Use analogies and examples β Concrete field/measure references help the model
- Avoid ambiguity β Spell out which measure or table to use (and which to avoid)
- Order matters β Test different orderings; the sequence of instructions can affect output quality
- Break down complex instructions β Split multi-step logic into simple sequential steps
- Keep focused β Conflicting or overly complex instructions can confuse the LLM. Less is sometimes more.
Good vs. Bad: AI Instructions
| β Bad β Vague or Ambiguous | β Good β Specific and Actionable |
|---|---|
| "You're a seasoned BI Analyst who is detail oriented." | "You're a BI Analyst for a food distributor. Responses should be detail oriented and focused on revenue and profitability." |
| "Use the right sales measure." | "For product-specific sales, use the measure Total_Sales_Product. Filter on the Product column from the Sales table. Examples of products: Word, PowerPoint, Excel." |
| "Show active partners count." | "For Total Active Partners, use the measure Monthly Active Partner Count_ID. Do NOT filter on the Customers table." |
| "Analyze sales by time." | "Always analyze sales on a quarterly basis. When showcasing revenue, break it down by quarter and compare to the industry field." |
| "Find top customers." | "Define top customers by first looking at the revenue table, then returning only customers with the highest order values. Some partners are also customers β don't remove duplicates; indicate whether each is a partner or customer based on the ID value." |
Good vs. Bad: Semantic Model Design for AI
| β Bad β AI-Unfriendly Model | β Good β AI-Ready Model |
|---|---|
Column named col_amt_1 with no description |
Column named Total Sales Amount with description: "Net revenue after discounts in USD" |
| Flat table with 80 columns mixing facts and dimensions | Star schema with separate fact_sales, dim_product, dim_date tables |
| All 80 columns exposed to AI data schema | AI data schema includes only the 25 most relevant, well-named fields |
Staging tables (stg_raw_import) visible in the model |
Staging/internal tables hidden; only Gold-layer tables exposed |
Duplicate columns: Revenue, Revenue_v2, Rev_Final |
Single canonical Revenue column; deprecated variants removed |
| No Q&A enabled, no descriptions, no AI instructions | Q&A enabled, measures have descriptions, AI instructions explain business context and analysis rules |
3. Verified Answers β Curated Responses
Verified answers are human-approved, visual responses triggered by predefined phrases. They provide consistent, reliable answers to common business questions and help Copilot learn from effective responses over time.
- Select a visual β "Set up a verified answer" from the
β¦menu - Add 5β7 trigger phrases per answer that reflect how users naturally ask about the data
- Use Copilot-generated suggestions for additional trigger phrases
- Add up to 3 filters that users can adjust via natural language (e.g., "for the Northeast region")
- Copilot uses both exact and semantic matching β users don't need to match trigger phrases word-for-word
- Stored at the semantic model level β works across all reports using that model
| Limit | Value |
|---|---|
| Verified answers per model | 250 |
| Trigger phrases per answer | 15 |
| Character limit per trigger | 500 |
| Filter permutations per answer | 10 |
4. Mark Model "Approved for Copilot"
After preparing your semantic model, mark it as Approved for Copilot in the model's settings in the Power BI service. This removes the friction treatment (confidence warnings) from answers in the standalone Copilot experience. Changes can take up to 24 hours to propagate for models with many reports.
Fabric Data Agents β Data Preparation
Fabric Data Agents are configurable AI artifacts that enable conversational Q&A over your organizational data. They use NL-to-SQL, NL-to-DAX, and NL-to-KQL to query data across OneLake, semantic models, and KQL databases.
Selecting Data Sources & Tables
- A Data Agent supports up to 5 data sources in any combination: Lakehouses, Warehouses, KQL databases, Power BI semantic models, Ontologies, and Microsoft Graph
- For each source, select only the relevant tables β focus the agent on data that matters
- If your data starts as files (CSV, JSON), ingest into tables first β agents query tables, not individual files
- For Power BI semantic models, users only need Read permission to interact via the agent
Agent Instructions & Examples
π Agent Instructions
Provide custom rules, terminology, and data source routing guidance:
- Direct financial metrics questions to a Power BI semantic model
- Route raw data exploration queries to the Lakehouse
- Send log analysis questions to the KQL database
- Clarify organizational terminology and abbreviations
π Example Queries
Provide questionβquery pairs to guide agent behavior:
- Natural language question + corresponding SQL/KQL query
- Helps the agent understand how to interpret similar questions
- Not currently supported for Power BI semantic model sources
- Include complex real-world examples (joins, aggregations, CTEs)
Good vs. Bad: Data Agent Setup
| β Bad β Poor Agent Configuration | β Good β Well-Configured Agent |
|---|---|
| Add all lakehouse tables (including staging, temp, and system tables) | Select only curated Gold-layer tables relevant to the agent's purpose |
| No agent instructions β let the AI figure it out | Instructions specify: "Direct financial questions to the finance_model semantic model. Route raw event queries to the Lakehouse. Send log analysis to the KQL database." |
| No example queries provided | 5β10 example questionβSQL pairs covering joins, aggregations, and common business questions |
| Data agent exposes CSV files directly from OneLake | CSV/JSON files ingested into lakehouse tables before adding to agent |
| Published without testing β users report wrong answers | Tested iteratively with colleagues, instructions refined based on feedback, then published |
Governance & Security Layers
Data Agents enforce a strict intent precedence model β higher layers always override lower ones:
| Layer | Description | Precedence |
|---|---|---|
| Organizational Intent | Tenant-wide policies, compliance requirements | π΄ Highest |
| Role-based Intent | Workspace governance, permission boundaries | π High |
| Developer Intent | Custom instructions, example queries, data source config | π‘ Medium |
| User Intent | Questions and prompts from end users | π’ Lowest |
- Agents enforce read-only data access β no writes, no mutations
- All queries run with the requesting user's credentials (least-privilege)
- Microsoft Purview policies apply: DLP, access restrictions, risk discovery, and audit
- Optional Azure AI Content Safety integration for content risk controls
AI Readiness Checklist
A combined checklist for preparing your data for both Copilot for Power BI and Fabric Data Agents:
β Semantic Model Prep (Copilot)
- Use clean, descriptive column names β avoid abbreviations
- Add descriptions to measures and important columns
- Build a star schema with clear fact/dimension separation
- Enable Q&A on the semantic model
- Set an AI data schema β select only relevant fields
- Write AI instructions with business context and analysis rules
- Create verified answers for your top 10 business questions
- Enable Copilot indexing for faster responses
- Test changes iteratively in the Copilot pane
- Mark model as "Approved for Copilot"
β Data Agent Prep
- Choose the right data sources (max 5 per agent)
- Select only relevant tables β don't expose staging/temp tables
- Ingest file-based data into Lakehouse tables
- Write agent instructions with routing rules and terminology
- Provide example query pairs for complex question patterns
- Configure user permissions (Read on semantic models)
- Test with colleagues and iterate on instructions
- Publish and share via Copilot in Power BI
- Review Purview governance policies for compliance
- Enable Azure AI Content Safety for production agents
π§ Migration Complexity Scorer
Answer 8 questions about your current environment to get a complexity estimate, risk assessment, and phased migration timeline.