πŸ‘€ Who is this for?

Data Engineer BI Developer Analytics Engineer β€” This section covers lakehouse vs. warehouse decisions, Spark optimization, notebook patterns, real-time streaming with Eventhouse, and Power BI Direct Lake integration.

Engineering

Data Engineering Best Practices

Practical guidance for building efficient, maintainable data pipelines in Fabric.

Lakehouse vs. Warehouse

Fabric offers two primary storage artifacts. Choose based on your workload:

FeatureLakehouseWarehouse
Primary languagePySpark, Spark SQLT-SQL
Data formatDelta (open format in OneLake)Proprietary columnar
SchemaSchema-on-read + schema-on-writeSchema-on-write only
Best forData engineering, ML, unstructured dataBI, structured queries, stored procedures
SQL accessRead-only SQL endpoint (auto-generated)Full T-SQL read/write
Multi-engineSpark + SQL + Power BISQL + Power BI
βœ… Recommendation

Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.

Notebook Best Practices

Spark Optimization

File Format Guidance

Recommended formats in Fabric
βœ… Delta (Parquet + transaction log)  β€” Default for all Fabric tables
βœ… Parquet                             β€” Good for external/read-only data
⚠️  CSV                                β€” Only for Bronze ingestion, convert to Delta ASAP
❌ JSON (nested)                       β€” Flatten before storing in tables
❌ Excel/XML                           β€” Convert during ingestion pipeline

Shortcuts Strategy

Shortcuts are pointers to data in external locations β€” they appear as folders in your lakehouse but don't copy data. Use them to:

Real-Time

Real-Time Analytics

Streaming data ingestion and time-series analytics with Real-Time Intelligence.

Overview

Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.

Key Components

Eventhouse

The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.

KQL Database

Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) β€” a powerful language for log and time-series analytics.

Eventstreams

Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.

Real-Time Dashboards

Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.

Streaming Ingestion Patterns

SourceIngestion MethodLatency
Azure Event HubsEventstream β†’ KQL DatabaseSeconds
Azure IoT HubEventstream β†’ KQL DatabaseSeconds
Apache KafkaKafka connector β†’ EventstreamSeconds
Custom appsREST API / SDK ingestionSeconds
Databases (CDC)Fabric Mirroring β†’ OneLakeMinutes

When to Use Real-Time vs. Batch

βœ… Best Practice

Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.

Analytics

Power BI Integration

Maximizing the reporting and analytics experience with Fabric-native Power BI features.

Direct Lake Mode

Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery β€” without copying data into the semantic model.

πŸ’‘ Key Insight

Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.

Semantic Model Best Practices

Report Governance

Performance Optimization

TechniqueImpactEffort
V-Order on Delta tables🟒 HighLow
Star schema design🟒 HighMedium
Reduce cardinality of columns🟒 HighLow
Use aggregation tables🟑 MediumMedium
Optimize DAX measures🟑 MediumVaries
Composite models🟑 MediumMedium
Paginated reports for large exports🟑 MediumLow
AI Readiness

Preparing Data for AI

Best practices for preparing your semantic models and data sources for Copilot for Power BI and Fabric Data Agents.

πŸ€– Why This Matters

Without proper data preparation, AI features can struggle to interpret your data correctly β€” leading to generic, inaccurate, or misleading outputs. Investing in data prep helps AI understand your unique business context, prioritize the right information, and deliver responses that are consistent, reliable, and aligned with your goals.

Copilot for Power BI β€” Prep Data for AI Toolkit

Power BI provides a unified "Prep data for AI" experience (available in both Desktop and the Power BI service) with three key features to optimize your semantic model for Copilot interactions.

πŸ“‹ Prerequisites

Requires a paid Fabric capacity (F2+) or Power BI Premium (P1+), the Copilot admin setting enabled, and Power BI Q&A enabled on your semantic model. Sovereign clouds are not yet supported.

1. AI Data Schema β€” Simplify Your Model

Define a focused subset of your model's fields for Copilot to prioritize. By selecting only the most relevant fields, you reduce ambiguity and help Copilot deliver clearer, more accurate responses.

⚠️ Scope Limitation

The AI data schema only applies to Copilot capabilities that use the schema (e.g., answering data questions). It does not affect report page creation, DAX queries, or dataset summaries β€” those still use the full model.

2. AI Instructions β€” Add Business Context

Provide context, business logic, and specific guidance directly on the semantic model. Copilot uses these instructions to better interpret user questions by incorporating organizational language, terminology, and analytical priorities.

πŸ“Š Business Context

  • "Busy season is October to February"
  • "Frame insights with a focus on risk assessment"
  • "When a user mentions ABCD, they mean the total invoice field"
  • "A lower attrition percent is more positive"

πŸ“ Analysis Rules

  • "Always analyze sales on a quarterly basis"
  • "For retail insights, prioritize the customsegmentationtable"
  • "Use the sales_fact table as the primary source for all sales questions"
  • "When a user asks about product sales, always ask for clarification on location"
πŸ’‘ Prompt Engineering Tips for AI Instructions
  • Be explicit and specific β€” Assume Copilot has no understanding of your business
  • Use analogies and examples β€” Concrete field/measure references help the model
  • Avoid ambiguity β€” Spell out which measure or table to use (and which to avoid)
  • Order matters β€” Test different orderings; the sequence of instructions can affect output quality
  • Break down complex instructions β€” Split multi-step logic into simple sequential steps
  • Keep focused β€” Conflicting or overly complex instructions can confuse the LLM. Less is sometimes more.

Good vs. Bad: AI Instructions

❌ Bad β€” Vague or Ambiguousβœ… Good β€” Specific and Actionable
"You're a seasoned BI Analyst who is detail oriented." "You're a BI Analyst for a food distributor. Responses should be detail oriented and focused on revenue and profitability."
"Use the right sales measure." "For product-specific sales, use the measure Total_Sales_Product. Filter on the Product column from the Sales table. Examples of products: Word, PowerPoint, Excel."
"Show active partners count." "For Total Active Partners, use the measure Monthly Active Partner Count_ID. Do NOT filter on the Customers table."
"Analyze sales by time." "Always analyze sales on a quarterly basis. When showcasing revenue, break it down by quarter and compare to the industry field."
"Find top customers." "Define top customers by first looking at the revenue table, then returning only customers with the highest order values. Some partners are also customers β€” don't remove duplicates; indicate whether each is a partner or customer based on the ID value."

Good vs. Bad: Semantic Model Design for AI

❌ Bad β€” AI-Unfriendly Modelβœ… Good β€” AI-Ready Model
Column named col_amt_1 with no description Column named Total Sales Amount with description: "Net revenue after discounts in USD"
Flat table with 80 columns mixing facts and dimensions Star schema with separate fact_sales, dim_product, dim_date tables
All 80 columns exposed to AI data schema AI data schema includes only the 25 most relevant, well-named fields
Staging tables (stg_raw_import) visible in the model Staging/internal tables hidden; only Gold-layer tables exposed
Duplicate columns: Revenue, Revenue_v2, Rev_Final Single canonical Revenue column; deprecated variants removed
No Q&A enabled, no descriptions, no AI instructions Q&A enabled, measures have descriptions, AI instructions explain business context and analysis rules

3. Verified Answers β€” Curated Responses

Verified answers are human-approved, visual responses triggered by predefined phrases. They provide consistent, reliable answers to common business questions and help Copilot learn from effective responses over time.

LimitValue
Verified answers per model250
Trigger phrases per answer15
Character limit per trigger500
Filter permutations per answer10

4. Mark Model "Approved for Copilot"

After preparing your semantic model, mark it as Approved for Copilot in the model's settings in the Power BI service. This removes the friction treatment (confidence warnings) from answers in the standalone Copilot experience. Changes can take up to 24 hours to propagate for models with many reports.

Fabric Data Agents β€” Data Preparation

Fabric Data Agents are configurable AI artifacts that enable conversational Q&A over your organizational data. They use NL-to-SQL, NL-to-DAX, and NL-to-KQL to query data across OneLake, semantic models, and KQL databases.

Selecting Data Sources & Tables

Agent Instructions & Examples

πŸ“ Agent Instructions

Provide custom rules, terminology, and data source routing guidance:

  • Direct financial metrics questions to a Power BI semantic model
  • Route raw data exploration queries to the Lakehouse
  • Send log analysis questions to the KQL database
  • Clarify organizational terminology and abbreviations

πŸ” Example Queries

Provide question–query pairs to guide agent behavior:

  • Natural language question + corresponding SQL/KQL query
  • Helps the agent understand how to interpret similar questions
  • Not currently supported for Power BI semantic model sources
  • Include complex real-world examples (joins, aggregations, CTEs)

Good vs. Bad: Data Agent Setup

❌ Bad β€” Poor Agent Configurationβœ… Good β€” Well-Configured Agent
Add all lakehouse tables (including staging, temp, and system tables) Select only curated Gold-layer tables relevant to the agent's purpose
No agent instructions β€” let the AI figure it out Instructions specify: "Direct financial questions to the finance_model semantic model. Route raw event queries to the Lakehouse. Send log analysis to the KQL database."
No example queries provided 5–10 example question–SQL pairs covering joins, aggregations, and common business questions
Data agent exposes CSV files directly from OneLake CSV/JSON files ingested into lakehouse tables before adding to agent
Published without testing β€” users report wrong answers Tested iteratively with colleagues, instructions refined based on feedback, then published

Governance & Security Layers

Data Agents enforce a strict intent precedence model β€” higher layers always override lower ones:

LayerDescriptionPrecedence
Organizational IntentTenant-wide policies, compliance requirementsπŸ”΄ Highest
Role-based IntentWorkspace governance, permission boundaries🟠 High
Developer IntentCustom instructions, example queries, data source config🟑 Medium
User IntentQuestions and prompts from end users🟒 Lowest

AI Readiness Checklist

A combined checklist for preparing your data for both Copilot for Power BI and Fabric Data Agents:

βœ… Semantic Model Prep (Copilot)

  • Use clean, descriptive column names β€” avoid abbreviations
  • Add descriptions to measures and important columns
  • Build a star schema with clear fact/dimension separation
  • Enable Q&A on the semantic model
  • Set an AI data schema β€” select only relevant fields
  • Write AI instructions with business context and analysis rules
  • Create verified answers for your top 10 business questions
  • Enable Copilot indexing for faster responses
  • Test changes iteratively in the Copilot pane
  • Mark model as "Approved for Copilot"

βœ… Data Agent Prep

  • Choose the right data sources (max 5 per agent)
  • Select only relevant tables β€” don't expose staging/temp tables
  • Ingest file-based data into Lakehouse tables
  • Write agent instructions with routing rules and terminology
  • Provide example query pairs for complex question patterns
  • Configure user permissions (Read on semantic models)
  • Test with colleagues and iterate on instructions
  • Publish and share via Copilot in Power BI
  • Review Purview governance policies for compliance
  • Enable Azure AI Content Safety for production agents
Migration

🧭 Migration Complexity Scorer

Answer 8 questions about your current environment to get a complexity estimate, risk assessment, and phased migration timeline.