Data Engineer BI Developer Analytics Engineer β This section covers lakehouse vs. warehouse decisions, Spark optimization, notebook patterns, real-time streaming with Eventhouse, and Power BI Direct Lake integration.
Data Engineering Best Practices
Practical guidance for building efficient, maintainable data pipelines in Fabric.
Lakehouse vs. Warehouse
Fabric offers two primary storage artifacts. Choose based on your workload:
| Feature | Lakehouse | Warehouse |
|---|---|---|
| Primary language | PySpark, Spark SQL | T-SQL |
| Data format | Delta (open format in OneLake) | Proprietary columnar |
| Schema | Schema-on-read + schema-on-write | Schema-on-write only |
| Best for | Data engineering, ML, unstructured data | BI, structured queries, stored procedures |
| SQL access | Read-only SQL endpoint (auto-generated) | Full T-SQL read/write |
| Multi-engine | Spark + SQL + Power BI | SQL + Power BI |
Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.
Notebook Best Practices
- Parameterize notebooks: Use Fabric's built-in parameters for environment-specific values (file paths, dates, flags)
- Modular design: Break complex logic into reusable functions or separate notebooks using
%run - Error handling: Use try/catch blocks and log errors to a monitoring table
- Avoid hardcoded paths: Use relative paths within OneLake or parameterized paths
- Version control: Connect notebooks to Git for versioning and code review
Spark Optimization
- V-Order optimization: Enable V-Order (a Fabric-native columnar sort optimization) for write-optimized Delta tables β significantly faster reads for Power BI and SQL
- Optimize file sizes: Target 128 MB β 1 GB files. Use
OPTIMIZE(bin-compaction) to merge small files - Partition wisely: Partition by high-cardinality columns used in filters (e.g.,
year,region). Avoid over-partitioning - Use Delta caching: SSD caching for repeated reads on the same data
- Right-size sessions: Don't default to large Spark clusters β start small and scale up based on the Spark UI metrics
File Format Guidance
β Delta (Parquet + transaction log) β Default for all Fabric tables β Parquet β Good for external/read-only data β οΈ CSV β Only for Bronze ingestion, convert to Delta ASAP β JSON (nested) β Flatten before storing in tables β Excel/XML β Convert during ingestion pipeline
Shortcuts Strategy
Shortcuts are pointers to data in external locations β they appear as folders in your lakehouse but don't copy data. Use them to:
- Access data in other lakehouses or workspaces without duplication
- Connect to external ADLS Gen2, AWS S3, or Google Cloud Storage
- Enable cross-domain data sharing without moving data
- Create a virtual data mesh where each domain manages its data but shares via shortcuts
Real-Time Analytics
Streaming data ingestion and time-series analytics with Real-Time Intelligence.
Overview
Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.
Key Components
Eventhouse
The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.
KQL Database
Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) β a powerful language for log and time-series analytics.
Eventstreams
Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.
Real-Time Dashboards
Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.
Streaming Ingestion Patterns
| Source | Ingestion Method | Latency |
|---|---|---|
| Azure Event Hubs | Eventstream β KQL Database | Seconds |
| Azure IoT Hub | Eventstream β KQL Database | Seconds |
| Apache Kafka | Kafka connector β Eventstream | Seconds |
| Custom apps | REST API / SDK ingestion | Seconds |
| Databases (CDC) | Fabric Mirroring β OneLake | Minutes |
When to Use Real-Time vs. Batch
- Real-Time Intelligence: IoT telemetry, application logs, clickstream analytics, fraud detection, operational monitoring
- Batch (Lakehouse/Warehouse): Historical reporting, data warehousing, regulatory reporting, large-scale transformations
- Hybrid: Ingest in real-time to Eventhouse for operational use, then copy to Lakehouse for long-term analytics
Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.
Power BI Integration
Maximizing the reporting and analytics experience with Fabric-native Power BI features.
Direct Lake Mode
Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery β without copying data into the semantic model.
- Reads data directly from Delta tables in OneLake
- No data refresh needed β reports always show the latest data
- Performance comparable to Import mode for most scenarios
- Falls back to DirectQuery if data exceeds memory (configurable behavior)
Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.
Semantic Model Best Practices
- Build star schemas in your Gold layer β separate facts and dimensions
- Use surrogate keys (integer) instead of natural keys for relationships
- Minimize calculated columns β push computations to the Lakehouse/Warehouse layer
- Use calculation groups for reusable time intelligence (YTD, MTD, YoY)
- Enable automatic aggregations for large tables to speed up high-level queries
Report Governance
- Centralize semantic models: Create shared models that multiple reports connect to (avoid dataset proliferation)
- Use endorsement: Certify production-ready models and promote development models
- Apply RLS: Configure row-level security on semantic models for multi-tenant reports
- Monitor usage: Use the usage metrics to track report adoption
Performance Optimization
| Technique | Impact | Effort |
|---|---|---|
| V-Order on Delta tables | π’ High | Low |
| Star schema design | π’ High | Medium |
| Reduce cardinality of columns | π’ High | Low |
| Use aggregation tables | π‘ Medium | Medium |
| Optimize DAX measures | π‘ Medium | Varies |
| Composite models | π‘ Medium | Medium |
| Paginated reports for large exports | π‘ Medium | Low |
Preparing Data for AI
Best practices for preparing your semantic models and data sources for Copilot for Power BI and Fabric Data Agents.
Without proper data preparation, AI features can struggle to interpret your data correctly β leading to generic, inaccurate, or misleading outputs. Investing in data prep helps AI understand your unique business context, prioritize the right information, and deliver responses that are consistent, reliable, and aligned with your goals.
Copilot for Power BI β Prep Data for AI Toolkit
Power BI provides a unified "Prep data for AI" experience (available in both Desktop and the Power BI service) with three key features to optimize your semantic model for Copilot interactions.
Requires a paid Fabric capacity (F2+) or Power BI Premium (P1+), the Copilot admin setting enabled, and Power BI Q&A enabled on your semantic model. Sovereign clouds are not yet supported.
1. AI Data Schema β Simplify Your Model
Define a focused subset of your model's fields for Copilot to prioritize. By selecting only the most relevant fields, you reduce ambiguity and help Copilot deliver clearer, more accurate responses.
- Open Prep data for AI β Simplify data schema tab
- Select fields you want Copilot to reason over β prioritize clean columns with limited ambiguity
- Remove any fields that could be confusing (duplicate names, internal IDs, staging columns)
- Hidden fields in the semantic model are automatically excluded in the initial schema
- Relationships between fields are still respected even if one side is excluded
The AI data schema only applies to Copilot capabilities that use the schema (e.g., answering data questions). It does not affect report page creation, DAX queries, or dataset summaries β those still use the full model.
2. AI Instructions β Add Business Context
Provide context, business logic, and specific guidance directly on the semantic model. Copilot uses these instructions to better interpret user questions by incorporating organizational language, terminology, and analytical priorities.
π Business Context
- "Busy season is October to February"
- "Frame insights with a focus on risk assessment"
- "When a user mentions ABCD, they mean the total invoice field"
- "A lower attrition percent is more positive"
π Analysis Rules
- "Always analyze sales on a quarterly basis"
- "For retail insights, prioritize the
customsegmentationtable" - "Use the
sales_facttable as the primary source for all sales questions" - "When a user asks about product sales, always ask for clarification on location"
- Be explicit and specific β Assume Copilot has no understanding of your business
- Use analogies and examples β Concrete field/measure references help the model
- Avoid ambiguity β Spell out which measure or table to use (and which to avoid)
- Order matters β Test different orderings; the sequence of instructions can affect output quality
- Break down complex instructions β Split multi-step logic into simple sequential steps
- Keep focused β Conflicting or overly complex instructions can confuse the LLM. Less is sometimes more.
Good vs. Bad: AI Instructions
| β Bad β Vague or Ambiguous | β Good β Specific and Actionable |
|---|---|
| "You're a seasoned BI Analyst who is detail oriented." | "You're a BI Analyst for a food distributor. Responses should be detail oriented and focused on revenue and profitability." |
| "Use the right sales measure." | "For product-specific sales, use the measure Total_Sales_Product. Filter on the Product column from the Sales table. Examples of products: Word, PowerPoint, Excel." |
| "Show active partners count." | "For Total Active Partners, use the measure Monthly Active Partner Count_ID. Do NOT filter on the Customers table." |
| "Analyze sales by time." | "Always analyze sales on a quarterly basis. When showcasing revenue, break it down by quarter and compare to the industry field." |
| "Find top customers." | "Define top customers by first looking at the revenue table, then returning only customers with the highest order values. Some partners are also customers β don't remove duplicates; indicate whether each is a partner or customer based on the ID value." |
Good vs. Bad: Semantic Model Design for AI
| β Bad β AI-Unfriendly Model | β Good β AI-Ready Model |
|---|---|
Column named col_amt_1 with no description |
Column named Total Sales Amount with description: "Net revenue after discounts in USD" |
| Flat table with 80 columns mixing facts and dimensions | Star schema with separate fact_sales, dim_product, dim_date tables |
| All 80 columns exposed to AI data schema | AI data schema includes only the 25 most relevant, well-named fields |
Staging tables (stg_raw_import) visible in the model |
Staging/internal tables hidden; only Gold-layer tables exposed |
Duplicate columns: Revenue, Revenue_v2, Rev_Final |
Single canonical Revenue column; deprecated variants removed |
| No Q&A enabled, no descriptions, no AI instructions | Q&A enabled, measures have descriptions, AI instructions explain business context and analysis rules |
3. Verified Answers β Curated Responses
Verified answers are human-approved, visual responses triggered by predefined phrases. They provide consistent, reliable answers to common business questions and help Copilot learn from effective responses over time.
- Select a visual β "Set up a verified answer" from the
β¦menu - Add 5β7 trigger phrases per answer that reflect how users naturally ask about the data
- Use Copilot-generated suggestions for additional trigger phrases
- Add up to 3 filters that users can adjust via natural language (e.g., "for the Northeast region")
- Copilot uses both exact and semantic matching β users don't need to match trigger phrases word-for-word
- Stored at the semantic model level β works across all reports using that model
| Limit | Value |
|---|---|
| Verified answers per model | 250 |
| Trigger phrases per answer | 15 |
| Character limit per trigger | 500 |
| Filter permutations per answer | 10 |
4. Mark Model "Approved for Copilot"
After preparing your semantic model, mark it as Approved for Copilot in the model's settings in the Power BI service. This removes the friction treatment (confidence warnings) from answers in the standalone Copilot experience. Changes can take up to 24 hours to propagate for models with many reports.
Fabric Data Agents β Data Preparation
Fabric Data Agents are configurable AI artifacts that enable conversational Q&A over your organizational data. They use NL-to-SQL, NL-to-DAX, and NL-to-KQL to query data across OneLake, semantic models, and KQL databases.
Selecting Data Sources & Tables
- A Data Agent supports up to 5 data sources in any combination: Lakehouses, Warehouses, KQL databases, Power BI semantic models, Ontologies, and Microsoft Graph
- For each source, select only the relevant tables β focus the agent on data that matters
- If your data starts as files (CSV, JSON), ingest into tables first β agents query tables, not individual files
- For Power BI semantic models, users only need Read permission to interact via the agent
Agent Instructions & Examples
π Agent Instructions
Provide custom rules, terminology, and data source routing guidance:
- Direct financial metrics questions to a Power BI semantic model
- Route raw data exploration queries to the Lakehouse
- Send log analysis questions to the KQL database
- Clarify organizational terminology and abbreviations
π Example Queries
Provide questionβquery pairs to guide agent behavior:
- Natural language question + corresponding SQL/KQL query
- Helps the agent understand how to interpret similar questions
- Not currently supported for Power BI semantic model sources
- Include complex real-world examples (joins, aggregations, CTEs)
Good vs. Bad: Data Agent Setup
| β Bad β Poor Agent Configuration | β Good β Well-Configured Agent |
|---|---|
| Add all lakehouse tables (including staging, temp, and system tables) | Select only curated Gold-layer tables relevant to the agent's purpose |
| No agent instructions β let the AI figure it out | Instructions specify: "Direct financial questions to the finance_model semantic model. Route raw event queries to the Lakehouse. Send log analysis to the KQL database." |
| No example queries provided | 5β10 example questionβSQL pairs covering joins, aggregations, and common business questions |
| Data agent exposes CSV files directly from OneLake | CSV/JSON files ingested into lakehouse tables before adding to agent |
| Published without testing β users report wrong answers | Tested iteratively with colleagues, instructions refined based on feedback, then published |
Governance & Security Layers
Data Agents enforce a strict intent precedence model β higher layers always override lower ones:
| Layer | Description | Precedence |
|---|---|---|
| Organizational Intent | Tenant-wide policies, compliance requirements | π΄ Highest |
| Role-based Intent | Workspace governance, permission boundaries | π High |
| Developer Intent | Custom instructions, example queries, data source config | π‘ Medium |
| User Intent | Questions and prompts from end users | π’ Lowest |
- Agents enforce read-only data access β no writes, no mutations
- All queries run with the requesting user's credentials (least-privilege)
- Microsoft Purview policies apply: DLP, access restrictions, risk discovery, and audit
- Optional Azure AI Content Safety integration for content risk controls
AI Readiness Checklist
A combined checklist for preparing your data for both Copilot for Power BI and Fabric Data Agents:
β Semantic Model Prep (Copilot)
- Use clean, descriptive column names β avoid abbreviations
- Add descriptions to measures and important columns
- Build a star schema with clear fact/dimension separation
- Enable Q&A on the semantic model
- Set an AI data schema β select only relevant fields
- Write AI instructions with business context and analysis rules
- Create verified answers for your top 10 business questions
- Enable Copilot indexing for faster responses
- Test changes iteratively in the Copilot pane
- Mark model as "Approved for Copilot"
β Data Agent Prep
- Choose the right data sources (max 5 per agent)
- Select only relevant tables β don't expose staging/temp tables
- Ingest file-based data into Lakehouse tables
- Write agent instructions with routing rules and terminology
- Provide example query pairs for complex question patterns
- Configure user permissions (Read on semantic models)
- Test with colleagues and iterate on instructions
- Publish and share via Copilot in Power BI
- Review Purview governance policies for compliance
- Enable Azure AI Content Safety for production agents
π§ Migration Complexity Scorer
Answer 8 questions about your current environment to get a complexity estimate, risk assessment, and phased migration timeline.