👤 Who is this for?

Data Engineer BI Developer Analytics Engineer — This section covers lakehouse vs. warehouse decisions, Spark optimization, notebook patterns, real-time streaming with Eventhouse, and Power BI Direct Lake integration.

Section 09

Data Engineering Best Practices

Practical guidance for building efficient, maintainable data pipelines in Fabric.

Lakehouse vs. Warehouse

Fabric offers two primary storage artifacts. Choose based on your workload:

FeatureLakehouseWarehouse
Primary languagePySpark, Spark SQLT-SQL
Data formatDelta (open format in OneLake)Proprietary columnar
SchemaSchema-on-read + schema-on-writeSchema-on-write only
Best forData engineering, ML, unstructured dataBI, structured queries, stored procedures
SQL accessRead-only SQL endpoint (auto-generated)Full T-SQL read/write
Multi-engineSpark + SQL + Power BISQL + Power BI
✅ Recommendation

Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.

Notebook Best Practices

Spark Optimization

File Format Guidance

Recommended formats in Fabric
✅ Delta (Parquet + transaction log)  — Default for all Fabric tables
✅ Parquet                             — Good for external/read-only data
⚠️  CSV                                — Only for Bronze ingestion, convert to Delta ASAP
❌ JSON (nested)                       — Flatten before storing in tables
❌ Excel/XML                           — Convert during ingestion pipeline

Shortcuts Strategy

Shortcuts are pointers to data in external locations — they appear as folders in your lakehouse but don't copy data. Use them to:

Section 10

Real-Time Analytics

Streaming data ingestion and time-series analytics with Real-Time Intelligence.

Overview

Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.

Key Components

Eventhouse

The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.

KQL Database

Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) — a powerful language for log and time-series analytics.

Eventstreams

Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.

Real-Time Dashboards

Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.

Streaming Ingestion Patterns

SourceIngestion MethodLatency
Azure Event HubsEventstream → KQL DatabaseSeconds
Azure IoT HubEventstream → KQL DatabaseSeconds
Apache KafkaKafka connector → EventstreamSeconds
Custom appsREST API / SDK ingestionSeconds
Databases (CDC)Fabric Mirroring → OneLakeMinutes

When to Use Real-Time vs. Batch

✅ Best Practice

Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.

Section 11

Power BI Integration

Maximizing the reporting and analytics experience with Fabric-native Power BI features.

Direct Lake Mode

Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery — without copying data into the semantic model.

💡 Key Insight

Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.

Semantic Model Best Practices

Report Governance

Performance Optimization

TechniqueImpactEffort
V-Order on Delta tables🟢 HighLow
Star schema design🟢 HighMedium
Reduce cardinality of columns🟢 HighLow
Use aggregation tables🟡 MediumMedium
Optimize DAX measures🟡 MediumVaries
Composite models🟡 MediumMedium
Paginated reports for large exports🟡 MediumLow