Data Engineer BI Developer Analytics Engineer — This section covers lakehouse vs. warehouse decisions, Spark optimization, notebook patterns, real-time streaming with Eventhouse, and Power BI Direct Lake integration.
Data Engineering Best Practices
Practical guidance for building efficient, maintainable data pipelines in Fabric.
Lakehouse vs. Warehouse
Fabric offers two primary storage artifacts. Choose based on your workload:
| Feature | Lakehouse | Warehouse |
|---|---|---|
| Primary language | PySpark, Spark SQL | T-SQL |
| Data format | Delta (open format in OneLake) | Proprietary columnar |
| Schema | Schema-on-read + schema-on-write | Schema-on-write only |
| Best for | Data engineering, ML, unstructured data | BI, structured queries, stored procedures |
| SQL access | Read-only SQL endpoint (auto-generated) | Full T-SQL read/write |
| Multi-engine | Spark + SQL + Power BI | SQL + Power BI |
Use Lakehouse for ingestion and transformation (Bronze/Silver layers) and Warehouse for the Gold/consumption layer if your consumers prefer T-SQL. Or use Lakehouse for all layers if your team is Spark-first.
Notebook Best Practices
- Parameterize notebooks: Use Fabric's built-in parameters for environment-specific values (file paths, dates, flags)
- Modular design: Break complex logic into reusable functions or separate notebooks using
%run - Error handling: Use try/catch blocks and log errors to a monitoring table
- Avoid hardcoded paths: Use relative paths within OneLake or parameterized paths
- Version control: Connect notebooks to Git for versioning and code review
Spark Optimization
- V-Order optimization: Enable V-Order for write-optimized Delta tables — significantly faster reads for Power BI and SQL
- Optimize file sizes: Target 128 MB – 1 GB files. Use
OPTIMIZE(bin-compaction) to merge small files - Partition wisely: Partition by high-cardinality columns used in filters (e.g.,
year,region). Avoid over-partitioning - Use Delta caching: SSD caching for repeated reads on the same data
- Right-size sessions: Don't default to large Spark clusters — start small and scale up based on the Spark UI metrics
File Format Guidance
✅ Delta (Parquet + transaction log) — Default for all Fabric tables ✅ Parquet — Good for external/read-only data ⚠️ CSV — Only for Bronze ingestion, convert to Delta ASAP ❌ JSON (nested) — Flatten before storing in tables ❌ Excel/XML — Convert during ingestion pipeline
Shortcuts Strategy
Shortcuts are pointers to data in external locations — they appear as folders in your lakehouse but don't copy data. Use them to:
- Access data in other lakehouses or workspaces without duplication
- Connect to external ADLS Gen2, AWS S3, or Google Cloud Storage
- Enable cross-domain data sharing without moving data
- Create a virtual data mesh where each domain manages its data but shares via shortcuts
Real-Time Analytics
Streaming data ingestion and time-series analytics with Real-Time Intelligence.
Overview
Fabric's Real-Time Intelligence experience is designed for high-volume streaming and time-series data. It's powered by the same engine behind Azure Data Explorer (Kusto) and provides sub-second query latency on billions of records.
Key Components
Eventhouse
The database container for real-time data. Stores data in a highly compressed columnar format optimized for time-series queries.
KQL Database
Individual databases within an Eventhouse. Query data using Kusto Query Language (KQL) — a powerful language for log and time-series analytics.
Eventstreams
Capture, transform, and route real-time events from sources like Azure Event Hubs, IoT Hub, Kafka, and custom apps.
Real-Time Dashboards
Build live dashboards with auto-refresh that query Eventhouse data directly. Ideal for operational monitoring.
Streaming Ingestion Patterns
| Source | Ingestion Method | Latency |
|---|---|---|
| Azure Event Hubs | Eventstream → KQL Database | Seconds |
| Azure IoT Hub | Eventstream → KQL Database | Seconds |
| Apache Kafka | Kafka connector → Eventstream | Seconds |
| Custom apps | REST API / SDK ingestion | Seconds |
| Databases (CDC) | Fabric Mirroring → OneLake | Minutes |
When to Use Real-Time vs. Batch
- Real-Time Intelligence: IoT telemetry, application logs, clickstream analytics, fraud detection, operational monitoring
- Batch (Lakehouse/Warehouse): Historical reporting, data warehousing, regulatory reporting, large-scale transformations
- Hybrid: Ingest in real-time to Eventhouse for operational use, then copy to Lakehouse for long-term analytics
Use Eventhouse for hot data (last 30 days) and OneLake for warm/cold data (historical). Set up retention policies in your KQL Database to automatically move aging data.
Power BI Integration
Maximizing the reporting and analytics experience with Fabric-native Power BI features.
Direct Lake Mode
Direct Lake is a game-changing connectivity mode exclusive to Fabric. It combines the performance of Import mode with the freshness of DirectQuery — without copying data into the semantic model.
- Reads data directly from Delta tables in OneLake
- No data refresh needed — reports always show the latest data
- Performance comparable to Import mode for most scenarios
- Falls back to DirectQuery if data exceeds memory (configurable behavior)
Direct Lake works best when your Gold-layer tables are V-Order optimized and properly sized (avoid millions of tiny files). Run OPTIMIZE and enable V-Order on your Gold tables for best Direct Lake performance.
Semantic Model Best Practices
- Build star schemas in your Gold layer — separate facts and dimensions
- Use surrogate keys (integer) instead of natural keys for relationships
- Minimize calculated columns — push computations to the Lakehouse/Warehouse layer
- Use calculation groups for reusable time intelligence (YTD, MTD, YoY)
- Enable automatic aggregations for large tables to speed up high-level queries
Report Governance
- Centralize semantic models: Create shared models that multiple reports connect to (avoid dataset proliferation)
- Use endorsement: Certify production-ready models and promote development models
- Apply RLS: Configure row-level security on semantic models for multi-tenant reports
- Monitor usage: Use the usage metrics to track report adoption
Performance Optimization
| Technique | Impact | Effort |
|---|---|---|
| V-Order on Delta tables | 🟢 High | Low |
| Star schema design | 🟢 High | Medium |
| Reduce cardinality of columns | 🟢 High | Low |
| Use aggregation tables | 🟡 Medium | Medium |
| Optimize DAX measures | 🟡 Medium | Varies |
| Composite models | 🟡 Medium | Medium |
| Paginated reports for large exports | 🟡 Medium | Low |