πŸ‘€ Who is this for?

Data Engineer Data Architect Analytics Engineer β€” This section covers how to bring data into Fabric using shortcuts, mirroring, Dataflows Gen2, and data pipelines β€” including when to use each approach.

Shortcuts

OneLake Shortcuts

Zero-copy data virtualization β€” access data where it lives without moving or duplicating it.

What Are Shortcuts?

Shortcuts are pointers to data in external or internal locations. They appear as folders in your lakehouse but don't copy data β€” queries read directly from the source. This enables unified data access across clouds, storage accounts, and Fabric workspaces with no ETL overhead.

πŸ’‘ Key Concept

Shortcuts are metadata-only references. There is no data movement, no duplication cost, and no sync lag. The data is always read from its original location at query time.

Shortcut Types

Shortcut TypeSourceUse Case
InternalOther Fabric lakehouses, warehouses, or KQL databasesCross-workspace and cross-domain data sharing within Fabric
ADLS Gen2Azure Data Lake Storage Gen2 accountsConnect to existing Azure data lakes without migration
Amazon S3AWS S3 bucketsMulti-cloud analytics β€” query AWS data from Fabric
Google Cloud StorageGCS bucketsMulti-cloud analytics β€” query GCP data from Fabric
DataverseMicrosoft Dataverse tablesAccess Dynamics 365 / Power Platform data directly
Amazon S3 CompatibleS3-compatible stores (MinIO, etc.)Connect to on-prem or third-party object storage

Architecture Patterns with Shortcuts

πŸ”— Cross-Domain Data Mesh

Each domain publishes a Gold layer in its own workspace. Other domains consume via shortcuts β€” no data copies, no pipelines to maintain. The producing domain controls access.

☁️ Multi-Cloud Lakehouse

Create shortcuts to ADLS Gen2, AWS S3, and GCS in a single lakehouse. Run Spark or SQL queries across all sources as if they were local Delta tables.

πŸ—οΈ Bronze Layer Virtualization

Instead of copying raw data into Fabric, create shortcuts to existing storage. Only materialize Silver and Gold layers β€” reducing storage cost and ingestion complexity.

πŸ“Š Shared Reference Data

Centralize reference tables (calendars, currencies, region mappings) in one workspace and share via shortcuts to all consumers. Update once, reflect everywhere.

Security & Access Control

⚠️ Important Considerations

Performance: Shortcut queries depend on source latency. For high-frequency analytics, consider materializing critical data instead of using shortcuts. Delta format is required for full feature support (time travel, ACID transactions). Non-Delta files (Parquet, CSV) are read-only through shortcuts.

Best Practices

Shortcuts β€” Do's and Don'ts
βœ… Use shortcuts for read-heavy, cross-domain sharing (Gold layer consumption)
βœ… Use shortcuts to avoid duplicating Bronze data already in ADLS Gen2
βœ… Apply consistent naming: sc_<source>_<dataset> (e.g., sc_sales_orders)
βœ… Document shortcuts in a central catalog β€” they're invisible in lineage views

⚠️  Monitor source latency β€” external shortcuts add network round-trip time
⚠️  Test Direct Lake mode compatibility β€” not all shortcut types support it

❌ Don't use shortcuts for write-back scenarios β€” they're read-only
❌ Don't chain shortcuts (shortcut β†’ shortcut) β€” performance degrades
❌ Don't shortcut to volatile/temporary storage locations
Mirroring

Fabric Mirroring

Near real-time data replication from operational databases into OneLake β€” no ETL pipelines required.

What Is Mirroring?

Mirroring continuously replicates data from external databases into OneLake as Delta tables using Change Data Capture (CDC). Unlike traditional ETL, mirroring is managed by Fabric β€” you configure the source, and data flows automatically with near real-time latency.

πŸ’‘ Key Concept

Mirrored data lands in OneLake as Delta tables, which means it's immediately queryable via Spark, SQL, KQL, and Power BI Direct Lake β€” no additional transformation needed for basic analytics.

Supported Sources

SourceCDC MethodTypical LatencyNotes
Azure SQL DatabaseChange tracking / CDCNear real-time (minutes)Most common source; full and incremental sync
Azure Cosmos DBChange feedNear real-timeNoSQL β†’ Delta conversion; analytical workloads on transactional data
SnowflakeStreams & tasksNear real-timeCross-platform analytics; bring Snowflake data into Fabric ecosystem
Azure Database for PostgreSQLLogical replicationNear real-timeFlexible server required; supports selected tables
Azure Database for MySQLBinlog replicationNear real-timeFlexible server required
Azure DatabricksUnity Catalog integrationVariesMirror Delta tables from Databricks lakehouse
SQL Server (on-prem)CDC via on-prem gatewayMinutesRequires on-premises data gateway configuration

How Mirroring Works

1. Initial Snapshot

When mirroring is first enabled, Fabric takes a full snapshot of the selected tables and writes them as Delta tables in OneLake.

2. Incremental Sync

After the snapshot, CDC captures inserts, updates, and deletes from the source and applies them to the Delta tables continuously.

3. Delta Conversion

Source data is automatically converted to Delta format with proper schema mapping. Complex types are flattened where needed.

4. Query Ready

Mirrored Delta tables are instantly available in the SQL analytics endpoint, Spark notebooks, and Power BI Direct Lake reports.

Architecture Patterns

πŸ“Š Operational Reporting

Mirror production databases to Fabric and build Power BI reports with Direct Lake mode β€” get near real-time dashboards without impacting the transactional system.

πŸ”„ Database Modernization

Use mirroring as the first step in migrating from SQL Server or Snowflake. Data flows into OneLake immediately while you build the target architecture.

🏭 Multi-Source Consolidation

Mirror multiple databases (Azure SQL, Cosmos DB, PostgreSQL) into a single Fabric workspace. Join across sources using the SQL analytics endpoint.

⚠️ Important Considerations

Capacity cost: Mirroring consumes CU (Capacity Units) for both initial snapshots and ongoing CDC. Monitor usage in the Capacity Metrics app. Schema changes: Some schema modifications at the source (e.g., column drops, type changes) may require re-initialization of the mirror. Latency: While near real-time, mirroring is not sub-second β€” for true streaming, use Eventstreams instead.

Best Practices

Mirroring β€” Do's and Don'ts
βœ… Start with high-value tables β€” mirror what's needed, not entire databases
βœ… Use the SQL analytics endpoint for ad-hoc queries on mirrored data
βœ… Combine mirroring with medallion architecture: mirror β†’ Bronze, transform β†’ Silver/Gold
βœ… Monitor replication lag in the Fabric portal and set up alerts

⚠️  Plan for initial snapshot time β€” large tables may take hours on first sync
⚠️  Test schema evolution scenarios before going to production

❌ Don't use mirroring for sub-second latency requirements β€” use Eventstreams instead
❌ Don't mirror tables with frequent schema changes without a re-init plan
❌ Don't skip capacity planning β€” continuous CDC consumes CUs 24/7
Dataflows

Dataflows Gen2

Low-code data transformation and ingestion powered by Power Query β€” ideal for citizen integrators and standardized ETL.

What Are Dataflows Gen2?

Dataflows Gen2 provide a Power Query-based, low-code interface for connecting to data sources, applying transformations, and loading data into Fabric destinations (lakehouses, warehouses, or KQL databases). They run on a managed Spark backend and support 150+ connectors.

Key Capabilities

πŸ”Œ 150+ Connectors

Connect to databases, SaaS apps, files, APIs, and cloud services. Popular connectors: SQL Server, SharePoint, Salesforce, SAP, REST APIs.

🧩 Power Query Transforms

Use the visual Power Query editor for joins, pivots, aggregations, type conversions, conditional columns, and custom M functions.

πŸ“¦ Multiple Destinations

Load data into Fabric lakehouses, warehouses, or KQL databases. Supports append, replace, and upsert (merge) load modes.

⚑ Staging Lakehouse

Dataflows Gen2 use an auto-provisioned staging lakehouse for intermediate processing β€” delivering significantly faster performance than Gen1.

When to Use Dataflows Gen2

ScenarioRecommended?Why
SaaS data ingestion (Salesforce, SAP, SharePoint)βœ… YesBuilt-in connectors handle auth, pagination, and schema mapping
Simple transformations (filter, rename, type cast)βœ… YesLow-code Power Query is faster to build and maintain
Citizen integrator / business-user ETLβœ… YesNo Spark or SQL skills required β€” visual drag-and-drop interface
Complex multi-step transformations⚠️ Consider NotebooksSpark notebooks offer more control for complex logic
High-volume data (100M+ rows)⚠️ Consider Pipelines + SparkDataflows may time out or consume excessive CUs on very large datasets
Real-time / streaming data❌ Use EventstreamsDataflows are batch-oriented, not designed for streaming
βœ… Best Practice

Use Dataflows Gen2 for Bronze layer ingestion from SaaS sources and simple transformations. For complex Silver/Gold transformations, use Spark notebooks orchestrated by Data Pipelines.

Pipelines

Data Pipelines

Orchestration and scheduling for data movement and transformation β€” the backbone of production data flows.

What Are Data Pipelines?

Data Pipelines in Fabric are the orchestration engine (based on Azure Data Factory) for scheduling and coordinating data movement and transformation activities. They connect all other integration components β€” triggering Dataflows, Spark notebooks, stored procedures, and copy operations in sequence or parallel.

Core Activities

πŸ“‹ Copy Activity

High-performance data movement between 100+ sources and Fabric destinations. Supports full and incremental loads with configurable parallelism.

πŸ““ Notebook Activity

Execute Spark notebooks for complex transformations. Pass parameters, capture outputs, and chain notebooks in sequence.

πŸ”„ Dataflow Activity

Trigger Dataflows Gen2 as a step in a larger pipeline. Combine low-code and code-first transforms in one orchestration.

πŸ“¦ Stored Procedure

Execute SQL stored procedures in Fabric warehouses for T-SQL transformations and data loading.

Orchestration Patterns

PatternDescriptionWhen to Use
Sequential PipelineActivities run one after another: Copy β†’ Notebook β†’ Stored ProcedureStandard ETL with dependencies between steps
Parallel Fan-OutForEach activity processes multiple tables or partitions in parallelIngesting many tables from the same source simultaneously
Master PipelineParent pipeline calls child pipelines (Invoke Pipeline activity)Large-scale orchestration with reusable sub-pipelines
Event-DrivenPipeline triggered by storage events, schedule, or API callReact to file arrivals or system events
Incremental LoadWatermark-based or timestamp-driven loads with Lookup + CopyEfficient daily/hourly loads from large tables
βœ… Best Practice

Use master/child pipeline patterns for production workloads. The master pipeline handles error handling and notifications, while child pipelines encapsulate individual domain or table-group logic for reusability.

Decide

Integration Decision Matrix

Choose the right integration approach based on your data source, latency needs, and team skills.

Which Tool Should I Use?

CriteriaShortcutsMirroringDataflows Gen2Data PipelinesEventstreams
Data movementNone (zero-copy)CDC replicationBatch copy + transformOrchestrated copyStreaming ingestion
LatencyReal-time (reads at source)Near real-time (minutes)Batch (minutes–hours)Batch (minutes–hours)Seconds
Transform capabilityNoneNone (raw replication)Power Query (visual)Orchestration onlyInline transforms
Skill levelLowLowLow (Power Query)Medium (ADF patterns)Medium–High
Best forCross-domain sharing, multi-cloudDatabase β†’ Lakehouse CDCSaaS ingestion, simple ETLProduction orchestrationIoT, logs, streaming
CU consumptionAt query time onlyContinuous (24/7)During executionDuring executionContinuous

Common Combinations

πŸ—οΈ Full Medallion ETL

Mirroring (source β†’ Bronze) + Pipelines (orchestrate Spark notebooks for Silver/Gold) + Shortcuts (share Gold across domains).

☁️ Multi-Cloud Analytics

Shortcuts to ADLS/S3/GCS for data lake consolidation + Dataflows Gen2 for SaaS connectors (Salesforce, SAP) + Pipelines to schedule refresh.

πŸ“Š Real-Time + Historical

Eventstreams for hot-path streaming into Eventhouse + Mirroring for operational databases + Shortcuts to join streaming and batch data.

πŸš€ Quick Start / POC

Mirroring for instant database access + Dataflows Gen2 for quick SaaS loads + Direct Lake reports on mirrored data β€” zero Spark code required.

🎯 Rule of Thumb

Don't move data unless you have to. Start with shortcuts for cross-domain sharing and mirroring for database replication. Add Dataflows Gen2 when you need transformations. Use Data Pipelines when you need production orchestration, scheduling, and error handling.