πŸ‘€ Who is this for?

Data Engineer Data Architect Analytics Engineer β€” This section covers how to bring data into Fabric using shortcuts, mirroring, Dataflows Gen2, and data pipelines β€” including when to use each approach.

Shortcuts

OneLake Shortcuts

Zero-copy data virtualization β€” access data where it lives without moving or duplicating it.

🌐 Multi-Cloud & Hybrid Patterns

This page covers the core integration mechanisms (Shortcuts, Mirroring, Dataflows, Pipelines). For strategic multi-cloud architecture patterns β€” AWS, GCP, Snowflake, Databricks integration, and hybrid landing zones β€” see the Multi-Cloud & Hybrid guide.

What Are Shortcuts?

Shortcuts are pointers to data in external or internal locations. They appear as folders in your lakehouse but don't copy data β€” queries read directly from the source. This enables unified data access across clouds, storage accounts, and Fabric workspaces with no ETL overhead.

πŸ’‘ Key Concept

Shortcuts are metadata-only references. There is no data movement, no duplication cost, and no sync lag. The data is always read from its original location at query time.

Shortcut Types

Shortcut TypeSourceUse Case
InternalOther Fabric lakehouses, warehouses, or KQL databasesCross-workspace and cross-domain data sharing within Fabric
ADLS Gen2Azure Data Lake Storage Gen2 accountsConnect to existing Azure data lakes without migration
Amazon S3AWS S3 bucketsMulti-cloud analytics β€” query AWS data from Fabric
Google Cloud StorageGCS bucketsMulti-cloud analytics β€” query GCP data from Fabric
DataverseMicrosoft Dataverse tablesAccess Dynamics 365 / Power Platform data directly
Amazon S3 CompatibleS3-compatible stores (MinIO, etc.)Connect to on-prem or third-party object storage

Architecture Patterns with Shortcuts

πŸ”— Cross-Domain Data Mesh

Each domain publishes a Gold layer in its own workspace. Other domains consume via shortcuts β€” no data copies, no pipelines to maintain. The producing domain controls access.

☁️ Multi-Cloud Lakehouse

Create shortcuts to ADLS Gen2, AWS S3, and GCS in a single lakehouse. Run Spark or SQL queries across all sources as if they were local Delta tables.

πŸ—οΈ Bronze Layer Virtualization

Instead of copying raw data into Fabric, create shortcuts to existing storage. Only materialize Silver and Gold layers β€” reducing storage cost and ingestion complexity.

πŸ“Š Shared Reference Data

Centralize reference tables (calendars, currencies, region mappings) in one workspace and share via shortcuts to all consumers. Update once, reflect everywhere.

Security & Access Control

⚠️ Important Considerations

Performance: Shortcut queries depend on source latency. For high-frequency analytics, consider materializing critical data instead of using shortcuts. Delta format is required for full feature support (time travel, ACID transactions). Non-Delta files (Parquet, CSV) are read-only through shortcuts.

Best Practices

Shortcuts β€” Do's and Don'ts
βœ… Use shortcuts for read-heavy, cross-domain sharing (Gold layer consumption)
βœ… Use shortcuts to avoid duplicating Bronze data already in ADLS Gen2
βœ… Apply consistent naming: sc_<source>_<dataset> (e.g., sc_sales_orders)
βœ… Document shortcuts in a central catalog β€” they're invisible in lineage views

⚠️  Monitor source latency β€” external shortcuts add network round-trip time
⚠️  Test Direct Lake mode compatibility β€” not all shortcut types support it

❌ Don't use shortcuts for write-back scenarios β€” they're read-only
❌ Don't chain shortcuts (shortcut β†’ shortcut) β€” performance degrades
❌ Don't shortcut to volatile/temporary storage locations
Mirroring

Fabric Mirroring

Near real-time data replication from operational databases into OneLake β€” no ETL pipelines required.

What Is Mirroring?

Mirroring continuously replicates data from external databases into OneLake as Delta tables using Change Data Capture (CDC). Unlike traditional ETL, mirroring is managed by Fabric β€” you configure the source, and data flows automatically with near real-time latency.

πŸ’‘ Key Concept

Mirrored data lands in OneLake as Delta tables, which means it's immediately queryable via Spark, SQL, KQL, and Power BI Direct Lake β€” no additional transformation needed for basic analytics.

Supported Sources

SourceCDC MethodTypical LatencyNotes
Azure SQL DatabaseChange tracking / CDCNear real-time (minutes)Most common source; full and incremental sync
Azure Cosmos DBChange feedNear real-timeNoSQL β†’ Delta conversion; analytical workloads on transactional data
SnowflakeStreams & tasksNear real-timeCross-platform analytics; bring Snowflake data into Fabric ecosystem
Azure Database for PostgreSQLLogical replicationNear real-timeFlexible server required; supports selected tables
Azure Database for MySQLBinlog replicationNear real-timeFlexible server required
Azure DatabricksUnity Catalog integrationVariesMirror Delta tables from Databricks lakehouse
SQL Server (on-prem)CDC via on-prem gatewayMinutesRequires on-premises data gateway configuration
Oracle DatabaseLogMiner CDCNear real-timeGA β€” supports Oracle on-prem and cloud (19c+); no gateway required for cloud instances
SAP DatasphereReplication flowNear real-timeGA β€” mirrors SAP analytical and planning data into OneLake
Google BigQueryChange trackingNear real-timeCross-cloud analytics; bring GCP warehouse data into Fabric
SharePoint ListsChange notificationsMinutesPreview β€” mirrors business lists for analytics without impacting SharePoint performance
DremioLakehouse federationNear real-timePreview β€” mirrors Dremio-managed datasets into OneLake Delta format

How Mirroring Works

1. Initial Snapshot

When mirroring is first enabled, Fabric takes a full snapshot of the selected tables and writes them as Delta tables in OneLake.

2. Incremental Sync

After the snapshot, CDC captures inserts, updates, and deletes from the source and applies them to the Delta tables continuously.

3. Delta Conversion

Source data is automatically converted to Delta format with proper schema mapping. Complex types are flattened where needed.

4. Query Ready

Mirrored Delta tables are instantly available in the SQL analytics endpoint, Spark notebooks, and Power BI Direct Lake reports.

Architecture Patterns

πŸ“Š Operational Reporting

Mirror production databases to Fabric and build Power BI reports with Direct Lake mode β€” get near real-time dashboards without impacting the transactional system.

πŸ”„ Database Modernization

Use mirroring as the first step in migrating from SQL Server or Snowflake. Data flows into OneLake immediately while you build the target architecture.

🏭 Multi-Source Consolidation

Mirror multiple databases (Azure SQL, Cosmos DB, PostgreSQL) into a single Fabric workspace. Join across sources using the SQL analytics endpoint.

⚠️ Important Considerations

Capacity cost: Mirroring consumes CU (Capacity Units) for both initial snapshots and ongoing CDC. Monitor usage in the Capacity Metrics app. Schema changes: Some schema modifications at the source (e.g., column drops, type changes) may require re-initialization of the mirror. Latency: While near real-time, mirroring is not sub-second β€” for true streaming, use Eventstreams instead.

Best Practices

Mirroring β€” Do's and Don'ts
βœ… Start with high-value tables β€” mirror what's needed, not entire databases
βœ… Use the SQL analytics endpoint for ad-hoc queries on mirrored data
βœ… Combine mirroring with medallion architecture: mirror β†’ Bronze, transform β†’ Silver/Gold
βœ… Monitor replication lag in the Fabric portal and set up alerts

⚠️  Plan for initial snapshot time β€” large tables may take hours on first sync
⚠️  Test schema evolution scenarios before going to production

❌ Don't use mirroring for sub-second latency requirements β€” use Eventstreams instead
❌ Don't mirror tables with frequent schema changes without a re-init plan
❌ Don't skip capacity planning β€” continuous CDC consumes CUs 24/7
Dataflows

Dataflows Gen2

Low-code data transformation and ingestion powered by Power Query β€” ideal for citizen integrators and standardized ETL.

What Are Dataflows Gen2?

Dataflows Gen2 provide a Power Query-based, low-code interface for connecting to data sources, applying transformations, and loading data into Fabric destinations (lakehouses, warehouses, or KQL databases). They run on a managed Spark backend and support 150+ connectors.

Key Capabilities

πŸ”Œ 150+ Connectors

Connect to databases, SaaS apps, files, APIs, and cloud services. Popular connectors: SQL Server, SharePoint, Salesforce, SAP, REST APIs.

🧩 Power Query Transforms

Use the visual Power Query editor for joins, pivots, aggregations, type conversions, conditional columns, and custom M functions.

πŸ“¦ Multiple Destinations

Load data into Fabric lakehouses, warehouses, or KQL databases. Supports append, replace, and upsert (merge) load modes.

⚑ Staging Lakehouse

Dataflows Gen2 use an auto-provisioned staging lakehouse for intermediate processing β€” delivering significantly faster performance than Gen1.

When to Use Dataflows Gen2

ScenarioRecommended?Why
SaaS data ingestion (Salesforce, SAP, SharePoint)βœ… YesBuilt-in connectors handle auth, pagination, and schema mapping
Simple transformations (filter, rename, type cast)βœ… YesLow-code Power Query is faster to build and maintain
Citizen integrator / business-user ETLβœ… YesNo Spark or SQL skills required β€” visual drag-and-drop interface
Complex multi-step transformations⚠️ Consider NotebooksSpark notebooks offer more control for complex logic
High-volume data (100M+ rows)⚠️ Consider Pipelines + SparkDataflows may time out or consume excessive CUs on very large datasets
Real-time / streaming data❌ Use EventstreamsDataflows are batch-oriented, not designed for streaming
βœ… Best Practice

Use Dataflows Gen2 for Bronze layer ingestion from SaaS sources and simple transformations. For complex Silver/Gold transformations, use Spark notebooks orchestrated by Data Pipelines.

Pipelines

Data Pipelines

Orchestration and scheduling for data movement and transformation β€” the backbone of production data flows.

What Are Data Pipelines?

Data Pipelines in Fabric are the orchestration engine (based on Azure Data Factory) for scheduling and coordinating data movement and transformation activities. They connect all other integration components β€” triggering Dataflows, Spark notebooks, stored procedures, and copy operations in sequence or parallel.

Core Activities

πŸ“‹ Copy Activity

High-performance data movement between 100+ sources and Fabric destinations. Supports full and incremental loads with configurable parallelism.

πŸ““ Notebook Activity

Execute Spark notebooks for complex transformations. Pass parameters, capture outputs, and chain notebooks in sequence.

πŸ”„ Dataflow Activity

Trigger Dataflows Gen2 as a step in a larger pipeline. Combine low-code and code-first transforms in one orchestration.

πŸ“¦ Stored Procedure

Execute SQL stored procedures in Fabric warehouses for T-SQL transformations and data loading.

Orchestration Patterns

PatternDescriptionWhen to Use
Sequential PipelineActivities run one after another: Copy β†’ Notebook β†’ Stored ProcedureStandard ETL with dependencies between steps
Parallel Fan-OutForEach activity processes multiple tables or partitions in parallelIngesting many tables from the same source simultaneously
Master PipelineParent pipeline calls child pipelines (Invoke Pipeline activity)Large-scale orchestration with reusable sub-pipelines
Event-DrivenPipeline triggered by storage events, schedule, or API callReact to file arrivals or system events
Incremental LoadWatermark-based or timestamp-driven loads with Lookup + CopyEfficient daily/hourly loads from large tables
βœ… Best Practice

Use master/child pipeline patterns for production workloads. The master pipeline handles error handling and notifications, while child pipelines encapsulate individual domain or table-group logic for reusability.

Airflow

Apache Airflow in Fabric

Managed DAG-based orchestration for complex, multi-step data workflows.

Why Airflow in Fabric?

Microsoft Fabric offers managed Apache Airflow jobs as a first-class Data Factory capability. This replaces the deprecated Azure Data Factory Airflow integration and provides a fully managed, auto-scaling orchestration service β€” no infrastructure to maintain.

πŸ€” When to choose Airflow over native Data Pipelines

Both tools orchestrate data workflows, but they solve different problems. Choose based on your team's needs and existing investments.

Airflow vs. Data Pipelines β€” Decision Matrix

CriteriaData Pipelines (Native)Apache Airflow (Managed)Why It Matters
Authoring model Visual drag-and-drop UI Python code (DAGs) Team skills: Low-code teams prefer Pipelines; Python-fluent teams move faster in Airflow with version control and testing.
Dependency complexity Linear/parallel with ForEach Arbitrary DAG with complex branching, retries, pools Infrastructure: Complex multi-hop workflows with conditional logic and backfills are Airflow's strength.
Multi-cloud orchestration Fabric-only (with Web activities for external) 1,000+ community operators (AWS, GCP, Databricks, etc.) Architecture: If you orchestrate across clouds or non-Microsoft services, Airflow's ecosystem is unmatched.
Testing & CI/CD Limited (JSON export, REST API) Full Python unit testing, Git-native, standard CI/CD Quality: Airflow DAGs can be unit-tested, linted, and peer-reviewed like application code.
Cost model Activity execution CUs only CUs while Airflow environment is running (auto-pause helps) Cost: Pipelines cost nothing when idle. Airflow has baseline cost but auto-pauses during inactivity.
Learning curve Low β€” ADF/Synapse familiarity Medium β€” requires Python + Airflow concepts Adoption: Teams already using Airflow elsewhere can reuse DAGs; greenfield teams may prefer native Pipelines.
Existing investment ADF/Synapse pipelines migrate directly Existing Airflow DAGs migrate with minimal changes Migration: Bring your existing orchestration code β€” don't rewrite from scratch.
βœ… Recommendation

Use native Data Pipelines when your orchestration is straightforward (copy β†’ transform β†’ load) and your team prefers low-code. Use Airflow when you need complex DAG dependencies, backfill capabilities, cross-cloud orchestration, or when your team already has Airflow expertise and existing DAGs to migrate.

Key Capabilities

πŸ”„ FabricRunItemOperator

Native Airflow operator to trigger Fabric items (pipelines, notebooks, warehouses) from within a DAG. Supports deferrable mode for efficient resource usage.

πŸ“ˆ Auto-Scale & Auto-Pause

Automatically scales workers for concurrent tasks and pauses when idle β€” you only pay CUs during active DAG execution.

πŸ” Entra ID Integration

SSO via Microsoft Entra, workspace-level RBAC, and service principal authentication for automated deployments.

πŸ“Š Built-in Monitoring

Real-time DAG run monitoring, task-level logging, and performance comparison across runs β€” all in the Fabric UI.

Example: Orchestrating a Medallion Pipeline

DAG triggering Fabric notebooks and pipelines
from airflow import DAG
from airflow.providers.microsoft.fabric.operators.run_item import FabricRunItemOperator
from datetime import datetime

with DAG(
    "medallion_etl",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=True,          # enables backfill for historical dates
) as dag:

    bronze_ingest = FabricRunItemOperator(
        task_id="bronze_ingest",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="notebook-bronze-ingest",
        job_type="RunNotebook",
        wait_for_termination=True,
        deferrable=True,   # frees Airflow worker while waiting
    )

    silver_transform = FabricRunItemOperator(
        task_id="silver_transform",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="notebook-silver-transform",
        job_type="RunNotebook",
        wait_for_termination=True,
        deferrable=True,
    )

    gold_publish = FabricRunItemOperator(
        task_id="gold_publish",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="pipeline-gold-publish",
        job_type="Pipeline",
        wait_for_termination=True,
        deferrable=True,
    )

    # DAG dependency chain
    bronze_ingest >> silver_transform >> gold_publish

Migration from ADF Airflow

Azure Data Factory's Apache Airflow integration was deprecated in 2025. Microsoft recommends migrating to Fabric's managed Airflow:

⚠️ Consider Before Adopting

Cost: Unlike Data Pipelines (pay-per-execution only), Airflow environments consume CUs while running. Auto-pause mitigates this, but teams with simple, infrequent workflows may not justify the overhead.
Complexity: Airflow adds operational concepts (DAG parsing, scheduler, workers, connections) that simpler workflows don't need.

Wrangler

Data Wrangler

Interactive, notebook-embedded data exploration and transformation β€” generates production-ready code from visual operations.

What Is Data Wrangler?

Data Wrangler is a visual, code-generation tool embedded directly in Fabric notebooks. You open a DataFrame in Wrangler, explore column distributions, apply transformations through a point-and-click interface, and it generates clean PySpark or Pandas code that you paste back into your notebook. It's designed for interactive profiling and prototyping β€” not for scheduled execution.

πŸ” Data Profiling

Automatic column statistics, null counts, distribution charts, and outlier detection β€” without writing a single line of code.

πŸ”„ Visual Transforms

Filter, sort, group, pivot, merge, split columns, handle nulls, change types, and derive columns β€” all through a guided UI.

πŸ’» Code Generation

Every visual operation generates PySpark or Pandas code. Export the full transformation as a reusable function ready for production notebooks.

πŸ“Š Live Preview

See transformation results immediately on a data sample before committing to the full dataset β€” fast iteration without wasted compute.

When to Use Data Wrangler

ScenarioFitWhy
Exploring a new dataset's shape and qualityβœ… IdealVisual profiling is faster than writing df.describe() for every column
Prototyping transformations before writing Spark codeβœ… IdealGenerate correct syntax without memorizing API β€” then refine in notebook
Quick data cleansing (nulls, types, outliers)βœ… GoodPoint-and-click cleaning generates reproducible code
Scheduled production ETL❌ Use Pipelines + NotebooksWrangler is interactive-only β€” it cannot be scheduled or triggered
Loading from external sources (APIs, SaaS)❌ Use Dataflows Gen2Wrangler operates on DataFrames already loaded in memory
Orchestrating multi-step workflows❌ Use Data PipelinesWrangler has no orchestration or dependency management

Best Practices

πŸ’‘ Wrangler vs. Dataflows Gen2

Both are "low-code" β€” but they serve different stages. Data Wrangler is for data scientists and engineers exploring data interactively inside notebooks. Dataflows Gen2 is for scheduled, repeatable ETL that runs without human interaction. If you need a human to click "run," it's Wrangler territory. If it needs to run at 2 AM unattended, use Dataflows or Pipelines.

Decide

Integration Decision Matrix

Choose the right integration approach based on your data source, latency needs, and team skills.

Which Tool Should I Use?

CriteriaShortcutsMirroringDataflows Gen2Data WranglerData PipelinesEventstreams
Data movementNone (zero-copy)CDC replicationBatch copy + transformNone (in-memory)Orchestrated copyStreaming ingestion
LatencyReal-time (reads at source)Near real-time (minutes)Batch (minutes–hours)Interactive (seconds)Batch (minutes–hours)Seconds
Transform capabilityNoneNone (raw replication)Power Query (visual)PySpark/Pandas (generated)Orchestration onlyInline transforms
Execution modelOn-demand readsContinuous CDCScheduled / triggeredInteractive only (human-driven)Scheduled / triggeredContinuous streaming
Skill levelLowLowLow (Power Query)Low–Medium (visual + code)Medium (ADF patterns)Medium–High
Best forCross-domain sharing, multi-cloudDatabase β†’ Lakehouse CDCSaaS ingestion, simple ETLData profiling, transform prototypingProduction orchestrationIoT, logs, streaming
CU consumptionAt query time onlyContinuous (24/7)During executionDuring interactive sessionDuring executionContinuous

Common Combinations

πŸ—οΈ Full Medallion ETL

Mirroring (source β†’ Bronze) + Pipelines (orchestrate Spark notebooks for Silver/Gold) + Shortcuts (share Gold across domains).

☁️ Multi-Cloud Analytics

Shortcuts to ADLS/S3/GCS for data lake consolidation + Dataflows Gen2 for SaaS connectors (Salesforce, SAP) + Pipelines to schedule refresh.

πŸ“Š Real-Time + Historical

Eventstreams for hot-path streaming into Eventhouse + Mirroring for operational databases + Shortcuts to join streaming and batch data.

πŸš€ Quick Start / POC

Mirroring for instant database access + Dataflows Gen2 for quick SaaS loads + Direct Lake reports on mirrored data β€” zero Spark code required.

🎯 Rule of Thumb

Don't move data unless you have to. Start with shortcuts for cross-domain sharing and mirroring for database replication. Add Dataflows Gen2 when you need transformations. Use Data Pipelines when you need production orchestration, scheduling, and error handling.