Data Integration — Microsoft Fabric Guide

Shortcuts

OneLake Shortcuts

Zero-copy data virtualization — access data where it lives without moving or duplicating it.

🌐 Multi-Cloud & Hybrid Patterns

This page covers the core integration mechanisms (Shortcuts, Mirroring, Dataflows, Pipelines). For strategic multi-cloud architecture patterns — AWS, GCP, Snowflake, Databricks integration, and hybrid landing zones — see the Multi-Cloud & Hybrid guide.

What Are Shortcuts?

Shortcuts are pointers to data in external or internal locations. They appear as folders in your lakehouse but don't copy data — queries read directly from the source. This enables unified data access across clouds, storage accounts, and Fabric workspaces with no ETL overhead.

💡 Key Concept

Shortcuts are metadata-only references. There is no data movement, no duplication cost, and no sync lag. The data is always read from its original location at query time.

Shortcut Types

Shortcut Type	Source	Use Case
Internal	Other Fabric lakehouses, warehouses, or KQL databases	Cross-workspace and cross-domain data sharing within Fabric
ADLS Gen2	Azure Data Lake Storage Gen2 accounts	Connect to existing Azure data lakes without migration
Amazon S3	AWS S3 buckets	Multi-cloud analytics — query AWS data from Fabric
Google Cloud Storage	GCS buckets	Multi-cloud analytics — query GCP data from Fabric
Dataverse	Microsoft Dataverse tables	Access Dynamics 365 / Power Platform data directly
Amazon S3 Compatible	S3-compatible stores (MinIO, etc.)	Connect to on-prem or third-party object storage

Architecture Patterns with Shortcuts

🔗 Cross-Domain Data Mesh

Each domain publishes a Gold layer in its own workspace. Other domains consume via shortcuts — no data copies, no pipelines to maintain. The producing domain controls access.

☁️ Multi-Cloud Lakehouse

Create shortcuts to ADLS Gen2, AWS S3, and GCS in a single lakehouse. Run Spark or SQL queries across all sources as if they were local Delta tables.

🏗️ Bronze Layer Virtualization

Instead of copying raw data into Fabric, create shortcuts to existing storage. Only materialize Silver and Gold layers — reducing storage cost and ingestion complexity.

📊 Shared Reference Data

Centralize reference tables (calendars, currencies, region mappings) in one workspace and share via shortcuts to all consumers. Update once, reflect everywhere.

Security & Access Control

Authentication: Internal shortcuts use Fabric identity. External shortcuts (ADLS, S3, GCS) use connection credentials configured by an admin.
Authorization: Access to the shortcut is governed by workspace permissions. The user must also have access to the underlying data source.
Row-Level Security: RLS defined on the source data applies when queries traverse shortcuts.
Governance: Shortcuts respect Microsoft Purview sensitivity labels and Information Protection policies.

⚠️ Important Considerations

Performance: Shortcut queries depend on source latency. For high-frequency analytics, consider materializing critical data instead of using shortcuts. Delta format is required for full feature support (time travel, ACID transactions). Non-Delta files (Parquet, CSV) are read-only through shortcuts.

Best Practices

Shortcuts — Do's and Don'ts

✅ Use shortcuts for read-heavy, cross-domain sharing (Gold layer consumption)
✅ Use shortcuts to avoid duplicating Bronze data already in ADLS Gen2
✅ Apply consistent naming: sc_<source>_<dataset> (e.g., sc_sales_orders)
✅ Document shortcuts in a central catalog — they're invisible in lineage views

⚠️  Monitor source latency — external shortcuts add network round-trip time
⚠️  Test Direct Lake mode compatibility — not all shortcut types support it

❌ Don't use shortcuts for write-back scenarios — they're read-only
❌ Don't chain shortcuts (shortcut → shortcut) — performance degrades
❌ Don't shortcut to volatile/temporary storage locations

📚 Learn More

OneLake Shortcuts ↗ Create ADLS Gen2 Shortcut ↗ Create S3 Shortcut ↗ Create GCS Shortcut ↗

Mirroring

Fabric Mirroring

Near real-time data replication from operational databases into OneLake — no ETL pipelines required.

What Is Mirroring?

Mirroring continuously replicates data from external databases into OneLake as Delta tables using Change Data Capture (CDC). Unlike traditional ETL, mirroring is managed by Fabric — you configure the source, and data flows automatically with near real-time latency.

💡 Key Concept

Mirrored data lands in OneLake as Delta tables, which means it's immediately queryable via Spark, SQL, KQL, and Power BI Direct Lake — no additional transformation needed for basic analytics.

Supported Sources

Source	CDC Method	Typical Latency	Notes
Azure SQL Database	Change tracking / CDC	Near real-time (minutes)	Most common source; full and incremental sync
Azure Cosmos DB	Change feed	Near real-time	NoSQL → Delta conversion; analytical workloads on transactional data
Snowflake	Streams & tasks	Near real-time	Cross-platform analytics; bring Snowflake data into Fabric ecosystem
Azure Database for PostgreSQL	Logical replication	Near real-time	Flexible server required; supports selected tables
Azure Database for MySQL	Binlog replication	Near real-time	Flexible server required
Azure Databricks	Unity Catalog integration	Varies	Mirror Delta tables from Databricks lakehouse
SQL Server (on-prem)	CDC via on-prem gateway	Minutes	Requires on-premises data gateway configuration
Oracle Database	LogMiner CDC	Near real-time	GA — supports Oracle on-prem and cloud (19c+); no gateway required for cloud instances
SAP Datasphere	Replication flow	Near real-time	GA — mirrors SAP analytical and planning data into OneLake
Google BigQuery	Change tracking	Near real-time	Cross-cloud analytics; bring GCP warehouse data into Fabric
SharePoint Lists	Change notifications	Minutes	Preview — mirrors business lists for analytics without impacting SharePoint performance
Dremio	Lakehouse federation	Near real-time	Preview — mirrors Dremio-managed datasets into OneLake Delta format

How Mirroring Works

1. Initial Snapshot

When mirroring is first enabled, Fabric takes a full snapshot of the selected tables and writes them as Delta tables in OneLake.

2. Incremental Sync

After the snapshot, CDC captures inserts, updates, and deletes from the source and applies them to the Delta tables continuously.

3. Delta Conversion

Source data is automatically converted to Delta format with proper schema mapping. Complex types are flattened where needed.

4. Query Ready

Mirrored Delta tables are instantly available in the SQL analytics endpoint, Spark notebooks, and Power BI Direct Lake reports.

Architecture Patterns

📊 Operational Reporting

Mirror production databases to Fabric and build Power BI reports with Direct Lake mode — get near real-time dashboards without impacting the transactional system.

🔄 Database Modernization

Use mirroring as the first step in migrating from SQL Server or Snowflake. Data flows into OneLake immediately while you build the target architecture.

🏭 Multi-Source Consolidation

Mirror multiple databases (Azure SQL, Cosmos DB, PostgreSQL) into a single Fabric workspace. Join across sources using the SQL analytics endpoint.

⚠️ Important Considerations

Capacity cost: Mirroring consumes CU (Capacity Units) for both initial snapshots and ongoing CDC. Monitor usage in the Capacity Metrics app. Schema changes: Some schema modifications at the source (e.g., column drops, type changes) may require re-initialization of the mirror. Latency: While near real-time, mirroring is not sub-second — for true streaming, use Eventstreams instead.

Best Practices

Mirroring — Do's and Don'ts

✅ Start with high-value tables — mirror what's needed, not entire databases
✅ Use the SQL analytics endpoint for ad-hoc queries on mirrored data
✅ Combine mirroring with medallion architecture: mirror → Bronze, transform → Silver/Gold
✅ Monitor replication lag in the Fabric portal and set up alerts

⚠️  Plan for initial snapshot time — large tables may take hours on first sync
⚠️  Test schema evolution scenarios before going to production

❌ Don't use mirroring for sub-second latency requirements — use Eventstreams instead
❌ Don't mirror tables with frequent schema changes without a re-init plan
❌ Don't skip capacity planning — continuous CDC consumes CUs 24/7

📚 Learn More

Mirrored Databases Overview ↗ Mirror Azure SQL Database ↗ Mirror Cosmos DB ↗ Mirror Snowflake ↗

Dataflows

Dataflows Gen2

Low-code data transformation and ingestion powered by Power Query — ideal for citizen integrators and standardized ETL.

What Are Dataflows Gen2?

Dataflows Gen2 provide a Power Query-based, low-code interface for connecting to data sources, applying transformations, and loading data into Fabric destinations (lakehouses, warehouses, or KQL databases). They run on a managed Spark backend and support 150+ connectors.

Key Capabilities

🔌 150+ Connectors

Connect to databases, SaaS apps, files, APIs, and cloud services. Popular connectors: SQL Server, SharePoint, Salesforce, SAP, REST APIs.

🧩 Power Query Transforms

Use the visual Power Query editor for joins, pivots, aggregations, type conversions, conditional columns, and custom M functions.

📦 Multiple Destinations

Load data into Fabric lakehouses, warehouses, or KQL databases. Supports append, replace, and upsert (merge) load modes.

⚡ Staging Lakehouse

Dataflows Gen2 use an auto-provisioned staging lakehouse for intermediate processing — delivering significantly faster performance than Gen1.

When to Use Dataflows Gen2

Scenario	Recommended?	Why
SaaS data ingestion (Salesforce, SAP, SharePoint)	✅ Yes	Built-in connectors handle auth, pagination, and schema mapping
Simple transformations (filter, rename, type cast)	✅ Yes	Low-code Power Query is faster to build and maintain
Citizen integrator / business-user ETL	✅ Yes	No Spark or SQL skills required — visual drag-and-drop interface
Complex multi-step transformations	⚠️ Consider Notebooks	Spark notebooks offer more control for complex logic
High-volume data (100M+ rows)	⚠️ Consider Pipelines + Spark	Dataflows may time out or consume excessive CUs on very large datasets
Real-time / streaming data	❌ Use Eventstreams	Dataflows are batch-oriented, not designed for streaming

✅ Best Practice

Use Dataflows Gen2 for Bronze layer ingestion from SaaS sources and simple transformations. For complex Silver/Gold transformations, use Spark notebooks orchestrated by Data Pipelines.

📚 Learn More

Create Your First Dataflow Gen2 ↗ Dataflows Gen2 Overview ↗ Power Query Connectors ↗

📎 See Also

Best Practices — Data Engineering →

Pipelines

Data Pipelines

Orchestration and scheduling for data movement and transformation — the backbone of production data flows.

What Are Data Pipelines?

Data Pipelines in Fabric are the orchestration engine (based on Azure Data Factory) for scheduling and coordinating data movement and transformation activities. They connect all other integration components — triggering Dataflows, Spark notebooks, stored procedures, and copy operations in sequence or parallel.

Core Activities

📋 Copy Activity

High-performance data movement between 100+ sources and Fabric destinations. Supports full and incremental loads with configurable parallelism.

📓 Notebook Activity

Execute Spark notebooks for complex transformations. Pass parameters, capture outputs, and chain notebooks in sequence.

🔄 Dataflow Activity

Trigger Dataflows Gen2 as a step in a larger pipeline. Combine low-code and code-first transforms in one orchestration.

📦 Stored Procedure

Execute SQL stored procedures in Fabric warehouses for T-SQL transformations and data loading.

Orchestration Patterns

Pattern	Description	When to Use
Sequential Pipeline	Activities run one after another: Copy → Notebook → Stored Procedure	Standard ETL with dependencies between steps
Parallel Fan-Out	ForEach activity processes multiple tables or partitions in parallel	Ingesting many tables from the same source simultaneously
Master Pipeline	Parent pipeline calls child pipelines (Invoke Pipeline activity)	Large-scale orchestration with reusable sub-pipelines
Event-Driven	Pipeline triggered by storage events, schedule, or API call	React to file arrivals or system events
Incremental Load	Watermark-based or timestamp-driven loads with Lookup + Copy	Efficient daily/hourly loads from large tables

✅ Best Practice

Use master/child pipeline patterns for production workloads. The master pipeline handles error handling and notifications, while child pipelines encapsulate individual domain or table-group logic for reusability.

📚 Learn More

Pipeline Activities ↗ Create Your First Pipeline ↗ Monitor Pipeline Runs ↗

Airflow

Apache Airflow in Fabric

Managed DAG-based orchestration for complex, multi-step data workflows.

Why Airflow in Fabric?

Microsoft Fabric offers managed Apache Airflow jobs as a first-class Data Factory capability. This replaces the deprecated Azure Data Factory Airflow integration and provides a fully managed, auto-scaling orchestration service — no infrastructure to maintain.

🤔 When to choose Airflow over native Data Pipelines

Both tools orchestrate data workflows, but they solve different problems. Choose based on your team's needs and existing investments.

Airflow vs. Data Pipelines — Decision Matrix

Criteria	Data Pipelines (Native)	Apache Airflow (Managed)	Why It Matters
Authoring model	Visual drag-and-drop UI	Python code (DAGs)	Team skills: Low-code teams prefer Pipelines; Python-fluent teams move faster in Airflow with version control and testing.
Dependency complexity	Linear/parallel with ForEach	Arbitrary DAG with complex branching, retries, pools	Infrastructure: Complex multi-hop workflows with conditional logic and backfills are Airflow's strength.
Multi-cloud orchestration	Fabric-only (with Web activities for external)	1,000+ community operators (AWS, GCP, Databricks, etc.)	Architecture: If you orchestrate across clouds or non-Microsoft services, Airflow's ecosystem is unmatched.
Testing & CI/CD	Limited (JSON export, REST API)	Full Python unit testing, Git-native, standard CI/CD	Quality: Airflow DAGs can be unit-tested, linted, and peer-reviewed like application code.
Cost model	Activity execution CUs only	CUs while Airflow environment is running (auto-pause helps)	Cost: Pipelines cost nothing when idle. Airflow has baseline cost but auto-pauses during inactivity.
Learning curve	Low — ADF/Synapse familiarity	Medium — requires Python + Airflow concepts	Adoption: Teams already using Airflow elsewhere can reuse DAGs; greenfield teams may prefer native Pipelines.
Existing investment	ADF/Synapse pipelines migrate directly	Existing Airflow DAGs migrate with minimal changes	Migration: Bring your existing orchestration code — don't rewrite from scratch.

✅ Recommendation

Use native Data Pipelines when your orchestration is straightforward (copy → transform → load) and your team prefers low-code. Use Airflow when you need complex DAG dependencies, backfill capabilities, cross-cloud orchestration, or when your team already has Airflow expertise and existing DAGs to migrate.

Key Capabilities

🔄 FabricRunItemOperator

Native Airflow operator to trigger Fabric items (pipelines, notebooks, warehouses) from within a DAG. Supports deferrable mode for efficient resource usage.

📈 Auto-Scale & Auto-Pause

Automatically scales workers for concurrent tasks and pauses when idle — you only pay CUs during active DAG execution.

🔐 Entra ID Integration

SSO via Microsoft Entra, workspace-level RBAC, and service principal authentication for automated deployments.

📊 Built-in Monitoring

Real-time DAG run monitoring, task-level logging, and performance comparison across runs — all in the Fabric UI.

Example: Orchestrating a Medallion Pipeline

DAG triggering Fabric notebooks and pipelines

from airflow import DAG
from airflow.providers.microsoft.fabric.operators.run_item import FabricRunItemOperator
from datetime import datetime

with DAG(
    "medallion_etl",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=True,          # enables backfill for historical dates
) as dag:

    bronze_ingest = FabricRunItemOperator(
        task_id="bronze_ingest",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="notebook-bronze-ingest",
        job_type="RunNotebook",
        wait_for_termination=True,
        deferrable=True,   # frees Airflow worker while waiting
    )

    silver_transform = FabricRunItemOperator(
        task_id="silver_transform",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="notebook-silver-transform",
        job_type="RunNotebook",
        wait_for_termination=True,
        deferrable=True,
    )

    gold_publish = FabricRunItemOperator(
        task_id="gold_publish",
        fabric_conn_id="fabric_prod",
        workspace_id="ws-abc-123",
        item_id="pipeline-gold-publish",
        job_type="Pipeline",
        wait_for_termination=True,
        deferrable=True,
    )

    # DAG dependency chain
    bronze_ingest >> silver_transform >> gold_publish

Migration from ADF Airflow

Azure Data Factory's Apache Airflow integration was deprecated in 2025. Microsoft recommends migrating to Fabric's managed Airflow:

DAGs are compatible: Move your existing Python DAG files with minimal changes (update connection IDs and operator imports)
Connections migrate: Reconfigure connections to use Fabric's Entra-based authentication
Cost benefit: Fabric's auto-pause reduces baseline costs compared to ADF's always-on Airflow environments

⚠️ Consider Before Adopting

Cost: Unlike Data Pipelines (pay-per-execution only), Airflow environments consume CUs while running. Auto-pause mitigates this, but teams with simple, infrequent workflows may not justify the overhead.
Complexity: Airflow adds operational concepts (DAG parsing, scheduler, workers, connections) that simpler workflows don't need.

📚 Learn More

Apache Airflow Jobs in Fabric ↗ Migrate from ADF Airflow ↗ Fabric Airflow Plugin (PyPI) ↗

Wrangler

Data Wrangler

Interactive, notebook-embedded data exploration and transformation — generates production-ready code from visual operations.

What Is Data Wrangler?

Data Wrangler is a visual, code-generation tool embedded directly in Fabric notebooks. You open a DataFrame in Wrangler, explore column distributions, apply transformations through a point-and-click interface, and it generates clean PySpark or Pandas code that you paste back into your notebook. It's designed for interactive profiling and prototyping — not for scheduled execution.

🔍 Data Profiling

Automatic column statistics, null counts, distribution charts, and outlier detection — without writing a single line of code.

🔄 Visual Transforms

Filter, sort, group, pivot, merge, split columns, handle nulls, change types, and derive columns — all through a guided UI.

💻 Code Generation

Every visual operation generates PySpark or Pandas code. Export the full transformation as a reusable function ready for production notebooks.

📊 Live Preview

See transformation results immediately on a data sample before committing to the full dataset — fast iteration without wasted compute.

When to Use Data Wrangler

Scenario	Fit	Why
Exploring a new dataset's shape and quality	✅ Ideal	Visual profiling is faster than writing df.describe() for every column
Prototyping transformations before writing Spark code	✅ Ideal	Generate correct syntax without memorizing API — then refine in notebook
Quick data cleansing (nulls, types, outliers)	✅ Good	Point-and-click cleaning generates reproducible code
Scheduled production ETL	❌ Use Pipelines + Notebooks	Wrangler is interactive-only — it cannot be scheduled or triggered
Loading from external sources (APIs, SaaS)	❌ Use Dataflows Gen2	Wrangler operates on DataFrames already loaded in memory
Orchestrating multi-step workflows	❌ Use Data Pipelines	Wrangler has no orchestration or dependency management

Best Practices

Use Wrangler for exploration, not execution. Generate code in Wrangler, copy it into your notebook cell, then run the notebook in a Pipeline for production.
Work on samples first. Wrangler loads data into memory — start with .limit(10000) or a filtered subset for large tables, then apply transforms to the full dataset in Spark.
Export as functions. Use "Export to code" to get a clean transformation function. Wrap it in a reusable module for your team.
Pair with Dataflows Gen2 for end-to-end flow: Wrangler prototypes the logic → you implement it in a notebook → Pipeline orchestrates the notebook on schedule.

💡 Wrangler vs. Dataflows Gen2

Both are "low-code" — but they serve different stages. Data Wrangler is for data scientists and engineers exploring data interactively inside notebooks. Dataflows Gen2 is for scheduled, repeatable ETL that runs without human interaction. If you need a human to click "run," it's Wrangler territory. If it needs to run at 2 AM unattended, use Dataflows or Pipelines.

📚 Learn More

Data Wrangler in Fabric ↗ Data Wrangler Quickstart ↗

Decide

Integration Decision Matrix

Choose the right integration approach based on your data source, latency needs, and team skills.

Which Tool Should I Use?

Criteria	Shortcuts	Mirroring	Dataflows Gen2	Data Wrangler	Data Pipelines	Eventstreams
Data movement	None (zero-copy)	CDC replication	Batch copy + transform	None (in-memory)	Orchestrated copy	Streaming ingestion
Latency	Real-time (reads at source)	Near real-time (minutes)	Batch (minutes–hours)	Interactive (seconds)	Batch (minutes–hours)	Seconds
Transform capability	None	None (raw replication)	Power Query (visual)	PySpark/Pandas (generated)	Orchestration only	Inline transforms
Execution model	On-demand reads	Continuous CDC	Scheduled / triggered	Interactive only (human-driven)	Scheduled / triggered	Continuous streaming
Skill level	Low	Low	Low (Power Query)	Low–Medium (visual + code)	Medium (ADF patterns)	Medium–High
Best for	Cross-domain sharing, multi-cloud	Database → Lakehouse CDC	SaaS ingestion, simple ETL	Data profiling, transform prototyping	Production orchestration	IoT, logs, streaming
CU consumption	At query time only	Continuous (24/7)	During execution	During interactive session	During execution	Continuous

Common Combinations

🏗️ Full Medallion ETL

Mirroring (source → Bronze) + Pipelines (orchestrate Spark notebooks for Silver/Gold) + Shortcuts (share Gold across domains).

☁️ Multi-Cloud Analytics

Shortcuts to ADLS/S3/GCS for data lake consolidation + Dataflows Gen2 for SaaS connectors (Salesforce, SAP) + Pipelines to schedule refresh.

📊 Real-Time + Historical

Eventstreams for hot-path streaming into Eventhouse + Mirroring for operational databases + Shortcuts to join streaming and batch data.

🚀 Quick Start / POC

Mirroring for instant database access + Dataflows Gen2 for quick SaaS loads + Direct Lake reports on mirrored data — zero Spark code required.

🎯 Rule of Thumb

Don't move data unless you have to. Start with shortcuts for cross-domain sharing and mirroring for database replication. Add Dataflows Gen2 when you need transformations. Use Data Pipelines when you need production orchestration, scheduling, and error handling.