Data Engineer Data Architect Analytics Engineer β This section covers how to bring data into Fabric using shortcuts, mirroring, Dataflows Gen2, and data pipelines β including when to use each approach.
OneLake Shortcuts
Zero-copy data virtualization β access data where it lives without moving or duplicating it.
This page covers the core integration mechanisms (Shortcuts, Mirroring, Dataflows, Pipelines). For strategic multi-cloud architecture patterns β AWS, GCP, Snowflake, Databricks integration, and hybrid landing zones β see the Multi-Cloud & Hybrid guide.
What Are Shortcuts?
Shortcuts are pointers to data in external or internal locations. They appear as folders in your lakehouse but don't copy data β queries read directly from the source. This enables unified data access across clouds, storage accounts, and Fabric workspaces with no ETL overhead.
Shortcuts are metadata-only references. There is no data movement, no duplication cost, and no sync lag. The data is always read from its original location at query time.
Shortcut Types
| Shortcut Type | Source | Use Case |
|---|---|---|
| Internal | Other Fabric lakehouses, warehouses, or KQL databases | Cross-workspace and cross-domain data sharing within Fabric |
| ADLS Gen2 | Azure Data Lake Storage Gen2 accounts | Connect to existing Azure data lakes without migration |
| Amazon S3 | AWS S3 buckets | Multi-cloud analytics β query AWS data from Fabric |
| Google Cloud Storage | GCS buckets | Multi-cloud analytics β query GCP data from Fabric |
| Dataverse | Microsoft Dataverse tables | Access Dynamics 365 / Power Platform data directly |
| Amazon S3 Compatible | S3-compatible stores (MinIO, etc.) | Connect to on-prem or third-party object storage |
Architecture Patterns with Shortcuts
π Cross-Domain Data Mesh
Each domain publishes a Gold layer in its own workspace. Other domains consume via shortcuts β no data copies, no pipelines to maintain. The producing domain controls access.
βοΈ Multi-Cloud Lakehouse
Create shortcuts to ADLS Gen2, AWS S3, and GCS in a single lakehouse. Run Spark or SQL queries across all sources as if they were local Delta tables.
ποΈ Bronze Layer Virtualization
Instead of copying raw data into Fabric, create shortcuts to existing storage. Only materialize Silver and Gold layers β reducing storage cost and ingestion complexity.
π Shared Reference Data
Centralize reference tables (calendars, currencies, region mappings) in one workspace and share via shortcuts to all consumers. Update once, reflect everywhere.
Security & Access Control
- Authentication: Internal shortcuts use Fabric identity. External shortcuts (ADLS, S3, GCS) use connection credentials configured by an admin.
- Authorization: Access to the shortcut is governed by workspace permissions. The user must also have access to the underlying data source.
- Row-Level Security: RLS defined on the source data applies when queries traverse shortcuts.
- Governance: Shortcuts respect Microsoft Purview sensitivity labels and Information Protection policies.
Performance: Shortcut queries depend on source latency. For high-frequency analytics, consider materializing critical data instead of using shortcuts. Delta format is required for full feature support (time travel, ACID transactions). Non-Delta files (Parquet, CSV) are read-only through shortcuts.
Best Practices
β Use shortcuts for read-heavy, cross-domain sharing (Gold layer consumption) β Use shortcuts to avoid duplicating Bronze data already in ADLS Gen2 β Apply consistent naming: sc_<source>_<dataset> (e.g., sc_sales_orders) β Document shortcuts in a central catalog β they're invisible in lineage views β οΈ Monitor source latency β external shortcuts add network round-trip time β οΈ Test Direct Lake mode compatibility β not all shortcut types support it β Don't use shortcuts for write-back scenarios β they're read-only β Don't chain shortcuts (shortcut β shortcut) β performance degrades β Don't shortcut to volatile/temporary storage locations
Fabric Mirroring
Near real-time data replication from operational databases into OneLake β no ETL pipelines required.
What Is Mirroring?
Mirroring continuously replicates data from external databases into OneLake as Delta tables using Change Data Capture (CDC). Unlike traditional ETL, mirroring is managed by Fabric β you configure the source, and data flows automatically with near real-time latency.
Mirrored data lands in OneLake as Delta tables, which means it's immediately queryable via Spark, SQL, KQL, and Power BI Direct Lake β no additional transformation needed for basic analytics.
Supported Sources
| Source | CDC Method | Typical Latency | Notes |
|---|---|---|---|
| Azure SQL Database | Change tracking / CDC | Near real-time (minutes) | Most common source; full and incremental sync |
| Azure Cosmos DB | Change feed | Near real-time | NoSQL β Delta conversion; analytical workloads on transactional data |
| Snowflake | Streams & tasks | Near real-time | Cross-platform analytics; bring Snowflake data into Fabric ecosystem |
| Azure Database for PostgreSQL | Logical replication | Near real-time | Flexible server required; supports selected tables |
| Azure Database for MySQL | Binlog replication | Near real-time | Flexible server required |
| Azure Databricks | Unity Catalog integration | Varies | Mirror Delta tables from Databricks lakehouse |
| SQL Server (on-prem) | CDC via on-prem gateway | Minutes | Requires on-premises data gateway configuration |
| Oracle Database | LogMiner CDC | Near real-time | GA β supports Oracle on-prem and cloud (19c+); no gateway required for cloud instances |
| SAP Datasphere | Replication flow | Near real-time | GA β mirrors SAP analytical and planning data into OneLake |
| Google BigQuery | Change tracking | Near real-time | Cross-cloud analytics; bring GCP warehouse data into Fabric |
| SharePoint Lists | Change notifications | Minutes | Preview β mirrors business lists for analytics without impacting SharePoint performance |
| Dremio | Lakehouse federation | Near real-time | Preview β mirrors Dremio-managed datasets into OneLake Delta format |
How Mirroring Works
1. Initial Snapshot
When mirroring is first enabled, Fabric takes a full snapshot of the selected tables and writes them as Delta tables in OneLake.
2. Incremental Sync
After the snapshot, CDC captures inserts, updates, and deletes from the source and applies them to the Delta tables continuously.
3. Delta Conversion
Source data is automatically converted to Delta format with proper schema mapping. Complex types are flattened where needed.
4. Query Ready
Mirrored Delta tables are instantly available in the SQL analytics endpoint, Spark notebooks, and Power BI Direct Lake reports.
Architecture Patterns
π Operational Reporting
Mirror production databases to Fabric and build Power BI reports with Direct Lake mode β get near real-time dashboards without impacting the transactional system.
π Database Modernization
Use mirroring as the first step in migrating from SQL Server or Snowflake. Data flows into OneLake immediately while you build the target architecture.
π Multi-Source Consolidation
Mirror multiple databases (Azure SQL, Cosmos DB, PostgreSQL) into a single Fabric workspace. Join across sources using the SQL analytics endpoint.
Capacity cost: Mirroring consumes CU (Capacity Units) for both initial snapshots and ongoing CDC. Monitor usage in the Capacity Metrics app. Schema changes: Some schema modifications at the source (e.g., column drops, type changes) may require re-initialization of the mirror. Latency: While near real-time, mirroring is not sub-second β for true streaming, use Eventstreams instead.
Best Practices
β Start with high-value tables β mirror what's needed, not entire databases β Use the SQL analytics endpoint for ad-hoc queries on mirrored data β Combine mirroring with medallion architecture: mirror β Bronze, transform β Silver/Gold β Monitor replication lag in the Fabric portal and set up alerts β οΈ Plan for initial snapshot time β large tables may take hours on first sync β οΈ Test schema evolution scenarios before going to production β Don't use mirroring for sub-second latency requirements β use Eventstreams instead β Don't mirror tables with frequent schema changes without a re-init plan β Don't skip capacity planning β continuous CDC consumes CUs 24/7
Dataflows Gen2
Low-code data transformation and ingestion powered by Power Query β ideal for citizen integrators and standardized ETL.
What Are Dataflows Gen2?
Dataflows Gen2 provide a Power Query-based, low-code interface for connecting to data sources, applying transformations, and loading data into Fabric destinations (lakehouses, warehouses, or KQL databases). They run on a managed Spark backend and support 150+ connectors.
Key Capabilities
π 150+ Connectors
Connect to databases, SaaS apps, files, APIs, and cloud services. Popular connectors: SQL Server, SharePoint, Salesforce, SAP, REST APIs.
π§© Power Query Transforms
Use the visual Power Query editor for joins, pivots, aggregations, type conversions, conditional columns, and custom M functions.
π¦ Multiple Destinations
Load data into Fabric lakehouses, warehouses, or KQL databases. Supports append, replace, and upsert (merge) load modes.
β‘ Staging Lakehouse
Dataflows Gen2 use an auto-provisioned staging lakehouse for intermediate processing β delivering significantly faster performance than Gen1.
When to Use Dataflows Gen2
| Scenario | Recommended? | Why |
|---|---|---|
| SaaS data ingestion (Salesforce, SAP, SharePoint) | β Yes | Built-in connectors handle auth, pagination, and schema mapping |
| Simple transformations (filter, rename, type cast) | β Yes | Low-code Power Query is faster to build and maintain |
| Citizen integrator / business-user ETL | β Yes | No Spark or SQL skills required β visual drag-and-drop interface |
| Complex multi-step transformations | β οΈ Consider Notebooks | Spark notebooks offer more control for complex logic |
| High-volume data (100M+ rows) | β οΈ Consider Pipelines + Spark | Dataflows may time out or consume excessive CUs on very large datasets |
| Real-time / streaming data | β Use Eventstreams | Dataflows are batch-oriented, not designed for streaming |
Use Dataflows Gen2 for Bronze layer ingestion from SaaS sources and simple transformations. For complex Silver/Gold transformations, use Spark notebooks orchestrated by Data Pipelines.
π Learn More
Create Your First Dataflow Gen2 β Dataflows Gen2 Overview β Power Query Connectors βπ See Also
Best Practices β Data Engineering βData Pipelines
Orchestration and scheduling for data movement and transformation β the backbone of production data flows.
What Are Data Pipelines?
Data Pipelines in Fabric are the orchestration engine (based on Azure Data Factory) for scheduling and coordinating data movement and transformation activities. They connect all other integration components β triggering Dataflows, Spark notebooks, stored procedures, and copy operations in sequence or parallel.
Core Activities
π Copy Activity
High-performance data movement between 100+ sources and Fabric destinations. Supports full and incremental loads with configurable parallelism.
π Notebook Activity
Execute Spark notebooks for complex transformations. Pass parameters, capture outputs, and chain notebooks in sequence.
π Dataflow Activity
Trigger Dataflows Gen2 as a step in a larger pipeline. Combine low-code and code-first transforms in one orchestration.
π¦ Stored Procedure
Execute SQL stored procedures in Fabric warehouses for T-SQL transformations and data loading.
Orchestration Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Sequential Pipeline | Activities run one after another: Copy β Notebook β Stored Procedure | Standard ETL with dependencies between steps |
| Parallel Fan-Out | ForEach activity processes multiple tables or partitions in parallel | Ingesting many tables from the same source simultaneously |
| Master Pipeline | Parent pipeline calls child pipelines (Invoke Pipeline activity) | Large-scale orchestration with reusable sub-pipelines |
| Event-Driven | Pipeline triggered by storage events, schedule, or API call | React to file arrivals or system events |
| Incremental Load | Watermark-based or timestamp-driven loads with Lookup + Copy | Efficient daily/hourly loads from large tables |
Use master/child pipeline patterns for production workloads. The master pipeline handles error handling and notifications, while child pipelines encapsulate individual domain or table-group logic for reusability.
Apache Airflow in Fabric
Managed DAG-based orchestration for complex, multi-step data workflows.
Why Airflow in Fabric?
Microsoft Fabric offers managed Apache Airflow jobs as a first-class Data Factory capability. This replaces the deprecated Azure Data Factory Airflow integration and provides a fully managed, auto-scaling orchestration service β no infrastructure to maintain.
Both tools orchestrate data workflows, but they solve different problems. Choose based on your team's needs and existing investments.
Airflow vs. Data Pipelines β Decision Matrix
| Criteria | Data Pipelines (Native) | Apache Airflow (Managed) | Why It Matters |
|---|---|---|---|
| Authoring model | Visual drag-and-drop UI | Python code (DAGs) | Team skills: Low-code teams prefer Pipelines; Python-fluent teams move faster in Airflow with version control and testing. |
| Dependency complexity | Linear/parallel with ForEach | Arbitrary DAG with complex branching, retries, pools | Infrastructure: Complex multi-hop workflows with conditional logic and backfills are Airflow's strength. |
| Multi-cloud orchestration | Fabric-only (with Web activities for external) | 1,000+ community operators (AWS, GCP, Databricks, etc.) | Architecture: If you orchestrate across clouds or non-Microsoft services, Airflow's ecosystem is unmatched. |
| Testing & CI/CD | Limited (JSON export, REST API) | Full Python unit testing, Git-native, standard CI/CD | Quality: Airflow DAGs can be unit-tested, linted, and peer-reviewed like application code. |
| Cost model | Activity execution CUs only | CUs while Airflow environment is running (auto-pause helps) | Cost: Pipelines cost nothing when idle. Airflow has baseline cost but auto-pauses during inactivity. |
| Learning curve | Low β ADF/Synapse familiarity | Medium β requires Python + Airflow concepts | Adoption: Teams already using Airflow elsewhere can reuse DAGs; greenfield teams may prefer native Pipelines. |
| Existing investment | ADF/Synapse pipelines migrate directly | Existing Airflow DAGs migrate with minimal changes | Migration: Bring your existing orchestration code β don't rewrite from scratch. |
Use native Data Pipelines when your orchestration is straightforward (copy β transform β load) and your team prefers low-code. Use Airflow when you need complex DAG dependencies, backfill capabilities, cross-cloud orchestration, or when your team already has Airflow expertise and existing DAGs to migrate.
Key Capabilities
π FabricRunItemOperator
Native Airflow operator to trigger Fabric items (pipelines, notebooks, warehouses) from within a DAG. Supports deferrable mode for efficient resource usage.
π Auto-Scale & Auto-Pause
Automatically scales workers for concurrent tasks and pauses when idle β you only pay CUs during active DAG execution.
π Entra ID Integration
SSO via Microsoft Entra, workspace-level RBAC, and service principal authentication for automated deployments.
π Built-in Monitoring
Real-time DAG run monitoring, task-level logging, and performance comparison across runs β all in the Fabric UI.
Example: Orchestrating a Medallion Pipeline
from airflow import DAG
from airflow.providers.microsoft.fabric.operators.run_item import FabricRunItemOperator
from datetime import datetime
with DAG(
"medallion_etl",
start_date=datetime(2025, 1, 1),
schedule_interval="@daily",
catchup=True, # enables backfill for historical dates
) as dag:
bronze_ingest = FabricRunItemOperator(
task_id="bronze_ingest",
fabric_conn_id="fabric_prod",
workspace_id="ws-abc-123",
item_id="notebook-bronze-ingest",
job_type="RunNotebook",
wait_for_termination=True,
deferrable=True, # frees Airflow worker while waiting
)
silver_transform = FabricRunItemOperator(
task_id="silver_transform",
fabric_conn_id="fabric_prod",
workspace_id="ws-abc-123",
item_id="notebook-silver-transform",
job_type="RunNotebook",
wait_for_termination=True,
deferrable=True,
)
gold_publish = FabricRunItemOperator(
task_id="gold_publish",
fabric_conn_id="fabric_prod",
workspace_id="ws-abc-123",
item_id="pipeline-gold-publish",
job_type="Pipeline",
wait_for_termination=True,
deferrable=True,
)
# DAG dependency chain
bronze_ingest >> silver_transform >> gold_publish
Migration from ADF Airflow
Azure Data Factory's Apache Airflow integration was deprecated in 2025. Microsoft recommends migrating to Fabric's managed Airflow:
- DAGs are compatible: Move your existing Python DAG files with minimal changes (update connection IDs and operator imports)
- Connections migrate: Reconfigure connections to use Fabric's Entra-based authentication
- Cost benefit: Fabric's auto-pause reduces baseline costs compared to ADF's always-on Airflow environments
Cost: Unlike Data Pipelines (pay-per-execution only), Airflow environments consume CUs while running. Auto-pause mitigates this, but teams with simple, infrequent workflows may not justify the overhead.
Complexity: Airflow adds operational concepts (DAG parsing, scheduler, workers, connections) that simpler workflows don't need.
Data Wrangler
Interactive, notebook-embedded data exploration and transformation β generates production-ready code from visual operations.
What Is Data Wrangler?
Data Wrangler is a visual, code-generation tool embedded directly in Fabric notebooks. You open a DataFrame in Wrangler, explore column distributions, apply transformations through a point-and-click interface, and it generates clean PySpark or Pandas code that you paste back into your notebook. It's designed for interactive profiling and prototyping β not for scheduled execution.
π Data Profiling
Automatic column statistics, null counts, distribution charts, and outlier detection β without writing a single line of code.
π Visual Transforms
Filter, sort, group, pivot, merge, split columns, handle nulls, change types, and derive columns β all through a guided UI.
π» Code Generation
Every visual operation generates PySpark or Pandas code. Export the full transformation as a reusable function ready for production notebooks.
π Live Preview
See transformation results immediately on a data sample before committing to the full dataset β fast iteration without wasted compute.
When to Use Data Wrangler
| Scenario | Fit | Why |
|---|---|---|
| Exploring a new dataset's shape and quality | β Ideal | Visual profiling is faster than writing df.describe() for every column |
| Prototyping transformations before writing Spark code | β Ideal | Generate correct syntax without memorizing API β then refine in notebook |
| Quick data cleansing (nulls, types, outliers) | β Good | Point-and-click cleaning generates reproducible code |
| Scheduled production ETL | β Use Pipelines + Notebooks | Wrangler is interactive-only β it cannot be scheduled or triggered |
| Loading from external sources (APIs, SaaS) | β Use Dataflows Gen2 | Wrangler operates on DataFrames already loaded in memory |
| Orchestrating multi-step workflows | β Use Data Pipelines | Wrangler has no orchestration or dependency management |
Best Practices
- Use Wrangler for exploration, not execution. Generate code in Wrangler, copy it into your notebook cell, then run the notebook in a Pipeline for production.
- Work on samples first. Wrangler loads data into memory β start with
.limit(10000)or a filtered subset for large tables, then apply transforms to the full dataset in Spark. - Export as functions. Use "Export to code" to get a clean transformation function. Wrap it in a reusable module for your team.
- Pair with Dataflows Gen2 for end-to-end flow: Wrangler prototypes the logic β you implement it in a notebook β Pipeline orchestrates the notebook on schedule.
Both are "low-code" β but they serve different stages. Data Wrangler is for data scientists and engineers exploring data interactively inside notebooks. Dataflows Gen2 is for scheduled, repeatable ETL that runs without human interaction. If you need a human to click "run," it's Wrangler territory. If it needs to run at 2 AM unattended, use Dataflows or Pipelines.
Integration Decision Matrix
Choose the right integration approach based on your data source, latency needs, and team skills.
Which Tool Should I Use?
| Criteria | Shortcuts | Mirroring | Dataflows Gen2 | Data Wrangler | Data Pipelines | Eventstreams |
|---|---|---|---|---|---|---|
| Data movement | None (zero-copy) | CDC replication | Batch copy + transform | None (in-memory) | Orchestrated copy | Streaming ingestion |
| Latency | Real-time (reads at source) | Near real-time (minutes) | Batch (minutesβhours) | Interactive (seconds) | Batch (minutesβhours) | Seconds |
| Transform capability | None | None (raw replication) | Power Query (visual) | PySpark/Pandas (generated) | Orchestration only | Inline transforms |
| Execution model | On-demand reads | Continuous CDC | Scheduled / triggered | Interactive only (human-driven) | Scheduled / triggered | Continuous streaming |
| Skill level | Low | Low | Low (Power Query) | LowβMedium (visual + code) | Medium (ADF patterns) | MediumβHigh |
| Best for | Cross-domain sharing, multi-cloud | Database β Lakehouse CDC | SaaS ingestion, simple ETL | Data profiling, transform prototyping | Production orchestration | IoT, logs, streaming |
| CU consumption | At query time only | Continuous (24/7) | During execution | During interactive session | During execution | Continuous |
Common Combinations
ποΈ Full Medallion ETL
Mirroring (source β Bronze) + Pipelines (orchestrate Spark notebooks for Silver/Gold) + Shortcuts (share Gold across domains).
βοΈ Multi-Cloud Analytics
Shortcuts to ADLS/S3/GCS for data lake consolidation + Dataflows Gen2 for SaaS connectors (Salesforce, SAP) + Pipelines to schedule refresh.
π Real-Time + Historical
Eventstreams for hot-path streaming into Eventhouse + Mirroring for operational databases + Shortcuts to join streaming and batch data.
π Quick Start / POC
Mirroring for instant database access + Dataflows Gen2 for quick SaaS loads + Direct Lake reports on mirrored data β zero Spark code required.
Don't move data unless you have to. Start with shortcuts for cross-domain sharing and mirroring for database replication. Add Dataflows Gen2 when you need transformations. Use Data Pipelines when you need production orchestration, scheduling, and error handling.