Data Engineer Data Architect Analytics Engineer β This section covers how to bring data into Fabric using shortcuts, mirroring, Dataflows Gen2, and data pipelines β including when to use each approach.
OneLake Shortcuts
Zero-copy data virtualization β access data where it lives without moving or duplicating it.
What Are Shortcuts?
Shortcuts are pointers to data in external or internal locations. They appear as folders in your lakehouse but don't copy data β queries read directly from the source. This enables unified data access across clouds, storage accounts, and Fabric workspaces with no ETL overhead.
Shortcuts are metadata-only references. There is no data movement, no duplication cost, and no sync lag. The data is always read from its original location at query time.
Shortcut Types
| Shortcut Type | Source | Use Case |
|---|---|---|
| Internal | Other Fabric lakehouses, warehouses, or KQL databases | Cross-workspace and cross-domain data sharing within Fabric |
| ADLS Gen2 | Azure Data Lake Storage Gen2 accounts | Connect to existing Azure data lakes without migration |
| Amazon S3 | AWS S3 buckets | Multi-cloud analytics β query AWS data from Fabric |
| Google Cloud Storage | GCS buckets | Multi-cloud analytics β query GCP data from Fabric |
| Dataverse | Microsoft Dataverse tables | Access Dynamics 365 / Power Platform data directly |
| Amazon S3 Compatible | S3-compatible stores (MinIO, etc.) | Connect to on-prem or third-party object storage |
Architecture Patterns with Shortcuts
π Cross-Domain Data Mesh
Each domain publishes a Gold layer in its own workspace. Other domains consume via shortcuts β no data copies, no pipelines to maintain. The producing domain controls access.
βοΈ Multi-Cloud Lakehouse
Create shortcuts to ADLS Gen2, AWS S3, and GCS in a single lakehouse. Run Spark or SQL queries across all sources as if they were local Delta tables.
ποΈ Bronze Layer Virtualization
Instead of copying raw data into Fabric, create shortcuts to existing storage. Only materialize Silver and Gold layers β reducing storage cost and ingestion complexity.
π Shared Reference Data
Centralize reference tables (calendars, currencies, region mappings) in one workspace and share via shortcuts to all consumers. Update once, reflect everywhere.
Security & Access Control
- Authentication: Internal shortcuts use Fabric identity. External shortcuts (ADLS, S3, GCS) use connection credentials configured by an admin.
- Authorization: Access to the shortcut is governed by workspace permissions. The user must also have access to the underlying data source.
- Row-Level Security: RLS defined on the source data applies when queries traverse shortcuts.
- Governance: Shortcuts respect Microsoft Purview sensitivity labels and Information Protection policies.
Performance: Shortcut queries depend on source latency. For high-frequency analytics, consider materializing critical data instead of using shortcuts. Delta format is required for full feature support (time travel, ACID transactions). Non-Delta files (Parquet, CSV) are read-only through shortcuts.
Best Practices
β Use shortcuts for read-heavy, cross-domain sharing (Gold layer consumption) β Use shortcuts to avoid duplicating Bronze data already in ADLS Gen2 β Apply consistent naming: sc_<source>_<dataset> (e.g., sc_sales_orders) β Document shortcuts in a central catalog β they're invisible in lineage views β οΈ Monitor source latency β external shortcuts add network round-trip time β οΈ Test Direct Lake mode compatibility β not all shortcut types support it β Don't use shortcuts for write-back scenarios β they're read-only β Don't chain shortcuts (shortcut β shortcut) β performance degrades β Don't shortcut to volatile/temporary storage locations
Fabric Mirroring
Near real-time data replication from operational databases into OneLake β no ETL pipelines required.
What Is Mirroring?
Mirroring continuously replicates data from external databases into OneLake as Delta tables using Change Data Capture (CDC). Unlike traditional ETL, mirroring is managed by Fabric β you configure the source, and data flows automatically with near real-time latency.
Mirrored data lands in OneLake as Delta tables, which means it's immediately queryable via Spark, SQL, KQL, and Power BI Direct Lake β no additional transformation needed for basic analytics.
Supported Sources
| Source | CDC Method | Typical Latency | Notes |
|---|---|---|---|
| Azure SQL Database | Change tracking / CDC | Near real-time (minutes) | Most common source; full and incremental sync |
| Azure Cosmos DB | Change feed | Near real-time | NoSQL β Delta conversion; analytical workloads on transactional data |
| Snowflake | Streams & tasks | Near real-time | Cross-platform analytics; bring Snowflake data into Fabric ecosystem |
| Azure Database for PostgreSQL | Logical replication | Near real-time | Flexible server required; supports selected tables |
| Azure Database for MySQL | Binlog replication | Near real-time | Flexible server required |
| Azure Databricks | Unity Catalog integration | Varies | Mirror Delta tables from Databricks lakehouse |
| SQL Server (on-prem) | CDC via on-prem gateway | Minutes | Requires on-premises data gateway configuration |
How Mirroring Works
1. Initial Snapshot
When mirroring is first enabled, Fabric takes a full snapshot of the selected tables and writes them as Delta tables in OneLake.
2. Incremental Sync
After the snapshot, CDC captures inserts, updates, and deletes from the source and applies them to the Delta tables continuously.
3. Delta Conversion
Source data is automatically converted to Delta format with proper schema mapping. Complex types are flattened where needed.
4. Query Ready
Mirrored Delta tables are instantly available in the SQL analytics endpoint, Spark notebooks, and Power BI Direct Lake reports.
Architecture Patterns
π Operational Reporting
Mirror production databases to Fabric and build Power BI reports with Direct Lake mode β get near real-time dashboards without impacting the transactional system.
π Database Modernization
Use mirroring as the first step in migrating from SQL Server or Snowflake. Data flows into OneLake immediately while you build the target architecture.
π Multi-Source Consolidation
Mirror multiple databases (Azure SQL, Cosmos DB, PostgreSQL) into a single Fabric workspace. Join across sources using the SQL analytics endpoint.
Capacity cost: Mirroring consumes CU (Capacity Units) for both initial snapshots and ongoing CDC. Monitor usage in the Capacity Metrics app. Schema changes: Some schema modifications at the source (e.g., column drops, type changes) may require re-initialization of the mirror. Latency: While near real-time, mirroring is not sub-second β for true streaming, use Eventstreams instead.
Best Practices
β Start with high-value tables β mirror what's needed, not entire databases β Use the SQL analytics endpoint for ad-hoc queries on mirrored data β Combine mirroring with medallion architecture: mirror β Bronze, transform β Silver/Gold β Monitor replication lag in the Fabric portal and set up alerts β οΈ Plan for initial snapshot time β large tables may take hours on first sync β οΈ Test schema evolution scenarios before going to production β Don't use mirroring for sub-second latency requirements β use Eventstreams instead β Don't mirror tables with frequent schema changes without a re-init plan β Don't skip capacity planning β continuous CDC consumes CUs 24/7
Dataflows Gen2
Low-code data transformation and ingestion powered by Power Query β ideal for citizen integrators and standardized ETL.
What Are Dataflows Gen2?
Dataflows Gen2 provide a Power Query-based, low-code interface for connecting to data sources, applying transformations, and loading data into Fabric destinations (lakehouses, warehouses, or KQL databases). They run on a managed Spark backend and support 150+ connectors.
Key Capabilities
π 150+ Connectors
Connect to databases, SaaS apps, files, APIs, and cloud services. Popular connectors: SQL Server, SharePoint, Salesforce, SAP, REST APIs.
π§© Power Query Transforms
Use the visual Power Query editor for joins, pivots, aggregations, type conversions, conditional columns, and custom M functions.
π¦ Multiple Destinations
Load data into Fabric lakehouses, warehouses, or KQL databases. Supports append, replace, and upsert (merge) load modes.
β‘ Staging Lakehouse
Dataflows Gen2 use an auto-provisioned staging lakehouse for intermediate processing β delivering significantly faster performance than Gen1.
When to Use Dataflows Gen2
| Scenario | Recommended? | Why |
|---|---|---|
| SaaS data ingestion (Salesforce, SAP, SharePoint) | β Yes | Built-in connectors handle auth, pagination, and schema mapping |
| Simple transformations (filter, rename, type cast) | β Yes | Low-code Power Query is faster to build and maintain |
| Citizen integrator / business-user ETL | β Yes | No Spark or SQL skills required β visual drag-and-drop interface |
| Complex multi-step transformations | β οΈ Consider Notebooks | Spark notebooks offer more control for complex logic |
| High-volume data (100M+ rows) | β οΈ Consider Pipelines + Spark | Dataflows may time out or consume excessive CUs on very large datasets |
| Real-time / streaming data | β Use Eventstreams | Dataflows are batch-oriented, not designed for streaming |
Use Dataflows Gen2 for Bronze layer ingestion from SaaS sources and simple transformations. For complex Silver/Gold transformations, use Spark notebooks orchestrated by Data Pipelines.
π Learn More
Create Your First Dataflow Gen2 β Dataflows Gen2 Overview β Power Query Connectors βπ See Also
Best Practices β Data Engineering βData Pipelines
Orchestration and scheduling for data movement and transformation β the backbone of production data flows.
What Are Data Pipelines?
Data Pipelines in Fabric are the orchestration engine (based on Azure Data Factory) for scheduling and coordinating data movement and transformation activities. They connect all other integration components β triggering Dataflows, Spark notebooks, stored procedures, and copy operations in sequence or parallel.
Core Activities
π Copy Activity
High-performance data movement between 100+ sources and Fabric destinations. Supports full and incremental loads with configurable parallelism.
π Notebook Activity
Execute Spark notebooks for complex transformations. Pass parameters, capture outputs, and chain notebooks in sequence.
π Dataflow Activity
Trigger Dataflows Gen2 as a step in a larger pipeline. Combine low-code and code-first transforms in one orchestration.
π¦ Stored Procedure
Execute SQL stored procedures in Fabric warehouses for T-SQL transformations and data loading.
Orchestration Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Sequential Pipeline | Activities run one after another: Copy β Notebook β Stored Procedure | Standard ETL with dependencies between steps |
| Parallel Fan-Out | ForEach activity processes multiple tables or partitions in parallel | Ingesting many tables from the same source simultaneously |
| Master Pipeline | Parent pipeline calls child pipelines (Invoke Pipeline activity) | Large-scale orchestration with reusable sub-pipelines |
| Event-Driven | Pipeline triggered by storage events, schedule, or API call | React to file arrivals or system events |
| Incremental Load | Watermark-based or timestamp-driven loads with Lookup + Copy | Efficient daily/hourly loads from large tables |
Use master/child pipeline patterns for production workloads. The master pipeline handles error handling and notifications, while child pipelines encapsulate individual domain or table-group logic for reusability.
Integration Decision Matrix
Choose the right integration approach based on your data source, latency needs, and team skills.
Which Tool Should I Use?
| Criteria | Shortcuts | Mirroring | Dataflows Gen2 | Data Pipelines | Eventstreams |
|---|---|---|---|---|---|
| Data movement | None (zero-copy) | CDC replication | Batch copy + transform | Orchestrated copy | Streaming ingestion |
| Latency | Real-time (reads at source) | Near real-time (minutes) | Batch (minutesβhours) | Batch (minutesβhours) | Seconds |
| Transform capability | None | None (raw replication) | Power Query (visual) | Orchestration only | Inline transforms |
| Skill level | Low | Low | Low (Power Query) | Medium (ADF patterns) | MediumβHigh |
| Best for | Cross-domain sharing, multi-cloud | Database β Lakehouse CDC | SaaS ingestion, simple ETL | Production orchestration | IoT, logs, streaming |
| CU consumption | At query time only | Continuous (24/7) | During execution | During execution | Continuous |
Common Combinations
ποΈ Full Medallion ETL
Mirroring (source β Bronze) + Pipelines (orchestrate Spark notebooks for Silver/Gold) + Shortcuts (share Gold across domains).
βοΈ Multi-Cloud Analytics
Shortcuts to ADLS/S3/GCS for data lake consolidation + Dataflows Gen2 for SaaS connectors (Salesforce, SAP) + Pipelines to schedule refresh.
π Real-Time + Historical
Eventstreams for hot-path streaming into Eventhouse + Mirroring for operational databases + Shortcuts to join streaming and batch data.
π Quick Start / POC
Mirroring for instant database access + Dataflows Gen2 for quick SaaS loads + Direct Lake reports on mirrored data β zero Spark code required.
Don't move data unless you have to. Start with shortcuts for cross-domain sharing and mirroring for database replication. Add Dataflows Gen2 when you need transformations. Use Data Pipelines when you need production orchestration, scheduling, and error handling.