Monitoring & Observability — Microsoft Fabric Guide

👤 Who is this for?

IT Admin Platform Owner Data Engineer — This guide brings together capacity telemetry, workspace health, event-driven alerts, Log Analytics integration, and operational response patterns for Microsoft Fabric.

Monitoring

Monitoring Overview

A unified observability strategy for Microsoft Fabric capacities, workloads, and security signals.

Microsoft Fabric is a SaaS analytics platform, which changes how operations teams think about troubleshooting. You do not patch servers, collect OS counters, or SSH into worker nodes. Instead, reliable operations come from understanding the telemetry Fabric exposes: capacity usage, workload execution signals, storage activity, and audit trails. The teams that succeed in Fabric are the teams that treat observability as a product, not as a last-minute dashboard.

A strong monitoring strategy in Fabric connects what users feel to what the platform is doing behind the scenes. When a pipeline misses its SLA, a report becomes slow, or a workspace starts consuming unexpected capacity units, you need a view that combines platform-level pressure with workload-level evidence. That is why unified monitoring matters: one signal tells you that something is wrong, but multiple signals tell you why.

In practice, Fabric observability usually rests on three pillars. Capacity monitoring tells you whether your shared compute pool is under pressure. Workload monitoring shows which artifacts, jobs, and interactive operations are creating the load. Security and audit monitoring reveals who changed configurations, accessed sensitive assets, or triggered unusual behavior. Taken together, these pillars let platform owners move from reactive firefighting to proactive operations.

📈 Capacity Monitoring

Track CU consumption, smoothing, throttling, and workload mix so you can understand whether the shared Fabric capacity is healthy and appropriately sized.

⚙️ Workload Monitoring

Watch pipeline runs, notebook jobs, refreshes, queries, and item-level execution patterns to pinpoint the workloads that affect user experience.

🔐 Security & Audit Monitoring

Capture audit events, configuration changes, access patterns, and suspicious activity so governance and security teams have operational visibility.

💡 Operating model mindset

Fabric monitoring works best when platform teams define ownership in advance: who watches capacity, who owns workspace SLAs, who triages security signals, and which alerts are informational versus urgent.

Capacity

Capacity Metrics App

The primary operational view for CU consumption, throttling behavior, and workload breakdown by item.

The Capacity Metrics App is the core monitoring tool for Fabric administrators and capacity owners. It is a Power BI app published by Microsoft that visualizes how a capacity spends compute units over time, which workloads are driving that spend, and whether the platform is entering throttling behavior. If you want to understand the health of an F SKU, this is usually the first place to look.

Installation is straightforward: deploy the app from Microsoft, connect it to the target capacity, and make sure the right admins or delegated operators can access it. Once configured, the app becomes the daily control room for shared compute. Teams use it to check whether spikes are expected, whether a single workspace is dominating consumption, and whether an upgrade, pause strategy, or workload redesign is needed.

The most important limitation is the 14-day rolling retention window. That makes the app excellent for active operations, but weak for long-term trending unless you export or archive the data elsewhere. Treat it as the source for short-horizon operational decisions, not as the only historical record of your environment.

How to install and configure

Install the app from Microsoft and bind it to the Fabric or Power BI capacity you want to monitor.
Confirm that capacity admins can access the report and that the dataset refreshes successfully.
Review workload mappings so platform owners understand which teams and workspaces correspond to the major usage spikes.
Decide whether you need an export/archive process to preserve data beyond the built-in retention period.

🕒 Timepoint page

Use this page to inspect a specific time slice, correlate spikes with events, and see how interactive and background operations contributed to total consumption.

📦 Items page

Break usage down by item so you can identify the exact notebook, semantic model, pipeline, report, or workspace that is creating pressure on the capacity.

🚦 Throttling indicators

Watch the throttling visuals closely; they tell you when the capacity is moving from healthy burst behavior into sustained pressure that can impact user-facing performance.

⚠️ Retention limitation

The Capacity Metrics App keeps a rolling 14 days of data. If you need monthly chargeback, capacity planning trends, or before/after comparisons for optimization work, export the data or snapshot it into your own monitoring store.

How to read the reports

Fabric separates interactive and background consumption because the user impact is different. Interactive CU is tied to user-facing experiences like report queries, while background CU usually comes from scheduled refreshes, Spark work, or pipeline execution. A capacity that looks busy can still feel healthy if the pressure is dominated by background work in the right windows. The app helps you make that distinction explicit.

The smoothing visuals are equally important. Fabric can absorb bursts and repay them over time, so not every spike is an emergency. What matters is whether short bursts become a continuous pattern that eats into headroom, triggers throttling, or overlaps with critical interactive hours. That is why the app is so valuable for right-sizing decisions and for proving whether a problem is architectural, operational, or simply a capacity mismatch.

✅ Best use cases

Daily operations checks, right-sizing decisions, chargeback conversations, troubleshooting noisy neighbors, and validating whether optimizations reduced real CU pressure.

❌ Not enough by itself

The app shows where capacity was spent, but not every workload detail, root-cause log, or security event. Pair it with workspace monitoring and Log Analytics for a fuller story.

📚 Learn More

Capacity Metrics App ↗

Workspace

Workspace Monitoring

Built-in operational telemetry for what is happening inside an individual Fabric workspace.

Workspace Monitoring is Fabric's built-in view of operational activity within a workspace. While the Capacity Metrics App tells you how the shared capacity behaves, workspace monitoring brings you closer to the artifacts that developers and analysts actually use. It is the better lens for answering questions such as: which pipeline failed, which notebook ran longer than expected, and which item has a growing history of execution problems?

This feature is especially useful for domain teams and workspace owners who need operational visibility without becoming full tenant admins. It typically highlights activity such as pipeline runs, notebook executions, refresh history, query behavior, and other workload signals relevant to that workspace. In practice, it gives engineering teams a faster path from symptom to suspect artifact.

Enable and access it from the workspace experience, then use it as the first-line operational dashboard for delivery teams. The most effective pattern is to let workspace owners watch execution health locally while platform owners correlate those findings with capacity-level telemetry centrally.

What it tracks

Pipeline runs: successes, failures, durations, retry behavior, and scheduling patterns
Notebook executions: run history, job duration, and trends in failed or long-running sessions
Query performance: item- or experience-level clues that help explain slow interactive behavior
Refresh history: semantic model and data movement operations that affect freshness and user trust

📊 Key metrics

Look for failure counts, average duration, p95 duration, concurrency patterns, and sudden shifts in execution volume after deployments or schema changes.

🔎 What the signals mean

A rise in duration without an increase in volume can indicate upstream slowness or poor query plans; a rise in failures after deployment usually points to config drift or data contract breaks.

👥 Best audience

Workspace admins, engineering leads, and platform teams supporting production workspaces that need workload-level visibility without leaving Fabric.

How it differs from the Capacity Metrics App

Workspace monitoring is artifact-centric, while the Capacity Metrics App is capacity-centric. Workspace monitoring helps you troubleshoot a specific workspace and its items. The Capacity Metrics App helps you understand whether the shared compute platform is under strain. You usually need both views to resolve incidents quickly.

Its other limitation is scope. Workspace monitoring does not give you the same tenant-wide, cross-workspace, cost-oriented perspective as capacity telemetry or admin monitoring solutions. It is excellent for local operations, but it is not a replacement for centralized observability.

📚 Learn More

Workspace Monitoring ↗

OneLake

OneLake Diagnostics

Observing storage growth, file activity, and storage-connected signals that affect data platform reliability.

OneLake is the storage foundation of Fabric, so storage diagnostics deserve explicit attention. Many operational issues present first as compute symptoms, but the root cause can be storage growth, file churn, poor layout, or unexpected data movement. Monitoring OneLake helps teams answer practical questions such as which domains are growing fastest, whether ingestion patterns are creating excessive file operations, and when storage-related behavior should trigger review.

For observability, think in terms of both consumption and activity. Consumption signals show how much data you are retaining and where growth is happening. Activity signals show what users and workloads are doing to that storage through file creation, modification, deletion, and access behavior. When combined with workspace and capacity views, these signals help explain downstream performance and governance outcomes.

Azure Monitor integration matters here because it provides the long-term, centralized alerting and retention model many enterprises want. Even if day-to-day teams live inside Fabric, platform and security teams often want storage-related telemetry in the same operational stack as the rest of Azure monitoring.

📦 Storage usage

Track how lakehouses, mirrored data, shortcuts, and workspace content grow over time so you can plan cost, lifecycle, and retention strategies.

📝 File operations

Watch creation, update, and delete behavior when investigating ingestion anomalies, runaway jobs, or unexpected data movement patterns.

🔔 Storage alerts

Set alerts for sudden storage growth, unusual file activity, or monitoring thresholds that indicate a broken ingestion loop or policy violation.

🔗 Azure Monitor integration

Use Azure Monitor and diagnostic export patterns when you need longer retention, alert rules, or centralized workbook views that join OneLake-related diagnostics with the rest of your operational data.

Alerts

Fabric Activator (Reflex)

Event-driven alerting that turns data changes into actions across email, Teams, and automation flows.

Fabric Activator adds an event-driven layer to Fabric observability. Instead of only reviewing dashboards after the fact, you define conditions on incoming or changing data and let Fabric trigger actions automatically. This is powerful because many business-critical issues are best detected by changes in metrics, sensor events, or streaming facts rather than by a human refreshing a report.

The operating model is simple: define a trigger against a data stream or monitored condition, evaluate it continuously or near-real time, then fire an action such as an email, a Teams notification, or a Power Automate flow. In other words, Activator closes the loop between observability and response. For KPI-led operations, that makes it one of the most practical monitoring tools in the platform.

Activator is particularly valuable when used with Real-Time Intelligence. If your organization already captures operational or business events in real time, Activator lets you define thresholds and anomaly conditions close to the data path instead of recreating the logic in multiple external tools.

Common use cases

📉 KPI threshold alerts

Notify stakeholders when inventory drops, latency rises, SLA attainment falls, or operational metrics cross agreed business thresholds.

📡 Anomaly detection

Trigger attention when volumes, durations, or event rates deviate sharply from the expected pattern for a process or business service.

⏱️ SLA monitoring

Escalate when a refresh has not completed by a deadline, when an event has not arrived on time, or when operational lag exceeds tolerance.

Setup walkthrough

Identify the business or technical signal you want to monitor, ideally one backed by real-time or frequently updated data.
Create the Activator item and point it to the relevant data source or event stream.
Define the trigger condition, threshold, or pattern that should produce an action.
Select the action path: email, Teams, or Power Automate for richer downstream workflows.
Test the trigger with controlled data changes, then define ownership and escalation expectations.

✅ Best fit

Use Activator for fast, event-driven detection and lightweight response. Use Azure Monitor and Log Analytics when you need centralized platform logging, long retention, complex correlation, or security operations workflows.

📚 Learn More

Fabric Activator ↗

Admin

FUAM (Fabric Unified Admin Monitoring)

Tenant-wide monitoring patterns that help admins understand usage, adoption, and operational trends across workspaces.

FUAM is best understood as a tenant-wide admin monitoring pattern or solution layer for Fabric rather than a single built-in screen. The goal is to aggregate telemetry that matters to platform administrators: which teams are active, which workspaces generate sustained load, which experiences are being adopted, and where governance or cost conversations should start. It fills the visibility gap between local workspace operations and raw tenant-level logs.

For admins, the real value is the ability to turn scattered usage signals into a repeatable operating view. Capacity metrics show pressure, workspace monitoring shows execution behavior, and audit data shows change history. FUAM-style monitoring brings those streams together into dashboards that can be used for adoption reviews, cost governance, support triage, and executive reporting.

Deployment typically involves standing up a dedicated admin workspace, loading the telemetry sources you care about, and publishing reports that summarize usage across all workspaces. The exact implementation can vary, but the design principle is consistent: create a durable admin analytics layer that survives beyond short retention windows and supports tenant-wide decisions.

👥 Usage analytics

See who is using what across the tenant, which workspaces are most active, and where adoption is concentrated or stalling.

📈 Capacity utilization trends

Track cross-workspace patterns over longer periods so admins can spot chronic hot spots, growth trajectories, and candidate workloads for optimization.

🧭 Admin decision support

Use dashboards for governance reviews, chargeback conversations, support prioritization, and identifying where enablement or architectural intervention is needed.

Key reports to include

Workspace adoption: active users, active items, and trend lines by domain or business unit
Operational exceptions: failed refreshes, repeated job failures, and top incident-generating workspaces
Capacity pressure by business owner: who is driving background versus interactive demand
Governance posture: admin changes, sensitive workspaces, and gaps in monitoring coverage

Azure

Azure Monitor & Log Analytics Integration

Centralizing Fabric diagnostics, KQL analysis, alert rules, and security monitoring in Azure operations tooling.

Fabric-native dashboards are excellent for local operations, but enterprise observability usually needs a central landing zone. That is where Azure Monitor and Log Analytics come in. By exporting Fabric diagnostic logs and correlating them with other operational data, teams can retain history longer, run KQL-based investigations, create reusable workbooks, and connect alerts to existing incident processes.

This integration is especially valuable when your platform team already works in Azure Monitor or Microsoft Sentinel. Instead of treating Fabric as a separate island, you bring its logs into the same analytics and security toolchain used for applications, infrastructure, and identity. That makes it easier to answer cross-cutting questions such as whether a failure spike aligns with a deployment, an authentication problem, or a data access anomaly.

For Fabric, a mature Azure Monitor setup usually includes diagnostic export, a Log Analytics workspace, KQL queries or workbooks for day-to-day triage, and targeted alert rules. Security teams often extend this with Sentinel so Fabric audit and access signals participate in broader detection and investigation flows.

What to send to Log Analytics

Diagnostic logs: operational events that support troubleshooting and alerting
Audit signals: admin changes, access events, and artifact-level actions relevant to governance
Execution outcomes: failures, retries, and duration-related signals needed for SLA tracking
Security-relevant activity: unusual access patterns, bulk export behavior, and high-risk changes

KQL — Example Fabric monitoring queries

// Failed Fabric operations in the last 24 hours
FabricActivityLogs
| where TimeGenerated > ago(24h)
| where Status =~ "Failed"
| summarize Failures = count() by OperationName, WorkspaceName
| order by Failures desc

// Workspaces with the highest recent activity
FabricActivityLogs
| where TimeGenerated > ago(24h)
| summarize Events = count() by WorkspaceName
| order by Events desc

// Potential admin-sensitive changes
FabricAuditLogs
| where TimeGenerated > ago(7d)
| where OperationName in ("UpdateTenantSettings", "AssignWorkspaceRole", "UpdateCapacity")
| project TimeGenerated, UserId, OperationName, WorkspaceName

📓 Workbooks

Create Azure Monitor workbooks that combine failure trends, workspace hotspots, security events, and capacity-related context into a single operational view.

🚨 Alert rules

Define alerts for capacity thresholds, error-rate spikes, recurring pipeline failures, or long-running jobs that indicate an SLA breach or degraded user experience.

🛡️ Sentinel integration

Forward relevant audit and diagnostics data into Microsoft Sentinel so Fabric becomes part of centralized security monitoring and investigation workflows.

Fabric audit log schema considerations

Fabric audit data is most useful when you standardize a few core fields in your investigations: timestamp, user or service principal, operation name, workspace or item identity, result status, and any affected capacity or security object. Even if different logs expose slightly different shapes, your queries and workbooks should normalize around those fields so analysts can pivot quickly.

That schema mindset matters because monitoring is not just about counting errors. It is about preserving enough context to reconstruct intent, impact, and ownership. The more consistently you capture and query those fields, the faster your ops and security teams can move from alert to action.

📚 Learn More

Azure Monitor & Fabric Monitoring Overview ↗

Patterns

Alert Design Patterns

Design alerts that are actionable, meaningful, and resistant to noise.

The hardest part of monitoring is usually not collecting data; it is deciding what deserves an alert. Fabric generates plenty of signals, but not every signal should wake someone up or create a ticket. Effective alerting starts by separating informative trends from urgent operational risks. If an alert does not have a clear owner and an expected response, it is probably a dashboard metric instead of an alert.

In Fabric environments, good alerts are usually tied to user impact, SLA risk, security risk, or sustained capacity stress. Bad alerts are often overly granular, trigger on transient blips, or duplicate information already visible elsewhere. Build alerts around durable operating questions: is the platform under stress, are critical pipelines missing deadlines, and is someone doing something unusual with sensitive data?

📈 Capacity alerts

Alert when CU utilization stays above 80% for sustained periods, when throttling events appear, or when overage activation indicates you are operating beyond expected headroom.

🔄 Pipeline alerts

Alert on repeated failures, SLA breaches, and long-running jobs that threaten downstream freshness or create user-facing delays in reports and data products.

🔐 Security alerts

Alert on unusual login patterns, bulk data exports, privileged admin changes, or access behavior inconsistent with the normal baseline for a user or workspace.

What not to alert on

Do not alert on every single failure in a noisy development workspace, every short-lived capacity burst, or every one-off retriable job issue. Those conditions are better handled through trend dashboards, summary digests, or daily review queues. The goal is to surface actionable exceptions, not to mirror every event stream with a notification stream.

🟢 Severity tiers

Define informational, warning, and critical thresholds so teams know whether to observe, investigate during business hours, or escalate immediately.

⏸️ Suppression rules

Suppress duplicate alerts during active incidents, maintenance windows, or known deployments to avoid overwhelming responders with repeated noise.

📞 Escalation paths

Document who owns the first response, when to involve platform engineering, and when security or business stakeholders must be notified.

🚫 Prevent alert fatigue

If the same non-critical Fabric alert fires repeatedly with no action taken, either tune it, aggregate it, or remove it. Noisy alerts train teams to ignore the very signals that matter during a real incident.

Operations

Monitoring Best Practices

Practical patterns for building an observability operating model that scales with your Fabric estate.

Monitoring in Fabric works best when you deliberately combine platform telemetry, workload execution data, and security signals into a single operating model. Do not let each persona keep separate blind spots. Capacity teams need enough workload detail to explain spikes. Workspace owners need enough platform context to know whether a failure is local or shared. Security teams need visibility into the same workspaces and capacities the operations team supports.

Build a dashboard set for the ops team that answers three questions fast: what is broken now, what is trending toward trouble, and where should we invest next? A practical template usually includes current capacity health, top failed workloads, data freshness or SLA compliance, recent admin changes, and a summary of alert volume by severity. That gives operators both immediate triage value and a path to continuous improvement.

Also plan for what Fabric does not retain long enough. The Capacity Metrics App is the clearest example, but the broader lesson applies everywhere: if a signal matters for monthly review, budgeting, adoption governance, or security investigation, archive it into your own monitoring store and keep the history you need.

📊 Build an ops dashboard

Include capacity health, top failing jobs, workspace hotspots, freshness/SLA indicators, and a compact security summary for shared situational awareness.

🚨 Alert before users complain

Use proactive thresholds for rising CU pressure, repeated job failures, and deteriorating execution times so teams act before business users notice impact.

🗃️ Archive short-retention data

Snapshot or export Capacity Metrics and other operational signals into a durable analytics store so weekly, monthly, and quarterly reviews are evidence-based.

Recommended review cadence

Daily ops

Review current capacity status, critical alerts, failed pipelines, failed refreshes, and any open incidents affecting business-facing workspaces.

Weekly trends

Look for repeat offenders, long-running workload drift, alert noise, data freshness patterns, and top workspaces by operational overhead.

Monthly capacity review

Assess SKU fit, workload scheduling, optimization results, overage behavior, and whether chargeback or governance interventions are needed.

Document runbooks

Every important alert should map to a short runbook: what the signal means, first checks to perform, where to look next, who owns escalation, and how to communicate impact. Runbooks are how you turn monitoring from knowledge held by a few experts into a repeatable operating capability for the whole team.

✅ Full-picture observability

Combine capacity + workload + security monitoring. Any one of them in isolation gives you only a partial truth; together they explain platform health, user impact, and governance posture.

Resources

Resources & Links

Official documentation and recommended starting points for operationalizing Fabric observability.

If you are standing up Fabric monitoring for the first time, start with the Capacity Metrics App and the monitoring overview documentation, then add workspace monitoring and Azure Monitor integration based on your operational maturity. That sequence gives you a fast path from basic visibility to enterprise-grade observability.

For teams working with event-driven use cases, Activator is a strong complement to platform monitoring. For governance-heavy environments, keep the monitoring overview and workspace docs close at hand so your operational model stays aligned with Microsoft guidance as Fabric evolves.

Monitoring Overview

📈 Capacity Monitoring

⚙️ Workload Monitoring

🔐 Security & Audit Monitoring

Capacity Metrics App

How to install and configure

🕒 Timepoint page

📦 Items page

🚦 Throttling indicators

How to read the reports

✅ Best use cases

❌ Not enough by itself

📚 Learn More

Workspace Monitoring

What it tracks

📊 Key metrics

🔎 What the signals mean

👥 Best audience

How it differs from the Capacity Metrics App

📚 Learn More

OneLake Diagnostics

📦 Storage usage

📝 File operations

🔔 Storage alerts

Fabric Activator (Reflex)

Common use cases

📉 KPI threshold alerts

📡 Anomaly detection

⏱️ SLA monitoring

Setup walkthrough

📚 Learn More

FUAM (Fabric Unified Admin Monitoring)

👥 Usage analytics

📈 Capacity utilization trends

🧭 Admin decision support

Key reports to include

Azure Monitor & Log Analytics Integration

What to send to Log Analytics

📓 Workbooks

🚨 Alert rules

🛡️ Sentinel integration

Fabric audit log schema considerations

📚 Learn More

Alert Design Patterns

📈 Capacity alerts

🔄 Pipeline alerts

🔐 Security alerts

What not to alert on

🟢 Severity tiers

⏸️ Suppression rules

📞 Escalation paths

Monitoring Best Practices

📊 Build an ops dashboard

🚨 Alert before users complain

🗃️ Archive short-retention data

Recommended review cadence

Daily ops

Weekly trends

Monthly capacity review

Document runbooks

Resources & Links

📈 Capacity Metrics App

🧭 Workspace Monitoring

⚡ Fabric Activator

☁️ Azure Monitor for Fabric