Application performance monitoring (APM); Smarter observability

Final Design

Before

My role

Designing the smart alerts for application and other services.

Time: 4 Sprints

Deliverables
Workshop artifacts
Wireframes
Hifi Designs

User frustration

“Smart Alerts” may not offer the same capabilities as compared to “Custom Events”

When the business shifted from a monolith to a microservices architecture, they decided to deprecate “Custom Events” as they were not scalable or efficient in dynamic, containerized environments, due to their static configurations and lacked automation.

As a result, there was a transition from “Custom Events” to “Smart Alerts”  However, the customer was not happy and they expressed concerns that Smart Alerts have limitations and do not offer the same functionality as Custom Events.

Research

Personas & need statements

Need: Automate anomaly detection

“I need an intelligent alerting system that automatically detects anomalies, adjusts thresholds, and minimizes alert fatigue”

Johnathan
Site Reliability Engineer

Need: Automate alerting

“I need an intelligent, automated alerting system that dynamically adapts to my environment and diagnose even unknown issues.”

Matt
DevOps 

Aligning on the user requirements

We had regular sync-ups to create a UX strategy blueprint, aligning everyone on the project objectives by identifying goals and the pain points we wanted to solve

Pain points

Include restrictive evaluation windows, unclear time window types, unmet 1-second MTTD, misaligned DFQ filters, limited metric threshold options, and a cumbersome alert preview experience.

  1. Evaluation windows: The smallest and largest time window that can be selected in Smart Alerts is still too small or large for me.

  1. Time windows: I don't know whether time windows in Custom Events and / or Smart Alerts are sliding windows or tumbling windows

  1. Lowered MTTD:  lack of being able to alert in an acceptable MTTD, does not align with marketing messaging of 1-second detection.

  1. Customizable and user-centric interface


    Provide customisable dashboards for pilots to personalise layouts, widgets, and quick-access menus, enhancing efficiency and reducing cognitive load.

  1. Scoping: I'm not sure if the filter I applied using DFQs for deprecated Custom Events match 100% to the filters in Smart Alerts

  1. Setting thresholds: I don't understand why some metric threshold types are not available in some "group by" options (e.g., Adaptive Thresholds not available for per-endpoint grouping)

  1. Alerts preview: I hate scrolling up and down the "advanced mode" modal dialogue when configuring a smart alert

  1. Integration with airline operational systems


    Develop APIs and integration points to connect the EFF with other operational systems, enabling data sharing and streamlining workflows across departments.

High-level user journey stages (proposed) - creating an AP smart alerts

This user journey outlines creating an AP Smart Alert, from selecting a blueprint and defining scope to configuring conditions, thresholds, persistence, and finalizing alert channels, properties, and payloads for effective monitoring.

Redesigning page layout

Interface enhances usability with an always-visible preview chart, optimised viewport usage, and reduced cognitive overload. A structured left-to-right control layout, "Group by" selector with guidance, and a dedicated "previous step" button improve navigation, clarity, and decision-making.

Redefining smart alerts layout for better usability

Applying the Zeigarnik Effect, our initial transition from custom events to smart alerts followed a specific layout. However, to enhance user experience, we've redefined the page structure for better clarity and usability.

Learnings

Strategic Alignment: Regular sync-ups and a unified UX strategy blueprint were essential, aligning on user requirements helped us clearly identify pain points and set actionable goals for transforming Smart Alerts.

User-Centric Problem Solving: By deeply understanding frustrations, such as restrictive evaluation windows and misaligned filters, we drove design enhancements that addressed specific user needs and improved overall usability.

Iterative Redesign: Applying principles like the Zeigarnik Effect, we continuously refined our page layout and interaction flows. This iterative approach not only improved navigation and clarity but also reinforced our commitment to a seamless user experience.