TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Observability

Developing a Mobile Crash Model for OpenTelemetry

Learn how the OpenTelemetry community is collaborating to adopt OTel’s "events" construct as a way to effectively model mobile crashes.
Nov 14th, 2024 6:35am by
Featued image for: Developing a Mobile Crash Model for OpenTelemetry
Featured image by Shutterstock.

OpenTelemetry (OTel) provides flexible, extensible and vendor-neutral standards for instrumentating and monitoring applications. It’s completely changed the observability game within the last few years, prompting many solutions providers to participate in an ecosystem that encourages open standards.

For the most part, however, OpenTelemetry has been largely focused on backend infrastructure monitoring.

With the growing importance of mobile as a means of transacting with businesses, plus users’ rising performance demands, it makes sense for mobile to become the next big frontier for OTel.

This is exactly what Embrace wanted to help with when we adopted the OTel standard and open sourced our software development kits (SDKs). We’ve long focused on providing hyper-specialized and ultra-granular means for collecting and analyzing specialized mobile observability signals that reveal the true user impact of app performance issues.

As part of this effort, we’ve been working with OpenTelemetry maintainers, contributors and special interest groups (SIGs) to develop standards for modeling mobile data within OpenTelemetry. Our latest project has been to adopt events, one of OTel’s emerging constructs, as a way to effectively model mobile crashes.

Modeling Crashes as Logs

Prior to the introduction of events, we mapped out mobile crashes as regular LogRecords with a specific attribute, emb.type, to internally convey the schema. Each known value of emb.type maps to a well-known set of attributes, depending on the type of crash it is modeling. Unfortunately, no one outside Embrace knows this mapping, and even if we were to publicize it, it would be a solution specific to Embrace.

The lack of standardization for occurrences that are fairly standard in mobile meant the crashes we record are less portable, as no other backend understands our proprietary typing system. Having a common understanding and definition for mobile telemetry is key to solving this problem.

OpenTelemetry Introduces the Event Data Type

OpenTelemetry maintainers and contributors have been working on introducing the event data type for some time. Currently, it’s in an experimental state, which means that breaking changes are still allowed.

Events are the next evolution of structured logs in OTel. They are based on the LogRecord signal, so they share many of the same characteristics as their parent. The main difference is that events have a specific schema of both required and optional attributes that a LogRecord must or can have, respectively.

The schema is defined in the OTel Semantic Conventions. This allow backends to know what data they can expect in a particular LogRecord and how to interpret the values of the expected attributes. This schema is outlined by the values in the required attribute, event.name, the existence of which qualifies a LogRecord as an event.

This event.name attribute functions in a very similar way to emb.type, except it is now part of the OpenTelemetry specification. Because of that status, all OpenTelemetry tooling will treat this as a first-class platform construct, so all backends supporting OpenTelemetry will be able to understand and use it.

Crashes are examples of events. Beyond crashes, other noteworthy happenings during the execution of mobile apps can be captured as events, including button clicks, session changes or network changes. Anything that occurs at a point in time when a mobile app is active is eligible to be an event. So crashes are just the start.

Crashes as Events

A crash in a mobile app is a “thing that happened at a point in time,” so using events to model them works well.

Because events are structured logs, the type of associated data that can be included with them is more useful, provides better context and is much easier to process. This makes OTel events ideal for use with an observability analysis platform, which does a lot of the heavy-lifting in terms of producing aggregated metrics from disparate telemetry and providing visualizations like charts and dashboards.

It also makes crashes (as events) more useful when it comes to forwarding data from the SDK to external observability backends because the structure of the payload is now well defined and part of OpenTelemetry.

Developing the Model

As events become better established, more of them will be accepted into OpenTelemetry’s official semantic conventions. At that point, they will have a documented payload structure in the form of an event schema, semantics for the defined attributes, and stability and requirement levels.

Embrace has been working with the community to define mobile crashes as an event because we believe it is vital for mobile telemetry to have shared, vendor-agnostic definitions that exist and are usable within OpenTelemetry and its ecosystem.

Unique Challenges With Mobile Crashes

The nature of mobile creates some unique challenges for developing a standard for mobile crashes as events in OTel.

Modeling the Many Flavors of Crashes

The most challenging issue, in terms of data modeling, is that crashes come in many flavors.

Depending on the platform, the nature of the crash and the data source of the crash details, you can get vastly different information. Much of this information is not usable without additional data that may not be available to the instrumentation tooling (e.g., mapping files that deobfuscate stack traces) or are decodable by the app (e.g., binary data).

Our proposed solution to this problem is adding an attribute to the schema that contains a data blob, along with another attribute that describes the relevant, unique combination of factors that affect how the crash data can be interpreted (e.g., Android crashes obtained from an UncaughtExceptionHandler implementation will be android_jvm).

Additional information, like encoding, is optionally included so that backends can parse the blob. What the blob actually contains, however, will not be specified, as the structure can be complex and dynamic. Backends that understand specific types of crashes will need additional information to interpret the custom fields inside the blob, which are outside the scope of this event definition.

This proposal differs from our proprietary solution, which maps each unique combination to a specific emb.type attribute value. For instance, the emb.type for Android crashes obtained from an UncaughtExceptionHandler is sys.android.crash.

This solution doesn’t work for a more general approach to crashes, however, because it can lead to a proliferation of events that are all trying to model crashes with slightly different data. This inevitably leads to a lot of overlap between each event and its effort to model the crash, making it hard to keep consistent definitions as more and more crash types are modeled.

Processing Delayed Data

Another challenge is dealing with delayed data, as the app may not know that a crash has happened until the next time it is launched.

When a mobile app experiences a crash, it’s no longer able to send data to the server in real time. Not only that, but the logging of the crash by the SDK installed on the device may also be delayed. The SDK will have to wait until the user reopens the app in order to emit the data it captured. That could be seconds, hours or even days later, and it has the potential to manipulate the overall data timeline — and the implications of that information — if not reported correctly.

We’ve had to be very explicit in dealing with this challenge when modeling the new crash event.

To do so, we specify that any fields on the event object should describe the state of the client at the time of the crash, not the time when the event is logged. This includes fields that are automatically captured as part of the event spec, like timestamp, as well as globally defined attributes via semantic convention, like session ID. Therefore, when the backend sees a session ID or looks at the timestamp of the LogRecord, it will be based on the values at the time of the crash.

The Ongoing Process

Like many aspects of the OpenTelemetry initiative, the process for developing the mobile crash model is continuously evolving. If you’d like to learn more about it, including details on the scheme being developed, head over to the docs section or feel free to follow along in the pull request (PR).

Additionally, check out some of the great ongoing work that’s being done by the OTel community to further define the measurement and modeling standards for mobile telemetry. If you’d like to learn more about Embrace, check out our open source SDKs or head over to our site.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.