docs(sdks): Propagated Sampling Rates & Strict Trace Propagation (#11912

) Co-authored-by: Stephanie Anderson <stephanie.anderson@sentry.io> Co-authored-by: Ivana Kellyer <ivana.kellyer@sentry.io> Co-authored-by: Anton Pirker <anton.pirker@sentry.io> Co-authored-by: Jan Michael Auer <mail@jauer.org> Co-authored-by: Liza Mock <liza.mock@sentry.io>
getsentry · Nov 28, 2024 · 399f529 · 399f529
1 parent 2591056
commit 399f529
Show file tree

Hide file tree

Showing 2 changed files with 93 additions and 13 deletions.
diff --git a/develop-docs/sdk/telemetry/traces/dynamic-sampling-context.mdx b/develop-docs/sdk/telemetry/traces/dynamic-sampling-context.mdx
@@ -49,17 +49,20 @@ To align DSC propagation over all our SDKs, we defined a [unified propagation me
 All of the attributes in the table below are required (non-optional) in a sense, that when they are known to an SDK at the time an envelope with an event (transaction or error) is sent to Sentry, or at the time a baggage header is propagated, they must also be included in said envelope or baggage.
 
 At the moment, only `release`, `environment` and `transaction` are used by the product for dynamic sampling functionality.
-The rest of the context attributes, `trace_id`, `public_key`, and `sample_rate`, are used by Relay for internal decisions (like transaction sample rate smoothing).
+The rest of the context attributes, `trace_id`, `public_key`, `sampled` and `sample_rate`, are used by Relay for internal decisions and for extrapolation in the product.
+Additional entries such as `replay_id`, `org` and `sample_rand` are only using the DSC as a means of transport.
 
 | Attribute                   | Type   | Description                                                                                                                  | Example                              | Required Level                       |
 | --------------------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------- | ------------------------------------ | ------------------------------------ |
 | `trace_id`                  | string | The original trace ID as generated by the SDK. This must match the trace id of the submitted transaction item. [1]           | `771a43a4192642f0b136d5159a501700`   | strictly required [0]                |
 | `public_key`                | string | Public key from the DSN used by the SDK. [2]                                                                                 | `49d0f7386ad645858ae85020e393bef3`   | strictly required [0]                |
-| `sample_rate`               | string | The sample rate as defined by the user on the SDK. [3]                                                                       | `0.7`                                | strictly required [0]                |
-| `sampled`                   | string | `"true"` if the trace is sampled, `"false"` otherwise. This is set by the head of the trace.                                 | `true`                               | required                             |
+| `sample_rate`               | string | The sample rate as defined by the user on the SDK. [3] [4]                                                                   | `0.7`                                | strictly required [0]                |
+| `sample_rand`               | string | A random number generated at the start of a trace by the head of trace SDK. [4]                                              | `0.5`                                | required                             |
+| `sampled`                   | string | `"true"` if the trace is sampled, `"false"` otherwise. This is set by the head of the trace SDK. [4]                         | `true`                               | required                             |
 | `release`                   | string | The release name as specified in client options.                                                                             | `myapp@1.2.3`, `1.2.3`, `2025.4.107` | required                             |
 | `environment`               | string | The environment name as specified in client options.                                                                         | `production`, `staging`              | required                             |
 | `transaction`               | string | The transaction name set on the scope. **Only include** if name has [good quality](#note-on-good-quality-transaction-names). | `/login`, `myApp.myController.login` | required (if known and good quality) |
+| `org`                       | string | The org ID parsed from the DSN or received by a downstream SDK.                                                              | `1`                                  | required                             |
 | `user_segment` [DEPRECATED] | string | User segment as set by the user with `scope.set_user()`.                                                                     |                                      | deprecated                           |
 
 0: In any case, `trace_id`, `public_key`, and `sample_rate` should always be known to an SDK, so these values are strictly required.
@@ -70,6 +73,8 @@ The rest of the context attributes, `trace_id`, `public_key`, and `sample_rate`,
 
 3: This string should always be a number between (and including) 0 and 1 in a notation that is supported by the [JSON specification](https://www.json.org/json-en.html). If a `tracesSampler` callback was used for the sampling decision, its result should be used for `sample_rate` instead of the `tracesSampleRate` from `SentryOptions`. In case `tracesSampler` returns `True` it should be sent as `1.0`, `False` should be sent as `0.0`.
 
+4: These attributes must conform to the invariant `sample_rand < sample_rate <=> sampled`.
+
 <Alert level="warning">
 
 ### Note on good-quality transaction names
@@ -276,5 +281,4 @@ TODO - Add some sort of Q&A section on the following questions, after evaluating
 - Why must baggage be immutable before the second transaction has been started?
 - What are the consequences and impacts of the immutability of baggage on Dynamic Sampling UX?
 - Why can't we just make the decision for the whole trace in Relay after the trace is complete?
-- What is sample rate smoothing and how does it use `sample_rate` from the Dynamic Sampling Context?
 - What are the differences between Dynamic Sampling on traces vs. transactions?
diff --git a/develop-docs/sdk/telemetry/traces/index.mdx b/develop-docs/sdk/telemetry/traces/index.mdx
@@ -10,32 +10,39 @@ This should give an overview of the APIs that SDKs need to implement, without
 mandating internal implementation details.
 
 Reference implementations:
+
 - [JavaScript SDK](https://github.com/getsentry/sentry-javascript/tree/master/packages/core/src/tracing)
 - [Python SDK](https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/tracing.py)
 
+<Note>
+
+This document uses standard interval notation, where `[` and `]` indicates closed intervals, which include the endpoints of the interval, while `(` and `)` indicates open intervals, which exclude the endpoints of the interval. An interval `[x, y)` covers all values starting from `x` up to but excluding `y`.
+
+</Note>
+
 ## SDK Configuration
 
 This section describes the options SDKs should expose to configure tracing and performance monitoring.
 
-Tracing is enabled by setting any of three SDK config options, `enableTracing`, `tracesSampleRate` and `tracesSampler`. If not set, these options default to `undefined`, making tracing opt-in.
+Tracing is enabled by setting either a `tracesSampleRate` or `tracesSampler`. If not set, these options default to `undefined` or `null`, making tracing opt-in.
 
 ### `enableTracing`
 
-This option shall enable the generation of transactions and propagation of trace data. Sample rates shall be set at a default which is practical to the specific platform. Users may use the other options, listed below, should their use case require it. The standard should be to set the default sample rate at 100%, and only working back if there are inherent concerns for that platform. Users should be able to send most if not all of their data and rely on Sentry server side processing of their data.
+This option is **deprecated** and should be removed from all SDKs.
 
 ### `tracesSampleRate`
 
-This should be a float/double between `0.0` and `1.0` (inclusive) and represents the percentage chance that any given transaction will be sent to Sentry. So, barring [outside influence](https://develop.sentry.dev/sdk/performance/#sampling), `0.0` is a 0% chance (none will be sent) and `1.0` is a 100% chance (all will be sent). This rate applies equally to all transactions; in other words, each transaction should have the same random chance of ending up with `sampled = true`, equal to the `tracesSampleRate`.
+This should be a floating-point number in the range `[0, 1]` and represents the percentage chance that any given transaction will be sent to Sentry. So, barring [outside influence](#sampling), `0.0` is a guaranteed 0% chance (none will be sent) and `1.0` is a guaranteed 100% chance (all will be sent). This rate applies equally to all transactions; in other words, each transaction has an equal chance of being marked as `sampled = true`, based on the `tracesSampleRate`.
 
-See more about how sampling should be performed below.
+See more about how sampling should be performed [below](#sampling).
 
 ### `tracesSampler`
 
-This should be a callback, called when a transaction is started, which will be given a `samplingContext` object and which should return a sample rate between `0.0` and `1.0` _for the transaction in question_. This sample rate should behave the same way as the `tracesSampleRate` above, with the difference that it only applies to the newly-created transaction, such that different transactions can be sampled at different rates. Returning `0.0` should force the transaction to be dropped (set to `sampled = false`) and returning `1.0` should force the transaction to be sent (set `sampled = true`).
+This should be a callback function, triggered when a transaction is started. It should be given a `samplingContext` object and should return a sample rate in the range of `[0, 1]` _for the transaction in question_. This sample rate should behave the same way as the `tracesSampleRate` above. The only difference is that it only applies to the newly-created transaction and that different transactions can be sampled at different rates. Returning `0.0` should force the transaction to be dropped (set to `sampled = false`) and returning `1.0` should force the transaction to be sent (set to `sampled = true`).
 
-Optionally, the `tracesSampler` callback can also return a boolean to force a sampling decision (with `false` equivalent to `0.0` and `true` equivalent to `1.0`). If returning two different datatypes isn't an option in the implementing language, this possibility can safely be omitted.
+Historically, the `tracesSampler` callback could have also returned a boolean to force a sampling decision (with `false` equivalent to `0.0` and `true` equivalent to `1.0`). This behavior is now **deprecated** and should be removed from all SDKs.
 
-See more about how sampling should be performed below.
+See more about how sampling should be performed [below](#sampling).
 
 ### `tracePropagationTargets`
 
@@ -65,6 +72,37 @@ This Option replaces the non-standardized `tracingOrigins` option which was prev
 
 </Alert>
 
+### `strictTraceContinuation`
+
+This must be a boolean value. Default is `false`. This option controls trace continuation from unknown 3rd party services that happen to be instrumented by a Sentry SDK.
+
+If the SDK is able parse an org ID from the configured DSN, it must be propagated as a baggage entry with the key `sentry-org`. Given a DSN of `https://1234@o1.ingest.us.sentry.io/1`, the org ID is `1`, based on `o1`.
+
+Addiotnally, the SDK must be configurable with an optional `org: <org-id>` setting that takes precedence over the parsed value from the DSN. This option should be set when running a self-hosted version of Sentry or if a non-standard Sentry DSN is used, such as when using a local Relay.
+
+On incoming traces, the SDK must compare the `sentry-org` baggage value against its own parsed value from the DSN or org setting. Only if both match, the trace is continued. If there is no match, neither the trace ID, the parent sampling decision nor the baggage should be taken into account.
+The SDK should behave like it is the head of trace in this case, and not consider any propagted values.
+
+This behavior can be disabled by setting `strictTraceContinuation: false` in the SDK init call.
+Initially, SDKs must introduce the this option with a default value of `false`.
+Once the majority of SDKs have introduced this option, we'll change the default value to `true` (in a major version bump), making it opt-out.
+
+Regardless of `strictTraceContinuation` being set to `true` or `false`, if the SDK is either configured with a `org` or was able to parse the value from the DSN, incoming traces containing an `org` value in the  baggage that does not match the one from the receiving SDK, the trace is not continuned.
+
+Examples:
+
+baggage: `sentry-org: 1`, SDK config: `org: 1, strictTraceContinuation: false` -> continue trace
+baggage: `sentry-org: none`, SDK config: `org: 1, strictTraceContinuation: false` -> continue trace
+baggage: `sentry-org: 1`, SDK config: `org: none, strictTraceContinuation: false` -> continue trace
+baggage: `sentry-org: none`, SDK config: `org: none, strictTraceContinuation: false` -> continue trace
+baggage: `sentry-org: 1`, SDK config: `org: 2, strictTraceContinuation: false` -> start new trace
+
+baggage: `sentry-org: 1`, SDK config: `org: 1, strictTraceContinuation: true` -> continue trace
+baggage: `sentry-org: none`, SDK config: `org: 1, strictTraceContinuation: true` -> start new trace
+baggage: `sentry-org: 1`, SDK config: `org: none, strictTraceContinuation: true` -> start new trace
+baggage: `sentry-org: none`, SDK config: `org: none, strictTraceContinuation: true` -> continue trace
+baggage: `sentry-org: 1`, SDK config: `org: 2, strictTraceContinuation: true` -> start new trace
+
 ### `traceOptionsRequests`
 
 This should be a boolean value. Default is `false`. When set to `true` transactions should be created for HTTP `OPTIONS` requests. When set to `false` NO transactions should be created for HTTP `OPTIONS` requests. This configuration is most valuable on backend server SDKs. If this configuration does not make sense for an SDK it can be omitted.
@@ -146,7 +184,7 @@ tree as well as the unit of reporting to Sentry.
 
 ## Sampling
 
-Each transaction has a "sampling decision," that is, a boolean which dictates whether or not it should be sent to Sentry. This should be set exactly once during a transaction's lifetime, and should be stored in an internal `sampled` boolean.
+Each transaction has a _sampling decision_, that is, a boolean which declares whether or not it should be sent to Sentry. This should be set exactly once during a transaction's lifetime, and should be stored in an internal `sampled` boolean.
 
 There are multiple ways a transaction can end up with a sampling decision:
 
@@ -156,7 +194,7 @@ There are multiple ways a transaction can end up with a sampling decision:
 - If the transaction has a parent, inheriting its parent's sampling decision
 - Absolute decision passed to `startTransaction`
 
-When there's the potential for more than one of these to come into play, the following precedence rules should apply:
+If more than one option could apply, the following rules determine which takes precedence:
 
 1. If a sampling decision is passed to `startTransaction` (`startTransaction({name: "my transaction", sampled: true})`), that decision will be used, regardlesss of anything else
 2. If `tracesSampler` is defined, its decision will be used. It can choose to keep or ignore any parent sampling decision, or use the sampling context data to make its own decision or choose a sample rate for the transaction.
@@ -176,6 +214,7 @@ Transactions should be sampled only by `tracesSampleRate` or `tracesSampler`. Th
 If defined, the `tracesSampler` callback should be passed a `samplingContext` object, which should include, at minimum:
 
 - The `transactionContext` with which the transaction was created
+- A float/double `parentSampleRate` which contains the sampling rate passed down from the parent
 - A boolean `parentSampled` which contains the sampling decision passed down from the parent, if any
 - Data from an optional `customSamplingContext` object passed to `startTransaction` when it is called manually
 
@@ -185,6 +224,43 @@ Depending on the platform, other default data may be included. (For example, for
 
 A transaction's sampling decision should be passed to all of its children, including across service boundaries. This can be accomplished in the `startChild` method for same-service children and using the `senry-trace` header for children in a different service.
 
+### Propagated Random Value
+
+To improve the likelihood of capturing complete traces when backend services use a custom sample rate via `tracesSampler`, the SDK propagates the same random value used for sampling decisions across all services in a trace. This ensures consistent sampling decisions across a trace instead of generating a new random value for each service.
+
+If no `tracesSampler` callback is used, the SDK fully inherits sampling decisions for propagated traces, and the presence of `sample_rand` in the DSC doesn't affect the decision. However, this behavior may change in the future.
+
+The random value is set according to the following rules:
+
+1. When an SDK starts a new trace, `sample_rand` is always set to a random number in the range of `[0, 1]`. This explicitly includes traces that aren't sampled, as well as when the `tracesSampleRate` is set to `0.0` or `1.0`.
+2. It is _recommended_ to generate the random number deterministically using the trace ID as seed or source of randomness. The exact method by which the random number is created is implementation defined and may vary between SDK implementations. See 4. on why this behaviour is desirable.
+3. On incoming traces, an SDK assumes the `sample_rand` value along with the rest of the DSC, overriding an existing value if needed.
+4. If `sample_rand` is missing on an incoming trace, the SDK creates and from now on propagates a new random number on-the-fly, based on the following rules:
+   1. If `sample_rate` and `sampled` are propgated, create `sample_rand` so that it adheres to the invariant. This means, for a decision of `True` generate a random number in half-open range `[0, rate)` and for a decision of `False` generate a random number in range `[rate, 1]`.
+   2. If the sampling decision is missing, generate a random number in range of `[0, 1]`, like for a new trace.
+
+The SDK should always use the stored random number (`sentry-sample_rand`) for sampling decisions and should no longer rely on `math.random()` or similar functions in tracing code:
+
+1. When the `tracesSampler` is invoked, this also applies to the return value of traces sampler. That is, `trace["sentry-sample_rand"] < tracesSampler(context)`
+2. Otherwise, when the SDK is the head of a trace, this also applies to sample decisions based on `tracesSampleRate`. That is, `trace["sentry-sample_rand"] < config.tracesSampleRate`
+3. There is no more direct comparison with `math.random()` during the sampling process.
+
+When using a `tracesSampler`, the proper way to inherit a parent's sampling decision is to return the parent's sample rate, instead of leaving the decision as a float (for example, 1.0). This way, Sentry can still extrapolate counts correctly.
+
+```js
+tracesSampler: ({ name, parentSampleRate }) => {
+  // Inherit the trace parent's sample rate if there is one. Sampling is deterministic
+  // for one trace, i.e. if the parent was sampled, we will be sampled too at the same
+  // rate.
+  if (typeof parentSampleRate === "number") {
+    return parentSampleRate;
+  }
+
+  // Else, use default sample rate (replacing tracesSampleRate).
+  return 0.5;
+},
+```
+
 ### Backpressure
 
 If the SDK supports backpressure handling, the overall sampling rate needs to be divided by the `downsamplingFactor` from the backpressure monitor. See [the backpressure spec](/sdk/performance/backpressure/#downsampling) for more details.