Releases: apollographql/router
v2.0.0-preview.6
2.0.0-preview.6
v1.60.1-rc.0
1.60.1-rc.0
v1.60.0
🚀 Features
Improve BatchProcessor observability (Issue #6558)
A new metric has been introduced to allow observation of how many spans are being dropped by an telemetry batch processor.
apollo.router.telemetry.batch_processor.errors
- The number of errors encountered by exporter batch processors.name
: One ofapollo-tracing
,datadog-tracing
,jaeger-collector
,otlp-tracing
,zipkin-tracing
.error
= One ofchannel closed
,channel full
.
By observing the number of spans dropped it is possible to estimate what batch processor settings will work for you.
In addition, the log message for dropped spans will now indicate which batch processor is affected.
By @BrynCooke in #6558
🐛 Fixes
Improve performance of query hashing by using a precomputed schema hash (PR #6622)
The router now uses a simpler and faster query hashing algorithm with more predictable CPU and memory usage. This improvement is enabled by using a precomputed hash of the entire schema, rather than computing and hashing the subset of types and fields used by each query.
For more details on why these design decisions were made, please see the PR description
By @IvanGoncharov in #6622
Truncate invalid error paths (PR #6359)
This fix addresses an issue where the router was silently dropping subgraph errors that included invalid paths.
According to the GraphQL Specification an error path must point to a response field:
If an error can be associated to a particular field in the GraphQL result, it must contain an entry with the key path that details the path of the response field which experienced the error.
The router now truncates the path to the nearest valid field path if a subgraph error includes a path that can't be matched to a response field,
By @IvanGoncharov in #6359
Eagerly init subgraph operation for subscription primary nodes (PR #6509)
When subgraph operations are deserialized, typically from a query plan cache, they are not automatically parsed into a full document. Instead, each node needs to initialize its operation(s) prior to execution. With this change, the primary node inside SubscriptionNode is initialized in the same way as other nodes in the plan.
By @tninesling in #6509
Fix increased memory usage in sysinfo
since Router 1.59.0 (PR #6634)
In version 1.59.0, Apollo Router started using the sysinfo
crate to gather metrics about available CPUs and RAM. By default, that crate uses rayon
internally to parallelize its handling of system processes. In turn, rayon creates a pool of long-lived threads.
In a particular benchmark on a 32-core Linux server, this caused resident memory use to increase by about 150 MB. This is likely a combination of stack space (which only gets freed when the thread terminates) and per-thread space reserved by the heap allocator to reduce cross-thread synchronization cost.
This regression is now fixed by:
- Disabling
sysinfo
’s use ofrayon
, so the thread pool is not created and system processes information is gathered in a sequential loop. - Making
sysinfo
not gather that information in the first place since Router does not use it.
By @SimonSapin in #6634
Optimize demand control lookup (PR #6450)
The performance of demand control in the router has been optimized.
Previously, demand control could reduce router throughput due to its extra processing required for scoring.
This fix improves performance by shifting more data to be computed at plugin initialization and consolidating lookup queries:
- Cost directives for arguments are now stored in a map alongside those for field definitions
- All precomputed directives are bundled into a struct for each field, along with that field's extended schema type. This reduces 5 individual lookups to a single lookup.
- Response scoring was looking up each field's definition twice. This is now reduced to a single lookup.
By @tninesling in #6450
Fix missing Content-Length header in subgraph requests (Issue #6503)
A change in 1.59.0
caused the Router to send requests to subgraphs without a Content-Length
header, which would cause issues with some GraphQL servers that depend on that header.
This solves the underlying bug and reintroduces the Content-Length
header.
By @nmoutschen in #6538
🛠 Maintenance
Remove the legacy query planner (PR #6418)
The legacy query planner has been removed in this release. In the previous release, router v1.58, it was no longer used by default but was still available through the experimental_query_planner_mode
configuration key. That key is now removed.
Also removed are configuration keys which were only relevant to the legacy planner:
supergraph.query_planning.experimental_parallelism
: the new planner can always use available parallelism.supergraph.experimental_reuse_query_fragments
: this experimental algorithm that attempted to
reuse fragments from the original operation while forming subgraph requests is no longer present. Instead, by default new fragment definitions are generated based on the shape of the subgraph operation.
By @SimonSapin in #6418
Migrate various metrics to OTel instruments (PR #6476, PR #6356, PR #6539)
Various metrics using our legacy mechanism based on the tracing
crate are migrated to OTel instruments.
By @goto-bus-stop in #6476, #6356, #6539
📚 Documentation
Add instrumentation configuration examples (PR #6487)
The docs for router telemetry have new example configurations for common use cases for selectors and condition.
🧪 Experimental
Remove experimental_retry option (PR #6338)
The experimental_retry
option has been removed due to its limited use and functionality during its experimental phase.
v1.60.0-rc.1
1.60.0-rc.1
v2.0.0-preview.5
2.0.0-preview.5
v1.59.2
Important
This release contains important fixes which address resource utilization regressions which impacted Router v1.59.0 and v1.59.1. These regressions were in the form of:
- A small baseline increase in memory usage; AND
- Additional per-request CPU and memory usage for queries which included references to abstract types with a large number of implementations
If you have enabled Distributed query plan caching, this release contains changes which necessarily alter the hashing algorithm used for the cache keys. On account of this, you should anticipate additional cache regeneration cost when updating between these versions while the new hashing algorithm comes into service.
🐛 Fixes
Improve performance of query hashing by using a precomputed schema hash (PR #6622)
The router now uses a simpler and faster query hashing algorithm with more predictable CPU and memory usage. This improvement is enabled by using a precomputed hash of the entire schema, rather than computing and hashing the subset of types and fields used by each query.
For more details on why these design decisions were made, please see the PR description
By @IvanGoncharov in #6622
Fix increased memory usage in sysinfo
since Router 1.59.0 (PR #6634)
In version 1.59.0, Apollo Router started using the sysinfo
crate to gather metrics about available CPUs and RAM. By default, that crate uses rayon
internally to parallelize its handling of system processes. In turn, rayon creates a pool of long-lived threads.
In a particular benchmark on a 32-core Linux server, this caused resident memory use to increase by about 150 MB. This is likely a combination of stack space (which only gets freed when the thread terminates) and per-thread space reserved by the heap allocator to reduce cross-thread synchronization cost.
This regression is now fixed by:
- Disabling
sysinfo
’s use ofrayon
, so the thread pool is not created and system processes information is gathered in a sequential loop. - Making
sysinfo
not gather that information in the first place since Router does not use it.
By @SimonSapin in #6634
v1.60.0-rc.0
1.60.0-rc.0
v1.59.2-rc.0
1.59.2-rc.0
v2.0.0-preview.4
2.0.0-preview.4
v1.59.1
Important
This release was impacted by a resource utilization regression which was fixed in v1.59.2. See the release notes for that release for more details. As a result, we recommend using v1.59.2 rather than v1.59.1 or v1.59.0.
🐛 Fixes
Fix transmitted header value for Datadog priority sampling resolution (PR #6017)
The router now transmits correct values of x-datadog-sampling-priority
to downstream services.
Previously, an x-datadog-sampling-priority
of -1
was incorrectly converted to 0
for downstream requests, and 2
was incorrectly converted to 1
. When propagating to downstream services, this resulted in values of USER_REJECT
being incorrectly transmitted as AUTO_REJECT
.
Enable accurate Datadog APM metrics (PR #6017)
The router supports a new preview feature, the preview_datadog_agent_sampling
option, to enable sending all spans to the Datadog Agent so APM metrics and views are accurate.
Previously, the sampler option in telemetry.exporters.tracing.common.sampler
wasn't Datadog-aware. To get accurate Datadog APM metrics, all spans must be sent to the Datadog Agent with a psr
or sampling.priority
attribute set appropriately to record the sampling decision.
The preview_datadog_agent_sampling
option enables accurate Datadog APM metrics. It should be used when exporting to the Datadog Agent, via OTLP or Datadog-native.
telemetry:
exporters:
tracing:
common:
# Only 10 percent of spans will be forwarded from the Datadog agent to Datadog. Experiment to find a value that is good for you!
sampler: 0.1
# Send all spans to the Datadog agent.
preview_datadog_agent_sampling: true
Using these options can decrease your Datadog bill, because you will be sending only a percentage of spans from the Datadog Agent to Datadog.
Important
- Users must enable
preview_datadog_agent_sampling
to get accurate APM metrics. Users that have been using recent versions of the router will have to modify their configuration to retain full APM metrics. - The router doesn't support
in-agent
ingestion control. - Configuring
traces_per_second
in the Datadog Agent won't dynamically adjust the router's sampling rate to meet the target rate. - Sending all spans to the Datadog Agent may require that you tweak the
batch_processor
settings in your exporter config. This applies to both OTLP and Datadog native exporters.
Learn more by reading the updated Datadog tracing documentation for more information on configuration options and their implications.
Fix non-parent sampling (PR #6481)
When the user specifies a non-parent sampler the router should ignore the information from upstream and use its own sampling rate.
The following configuration would not work correctly:
exporters:
tracing:
common:
service_name: router
sampler: 0.00001
parent_based_sampler: false
All spans are being sampled.
This is now fixed and the router will correctly ignore any upstream sampling decision.
By @BrynCooke in #6481