Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic ttrpc and bridge span creation #1379

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

helsaawy
Copy link
Contributor

@helsaawy helsaawy commented Apr 27, 2022

(Host) Bridge automatically creates spans for RPCs.
Update ttrpc interceptor to add request and response structs to span attributes.
Update gcs bridge to create spans for outgoing calls automatically.

Replaced redundant cmd/containerd-shim-runhcs-v1.(*Service)
and internal/gcs spans with .Trace log entries, since they include
all the same (or less) information as those automatically generated by
the ttrp interceptor and (*bridge).RPC

Relies on PRs:

This PR gets rid of the cmd/containerd-shim-runhcs-v1.(*Service) spans (eg, "State", "Create")
and instead only exports the TTRPC spans (eg, "containerd.task.v2.Task.State", "containerd.task.v2.Task.Create")

@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 4a510c9 to 01b6418 Compare May 9, 2022 18:06
@helsaawy helsaawy changed the title ttrpc and bridge span update Automatic ttrpc and bridge span creation May 9, 2022
@helsaawy helsaawy marked this pull request as ready for review May 9, 2022 18:08
@helsaawy helsaawy requested a review from a team as a code owner May 9, 2022 18:08
@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 01b6418 to 668fd6e Compare May 9, 2022 19:31
@helsaawy helsaawy force-pushed the he/trace-span-creation branch 2 times, most recently from 74fd8aa to aadc6de Compare May 9, 2022 20:04
@jterry75
Copy link
Contributor

In general I like it! The one comment that I have is I never loved the default name for gRPC entry spans and in our case here we are going to do "opengcs::bridge::" + header.Type.String(). While its not the worst, it doesn't intuitively scream to mean what "header.Type.String()" would be when creating my queries. I suppose over time that is just learned behavior though. I do love that all entry/exit spans are automatic which means they are actually measurable! Super cool

@helsaawy
Copy link
Contributor Author

In general I like it! The one comment that I have is I never loved the default name for gRPC entry spans and

Do you mean runtime.v1alpha2.RuntimeService.PodSandboxStatus, containerd.task.v2.Task.Start, and co?

in our case here we are going to do "opengcs::bridge::" + header.Type.String(). While its not the worst, it doesn't intuitively scream to mean what "header.Type.String()" would be when creating my queries.

Yeah ... we are planning to clean up the bridge protocol code (eventually). But it is cludgy to say the least.
Ideally we would define the protocols and message cleanly enough that we dont have to resort to "opengcs::bridge::" + header.Type.String() or "gcs::bridge::RPC::"+proc.String(), but this will have to hold over for now.

Unfortunately, the spans on the host will have different names than spans on the Linux guest, which resolve Type.String from:

case MiNone:
return "None"
case ComputeSystemCreateV1:
return "ComputeSystemCreateV1"
case ComputeSystemStartV1:
return "ComputeSystemStartV1"
case ComputeSystemShutdownGracefulV1:
return "ComputeSystemShutdownGracefulV1"
case ComputeSystemShutdownForcedV1:
return "ComputeSystemShutdownForcedV1"
case ComputeSystemExecuteProcessV1:
return "ComputeSystemExecuteProcessV1"
case ComputeSystemWaitForProcessV1:
return "ComputeSystemWaitForProcessV1"
case ComputeSystemSignalProcessV1:
return "ComputeSystemSignalProcessV1"
case ComputeSystemResizeConsoleV1:
return "ComputeSystemResizeConsoleV1"
case ComputeSystemGetPropertiesV1:
return "ComputeSystemGetPropertiesV1"
case ComputeSystemModifySettingsV1:
return "ComputeSystemModifySettingsV1"
case ComputeSystemNegotiateProtocolV1:
return "ComputeSystemNegotiateProtocolV1"
case ComputeSystemDumpStacksV1:
return "ComputeSystemDumpStacksV1"
case ComputeSystemDeleteContainerStateV1:
return "ComputeSystemDeleteContainerStateV1"
case ComputeSystemResponseCreateV1:
return "ComputeSystemResponseCreateV1"
case ComputeSystemResponseStartV1:
return "ComputeSystemResponseStartV1"
case ComputeSystemResponseShutdownGracefulV1:
return "ComputeSystemResponseShutdownGracefulV1"
case ComputeSystemResponseShutdownForcedV1:
return "ComputeSystemResponseShutdownForcedV1"
case ComputeSystemResponseExecuteProcessV1:
return "ComputeSystemResponseExecuteProcessV1"
case ComputeSystemResponseWaitForProcessV1:
return "ComputeSystemResponseWaitForProcessV1"
case ComputeSystemResponseSignalProcessV1:
return "ComputeSystemResponseSignalProcessV1"
case ComputeSystemResponseResizeConsoleV1:
return "ComputeSystemResponseResizeConsoleV1"
case ComputeSystemResponseGetPropertiesV1:
return "ComputeSystemResponseGetPropertiesV1"
case ComputeSystemResponseModifySettingsV1:
return "ComputeSystemResponseModifySettingsV1"
case ComputeSystemResponseNegotiateProtocolV1:
return "ComputeSystemResponseNegotiateProtocolV1"
case ComputeSystemResponseDumpStacksV1:
return "ComputeSystemResponseDumpStacksV1"
case ComputeSystemNotificationV1:
return "ComputeSystemNotificationV1"
default:
return strconv.FormatUint(uint64(mi), 10)
}

@helsaawy
Copy link
Contributor Author

Rebased to resolve merge conflicts

@@ -248,8 +228,9 @@ func (c *Container) Wait() error {
}

func (c *Container) waitBackground() {
ctx, span := oc.StartSpan(context.Background(), "gcs::Container::waitBackground")
_, span := oc.StartSpan(context.Background(), "gcs::Container::waitBackground")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we keeping this span but not the others on a Container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the other Container operations eventually end up with a gcs and then bridge call, so they have spans being created for them.
This is the only one that doesn't. Plus this is a long running call that is launched asynchronously, so we dont have any other way of tracking it, which i felt made it worth of a dedicated span

Comment on lines +243 to +244
var err error
_, span := oc.StartSpan(context.Background(), "gcs::Process::waitBackground")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here. Why are we starting a span here?

Comment on lines 419 to 420
ctx := r.Context
log.G(ctx).Trace("opengcs::bridge::modifySettingsV2")
Copy link
Contributor

@katiewasnothere katiewasnothere Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Should this and dumpStacksV2 and deleteContainerStateV2 use logEntry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumpStacksV2 doesnt apply to a particular container, right?
but yes, deleteContainerStateV2 should


// Duration format converts a [time.Duration] fields to an appropriate encoding.
//
// Default is [DurationFormatSeconds].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be milliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, using milliseconds is a better representation? I figured since some of the OCI requests are on the order of seconds, seconds was appropriate, but I can swap out milliseconds if that makes more sense

if s.isSandbox {
span.AddAttributes(trace.StringAttribute("pod-id", s.tid))
}
s.logEntry(ctx).Trace("State")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we're no longer logging request information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, we now add the entire request to the auto-generated span, instead of manually adding bits of the request to the individual spans we create here

if err = unmarshal(req); err == nil {
span.AddAttributes(trace.StringAttribute("request", log.Format(ctx, req)))
}

@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 99417e1 to 081c6d9 Compare March 15, 2023 19:26
@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 081c6d9 to 66588a1 Compare April 4, 2023 21:40
@helsaawy
Copy link
Contributor Author

helsaawy commented Apr 4, 2023

rebased

@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 66588a1 to 22fee56 Compare July 19, 2023 19:53
Comment on lines 126 to 189
defer func() {
if err == nil {
span.AddAttributes(trace.StringAttribute("response", log.Format(ctx, resp)))
}
setSpanStatus(span, err)
}()
if len(o.attrs) > 0 {
span.AddAttributes(o.attrs...)
}

return method(ctx, unmarshal)
return method(ctx, func(req interface{}) (err error) {
if err = unmarshal(req); err == nil {
span.AddAttributes(trace.StringAttribute("request", log.Format(ctx, req)))
}
return err
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about always logging the request/response:

  • These could be quite large (compared to a normal span) in many cases, which would bloat logging data or lead to worse performance.
  • These could contain sensitive information that should not be logged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll validate/update the scrubbing for the requests that contain spans, but we already log most components of the pieces of the requests/responses in the span, but we manually add the fields:

func (s *service) Create(ctx context.Context, req *task.CreateTaskRequest) (resp *task.CreateTaskResponse, err error) {
ctx, span := oc.StartSpan(ctx, "Create")
defer span.End()
defer func() {
if resp != nil {
span.AddAttributes(trace.Int64Attribute("pid", int64(resp.Pid)))
}
oc.SetSpanStatus(span, err)
}()
span.AddAttributes(
trace.StringAttribute("tid", req.ID),
trace.StringAttribute("bundle", req.Bundle),
// trace.StringAttribute("rootfs", req.Rootfs), TODO: JTERRY75 -
// OpenCensus doesnt support slice like our logrus hook
trace.BoolAttribute("terminal", req.Terminal),
trace.StringAttribute("stdin", req.Stdin),
trace.StringAttribute("stdout", req.Stdout),
trace.StringAttribute("stderr", req.Stderr),
trace.StringAttribute("checkpoint", req.Checkpoint),
trace.StringAttribute("parentcheckpoint", req.ParentCheckpoint))
if s.isSandbox {
span.AddAttributes(trace.StringAttribute("pod-id", s.tid))
}

The only exception is Exec process, which carries the OCI spec, and that we log latter on regardless.
Also, we log all requests/responses over the bridge (and log them again on the Linux GCS side):

brdg.log.WithFields(logrus.Fields{
"payload": string(b),
"type": typ.String(),
"message-id": id}).Debug("bridge send")

entry.WithField("message", s).Debug("request read message")

brdg.log.WithFields(logrus.Fields{
"payload": string(b),
"type": typ.String(),
"message-id": id}).Debug("bridge receive")

log.G(resp.ctx).WithField("message", string(responseBytes)).Debug("request write response")

Performance wise, the formatting uses google.golang.org/protobuf/encoding/protojson, which should be quite performant:

if m, ok := v.(proto.Message); ok {
// use canonical JSON encoding for protobufs (instead of [encoding/json])
// https://protobuf.dev/programming-guides/proto3/#json
return protojson.MarshalOptions{
AllowPartial: true,
// protobuf defaults to camel case for JSON encoding; use proto field name instead (snake case)
UseProtoNames: true,
}.Marshal(m)
}

@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 22fee56 to 88090d0 Compare August 3, 2023 19:00
@msscotb msscotb assigned kiashok and unassigned kevpar Jan 24, 2024
ttrpc interceptor automatically formats (as JSON) and adds
server requests and responses as span attributes.

Host and guest bridge automatically creates spans for RPCs.

Replaced redundant `cmd/containerd-shim-runhcs-v1/service.go.(Service)`
`internal/gcs`, and `interna\guest\bridge` spans with `.Trace` log
entries, since they include
all the same (or less) information as those automatically generated by
the ttrp interceptor and `(*bridge).RPC`

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Allow enabling adding the request/response to the ttrpc span as
attributes, since there are other customers of our code.

Allow scrubbing the `proto.Message` request and response messages in the
ttrpc server interceptor by specifying an arbitrary function to update
a clone of the payloads.

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
@helsaawy helsaawy force-pushed the he/trace-span-creation branch from 88090d0 to d455f6d Compare January 26, 2024 18:18
Copy link
Contributor

@kiashok kiashok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@helsaawy helsaawy marked this pull request as draft February 28, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants