Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove section on Character Encoding #389

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mprpic
Copy link
Contributor

@mprpic mprpic commented Feb 5, 2025

This section contained duplicate information that is already present for each component in the "Rules" section and included confusing sentences that applied too generally to certain purl components.

Each component's set of rules should clearly define how the content of that component should (or should not) be encoded. These improvements are being made in PRs such as #383 (more will be coming for other component sections).

These changes were discussed in the Feb 5, 2025 purl community meeting.

This section contained duplicate information that is already present for
each component in the "Rules" section and included confusing sentences
that applied too generally to certain purl components.

Each component's set of rules should clearly define how the content of
that component should (or should not) be encoded. These improvements are
being made in PRs such as #383 (more will be coming for other component
sections).
Copy link
Member

@johnmhoran johnmhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good @mprpic . 👍

pombredanne added a commit that referenced this pull request Feb 12, 2025
We do not want to update this here.

See-also: #389
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this pull request Feb 12, 2025
Streamline the definition of the scheme,
including encoding of colon between scheme and type.

Add test for colon between scheme and type

Move non-core scheme realted discussion to the new FAQ

Suggested-by: @gernot-h
Reference: #376
Reference: #39
See-also: #389
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@@ -231,42 +234,6 @@ The rules for each component are:
- The ``subpath`` must be interpreted as relative to the root of the package


Character encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprpic all these could be moved to the FAQ maybe? We are losing details otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're actually losing anything. Most of the information is present for each of the purl components in the sections above. For example, the type section notes:

  • The type cannot contain spaces
  • The type must NOT be percent-encoded

while we have the same information below:

  • the type must NOT be encoded and must NOT contain separators

The guidance on how to encode certain character is, in my opinion, just confusing since those specific characters when used as field separators should not be encoded. When used as values, users should just refer to the percent-encoding rules linked in the wiki page or the RFC. Also, I don't think anyone out there will be manually encoding these values anyway, most people rely on things like Python's urllib.parse.

If you think some of this information is still important to retain, can you point out which exact parts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When used as values, users should just refer to the percent-encoding rules linked in the wiki page or the RFC.

This is incorrect. The RFC does not specify when which characters are supposed to be encoded in a PURL, and the rules are different for PURL than for URL (eg @).

Also, I don't think anyone out there will be manually encoding these values anyway, most people rely on things like Python's urllib.parse.

urllib.parse can decode PURLs correctly, but it cannot encode a canonical PURL. packageurl-python uses urllib.parse and a string replace, which is right most of the time. It fails if the package name contains a slash, as proposed in the pkg:go PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When used as values, users should just refer to the percent-encoding rules linked in the wiki page or the RFC.

@mprpic You make good points, but I have a slightly different take on the situation and our goals.

  • Re your type example, you can see what I've added to our new faq.rst for type in PR Update 'type' component in core spec and add faq.rst #378 #383. I'm currently updating that PR and will also review the test-suite-data.json tests to update/add type-related tests as needed (which we want to do as part of all of our core spec updating efforts).

  • If by "wiki page" you mean https://github.com/package-url/purl-spec/wiki/FAQ, that hasn't been updated since late 2017. The whole point of the faq.rst we've begun working on is to provide a place to flesh out encoding and other relevant details, including addressing the questions raised and differing views expressed by users in the numerous open issues and PRs. We can add sections in the faq.rst as part of each of the updating PRs. (Note that the component-by-component approach we've started with is good but can't cover everything we need to address -- see the encoding gSheet for some examples.)

  • Simply referring a user to that wiki or the RFC (I assume you mean RFC 3986) is not ideal -- after all, one reason for the issue/PR backlog is the number of encoding questions posed by members of the community. Providing details that address actual and reasonably anticipated questions in faq.rst would be far more helpful, particularly since even RFC 3986 is somewhat ambiguous and open to conflicting interpretations. (Section 2.2, for example, is helpful re encoding but remains somewhat ambiguous, e.g., check the conflicting use of "must" and "should":

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.  . . .  URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
  • Re using, e.g., urllib.parse quote(), I just ran a group of 32 reserved and other non-alphanumeric characters through quote() and all but -, ., /, _ and ~ returned percent-encoded values. This suggests that unless a character is a member of the unreserved group or defined as a purl separator (or "delimiter" to use RFC 3986's terminology), that character must be percent-encoded. And clarity would benefit us all. ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the attached ascii-encoding-examples-2025-02-13.txt for an example of the output.
ascii-encoding-examples-2025-02-13.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When used as values, users should just refer to the percent-encoding rules linked in the wiki page or the RFC.

This is incorrect. The RFC does not specify when which characters are supposed to be encoded in a PURL, and the rules are different for PURL than for URL (eg @).

https://datatracker.ietf.org/doc/html/rfc3986#section-2.1 spells out how to encode every character, no? And %40 is the standard way to encode the @ sign, so why do we need it spelled out in the purl spec?

Also, I don't think anyone out there will be manually encoding these values anyway, most people rely on things like Python's urllib.parse.

urllib.parse can decode PURLs correctly, but it cannot encode a canonical PURL. packageurl-python uses urllib.parse and a string replace, which is right most of the time. It fails if the package name contains a slash, as proposed in the pkg:go PR.

There is a single replacement that I can see here:

https://github.com/package-url/packageurl-python/blob/main/src/packageurl/__init__.py#L64

and it has been a point of confusion for quite a while:

package-url/packageurl-python#152 (comment)

So why not just let standard tools handle this instead of the magic that is currently in place? :)


@johnmhoran By wiki page I meant https://en.wikipedia.org/wiki/Percent-encoding, but I see I left that out of the paragraph I added at the top of the Rules section. I added it now. I'm also happy to move some of the remove content into an FAQ, I just don't think it's worth repeating information that is made clear in other sources. So if there is something in particular that you think should be further discussed in the FAQ, let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mprpic .

  • Re the FAQs, since we're doing the component updating component-by-component atm, it makes most sense for whoever handles a particular component PR add to the FAQs as that person sees fit. No need for you to do anything on that score imho since you're just removing the "Character encoding" section in this PR.

  • A purl user should be able to easily determine, for example, when/where in a purl string a { or : is permitted and when/where it MUST be percent-encoded. Neither the core spec nor RFC 3986 currently enable that. We can revise the core spec to provide language that clearly states the intended meaning of the rules and requirements reflected in the spec. I'm not recommending the substance of the following, but I wonder whether we could use this sort of approach --

At or near the top of the "Rules for each purl component" section we could add something like this:

A permitted character (as defined below for each purl component) MUST be percent-encoded unless it is:

(1) an Unreserved Character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3) or

(2) expressly defined in this PURL-SPECIFICATION.rst as

    (a) a purl separator (and only when used as such a separator) or

    (b) a permitted character in one or more specifically identified purl components.

Each of the seven component subsections would contain additional details and exceptions, clearly identify which characters are permitted (as several already do), and state that no other characters are permitted for that component.

BTW, back in 2018 @jgustie and @pombredanne discussed a similar approach to percent-encoding. See Clarifications, in particular about version encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://datatracker.ietf.org/doc/html/rfc3986#section-2.1 spells out how to encode every character, no? And %40 is the standard way to encode the @ sign, so why do we need it spelled out in the purl spec?

The RFC spells out how to encode any character, including characters that are not to be encoded, and the rules for when characters are to be encoded are different for PURL. For example, because URI has no concept of a package version, it specifies that @ signs in paths do not require escaping.

So why not just let standard tools handle this instead of the magic that is currently in place? :)

There is no one standard tool for this.

Python's urllib.parse defaults to not escaping /, which must be escaped in many cases.

The set of characters encoded varies by implementations, and sometimes a single implementation provides multiple methods with different behaviors.

Python's urllib.parse.quote escapes [0, 0x2c], [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's Uri.EscapeDataString escapes [0, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's Uri.EscapeUriString escapes [0, 0x20], 0x22, 0x25, 0x3c, 0x3e, 0x5c, 0x5e, 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's HttpUtility.UrlEncode escapes [0, 0x1f], 0x20 as +, [0x22, 0x27], [0x2b, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7f].
Ruby's CGI::escape escapes [0, 0x1f], 0x20 as +, [0x21, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
ECMAScript's escape escapes [0, 0x29], 0x2c, [0x3a, 0x3f], [0x5b, 0x5e], 0x60, [0x7b, 0x7f]
ECMAScript's encodeURI escapes [0, 0x20], 0x22, 0x25, 0x3c, 0x3e, [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f
ECMAScript's encodeURIComponent escapes [0, 0x20], [0x22, 0x26], [0x2b, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f

and it has been a point of confusion for quite a while:

The confusion is because

  1. RFC3986, the document where percent encoding is defined, is not about percent encoding, and only introducing percent encoding as a simple and consistent means to encode characters in URIs.
  2. The Wikipedia page on percent encoding has only one paragraph at the top about percent encoding and the rest of the page is about URIs.
  3. The PURL spec links to the Wikipedia page as the definition of percent encoding and readers likely skip the part about percent encoding and jump straight to the part about URI encoding rules because the part about percent encoding looks like just an introductory paragraph.
  4. The PURL spec gives confusing rules like "[...] ':' characters [...] may need to be encoded elsewhere[.] [...] It is unambiguous unencoded elsewhere" (ie it does not need to be encoded, contradicting the vague "may need to be encoded" from the line before) and makes misleading statements like "It is OK to percent-encode purl components otherwise" (it's not if you're building a canonical PURL).
  5. Most PURL implementations have at least one percent encoding bug.

Most readers don't realize that the set of encoded characters even when building a URI is different for each component of the URI (the "reserved" characters in RFC3986 language are only sometimes encoded).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC spells out how to encode any character, including characters that are not to be encoded, and the rules for when characters are to be encoded are different for PURL. For example, because URI has no concept of a package version, it specifies that @ signs in paths do not require escaping.

What does "it specifies that @ signs in paths do not require escaping." mean here, specifically "paths do not require escaping"? I assume that the at sign should not be encoded when used to separate the name and version component, but it should be encoded when used as part of the version or any qualifier value for example. That is in line with the guidance of percent-encode the values of each field, but ignore encoding of separators as noted in the "how to build/parse purl" sections.

Most readers don't realize that the set of encoded characters even when building a URI is different for each component of the URI (the "reserved" characters in RFC3986 language are only sometimes encoded).

Do you not feel we should make this a bit more easy to accomplish rather than requiring some obscure rules for encoding this, but not that? If I as a user have to read through the nitty-gritty parts of the spec and dig into the source code of the purl library of my choice to figure out if I need to encode or not encode a certain value using some special rules, I feel like I'm not set up for success. My intention behind simplifying this was to remove complexity, but I see there are a lot of opinions for keeping it as is.

How can we solve the problem of character encoding for purl once and for all? If you have time to join #377 (comment) tomorrow, we could try to find a better solution perhaps than my change here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "it specifies that @ signs in paths do not require escaping." mean here, specifically "paths do not require escaping"?

@ is 0x40, which is encoded by 5/8 of the above example implementations of URL encoding. That means if you build a PURL using one of the other three implementations you need a special case for encoding @. For ECMAScript, if you use encodeURIComponent it encodes @, but it also encodes /, which is correct for names but incorrect for namespaces. encodeURI is better for the namespace because it doesn't encode /, but it doesn't encode @. : is 0x3a, which is encoded by 6/8 when it should never be encoded in a canonical PURL with a valid package type.

Do you not feel we should make this a bit more easy to accomplish rather than requiring some obscure rules for encoding this, but not that?

Removing the encoding section certainly does not make it easier to implement encoding. This makes the PURL spec underspecified on which characters are to be encoded when, making it impossible to implement PURL canonicalization. PURL needs more clarity around character encoding rules for canonicalization, not less.

The encoding section needs to be rewritten, and possibly the information in the encoding section needs to be moved to the "how to build" section instead. I think that the WHATWG URL spec does a much better job of defining character encoding than RFC3986, which does a much better job than PURL. RFC3986 defines a set of characters that are not escaped, a set of characters that may be escaped, and then defines the set of characters that are escaped for a given URI component by specifying the characters that are not escaped. WHATWG URL defines a constructive hierarchy of named sets of characters to be encoded and references those sets when describing how to build the URL. If the PURL spec said something like "The qualifier percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+0023 (#), U+0026(&), U+003C (<), U+003E (>), U+003F (?)" (based on WHATWG URL's query percent-encode set) and "Percent encode the qualifier value using the qualifier percent-encode set", or even just listed the characters without naming the final set, there wouldn't be any confusion about whether colons are supposed to be encoded in qualifier values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL encoding
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants