Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix colon encoding in tests again #416

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matt-phylum
Copy link
Contributor

About colons, the spec says:

- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere

Therefore, the colon should not be encoded. I've read #364 and I can't understand the reasoning for changing the unencoded colon to be encoded. #364 also quotes the spec and references another issue, contradicting its own implementation.

9/14 implementations agree that this colon is not to be encoded.

…ing"

This reverts commit e2c8eec, reversing
changes made to 2d06019.
@johnmhoran
Copy link
Member

johnmhoran commented Mar 10, 2025

@matt-phylum @pombredanne These test changes are putting the cart before the horse. Let's first clarify the qualifiers component section and the related "Character encoding" section (see #398), and we can then simply apply the clarified rules to this and other tests. That would address not just the long-discussed colon but all other characters that are permitted and need to be addressed. For example, if the colon MUST be unencoded throughout the purl, we can say so, clearly.

@johnmhoran
Copy link
Member

@pombredanne What did you mean by the language @matt-phylum cites at the top: It is unambiguous unencoded everywhere? The colon separator, i.e., the colon when (and only when) used as that defined separator? Or the colon, wherever used in the purl? Or something else? Why refer at all to "the ':' scheme and type separator" if we mean the literal colon character wherever used?

@matt-phylum
Copy link
Contributor Author

This PR is just changing the test case back to be consistent with the existing documentation and other tests. The test used to be this way for 7 years from 2017-11-16 when the test suite was introduced until 2025-02-05. As it currently stands, you cannot pass the tests without a special case to encode : in the version but not in the name or qualifiers, which is not part of the spec.

The colon character is unambiguous unencoded everywhere. Not the colon separator. Colons only have meaning when used after pkg.

The wording of these rules has always been problematic. The first rule says that the separator must not be encoded, but that the : may need to be encoded when it's not a separator, and then second rule immediately clarifies that the case where the character :, not the separator, requires encoding is never.

Compare to the other rules:

- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator
does not need to and must NOT be percent-encoded. It is unambiguous unencoded
everywhere

This one is technically wrong. There is one time when slashes are ambiguous and do need to be encoded: when they are within the name. However, it follows the same pattern of referencing the character / when not used as a separator by calling it the separator. "It is unambiguous unencoded everywhere" means slashes within qualifiers are not encoded, and the tests agree.

- the '@' ``version`` separator must be encoded as ``%40`` elsewhere
- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere
- the '=' ``qualifiers`` key/value separator must NOT be encoded
- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere

For these ones it is more obvious what's going on. @ is only the version separator when it is used as the version separator, and it must never be encoded when it is the version separator. However, the rule says that the @ character must be encoded as %40 by saying that the version separator character must be encoded when it's not the version separator, and the same pattern is used for the other characters that must be encoded.

These tests are in contradiction to the change being reverted:

{
"description": "Hugging Face model with staging endpoint",
"purl": "pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https://hub-ci.huggingface.co",
"canonical_purl": "pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https://hub-ci.huggingface.co",
"type": "huggingface",
"namespace": "microsoft",
"name": "deberta-v3-base",
"version": "559062ad13d311b87b2c455e67dcd5f1c8f65111",
"qualifiers": {"repository_url": "https://hub-ci.huggingface.co"},
"subpath": null,
"is_invalid": false
},

The colon in the qualifier is not encoded.

{
"description": "cpan module name are case sensitive",
"purl": "pkg:cpan/URI::PackageURL@2.11",
"canonical_purl": "pkg:cpan/URI::PackageURL@2.11",
"type": "cpan",
"namespace": null,
"name": "URI::PackageURL",
"version": "2.11",
"qualifiers": null,
"subpath": null,
"is_invalid": false
},

The colon in the name is not encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants