-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PNE-241] Data manager release #949
Conversation
36e6261
to
d4ce1df
Compare
src/citrine/resources/gemtables.py
Outdated
@@ -244,7 +246,12 @@ def get_by_build_job(self, job: Union[JobSubmissionResponse, UUID], *, | |||
The table built by the specified job. | |||
|
|||
""" | |||
status = _poll_for_job_completion(self.session, self.project_id, job, timeout=timeout) | |||
# TODO: Should this use the project or team version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this out here in case I forget to ask Pablo before the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly required, but if it happens that you have the teamId it would be nice, I'm trying to move the jobs API to the team level.
Datasets and all related objects now belong to teams, so new endpoints have been introduced into the platform to support this. As such, all such calls need to be updated. The result is a bunch of `ProjectCollection` methods are being deprecated in favor of their `TeamCollection` equivalents, and many methods which accepted a project or project_id must be moved to a team/team_id. The vast majority of behaviors will remain unchanged when using the deprecated methods, until we bump the major version of the SDK and drop the old endpoints. The major exception is that creating a dataset on a project endpoint will register it to a team, so it will be omitted when you list your datasets by project.
d4ce1df
to
c7117a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of changes....
def __init__(self, | ||
*args, | ||
session: Session = None, | ||
dataset_id: Optional[UUID] = None, | ||
team_id: Optional[UUID] = None, | ||
project_id: Optional[UUID] = None): | ||
# Handle positional arguments for backward compatibility | ||
args = _pad_positional_args(args, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, I've handled this with a signature like:
def __init__(self,
project_id: Optional[UUID] = None):
deprecated_dataset_id: Optional[UUID] = None,
deprecated_session: Session = None,
*,
session: Session = None,
dataset_id: Optional[UUID] = None,
team_id: Optional[UUID] = None,
Because this allows for making sure one or the other, not both, are expressed. If you are concerned about detecting passed None
s, that can be accomplished using a sentinel value: https://github.com/CitrineInformatics/gemd-python/blob/26c6be864b1a8eed2c2a9a2308da03db4588b650/gemd/entity/case_insensitive_dict.py#L5
I see how very similar operations are repeated, so following that pattern might require a fair bit of typing. It also has the added value of having the deprecation warning firing from the correct layer of the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we had to apply this to a bunch of classes, I was more worried about excessive verbosity as well as copying the same code in a bunch of places, making any changes to be applied broadly a huge pain. Given the deprecation warnings it triggers, I'm not too worried about people providing both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further reflection, I think team_id is not Optional. Scenarios where no team_id is passed result in either errors or deprecation warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think team_id is not Optional
Agreed! I intended to drop the Optional
typing, but seems I missed some spots (or maybe completely forgot). Good catch!
self.project_id = project_id or args[0] | ||
self.dataset_id = dataset_id or args[1] | ||
self.session = session or args[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on non-None arguments to evaluate to True in boolean context is counter to best practice:
https://peps.python.org/pep-0008/#programming-recommendations
These should be the more verbose
self.project_id = project_id if project_id is not None else args[0]
or even the full block. Unlikely to trigger a bug in real-world context, but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice is to beware. We are using typing
to indicate the caller should either be passing None
or a type for which this holds true. If the caller violates the stated contract, they take responsibility for unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python has no contracts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course it does, the language just doesn't enforce them. Otherwise, we'd also check isinstance(project_id, UUID)
.
src/citrine/_utils/functions.py
Outdated
|
||
def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str): | ||
if project_id is None and team_id is None: | ||
raise TypeError("Missing one required argument: team_id.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is an error in the configuration of arguments and not passing a bad keyword or the wrong number of arguments, I think this is actually a ValueError. Grey zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definite gray zone. But I like using TypeError
here because it will mirror the error they'd get in the future once the deprecated code is removed.
@@ -319,3 +319,25 @@ def resource_path(*, | |||
new_url = base._replace(path='/'.join(path), query=query).geturl() | |||
|
|||
return format_escaped_url(new_url, *action, **kwargs, uid=uid) | |||
|
|||
|
|||
def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this
if team_id is None:
if project_id is None:
raise Error
warn
import
team_id = Project.get_team_id_from_project_id(session=session, project_id=project_id)
return team_id
also, why aren't you checking if both ProjectID & TeamID are provided? Seems like you could screw up downstream stuff if you pass a team ID that's inconsistent with the Project ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inherited code. =) Yeah, I agree that your structure is better.
src/citrine/jobs/job.py
Outdated
if team_id is not None: | ||
path = format_escaped_url('teams/{}/execution/job-status', team_id) | ||
else: | ||
path = format_escaped_url('projects/{}/execution/job-status', project_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code path not deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check with Pablo. Based on his comment above, it sounds like the intent is for the project-level jobs endpoint to be deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the team_id is known we should call the team API. I'm unsure if we have guarantees of the team_id to be available here. But yes, ideally we should be moving to the new pattern. As a side note, the project-based job ID is not going to be deprecated because of the data manager, it will be for another reason (having a homogenous API for dealing with jobs).
src/citrine/resources/file_link.py
Outdated
def __init__( | ||
self, | ||
*args, | ||
session: Session = None, | ||
dataset_id: UUID = None, | ||
team_id: Optional[UUID] = None, | ||
project_id: Optional[UUID] = None | ||
): | ||
args = _pad_positional_args(args, 3) | ||
self.project_id = project_id or args[0] | ||
self.dataset_id = dataset_id or args[1] | ||
self.session = session or args[2] | ||
if self.session is None: | ||
raise TypeError("Missing one required argument: session.") | ||
if self.dataset_id is None: | ||
raise TypeError("Missing one required argument: dataset_id.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to DataConceptsCollection
def __init__( | ||
self, | ||
*args, | ||
dataset_id: UUID = None, | ||
session: Session = None, | ||
team_id: Optional[UUID] = None, | ||
project_id: Optional[UUID] = None | ||
): | ||
super().__init__(*args, | ||
team_id=team_id, | ||
dataset_id=dataset_id, | ||
session=session, | ||
project_id=project_id) | ||
args = _pad_positional_args(args, 3) | ||
self.project_id = project_id or args[0] | ||
self.dataset_id = dataset_id or args[1] | ||
self.session = session or args[2] | ||
self.team_id = team_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to DataConceptsCollection
def __init__(self, *args, team_id: UUID, project_id: UUID = None, session: Session = None): | ||
args = _pad_positional_args(args, 2) | ||
self.project_id = project_id or args[0] | ||
self.session: Session = session or args[1] | ||
self.team_id = team_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a Table Config need a Project ID? Are Tables still affiliated with particular Projects and not affiliated with datasets anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it's still needed to retrieve it based on the table it built:
path = (f'projects/{self.project_id}/display-tables/{table.uid}/versions/{table.version}' |
I easily could be wrong though. @pacdaemon did I misinterpret what you said?
src/citrine/resources/project.py
Outdated
@@ -249,6 +353,9 @@ def publish(self, *, resource: Resource): | |||
""" | |||
resource_access = resource.access_control_dict() | |||
resource_type = resource_access["type"] | |||
if resource_type == ResourceTypeEnum.DATASET: | |||
warn("Datasets are no longer owned by Projects, so cannot be published by a Project.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this warning explicitly call out the Team-based route?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Publishing" no longer means anything with regards to datasets in the world of Data Manager. Previously, it's what made them visible to the team so another project could pull them in. Now, since everything is at a team level, it's a no-op.
Sounds like the message should probably be changed to be a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, publishing doesn't mean anything now for team-based datasets, however, the accounts service doesn't implement an NOP for this type of dataset. Instead, it returns a 403, because the project doesn't have admin rights on the new type of dataset. I'd say that on top of the warning we should not call the API at all. The NOP is on our side.
from typing import TYPE_CHECKING | ||
if TYPE_CHECKING: # pragma: no cover | ||
from citrine.resources.project import Project | ||
from citrine.resources.team import Team | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really benefiting from this?
https://docs.python.org/3.10/library/typing.html?highlight=type_checking#constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's debatable. This is what allowed me to add the proper type hints on lines 189-190 and 296-297 while still allowing it to compile and without flake8
complaining. Exactly how valuable that is depends on how much people rely on the type hinting. 🤷
e026979
to
c7117a6
Compare
if data and data[0]["roots"]: | ||
# Since the above query presents a single dataset to the endpoint, the response will be | ||
# a list of length one, with a single route. | ||
history_data = data[0] | ||
history_data["roots"] = history_data["roots"][0] | ||
return MaterialRun.build(history_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a change that came from Pablo running the e2es.
The result of the endpoint changed slightly, such that it can now encode multiple material histories. He confirmed to me that because this call is querying on a single dataset ID, it will always return a single element, and its roots
field will be an array of length 1.
I could have put this stuff in _pre_build
, but my impression is we very well might want to expose the whole thing at some point in the future, so I'd make it clear these changes apply to this specific usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:
history_data = data[0]
history_data["object"] = history_data.pop("roots")[0]
so that at least we're making it so that keywords have different meanings? 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's not great. And yeah, I have no problem making that tweak. Looking at the code in GEMDResource
, it should work fine.
Also fix the deprecation warnings in Project to point at Team instead of TeamCollection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic fixes. Nothing blocking.
- I thought I submitted this review, sorry.
self.poll_async_update_job(job_id, timeout=timeout, | ||
self.poll_async_update_job(job_id=job_id, timeout=timeout, | ||
polling_delay=polling_delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/CitrineInformatics/citrine-python/blob/main/CONTRIBUTING.md#coding-style
Positional arguments are strongly discouraged for methods with multiple arguments
Since collection.poll_async_update_job(job_id)
actually reads well and you'd only be asking this one kind of question, it passes the requirements for a valid positional argument. It'd be user-antagonistic to make them type job_id=
when the only reasonable this to include there would be the job_id
.
Non-blocking.
@@ -268,7 +290,7 @@ def delete(self, uid: Union[UUID, str, LinkByUID, DataConcepts], *, dry_run=Fals | |||
collection = self.gemd._collection_for(uid) | |||
else: | |||
collection = self.gemd | |||
return collection.delete(uid, dry_run=dry_run) | |||
return collection.delete(uid=uid, dry_run=dry_run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar response re: https://github.com/CitrineInformatics/citrine-python/blob/main/CONTRIBUTING.md#coding-style. Again, non-blocking.
] | ||
} | ||
data = self.session.post_resource(path, json=query) | ||
if data and data[0]["roots"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "roots" is not present, this is fatal. Maybe you wanted
if data and data[0]["roots"]: | |
if data and data[0].get("roots"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬 You are correct!
if data and data[0]["roots"]: | ||
# Since the above query presents a single dataset to the endpoint, the response will be | ||
# a list of length one, with a single route. | ||
history_data = data[0] | ||
history_data["roots"] = history_data["roots"][0] | ||
return MaterialRun.build(history_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:
history_data = data[0]
history_data["object"] = history_data.pop("roots")[0]
so that at least we're making it so that keywords have different meanings? 🤦
src/citrine/resources/project.py
Outdated
return session.get_resource( | ||
path=f'projects/{project_id}', | ||
version="v3")['project']['team']['id'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This passes linting?
return session.get_resource( | |
path=f'projects/{project_id}', | |
version="v3")['project']['team']['id'] | |
response = session.get_resource(path=f'projects/{project_id}', version='v3') | |
return response['project']['team']['id'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's been like that, so I was just gonna leave it. But since I'm in there anyways, sure, I'll clean it up.
src/citrine/resources/project.py
Outdated
return GEMDResourceCollection( | ||
project_id=self.uid, | ||
dataset_id=None, | ||
session=self.session, | ||
team_id=self.team_id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointlessly inconsistent formatting
return GEMDResourceCollection( | |
project_id=self.uid, | |
dataset_id=None, | |
session=self.session, | |
team_id=self.team_id | |
) | |
return GEMDResourceCollection(project_id=self.uid, | |
dataset_id=None, | |
session=self.session, | |
team_id=self.team_id) |
src/citrine/resources/team.py
Outdated
return self.session.get_resource("/DATASET/authorized-ids", | ||
params=query_params, | ||
version="v3")['ids'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved legibility -- key gets lost in the arguments.
return self.session.get_resource("/DATASET/authorized-ids", | |
params=query_params, | |
version="v3")['ids'] | |
response = self.session.get_resource( | |
"/DATASET/authorized-ids", | |
params=query_params, | |
version="v3" | |
) | |
return response['ids'] |
src/citrine/resources/team.py
Outdated
return MeasurementTemplateCollection( | ||
team_id=self.uid, | ||
dataset_id=None, | ||
session=self.session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session=self.session) | |
session=self.session | |
) |
Mostly formatting things (and one time bomb of a bug).
Citrine Python PR
Description
I wanted to preserve Lenore's old PR for posterity, but clean up the release we'll be working with. Hence this separate branch and PR.
PR Type:
Adherence to team decisions