Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PNE-241] Data manager release #949

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

anoto-moniz
Copy link
Collaborator

@anoto-moniz anoto-moniz commented Jul 10, 2024

Citrine Python PR

Description

I wanted to preserve Lenore's old PR for posterity, but clean up the release we'll be working with. Hence this separate branch and PR.

PR Type:

  • Breaking change (fix or feature that would cause existing functionality to change)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Maintenance (non-breaking change to assist developers)

Adherence to team decisions

  • I have added tests for 100% coverage
  • I have written Numpy-style docstrings for every method and class.
  • I have communicated the downstream consequences of the PR to others.
  • I have bumped the version in __version__.py

@anoto-moniz anoto-moniz force-pushed the feature/pne-241-data-manager-release branch 7 times, most recently from 36e6261 to d4ce1df Compare July 15, 2024 18:04
@@ -244,7 +246,12 @@ def get_by_build_job(self, job: Union[JobSubmissionResponse, UUID], *,
The table built by the specified job.

"""
status = _poll_for_job_completion(self.session, self.project_id, job, timeout=timeout)
# TODO: Should this use the project or team version?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this out here in case I forget to ask Pablo before the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly required, but if it happens that you have the teamId it would be nice, I'm trying to move the jobs API to the team level.

Datasets and all related objects now belong to teams, so new endpoints
have been introduced into the platform to support this. As such, all
such calls need to be updated. The result is a bunch of
`ProjectCollection` methods are being deprecated in favor of their
`TeamCollection` equivalents, and many methods which accepted a project
or project_id must be moved to a team/team_id.

The vast majority of behaviors will remain unchanged when using the
deprecated methods, until we bump the major version of the SDK and drop
the old endpoints. The major exception is that creating a dataset on a
project endpoint will register it to a team, so it will be omitted when
you list your datasets by project.
@anoto-moniz anoto-moniz force-pushed the feature/pne-241-data-manager-release branch from d4ce1df to c7117a6 Compare July 15, 2024 18:42
@anoto-moniz anoto-moniz marked this pull request as ready for review July 15, 2024 18:51
Copy link
Collaborator

@kroenlein kroenlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of changes....

Comment on lines 217 to 224
def __init__(self,
*args,
session: Session = None,
dataset_id: Optional[UUID] = None,
team_id: Optional[UUID] = None,
project_id: Optional[UUID] = None):
# Handle positional arguments for backward compatibility
args = _pad_positional_args(args, 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, I've handled this with a signature like:

    def __init__(self,
                 project_id: Optional[UUID] = None):
                 deprecated_dataset_id: Optional[UUID] = None,
                 deprecated_session: Session = None,
                 *,
                 session: Session = None,
                 dataset_id: Optional[UUID] = None,
                 team_id: Optional[UUID] = None,

Because this allows for making sure one or the other, not both, are expressed. If you are concerned about detecting passed Nones, that can be accomplished using a sentinel value: https://github.com/CitrineInformatics/gemd-python/blob/26c6be864b1a8eed2c2a9a2308da03db4588b650/gemd/entity/case_insensitive_dict.py#L5

I see how very similar operations are repeated, so following that pattern might require a fair bit of typing. It also has the added value of having the deprecation warning firing from the correct layer of the stack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we had to apply this to a bunch of classes, I was more worried about excessive verbosity as well as copying the same code in a bunch of places, making any changes to be applied broadly a huge pain. Given the deprecation warnings it triggers, I'm not too worried about people providing both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further reflection, I think team_id is not Optional. Scenarios where no team_id is passed result in either errors or deprecation warnings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think team_id is not Optional

Agreed! I intended to drop the Optional typing, but seems I missed some spots (or maybe completely forgot). Good catch!

Comment on lines +225 to +227
self.project_id = project_id or args[0]
self.dataset_id = dataset_id or args[1]
self.session = session or args[2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on non-None arguments to evaluate to True in boolean context is counter to best practice:

https://peps.python.org/pep-0008/#programming-recommendations

These should be the more verbose

self.project_id = project_id if project_id is not None else args[0]

or even the full block. Unlikely to trigger a bug in real-world context, but...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to beware. We are using typing to indicate the caller should either be passing None or a type for which this holds true. If the caller violates the stated contract, they take responsibility for unexpected behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has no contracts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it does, the language just doesn't enforce them. Otherwise, we'd also check isinstance(project_id, UUID).


def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str):
if project_id is None and team_id is None:
raise TypeError("Missing one required argument: team_id.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is an error in the configuration of arguments and not passing a bad keyword or the wrong number of arguments, I think this is actually a ValueError. Grey zone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definite gray zone. But I like using TypeError here because it will mirror the error they'd get in the future once the deprecated code is removed.

@@ -319,3 +319,25 @@ def resource_path(*,
new_url = base._replace(path='/'.join(path), query=query).geturl()

return format_escaped_url(new_url, *action, **kwargs, uid=uid)


def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this

if team_id is None:
    if project_id is None:
        raise Error
    warn
    import
    team_id = Project.get_team_id_from_project_id(session=session, project_id=project_id)
return team_id

also, why aren't you checking if both ProjectID & TeamID are provided? Seems like you could screw up downstream stuff if you pass a team ID that's inconsistent with the Project ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherited code. =) Yeah, I agree that your structure is better.

if team_id is not None:
path = format_escaped_url('teams/{}/execution/job-status', team_id)
else:
path = format_escaped_url('projects/{}/execution/job-status', project_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code path not deprecated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check with Pablo. Based on his comment above, it sounds like the intent is for the project-level jobs endpoint to be deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the team_id is known we should call the team API. I'm unsure if we have guarantees of the team_id to be available here. But yes, ideally we should be moving to the new pattern. As a side note, the project-based job ID is not going to be deprecated because of the data manager, it will be for another reason (having a homogenous API for dealing with jobs).

Comment on lines 199 to 214
def __init__(
self,
*args,
session: Session = None,
dataset_id: UUID = None,
team_id: Optional[UUID] = None,
project_id: Optional[UUID] = None
):
args = _pad_positional_args(args, 3)
self.project_id = project_id or args[0]
self.dataset_id = dataset_id or args[1]
self.session = session or args[2]
if self.session is None:
raise TypeError("Missing one required argument: session.")
if self.dataset_id is None:
raise TypeError("Missing one required argument: dataset_id.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to DataConceptsCollection

Comment on lines 29 to 46
def __init__(
self,
*args,
dataset_id: UUID = None,
session: Session = None,
team_id: Optional[UUID] = None,
project_id: Optional[UUID] = None
):
super().__init__(*args,
team_id=team_id,
dataset_id=dataset_id,
session=session,
project_id=project_id)
args = _pad_positional_args(args, 3)
self.project_id = project_id or args[0]
self.dataset_id = dataset_id or args[1]
self.session = session or args[2]
self.team_id = team_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to DataConceptsCollection

Comment on lines +419 to +423
def __init__(self, *args, team_id: UUID, project_id: UUID = None, session: Session = None):
args = _pad_positional_args(args, 2)
self.project_id = project_id or args[0]
self.session: Session = session or args[1]
self.team_id = team_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a Table Config need a Project ID? Are Tables still affiliated with particular Projects and not affiliated with datasets anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it's still needed to retrieve it based on the table it built:

path = (f'projects/{self.project_id}/display-tables/{table.uid}/versions/{table.version}'

I easily could be wrong though. @pacdaemon did I misinterpret what you said?

@@ -249,6 +353,9 @@ def publish(self, *, resource: Resource):
"""
resource_access = resource.access_control_dict()
resource_type = resource_access["type"]
if resource_type == ResourceTypeEnum.DATASET:
warn("Datasets are no longer owned by Projects, so cannot be published by a Project.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this warning explicitly call out the Team-based route?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Publishing" no longer means anything with regards to datasets in the world of Data Manager. Previously, it's what made them visible to the team so another project could pull them in. Now, since everything is at a team level, it's a no-op.

Sounds like the message should probably be changed to be a bit clearer.

Copy link
Contributor

@pacdaemon pacdaemon Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, publishing doesn't mean anything now for team-based datasets, however, the accounts service doesn't implement an NOP for this type of dataset. Instead, it returns a 403, because the project doesn't have admin rights on the new type of dataset. I'd say that on top of the warning we should not call the API at all. The NOP is on our side.

Comment on lines +27 to +31
from typing import TYPE_CHECKING
if TYPE_CHECKING: # pragma: no cover
from citrine.resources.project import Project
from citrine.resources.team import Team

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's debatable. This is what allowed me to add the proper type hints on lines 189-190 and 296-297 while still allowing it to compile and without flake8 complaining. Exactly how valuable that is depends on how much people rely on the type hinting. 🤷

@anoto-moniz anoto-moniz force-pushed the feature/pne-241-data-manager-release branch from e026979 to c7117a6 Compare July 16, 2024 18:58
Comment on lines 142 to 147
if data and data[0]["roots"]:
# Since the above query presents a single dataset to the endpoint, the response will be
# a list of length one, with a single route.
history_data = data[0]
history_data["roots"] = history_data["roots"][0]
return MaterialRun.build(history_data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a change that came from Pablo running the e2es.

The result of the endpoint changed slightly, such that it can now encode multiple material histories. He confirmed to me that because this call is querying on a single dataset ID, it will always return a single element, and its roots field will be an array of length 1.

I could have put this stuff in _pre_build, but my impression is we very well might want to expose the whole thing at some point in the future, so I'd make it clear these changes apply to this specific usage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:

history_data = data[0]
history_data["object"] = history_data.pop("roots")[0]

so that at least we're making it so that keywords have different meanings? 🤦

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not great. And yeah, I have no problem making that tweak. Looking at the code in GEMDResource, it should work fine.

@anoto-moniz anoto-moniz requested a review from kroenlein July 17, 2024 19:41
Also fix the deprecation warnings in Project to point at Team instead of
TeamCollection.
kroenlein
kroenlein previously approved these changes Jul 18, 2024
Copy link
Collaborator

@kroenlein kroenlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic fixes. Nothing blocking.

  • I thought I submitted this review, sorry.

Comment on lines -485 to 517
self.poll_async_update_job(job_id, timeout=timeout,
self.poll_async_update_job(job_id=job_id, timeout=timeout,
polling_delay=polling_delay)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/CitrineInformatics/citrine-python/blob/main/CONTRIBUTING.md#coding-style

Positional arguments are strongly discouraged for methods with multiple arguments

Since collection.poll_async_update_job(job_id) actually reads well and you'd only be asking this one kind of question, it passes the requirements for a valid positional argument. It'd be user-antagonistic to make them type job_id= when the only reasonable this to include there would be the job_id.

Non-blocking.

@@ -268,7 +290,7 @@ def delete(self, uid: Union[UUID, str, LinkByUID, DataConcepts], *, dry_run=Fals
collection = self.gemd._collection_for(uid)
else:
collection = self.gemd
return collection.delete(uid, dry_run=dry_run)
return collection.delete(uid=uid, dry_run=dry_run)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

]
}
data = self.session.post_resource(path, json=query)
if data and data[0]["roots"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "roots" is not present, this is fatal. Maybe you wanted

Suggested change
if data and data[0]["roots"]:
if data and data[0].get("roots"):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 You are correct!

Comment on lines 142 to 147
if data and data[0]["roots"]:
# Since the above query presents a single dataset to the endpoint, the response will be
# a list of length one, with a single route.
history_data = data[0]
history_data["roots"] = history_data["roots"][0]
return MaterialRun.build(history_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:

history_data = data[0]
history_data["object"] = history_data.pop("roots")[0]

so that at least we're making it so that keywords have different meanings? 🤦

Comment on lines 120 to 122
return session.get_resource(
path=f'projects/{project_id}',
version="v3")['project']['team']['id']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This passes linting?

Suggested change
return session.get_resource(
path=f'projects/{project_id}',
version="v3")['project']['team']['id']
response = session.get_resource(path=f'projects/{project_id}', version='v3')
return response['project']['team']['id']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's been like that, so I was just gonna leave it. But since I'm in there anyways, sure, I'll clean it up.

Comment on lines 321 to 326
return GEMDResourceCollection(
project_id=self.uid,
dataset_id=None,
session=self.session,
team_id=self.team_id
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointlessly inconsistent formatting

Suggested change
return GEMDResourceCollection(
project_id=self.uid,
dataset_id=None,
session=self.session,
team_id=self.team_id
)
return GEMDResourceCollection(project_id=self.uid,
dataset_id=None,
session=self.session,
team_id=self.team_id)

Comment on lines 384 to 386
return self.session.get_resource("/DATASET/authorized-ids",
params=query_params,
version="v3")['ids']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved legibility -- key gets lost in the arguments.

Suggested change
return self.session.get_resource("/DATASET/authorized-ids",
params=query_params,
version="v3")['ids']
response = self.session.get_resource(
"/DATASET/authorized-ids",
params=query_params,
version="v3"
)
return response['ids']

return MeasurementTemplateCollection(
team_id=self.uid,
dataset_id=None,
session=self.session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
session=self.session)
session=self.session
)

Mostly formatting things (and one time bomb of a bug).
@anoto-moniz anoto-moniz merged commit 619e003 into main Jul 18, 2024
16 checks passed
@anoto-moniz anoto-moniz deleted the feature/pne-241-data-manager-release branch July 18, 2024 19:42
@pacdaemon pacdaemon mentioned this pull request Jul 19, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants