[PNE-241] Data manager release #949

anoto-moniz · 2024-07-10T18:28:43Z

Citrine Python PR

Description

I wanted to preserve Lenore's old PR for posterity, but clean up the release we'll be working with. Hence this separate branch and PR.

PR Type:

Breaking change (fix or feature that would cause existing functionality to change)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Maintenance (non-breaking change to assist developers)

Adherence to team decisions

I have added tests for 100% coverage
I have written Numpy-style docstrings for every method and class.
I have communicated the downstream consequences of the PR to others.
I have bumped the version in __version__.py

anoto-moniz · 2024-07-15T18:11:34Z

src/citrine/resources/gemtables.py

@@ -244,7 +246,12 @@ def get_by_build_job(self, job: Union[JobSubmissionResponse, UUID], *,
            The table built by the specified job.

        """
-        status = _poll_for_job_completion(self.session, self.project_id, job, timeout=timeout)
+        # TODO: Should this use the project or team version?


Calling this out here in case I forget to ask Pablo before the PR.

Not strictly required, but if it happens that you have the teamId it would be nice, I'm trying to move the jobs API to the team level.

Datasets and all related objects now belong to teams, so new endpoints have been introduced into the platform to support this. As such, all such calls need to be updated. The result is a bunch of `ProjectCollection` methods are being deprecated in favor of their `TeamCollection` equivalents, and many methods which accepted a project or project_id must be moved to a team/team_id. The vast majority of behaviors will remain unchanged when using the deprecated methods, until we bump the major version of the SDK and drop the old endpoints. The major exception is that creating a dataset on a project endpoint will register it to a team, so it will be omitted when you list your datasets by project.

kroenlein

That's a lot of changes....

kroenlein · 2024-07-15T19:30:12Z

src/citrine/resources/data_concepts.py

+    def __init__(self,
+                 *args,
+                 session: Session = None,
+                 dataset_id: Optional[UUID] = None,
+                 team_id: Optional[UUID] = None,
+                 project_id: Optional[UUID] = None):
+        # Handle positional arguments for backward compatibility
+        args = _pad_positional_args(args, 3)


Historically, I've handled this with a signature like:

def __init__(self, project_id: Optional[UUID] = None): deprecated_dataset_id: Optional[UUID] = None, deprecated_session: Session = None, *, session: Session = None, dataset_id: Optional[UUID] = None, team_id: Optional[UUID] = None,

Because this allows for making sure one or the other, not both, are expressed. If you are concerned about detecting passed Nones, that can be accomplished using a sentinel value: https://github.com/CitrineInformatics/gemd-python/blob/26c6be864b1a8eed2c2a9a2308da03db4588b650/gemd/entity/case_insensitive_dict.py#L5

I see how very similar operations are repeated, so following that pattern might require a fair bit of typing. It also has the added value of having the deprecation warning firing from the correct layer of the stack.

Given that we had to apply this to a bunch of classes, I was more worried about excessive verbosity as well as copying the same code in a bunch of places, making any changes to be applied broadly a huge pain. Given the deprecation warnings it triggers, I'm not too worried about people providing both.

Upon further reflection, I think team_id is not Optional. Scenarios where no team_id is passed result in either errors or deprecation warnings.

I think team_id is not Optional

Agreed! I intended to drop the Optional typing, but seems I missed some spots (or maybe completely forgot). Good catch!

kroenlein · 2024-07-15T19:35:48Z

src/citrine/resources/data_concepts.py

+        self.project_id = project_id or args[0]
+        self.dataset_id = dataset_id or args[1]
+        self.session = session or args[2]


Relying on non-None arguments to evaluate to True in boolean context is counter to best practice:

https://peps.python.org/pep-0008/#programming-recommendations

These should be the more verbose

self.project_id = project_id if project_id is not None else args[0]

or even the full block. Unlikely to trigger a bug in real-world context, but...

Best practice is to beware. We are using typing to indicate the caller should either be passing None or a type for which this holds true. If the caller violates the stated contract, they take responsibility for unexpected behavior.

Python has no contracts.

Of course it does, the language just doesn't enforce them. Otherwise, we'd also check isinstance(project_id, UUID).

kroenlein · 2024-07-15T19:38:58Z

src/citrine/_utils/functions.py

+
+def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str):
+    if project_id is None and team_id is None:
+        raise TypeError("Missing one required argument: team_id.")


Because this is an error in the configuration of arguments and not passing a bad keyword or the wrong number of arguments, I think this is actually a ValueError. Grey zone.

Definite gray zone. But I like using TypeError here because it will mirror the error they'd get in the future once the deprecated code is removed.

kroenlein · 2024-07-15T19:47:51Z

src/citrine/_utils/functions.py

@@ -319,3 +319,25 @@ def resource_path(*,
    new_url = base._replace(path='/'.join(path), query=query).geturl()

    return format_escaped_url(new_url, *action, **kwargs, uid=uid)
+
+
+def _data_manager_deprecation_checks(session, project_id: UUID, team_id: UUID, obj_type: str):


Why isn't this

if team_id is None: if project_id is None: raise Error warn import team_id = Project.get_team_id_from_project_id(session=session, project_id=project_id) return team_id

also, why aren't you checking if both ProjectID & TeamID are provided? Seems like you could screw up downstream stuff if you pass a team ID that's inconsistent with the Project ID.

Inherited code. =) Yeah, I agree that your structure is better.

kroenlein · 2024-07-15T19:49:03Z

src/citrine/jobs/job.py

+    if team_id is not None:
+        path = format_escaped_url('teams/{}/execution/job-status', team_id)
+    else:
+        path = format_escaped_url('projects/{}/execution/job-status', project_id)


Is this code path not deprecated?

I need to check with Pablo. Based on his comment above, it sounds like the intent is for the project-level jobs endpoint to be deprecated.

If the team_id is known we should call the team API. I'm unsure if we have guarantees of the team_id to be available here. But yes, ideally we should be moving to the new pattern. As a side note, the project-based job ID is not going to be deprecated because of the data manager, it will be for another reason (having a homogenous API for dealing with jobs).

kroenlein · 2024-07-15T20:14:29Z

src/citrine/resources/file_link.py

+    def __init__(
+        self,
+        *args,
+        session: Session = None,
+        dataset_id: UUID = None,
+        team_id: Optional[UUID] = None,
+        project_id: Optional[UUID] = None
+    ):
+        args = _pad_positional_args(args, 3)
+        self.project_id = project_id or args[0]
+        self.dataset_id = dataset_id or args[1]
+        self.session = session or args[2]
+        if self.session is None:
+            raise TypeError("Missing one required argument: session.")
+        if self.dataset_id is None:
+            raise TypeError("Missing one required argument: dataset_id.")


Similar to DataConceptsCollection

kroenlein · 2024-07-15T20:15:46Z

src/citrine/resources/gemd_resource.py

+    def __init__(
+        self,
+        *args,
+        dataset_id: UUID = None,
+        session: Session = None,
+        team_id: Optional[UUID] = None,
+        project_id: Optional[UUID] = None
+    ):
+        super().__init__(*args,
+                         team_id=team_id,
+                         dataset_id=dataset_id,
+                         session=session,
+                         project_id=project_id)
+        args = _pad_positional_args(args, 3)
+        self.project_id = project_id or args[0]
+        self.dataset_id = dataset_id or args[1]
+        self.session = session or args[2]
+        self.team_id = team_id


Similar to DataConceptsCollection

kroenlein · 2024-07-15T20:21:30Z

src/citrine/resources/table_config.py

+    def __init__(self, *args, team_id: UUID, project_id: UUID = None, session: Session = None):
+        args = _pad_positional_args(args, 2)
+        self.project_id = project_id or args[0]
+        self.session: Session = session or args[1]
+        self.team_id = team_id


Why would a Table Config need a Project ID? Are Tables still affiliated with particular Projects and not affiliated with datasets anymore?

My understanding is that it's still needed to retrieve it based on the table it built:

citrine-python/src/citrine/resources/table_config.py

Line 460 in c7117a6

path = (f'projects/{self.project_id}/display-tables/{table.uid}/versions/{table.version}'

I easily could be wrong though. @pacdaemon did I misinterpret what you said?

kroenlein · 2024-07-15T20:22:42Z

src/citrine/resources/project.py

@@ -249,6 +353,9 @@ def publish(self, *, resource: Resource):
        """
        resource_access = resource.access_control_dict()
        resource_type = resource_access["type"]
+        if resource_type == ResourceTypeEnum.DATASET:
+            warn("Datasets are no longer owned by Projects, so cannot be published by a Project.",


Shouldn't this warning explicitly call out the Team-based route?

"Publishing" no longer means anything with regards to datasets in the world of Data Manager. Previously, it's what made them visible to the team so another project could pull them in. Now, since everything is at a team level, it's a no-op.

Sounds like the message should probably be changed to be a bit clearer.

Yes, publishing doesn't mean anything now for team-based datasets, however, the accounts service doesn't implement an NOP for this type of dataset. Instead, it returns a 403, because the project doesn't have admin rights on the new type of dataset. I'd say that on top of the warning we should not call the API at all. The NOP is on our side.

kroenlein · 2024-07-15T20:24:20Z

src/citrine/resources/table_config.py

+from typing import TYPE_CHECKING
+if TYPE_CHECKING:   # pragma: no cover
+    from citrine.resources.project import Project
+    from citrine.resources.team import Team
+


Are we really benefiting from this?

https://docs.python.org/3.10/library/typing.html?highlight=type_checking#constant

It's debatable. This is what allowed me to add the proper type hints on lines 189-190 and 296-297 while still allowing it to compile and without flake8 complaining. Exactly how valuable that is depends on how much people rely on the type hinting. 🤷

anoto-moniz · 2024-07-17T19:39:45Z

src/citrine/resources/material_run.py

+        if data and data[0]["roots"]:
+            # Since the above query presents a single dataset to the endpoint, the response will be
+            # a list of length one, with a single route.
+            history_data = data[0]
+            history_data["roots"] = history_data["roots"][0]
+            return MaterialRun.build(history_data)


This was a change that came from Pablo running the e2es.

The result of the endpoint changed slightly, such that it can now encode multiple material histories. He confirmed to me that because this call is querying on a single dataset ID, it will always return a single element, and its roots field will be an array of length 1.

I could have put this stuff in _pre_build, but my impression is we very well might want to expose the whole thing at some point in the future, so I'd make it clear these changes apply to this specific usage.

This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:

history_data = data[0] history_data["object"] = history_data.pop("roots")[0]

so that at least we're making it so that keywords have different meanings? 🤦

Agreed, it's not great. And yeah, I have no problem making that tweak. Looking at the code in GEMDResource, it should work fine.

Also fix the deprecation warnings in Project to point at Team instead of TeamCollection.

kroenlein

Cosmetic fixes. Nothing blocking.

I thought I submitted this review, sorry.

kroenlein · 2024-07-17T21:02:10Z

src/citrine/resources/data_concepts.py

-            self.poll_async_update_job(job_id, timeout=timeout,
+            self.poll_async_update_job(job_id=job_id, timeout=timeout,
                                       polling_delay=polling_delay)


https://github.com/CitrineInformatics/citrine-python/blob/main/CONTRIBUTING.md#coding-style

Positional arguments are strongly discouraged for methods with multiple arguments

Since collection.poll_async_update_job(job_id) actually reads well and you'd only be asking this one kind of question, it passes the requirements for a valid positional argument. It'd be user-antagonistic to make them type job_id= when the only reasonable this to include there would be the job_id.

Non-blocking.

kroenlein · 2024-07-17T21:03:02Z

src/citrine/resources/dataset.py

@@ -268,7 +290,7 @@ def delete(self, uid: Union[UUID, str, LinkByUID, DataConcepts], *, dry_run=Fals
            collection = self.gemd._collection_for(uid)
        else:
            collection = self.gemd
-        return collection.delete(uid, dry_run=dry_run)
+        return collection.delete(uid=uid, dry_run=dry_run)


Similar response re: https://github.com/CitrineInformatics/citrine-python/blob/main/CONTRIBUTING.md#coding-style. Again, non-blocking.

kroenlein · 2024-07-17T21:06:40Z

src/citrine/resources/material_run.py

+            ]
+        }
+        data = self.session.post_resource(path, json=query)
+        if data and data[0]["roots"]:


If "roots" is not present, this is fatal. Maybe you wanted

Suggested change

if data and data[0]["roots"]:

if data and data[0].get("roots"):

😬 You are correct!

kroenlein · 2024-07-17T21:17:38Z

src/citrine/resources/material_run.py

+        if data and data[0]["roots"]:
+            # Since the above query presents a single dataset to the endpoint, the response will be
+            # a list of length one, with a single route.
+            history_data = data[0]
+            history_data["roots"] = history_data["roots"][0]
+            return MaterialRun.build(history_data)


This is so gross. When gemd-python was written, it expected "context" and "object" as the two keys at the base level of the serialized history object. When it was implemented on platform, someone decided to use a different keyword "roots", so I made the build method for a GEMDResource require the key "context" and just assume the other key points at the root thing. In addition, one library includes the object in the "context" array and one doesn't -- I don't recall which does which. As long as we're screwing with this structure, maybe we could make them a bit more harmonious? Something like:

history_data = data[0] history_data["object"] = history_data.pop("roots")[0]

so that at least we're making it so that keywords have different meanings? 🤦

kroenlein · 2024-07-17T21:19:34Z

src/citrine/resources/project.py

+        return session.get_resource(
+            path=f'projects/{project_id}',
+            version="v3")['project']['team']['id']


This passes linting?

Suggested change

return session.get_resource(

path=f'projects/{project_id}',

version="v3")['project']['team']['id']

response = session.get_resource(path=f'projects/{project_id}', version='v3')

return response['project']['team']['id']

Yeah, it's been like that, so I was just gonna leave it. But since I'm in there anyways, sure, I'll clean it up.

kroenlein · 2024-07-17T21:23:20Z

src/citrine/resources/project.py

+        return GEMDResourceCollection(
+            project_id=self.uid,
+            dataset_id=None,
+            session=self.session,
+            team_id=self.team_id
+        )


Pointlessly inconsistent formatting

Suggested change

return GEMDResourceCollection(

project_id=self.uid,

dataset_id=None,

session=self.session,

team_id=self.team_id

)

return GEMDResourceCollection(project_id=self.uid,

dataset_id=None,

session=self.session,

team_id=self.team_id)

kroenlein · 2024-07-17T21:27:55Z

src/citrine/resources/team.py

+        return self.session.get_resource("/DATASET/authorized-ids",
+                                         params=query_params,
+                                         version="v3")['ids']


Improved legibility -- key gets lost in the arguments.

Suggested change

return self.session.get_resource("/DATASET/authorized-ids",

params=query_params,

version="v3")['ids']

response = self.session.get_resource(

"/DATASET/authorized-ids",

params=query_params,

version="v3"

)

return response['ids']

kroenlein · 2024-07-17T21:29:01Z

src/citrine/resources/team.py

+        return MeasurementTemplateCollection(
+            team_id=self.uid,
+            dataset_id=None,
+            session=self.session)


Suggested change

session=self.session)

session=self.session

)

Mostly formatting things (and one time bomb of a bug).

anoto-moniz force-pushed the feature/pne-241-data-manager-release branch 7 times, most recently from 36e6261 to d4ce1df Compare July 15, 2024 18:04

anoto-moniz commented Jul 15, 2024

View reviewed changes

anoto-moniz force-pushed the feature/pne-241-data-manager-release branch from d4ce1df to c7117a6 Compare July 15, 2024 18:42

anoto-moniz marked this pull request as ready for review July 15, 2024 18:51

anoto-moniz requested review from pacdaemon and kroenlein July 15, 2024 18:52

kroenlein reviewed Jul 15, 2024

View reviewed changes

anoto-moniz force-pushed the feature/pne-241-data-manager-release branch from e026979 to c7117a6 Compare July 16, 2024 18:58

Addressing PR comments.

3da0570

anoto-moniz commented Jul 17, 2024

View reviewed changes

anoto-moniz requested a review from kroenlein July 17, 2024 19:41

Add gemd_batch_delete to Team.

687d134

Also fix the deprecation warnings in Project to point at Team instead of TeamCollection.

kroenlein previously approved these changes Jul 18, 2024

View reviewed changes

Address some more PR comments.

0a85d27

Mostly formatting things (and one time bomb of a bug).

anoto-moniz dismissed kroenlein’s stale review via 0a85d27 July 18, 2024 17:54

anoto-moniz requested a review from kroenlein July 18, 2024 17:54

kroenlein approved these changes Jul 18, 2024

View reviewed changes

anoto-moniz merged commit 619e003 into main Jul 18, 2024
16 checks passed

anoto-moniz deleted the feature/pne-241-data-manager-release branch July 18, 2024 19:42

pacdaemon mentioned this pull request Jul 19, 2024

[PNE-241] Data Manager support #943

Closed

8 tasks

	if data and data[0]["roots"]:
	if data and data[0].get("roots"):

[PNE-241] Data manager release #949

[PNE-241] Data manager release #949

Conversation

anoto-moniz commented Jul 10, 2024 • edited Loading

Citrine Python PR

Description

PR Type:

Adherence to team decisions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kroenlein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacdaemon Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kroenlein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anoto-moniz commented Jul 10, 2024 •

edited

Loading

pacdaemon Jul 16, 2024 •

edited

Loading