Add a datasource field to the DesignWorkflow object #968

kroenlein · 2024-10-03T05:03:07Z

Citrine Python PR

Description

Add SnapshotDataSource to the implemented classes of DataSource
Augment DesignWorkflow to include a datasource field
Consolidate DesignWorkflow test objects into one factory

When the user of the UI creates a branch with a data source but not yet any predictor, the API DesignWorkflow object has a direct reference to the data source. When a Predictor was referenced by the DesignWorkflow, that Predictor contained a reference to the DataSource, thus an SDK user could ultimately recover the value, but without a Predictor, there is no such workflow.

Unfortunately, the API returns a DataSourceId, which is a specially formatted string, not a DataSource object, as with other API references to DataSources. To avoid creating a new class or requiring users to manually construct a specially formatted string, this PR also adds basic mapping tools between DataSourceId strings and DataSource objects, as well as a convenience method for generating a TableDataSource from a GemTable object.

I think this approach is ready, but happy to work through concerns or potential improvements.

Prerequisite PR for https://citrine.atlassian.net/browse/PNE-731

PR Type:

Breaking change (fix or feature that would cause existing functionality to change)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Maintenance (non-breaking change to assist developers)

Adherence to team decisions

I have added tests for 100% coverage
I have written Numpy-style docstrings for every method and class.
I have communicated the downstream consequences of the PR to others.
I have bumped the version in __version__.py

kroenlein · 2024-10-03T05:04:38Z

src/citrine/informatics/data_sources.py

+        # TODO Figure out how to populate the column definitions
+        return CSVDataSource(
+            file_link=FileLink(url=args[0], filename=args[1]),
+            column_definitions={}
+        )


There is no obvious method to identify what the column definitions and identifiers would be from just the DataSourceId.

Would it be too disruptive to issue a warning here, so the user knows that the data source they're getting is incomplete? Actually, maybe the warning should come when they access the DesignWorkflow.data_source field, if it's a CSVDataSource with no column_definition.

Either way, I'd like if we can alert them this isn't a completely accurate representation.

Yeah, using a warning makes sense. I don't think people are using CSVDataSources anymore, but...

kroenlein · 2024-10-03T05:05:21Z

src/citrine/informatics/data_sources.py

+        return "::".join(str(x) for x in [self._data_source_type, self.datasource_id])
+
+
+class SnapshotDataSource(Serializable['SnapshotDataSource'], DataSource):


This will presumably be useful soon, since it's the next step after Multistep Materials tables.

Yeah, Bill and I were just talking about what's needed to properly support training, etc on snapshots.

kroenlein · 2024-10-03T05:06:25Z

src/citrine/informatics/workflows/design_workflow.py

+    @classmethod
+    def _pre_build(cls, data: dict) -> dict:
+        """Run data modification before building."""
+        data_source_id = data.pop("data_source_id", None)
+        if data_source_id is not None:
+            data["data_source"] = DataSource.from_data_source_id(data_source_id).dump()
+        return data
+
+    def _post_dump(self, data: dict) -> dict:
+        """Run data modification after dumping."""
+        data_source = data.pop("data_source", None)
+        if data_source is not None:
+            data["data_source_id"] = DataSource.build(data_source).to_data_source_id()
+        else:
+            data["data_source_id"] = None
+        return data
+


This uses the existing serde hooks to filter out data_source_id fields and insert DataSource objects.

kroenlein · 2024-10-03T05:07:35Z

tests/conftest.py

-
-@pytest.fixture
-def design_workflow_dict(generic_entity):
-    ret = generic_entity.copy()
-    ret.update({
-        "name": "Example Design Workflow",
-        "description": "A description! Of the Design Workflow! So you know what it's for!",
-        "design_space_id": str(uuid.uuid4()),
-        "predictor_id": str(uuid.uuid4()),
-        "predictor_version": random.randint(1, 10),
-        "branch_id": str(uuid.uuid4()),
-    })
-    return ret


Having two different DesignWorkflow test objects (a fixture & a Factory) made testing on the updated object complicated.

kroenlein · 2024-10-03T05:09:24Z

tests/resources/test_design_workflows.py

-@pytest.fixture
-def workflow_minimal(collection, workflow) -> DesignWorkflow:
-    workflow.predictor_id = None
-    workflow.predictor_version = None
-    workflow.design_space_id = None
-    return workflow


This became obsolete because it was only used twice, and so was better served by a Factory with arguments.

anoto-moniz · 2024-10-04T14:02:05Z

src/citrine/informatics/data_sources.py

+        types: List[Type[Serializable]] = [
+            CSVDataSource, GemTableDataSource, ExperimentDataSourceRef, SnapshotDataSource
+        ]


Nitpick: maybe lift this list to the class level, as it's duplicated above.

anoto-moniz · 2024-10-04T14:02:20Z

src/citrine/informatics/data_sources.py

+        ]
+        res = next((x for x in types if x._data_source_type == terms[0]), None)
+        if res is None:
+            raise ValueError("Unrecognized type: {}".format(terms[0]))


Nitpick: f-string.

anoto-moniz · 2024-10-04T14:06:09Z

src/citrine/informatics/data_sources.py

+        return "::".join(
+            str(x) for x in [self._data_source_type, self.table_id, self.table_version]
+        )


Nitpick: what's the benefit of this over an f-string?

f{self._data_source_type}::{self.table_id}::{self.table_version}

My thought when writing was that we have a series of identifiers joined by :: in all cases, but yes, it ended up not great. I thought of creating a get_identifiers method that got joined at the parent level, but then that seemed a bit absurd. I'll swap to f-string.

anoto-moniz · 2024-10-04T14:06:40Z

src/citrine/informatics/data_sources.py

+
+    def to_data_source_id(self) -> str:
+        """Generate the data_source_id for this DataSource."""
+        return "::".join(str(x) for x in [self._data_source_type, self.datasource_id])


Same nitpick as above.

anoto-moniz · 2024-10-04T14:07:21Z

src/citrine/informatics/data_sources.py

+
+    def to_data_source_id(self) -> str:
+        """Generate the data_source_id for this DataSource."""
+        return "::".join(str(x) for x in [self._data_source_type, self.snapshot_id])


Same as above.

anoto-moniz

Mostly looks good, just some nitpick comments which aren't worth blocking on.

My biggest concern is the same you call out in TODOs regarding the CSVDataSource. I'd like the sort of warning I mention, but I'm approving anyways since I don't know it's the most crucial thing. And we can always add it later if it becomes a problem.

Add a datasource field to the DesignWorkflow object

f53a021

kroenlein requested a review from anoto-moniz October 3, 2024 05:03

kroenlein requested a review from a team as a code owner October 3, 2024 05:03

kroenlein commented Oct 3, 2024

View reviewed changes

Remove non-3.8 operator

b03680b

anoto-moniz reviewed Oct 4, 2024

View reviewed changes

anoto-moniz previously approved these changes Oct 4, 2024

View reviewed changes

kroenlein added 2 commits October 4, 2024 09:00

Merge branch 'main' into feature/add-design-workflow-datasource

e777535

PR feedback

f5f7538

kroenlein dismissed anoto-moniz’s stale review via f5f7538 October 4, 2024 15:12

PR feedback

169e809

anoto-moniz approved these changes Oct 4, 2024

View reviewed changes

kroenlein merged commit 9c59fb6 into main Oct 4, 2024
16 checks passed

kroenlein deleted the feature/add-design-workflow-datasource branch October 4, 2024 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a datasource field to the DesignWorkflow object #968

Add a datasource field to the DesignWorkflow object #968

kroenlein commented Oct 3, 2024

kroenlein Oct 3, 2024

anoto-moniz Oct 4, 2024

kroenlein Oct 4, 2024

kroenlein Oct 3, 2024

anoto-moniz Oct 4, 2024

kroenlein Oct 3, 2024

kroenlein Oct 3, 2024

kroenlein Oct 3, 2024

anoto-moniz Oct 4, 2024

anoto-moniz Oct 4, 2024

anoto-moniz Oct 4, 2024 •

edited

Loading

kroenlein Oct 4, 2024

anoto-moniz Oct 4, 2024

anoto-moniz Oct 4, 2024

anoto-moniz left a comment

		return "::".join(str(x) for x in [self._data_source_type, self.datasource_id])


		class SnapshotDataSource(Serializable['SnapshotDataSource'], DataSource):

Add a datasource field to the DesignWorkflow object #968

Add a datasource field to the DesignWorkflow object #968

Conversation

kroenlein commented Oct 3, 2024

Citrine Python PR

Description

PR Type:

Adherence to team decisions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anoto-moniz Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anoto-moniz left a comment

Choose a reason for hiding this comment

anoto-moniz Oct 4, 2024 •

edited

Loading