-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a datasource field to the DesignWorkflow object #968
Conversation
# TODO Figure out how to populate the column definitions | ||
return CSVDataSource( | ||
file_link=FileLink(url=args[0], filename=args[1]), | ||
column_definitions={} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no obvious method to identify what the column definitions and identifiers would be from just the DataSourceId.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too disruptive to issue a warning here, so the user knows that the data source they're getting is incomplete? Actually, maybe the warning should come when they access the DesignWorkflow.data_source
field, if it's a CSVDataSource
with no column_definition
.
Either way, I'd like if we can alert them this isn't a completely accurate representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using a warning makes sense. I don't think people are using CSVDataSources anymore, but...
return "::".join(str(x) for x in [self._data_source_type, self.datasource_id]) | ||
|
||
|
||
class SnapshotDataSource(Serializable['SnapshotDataSource'], DataSource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will presumably be useful soon, since it's the next step after Multistep Materials tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Bill and I were just talking about what's needed to properly support training, etc on snapshots.
@classmethod | ||
def _pre_build(cls, data: dict) -> dict: | ||
"""Run data modification before building.""" | ||
data_source_id = data.pop("data_source_id", None) | ||
if data_source_id is not None: | ||
data["data_source"] = DataSource.from_data_source_id(data_source_id).dump() | ||
return data | ||
|
||
def _post_dump(self, data: dict) -> dict: | ||
"""Run data modification after dumping.""" | ||
data_source = data.pop("data_source", None) | ||
if data_source is not None: | ||
data["data_source_id"] = DataSource.build(data_source).to_data_source_id() | ||
else: | ||
data["data_source_id"] = None | ||
return data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the existing serde hooks to filter out data_source_id
fields and insert DataSource objects.
|
||
@pytest.fixture | ||
def design_workflow_dict(generic_entity): | ||
ret = generic_entity.copy() | ||
ret.update({ | ||
"name": "Example Design Workflow", | ||
"description": "A description! Of the Design Workflow! So you know what it's for!", | ||
"design_space_id": str(uuid.uuid4()), | ||
"predictor_id": str(uuid.uuid4()), | ||
"predictor_version": random.randint(1, 10), | ||
"branch_id": str(uuid.uuid4()), | ||
}) | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two different DesignWorkflow test objects (a fixture & a Factory) made testing on the updated object complicated.
@pytest.fixture | ||
def workflow_minimal(collection, workflow) -> DesignWorkflow: | ||
workflow.predictor_id = None | ||
workflow.predictor_version = None | ||
workflow.design_space_id = None | ||
return workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This became obsolete because it was only used twice, and so was better served by a Factory with arguments.
types: List[Type[Serializable]] = [ | ||
CSVDataSource, GemTableDataSource, ExperimentDataSourceRef, SnapshotDataSource | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: maybe lift this list to the class level, as it's duplicated above.
] | ||
res = next((x for x in types if x._data_source_type == terms[0]), None) | ||
if res is None: | ||
raise ValueError("Unrecognized type: {}".format(terms[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: f-string.
return "::".join( | ||
str(x) for x in [self._data_source_type, self.table_id, self.table_version] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: what's the benefit of this over an f-string?
f{self._data_source_type}::{self.table_id}::{self.table_version}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought when writing was that we have a series of identifiers joined by ::
in all cases, but yes, it ended up not great. I thought of creating a get_identifiers
method that got joined at the parent level, but then that seemed a bit absurd. I'll swap to f-string.
|
||
def to_data_source_id(self) -> str: | ||
"""Generate the data_source_id for this DataSource.""" | ||
return "::".join(str(x) for x in [self._data_source_type, self.datasource_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nitpick as above.
|
||
def to_data_source_id(self) -> str: | ||
"""Generate the data_source_id for this DataSource.""" | ||
return "::".join(str(x) for x in [self._data_source_type, self.snapshot_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just some nitpick comments which aren't worth blocking on.
My biggest concern is the same you call out in TODOs regarding the CSVDataSource
. I'd like the sort of warning I mention, but I'm approving anyways since I don't know it's the most crucial thing. And we can always add it later if it becomes a problem.
Citrine Python PR
Description
When the user of the UI creates a branch with a data source but not yet any predictor, the API DesignWorkflow object has a direct reference to the data source. When a Predictor was referenced by the DesignWorkflow, that Predictor contained a reference to the DataSource, thus an SDK user could ultimately recover the value, but without a Predictor, there is no such workflow.
Unfortunately, the API returns a DataSourceId, which is a specially formatted string, not a DataSource object, as with other API references to DataSources. To avoid creating a new class or requiring users to manually construct a specially formatted string, this PR also adds basic mapping tools between DataSourceId strings and DataSource objects, as well as a convenience method for generating a TableDataSource from a GemTable object.
I think this approach is ready, but happy to work through concerns or potential improvements.
Prerequisite PR for https://citrine.atlassian.net/browse/PNE-731
PR Type:
Adherence to team decisions