Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a datasource field to the DesignWorkflow object #968

Merged
merged 5 commits into from
Oct 4, 2024

Conversation

kroenlein
Copy link
Collaborator

Citrine Python PR

Description

  • Add SnapshotDataSource to the implemented classes of DataSource
  • Augment DesignWorkflow to include a datasource field
  • Consolidate DesignWorkflow test objects into one factory

When the user of the UI creates a branch with a data source but not yet any predictor, the API DesignWorkflow object has a direct reference to the data source. When a Predictor was referenced by the DesignWorkflow, that Predictor contained a reference to the DataSource, thus an SDK user could ultimately recover the value, but without a Predictor, there is no such workflow.

Unfortunately, the API returns a DataSourceId, which is a specially formatted string, not a DataSource object, as with other API references to DataSources. To avoid creating a new class or requiring users to manually construct a specially formatted string, this PR also adds basic mapping tools between DataSourceId strings and DataSource objects, as well as a convenience method for generating a TableDataSource from a GemTable object.

I think this approach is ready, but happy to work through concerns or potential improvements.

Prerequisite PR for https://citrine.atlassian.net/browse/PNE-731

PR Type:

  • Breaking change (fix or feature that would cause existing functionality to change)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Maintenance (non-breaking change to assist developers)

Adherence to team decisions

  • I have added tests for 100% coverage
  • I have written Numpy-style docstrings for every method and class.
  • I have communicated the downstream consequences of the PR to others.
  • I have bumped the version in __version__.py

@kroenlein kroenlein requested a review from anoto-moniz October 3, 2024 05:03
@kroenlein kroenlein requested a review from a team as a code owner October 3, 2024 05:03
Comment on lines 109 to 113
# TODO Figure out how to populate the column definitions
return CSVDataSource(
file_link=FileLink(url=args[0], filename=args[1]),
column_definitions={}
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no obvious method to identify what the column definitions and identifiers would be from just the DataSourceId.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too disruptive to issue a warning here, so the user knows that the data source they're getting is incomplete? Actually, maybe the warning should come when they access the DesignWorkflow.data_source field, if it's a CSVDataSource with no column_definition.

Either way, I'd like if we can alert them this isn't a completely accurate representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using a warning makes sense. I don't think people are using CSVDataSources anymore, but...

return "::".join(str(x) for x in [self._data_source_type, self.datasource_id])


class SnapshotDataSource(Serializable['SnapshotDataSource'], DataSource):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will presumably be useful soon, since it's the next step after Multistep Materials tables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Bill and I were just talking about what's needed to properly support training, etc on snapshots.

Comment on lines +67 to +83
@classmethod
def _pre_build(cls, data: dict) -> dict:
"""Run data modification before building."""
data_source_id = data.pop("data_source_id", None)
if data_source_id is not None:
data["data_source"] = DataSource.from_data_source_id(data_source_id).dump()
return data

def _post_dump(self, data: dict) -> dict:
"""Run data modification after dumping."""
data_source = data.pop("data_source", None)
if data_source is not None:
data["data_source_id"] = DataSource.build(data_source).to_data_source_id()
else:
data["data_source_id"] = None
return data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the existing serde hooks to filter out data_source_id fields and insert DataSource objects.

Comment on lines -952 to -964

@pytest.fixture
def design_workflow_dict(generic_entity):
ret = generic_entity.copy()
ret.update({
"name": "Example Design Workflow",
"description": "A description! Of the Design Workflow! So you know what it's for!",
"design_space_id": str(uuid.uuid4()),
"predictor_id": str(uuid.uuid4()),
"predictor_version": random.randint(1, 10),
"branch_id": str(uuid.uuid4()),
})
return ret
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two different DesignWorkflow test objects (a fixture & a Factory) made testing on the updated object complicated.

Comment on lines -57 to -62
@pytest.fixture
def workflow_minimal(collection, workflow) -> DesignWorkflow:
workflow.predictor_id = None
workflow.predictor_version = None
workflow.design_space_id = None
return workflow
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became obsolete because it was only used twice, and so was better served by a Factory with arguments.

Comment on lines 55 to 57
types: List[Type[Serializable]] = [
CSVDataSource, GemTableDataSource, ExperimentDataSourceRef, SnapshotDataSource
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: maybe lift this list to the class level, as it's duplicated above.

]
res = next((x for x in types if x._data_source_type == terms[0]), None)
if res is None:
raise ValueError("Unrecognized type: {}".format(terms[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: f-string.

Comment on lines 154 to 156
return "::".join(
str(x) for x in [self._data_source_type, self.table_id, self.table_version]
)
Copy link
Collaborator

@anoto-moniz anoto-moniz Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: what's the benefit of this over an f-string?

f{self._data_source_type}::{self.table_id}::{self.table_version}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought when writing was that we have a series of identifiers joined by :: in all cases, but yes, it ended up not great. I thought of creating a get_identifiers method that got joined at the parent level, but then that seemed a bit absurd. I'll swap to f-string.


def to_data_source_id(self) -> str:
"""Generate the data_source_id for this DataSource."""
return "::".join(str(x) for x in [self._data_source_type, self.datasource_id])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nitpick as above.


def to_data_source_id(self) -> str:
"""Generate the data_source_id for this DataSource."""
return "::".join(str(x) for x in [self._data_source_type, self.snapshot_id])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

anoto-moniz
anoto-moniz previously approved these changes Oct 4, 2024
Copy link
Collaborator

@anoto-moniz anoto-moniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just some nitpick comments which aren't worth blocking on.

My biggest concern is the same you call out in TODOs regarding the CSVDataSource. I'd like the sort of warning I mention, but I'm approving anyways since I don't know it's the most crucial thing. And we can always add it later if it becomes a problem.

@kroenlein kroenlein merged commit 9c59fb6 into main Oct 4, 2024
16 checks passed
@kroenlein kroenlein deleted the feature/add-design-workflow-datasource branch October 4, 2024 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants