Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

Open
wants to merge 67 commits into
base: master
Choose a base branch
from

Conversation

ryota-cloud
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 13, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 13, 2025
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 82.25256% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ingestion/src/datahub/ingestion/source/vertexai.py 80.00% 52 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ryota-cloud ryota-cloud changed the title (WIP) feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) Feb 24, 2025
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every entity needs container aspects

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter needs-review Label for PRs that need review from a maintainer. and removed needs-review Label for PRs that need review from a maintainer. pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 25, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2025
@ryota-cloud
Copy link
Collaborator Author

  • the logic that constructs each entity is sometimes split across multiple methods (e.g. for jobs), which makes it difficult to understand what metadata we're producing for each entity type

I tried to separate metadata extraction and workunit generation like this. @hsheth2

    def _get_training_job_workunits(
        self, job: VertexAiResourceNoun
    ) -> Iterable[MetadataWorkUnit]:
        job_meta: TrainingJobMetadata = self._get_training_job_metadata(job)
        yield from self._gen_training_job_workunits(job_meta)
        yield from self._gen_output_model_workunits(job_meta)

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few additional comments

},
{
"entityType": "mlModelGroup",
"entityUrn": "urn:li:mlModelGroup:(urn:li:dataPlatform:vertexai,test-project-id.model.mock_prediction_model_2,PROD)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty critical part of the connector - so we should figure out how to mock and test it, even if its tricky

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 5, 2025
2. Download a service account JSON keyfile.
Example credential file:

```json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed; this code block should be indented, similar to how the code blocks below are indented

image

config:
project_id: "acryl-poc"
region: "us-west2"
# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.
# You must either set GOOGLE_APPLICATION_CREDENTIALS or provide credential as shown below.

super().__init__(**data)

if self.credential:
self._credentials_path = self.credential.create_credential_temp_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to create a credentials file?

)
return job_meta

def _gen_endpoint_mcps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could be multiple endpoints right?

Suggested change
def _gen_endpoint_mcps(
def _gen_endpoints_mcps(

job_meta.output_model = model
job_meta.output_model_version = model_version
except GoogleAPICallError:
logger.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +143 to +144
if func_to_mock == "google.cloud.aiplatform.Model.list":
mock.return_value = gen_mock_models()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these probably should not be in the for loop

instead, do something like this

exit_stack.enter_context(patch("google.cloud.aiplatform.Model.list")).return_value = gen_mock_models()

mock.return_value = [mock_automl_job]
elif (
func_to_mock
== "datahub.ingestion.source.vertexai.VertexAISource._get_training_job_metadata"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it makes sense to mock this method - we should be mocking whatever it fetches from vertex ai

mock_endpoint.description = "test endpoint"
mock_endpoint.create_time = datetime.now()
mock_endpoint.display_name = "test endpoint display name"
return mock_endpoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this stuff copy-pasted from the other test file? if so, we should probably pull it into a vertex_ai_mocks.py that both import

# Run _gen_ml_model_mcps
mcp = [mcp for mcp in source._gen_ml_model_mcps(model_meta)]
assert len(mcp) == 1
assert hasattr(mcp[0], "aspect")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these unit tests need some work

  • removing hasattr calls
  • the for mcp in mcps: if ... flow does not work. I left a comment about this earlier as well

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants