feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

ryota-cloud · 2025-02-13T21:53:20Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-02-13T22:22:20Z

Codecov Report

Attention: Patch coverage is 82.25256% with 52 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ingestion/src/datahub/ingestion/source/vertexai.py	80.00%	52 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hsheth2

Every entity needs container aspects

metadata-ingestion/docs/sources/vertexai/README.md

metadata-ingestion/docs/sources/vertexai/vertexai_recipe.yml

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

ryota-cloud · 2025-03-04T18:27:12Z

the logic that constructs each entity is sometimes split across multiple methods (e.g. for jobs), which makes it difficult to understand what metadata we're producing for each entity type

I tried to separate metadata extraction and workunit generation like this. @hsheth2

    def _get_training_job_workunits(
        self, job: VertexAiResourceNoun
    ) -> Iterable[MetadataWorkUnit]:
        job_meta: TrainingJobMetadata = self._get_training_job_metadata(job)
        yield from self._gen_training_job_workunits(job_meta)
        yield from self._gen_output_model_workunits(job_meta)

hsheth2

left a few additional comments

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

metadata-ingestion/tests/unit/test_vertexai_source.py

hsheth2 · 2025-03-04T22:40:17Z

metadata-ingestion/tests/integration/vertexai/vertexai_mcps_golden.json

+},
+{
+    "entityType": "mlModelGroup",
+    "entityUrn": "urn:li:mlModelGroup:(urn:li:dataPlatform:vertexai,test-project-id.model.mock_prediction_model_2,PROD)",


this is a pretty critical part of the connector - so we should figure out how to mock and test it, even if its tricky

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

…tion

…t, and job to model

hsheth2 · 2025-03-05T22:50:38Z

metadata-ingestion/docs/sources/vertexai/vertexai_pre.md

+2. Download a service account JSON keyfile.
+   Example credential file:
+
+```json


not fixed; this code block should be indented, similar to how the code blocks below are indented

hsheth2 · 2025-03-05T22:51:28Z

metadata-ingestion/docs/sources/vertexai/vertexai_recipe.yml

+  config:
+    project_id: "acryl-poc"
+    region:  "us-west2"
+# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.


Suggested change

# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.

# You must either set GOOGLE_APPLICATION_CREDENTIALS or provide credential as shown below.

hsheth2 · 2025-03-05T22:52:34Z

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

+        super().__init__(**data)
+
+        if self.credential:
+            self._credentials_path = self.credential.create_credential_temp_file(


do we actually need to create a credentials file?

hsheth2 · 2025-03-05T22:54:23Z

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

+                    )
+        return job_meta
+
+    def _gen_endpoint_mcps(


there could be multiple endpoints right?

Suggested change

def _gen_endpoint_mcps(

def _gen_endpoints_mcps(

hsheth2 · 2025-03-05T22:55:00Z

metadata-ingestion/src/datahub/ingestion/source/vertexai.py

+                        job_meta.output_model = model
+                        job_meta.output_model_version = model_version
+                except GoogleAPICallError:
+                    logger.error(


use structured error reporting https://acryldata.notion.site/Error-reporting-in-ingestion-5343cc6ea0c84633b38070d1a409c569?pvs=74

hsheth2 · 2025-03-05T23:10:47Z

metadata-ingestion/tests/integration/vertexai/test_vertexai.py

+            if func_to_mock == "google.cloud.aiplatform.Model.list":
+                mock.return_value = gen_mock_models()


these probably should not be in the for loop

instead, do something like this

exit_stack.enter_context(patch("google.cloud.aiplatform.Model.list")).return_value = gen_mock_models()

hsheth2 · 2025-03-05T23:11:43Z

metadata-ingestion/tests/integration/vertexai/test_vertexai.py

+                mock.return_value = [mock_automl_job]
+            elif (
+                func_to_mock
+                == "datahub.ingestion.source.vertexai.VertexAISource._get_training_job_metadata"


not sure it makes sense to mock this method - we should be mocking whatever it fetches from vertex ai

hsheth2 · 2025-03-05T23:12:27Z

metadata-ingestion/tests/unit/test_vertexai_source.py

+    mock_endpoint.description = "test endpoint"
+    mock_endpoint.create_time = datetime.now()
+    mock_endpoint.display_name = "test endpoint display name"
+    return mock_endpoint


is this stuff copy-pasted from the other test file? if so, we should probably pull it into a vertex_ai_mocks.py that both import

hsheth2 · 2025-03-05T23:12:57Z

metadata-ingestion/tests/unit/test_vertexai_source.py

+    # Run _gen_ml_model_mcps
+    mcp = [mcp for mcp in source._gen_ml_model_mcps(model_meta)]
+    assert len(mcp) == 1
+    assert hasattr(mcp[0], "aspect")


see https://www.notion.so/acryldata/Python-Code-Style-1adfc6a64277803494edff3f87fb1503?pvs=4#1adfc6a6427780fc9188c89ed8ec2927

hsheth2 · 2025-03-05T23:15:24Z

metadata-ingestion/tests/unit/test_vertexai_source.py

these unit tests need some work

removing hasattr calls

the for mcp in mcps: if ... flow does not work. I left a comment about this earlier as well

vercel bot had a problem deploying to Preview February 13, 2025 22:00 Failure

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 13, 2025

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 13, 2025

vercel bot had a problem deploying to Preview February 18, 2025 22:52 Failure

ryota-cloud force-pushed the vertex_src branch from a204827 to 98aa10a Compare February 24, 2025 20:33

vercel bot had a problem deploying to Preview February 24, 2025 20:41 Failure

vercel bot had a problem deploying to Preview February 24, 2025 20:54 Failure

vercel bot had a problem deploying to Preview February 24, 2025 21:54 Failure

ryota-cloud changed the title ~~(WIP) feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model)~~ feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) Feb 24, 2025

ryota-cloud force-pushed the vertex_src branch from 8e36548 to 51a13d7 Compare February 24, 2025 23:06

vercel bot deployed to Preview February 24, 2025 23:50 View deployment

hsheth2 reviewed Feb 25, 2025

View reviewed changes

vercel bot deployed to Preview February 25, 2025 02:28 View deployment

vercel bot had a problem deploying to Preview February 25, 2025 18:37 Failure

vercel bot had a problem deploying to Preview February 25, 2025 20:34 Failure

vercel bot had a problem deploying to Preview February 25, 2025 20:52 Failure

ryota-cloud added 8 commits February 25, 2025 13:39

feat(ingestion) Adding vertexAI ingestion source

45ce05e

lintfix

9a1355d

minor comment change

04315d4

minor

e3a17b5

minor change in unit test

2a5ea58

Adding sources and documents

3739c20

delete unnecessary file

520eda6

fetch list of training jobs

c320a6c

delete test case using real model

4b09365

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 4, 2025

delete commented out code

eb261c3

vercel bot deployed to Preview March 4, 2025 18:21 View deployment

ryota-cloud added 3 commits March 4, 2025 12:54

consolidate use of auto_workunit and change func output to mcps

e6feb8a

Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp

a8d7980

fix comment

b31d0f6

vercel bot deployed to Preview March 4, 2025 21:39 View deployment

hsheth2 reviewed Mar 4, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2025

hsheth2 reviewed Mar 4, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/vertexai.py Outdated Show resolved Hide resolved

ryota-cloud added 2 commits March 5, 2025 01:28

Add POJO for model and change logic of model extraction and mcps crea…

99269aa

…tion

Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp

a517173

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 5, 2025

vercel bot deployed to Preview March 5, 2025 09:49 View deployment

use datetime_to_ts_millis helper

f900f6d

vercel bot deployed to Preview March 5, 2025 10:12 View deployment

refactored unit test case for better assertion

5c46c59

vercel bot deployed to Preview March 5, 2025 18:42 View deployment

ryota-cloud added 3 commits March 5, 2025 12:25

Modified integration test to cover relationship between job to datase…

1772b7e

…t, and job to model

fix import error in test case

8e40b7c

Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp

2a91e6d

vercel bot deployed to Preview March 5, 2025 20:51 View deployment

hsheth2 requested changes Mar 5, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

ryota-cloud commented Feb 13, 2025

codecov bot commented Feb 13, 2025 •

edited

Loading

hsheth2 left a comment

ryota-cloud commented Mar 4, 2025

hsheth2 left a comment

hsheth2 Mar 4, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

hsheth2 Mar 5, 2025

	# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.
	# You must either set GOOGLE_APPLICATION_CREDENTIALS or provide credential as shown below.

		if func_to_mock == "google.cloud.aiplatform.Model.list":
		mock.return_value = gen_mock_models()

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

Are you sure you want to change the base?

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

Conversation

ryota-cloud commented Feb 13, 2025

Checklist

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

hsheth2 left a comment

Choose a reason for hiding this comment

ryota-cloud commented Mar 4, 2025

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 13, 2025 •

edited

Loading