Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) #12632

Open
wants to merge 67 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
45ce05e
feat(ingestion) Adding vertexAI ingestion source
ryota-cloud Feb 13, 2025
9a1355d
lintfix
ryota-cloud Feb 13, 2025
04315d4
minor comment change
ryota-cloud Feb 13, 2025
e3a17b5
minor
ryota-cloud Feb 13, 2025
2a5ea58
minor change in unit test
ryota-cloud Feb 13, 2025
3739c20
Adding sources and documents
ryota-cloud Feb 18, 2025
520eda6
delete unnecessary file
ryota-cloud Feb 18, 2025
c320a6c
fetch list of training jobs
ryota-cloud Feb 22, 2025
bc9e451
adding comments
ryota-cloud Feb 23, 2025
960129b
feat(ingest): add vertex AI sample data ingestion
ryota-cloud Feb 12, 2025
95712f5
Update vertexai.py
ryota-cloud Feb 24, 2025
78d184b
added endopint workunit creation and refactored
ryota-cloud Feb 24, 2025
d746a4c
commit temporarily
ryota-cloud Feb 24, 2025
5fbe0e5
lintfix
ryota-cloud Feb 24, 2025
9f8e8a3
removing unnecesary commits
ryota-cloud Feb 24, 2025
85d1830
cleanup recipe
ryota-cloud Feb 24, 2025
aae6893
minor change in config
ryota-cloud Feb 24, 2025
764f8fd
fixing dataset
ryota-cloud Feb 24, 2025
29ddcff
adding comments for dataset
ryota-cloud Feb 24, 2025
437e7d2
minor fix
ryota-cloud Feb 24, 2025
a2a1f0a
adding vertex to dev requirements in setup.py
ryota-cloud Feb 24, 2025
bf869da
minor fix
ryota-cloud Feb 24, 2025
c1f24b7
caching dataset list acquisitions
ryota-cloud Feb 25, 2025
453688d
review comment on dataset
ryota-cloud Feb 25, 2025
be03cf5
minor chagne
ryota-cloud Feb 25, 2025
8c76435
change name
ryota-cloud Feb 25, 2025
33a19c9
lint fix
ryota-cloud Feb 25, 2025
b76ec25
Refactor code to use auto_workunit
ryota-cloud Feb 25, 2025
c7d5165
flattern make_vertexai_name
ryota-cloud Feb 25, 2025
482c159
lint type error is fixed
ryota-cloud Feb 25, 2025
1032630
adding credentail config
ryota-cloud Feb 26, 2025
616b76a
refactor and changed GCP credential to pass project_id
ryota-cloud Feb 26, 2025
1dcfce1
Adding more unit test case coverage, fixed lint and test case
ryota-cloud Feb 26, 2025
f16c8f5
fix platform name
ryota-cloud Feb 26, 2025
1de43a0
fixed _get_data_process_input_workunit test case
ryota-cloud Feb 26, 2025
ea577cb
Adding subtype and container to dataset and training job
ryota-cloud Feb 27, 2025
46ff526
fix UI issue on timestamp and refactor
ryota-cloud Feb 27, 2025
9b6c01e
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Feb 27, 2025
7b0fb70
removed token
ryota-cloud Feb 27, 2025
cf9c242
Adding integration test for VertexAI
ryota-cloud Feb 28, 2025
398c380
Adding unit test cases
ryota-cloud Feb 28, 2025
4703cd9
increasing unit test coverage
ryota-cloud Feb 28, 2025
63e8e8e
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Feb 28, 2025
ba26abb
adding more unit tests
ryota-cloud Feb 28, 2025
3a85d8a
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Feb 28, 2025
84ebae0
fixed review comments
ryota-cloud Mar 3, 2025
0b6b7db
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Mar 3, 2025
5472929
fixed review comments, adding unit test cases
ryota-cloud Mar 3, 2025
0eeeb72
minor change
ryota-cloud Mar 3, 2025
6c43ecc
Change BigQueryCredentail to common function: GCPCredential
ryota-cloud Mar 3, 2025
d381b9e
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Mar 3, 2025
1f64a95
fixed one unit test case failure, and naming chagne
ryota-cloud Mar 3, 2025
b559286
Added Enum and refactoring
ryota-cloud Mar 3, 2025
4edd575
add comment
ryota-cloud Mar 3, 2025
5765025
fixed review comments
ryota-cloud Mar 4, 2025
4b09365
delete test case using real model
ryota-cloud Mar 4, 2025
eb261c3
delete commented out code
ryota-cloud Mar 4, 2025
e6feb8a
consolidate use of auto_workunit and change func output to mcps
ryota-cloud Mar 4, 2025
a8d7980
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Mar 4, 2025
b31d0f6
fix comment
ryota-cloud Mar 4, 2025
99269aa
Add POJO for model and change logic of model extraction and mcps crea…
ryota-cloud Mar 5, 2025
a517173
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Mar 5, 2025
f900f6d
use datetime_to_ts_millis helper
ryota-cloud Mar 5, 2025
5c46c59
refactored unit test case for better assertion
ryota-cloud Mar 5, 2025
1772b7e
Modified integration test to cover relationship between job to datase…
ryota-cloud Mar 5, 2025
8e40b7c
fix import error in test case
ryota-cloud Mar 5, 2025
2a91e6d
Merge remote-tracking branch 'oss-datahub/master' into vertex_src_temp
ryota-cloud Mar 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -333,5 +333,12 @@
"description": "Import Nodes and Relationships from Neo4j.",
"docsUrl": "https://datahubproject.io/docs/generated/ingestion/sources/neo4j/",
"recipe": "source:\n type: 'neo4j'\n config:\n uri: 'neo4j+ssc://host:7687'\n username: 'neo4j'\n password: 'password'\n env: 'PROD'\n\nsink:\n type: \"datahub-rest\"\n config:\n server: 'http://localhost:8080'"
},
{
"urn": "urn:li:dataPlatform:vertexai",
"name": "vertexai",
"displayName": "VertexAI",
"docsUrl": "https://datahubproject.io/docs/generated/ingestion/sources/vertexai/",
"recipe": "source:\n type: vertexai\n config:\n project_id: # you GCP project ID \n region: # region where your GCP project resides \n # Credentials\n # Add GCP credentials"
}
]
Binary file added datahub-web-react/src/images/vertexai.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 48 additions & 0 deletions metadata-ingestion/docs/sources/vertexai/vertexai_pre.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Ingesting metadata from VertexAI requires using the **Vertex AI** module.

#### Prerequisites
Please refer to the [Vertex AI documentation](https://cloud.google.com/vertex-ai/docs) for basic information on Vertex AI.

#### Credentials to access to GCP
Please read the section to understand how to set up application default Credentials to GCP [GCP docs](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to).

#### Create a service account and assign roles

1. Setup a ServiceAccount as per [GCP docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) and assign the previously created role to this service account.
2. Download a service account JSON keyfile.
- Example credential file:

```json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be nested under list item 2 - not dedented

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, probably big query doc also need to be fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed; this code block should be indented, similar to how the code blocks below are indented

image

{
"type": "service_account",
"project_id": "project-id-1234567",
"private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
"client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com",
"client_id": "113545814931671546333",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}
```

3. To provide credentials to the source, you can either:

- Set an environment variable:

```sh
$ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
```

_or_

- Set credential config in your source based on the credential json file. For example:

```yml
credential:
private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com"
client_id: "123456678890"
```
16 changes: 16 additions & 0 deletions metadata-ingestion/docs/sources/vertexai/vertexai_recipe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
source:
type: vertexai
config:
project_id: "acryl-poc"
region: "us-west2"
# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Note that GOOGLE_APPLICATION_CREDENTIALS or credential section below is required for authentication.
# You must either set GOOGLE_APPLICATION_CREDENTIALS or provide credential as shown below.

# credential:
# private_key: '-----BEGIN PRIVATE KEY-----\\nprivate-key\\n-----END PRIVATE KEY-----\\n'
# private_key_id: "project_key_id"
# client_email: "client_email"
# client_id: "client_id"

sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"
4 changes: 4 additions & 0 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -532,6 +532,7 @@
"sigma": sqlglot_lib | {"requests"},
"sac": sac,
"neo4j": {"pandas", "neo4j"},
"vertexai": {"google-cloud-aiplatform>=1.80.0"},
}

# This is mainly used to exclude plugins from the Docker image.
Expand Down Expand Up @@ -677,6 +678,7 @@
"sac",
"cassandra",
"neo4j",
"vertexai",
]
if plugin
for dependency in plugins[plugin]
Expand Down Expand Up @@ -710,6 +712,7 @@
"mariadb",
"redash",
"vertica",
"vertexai"
]
if plugin
for dependency in plugins[plugin]
Expand Down Expand Up @@ -799,6 +802,7 @@
"sac = datahub.ingestion.source.sac.sac:SACSource",
"cassandra = datahub.ingestion.source.cassandra.cassandra:CassandraSource",
"neo4j = datahub.ingestion.source.neo4j.neo4j_source:Neo4jSource",
"vertexai = datahub.ingestion.source.vertexai:VertexAISource",
],
"datahub.ingestion.transformer.plugins": [
"pattern_cleanup_ownership = datahub.ingestion.transformer.pattern_cleanup_ownership:PatternCleanUpOwnership",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import json
import logging
import os
import re
import tempfile
from datetime import timedelta
from typing import Any, Dict, List, Optional, Union

Expand All @@ -17,10 +15,10 @@
PlatformInstanceConfigMixin,
)
from datahub.configuration.validate_field_removal import pydantic_removed_field
from datahub.configuration.validate_multiline_string import pydantic_multiline_string
from datahub.ingestion.glossary.classification_mixin import (
ClassificationSourceConfigMixin,
)
from datahub.ingestion.source.common.gcp_credentials_config import GCPCredential
from datahub.ingestion.source.data_lake_common.path_spec import PathSpec
from datahub.ingestion.source.sql.sql_config import SQLCommonConfig, SQLFilterConfig
from datahub.ingestion.source.state.stateful_ingestion_base import (
Expand Down Expand Up @@ -107,50 +105,8 @@ class BigQueryUsageConfig(BaseUsageConfig):
)


class BigQueryCredential(ConfigModel):
project_id: str = Field(description="Project id to set the credentials")
private_key_id: str = Field(description="Private key id")
private_key: str = Field(
description="Private key in a form of '-----BEGIN PRIVATE KEY-----\\nprivate-key\\n-----END PRIVATE KEY-----\\n'"
)
client_email: str = Field(description="Client email")
client_id: str = Field(description="Client Id")
auth_uri: str = Field(
default="https://accounts.google.com/o/oauth2/auth",
description="Authentication uri",
)
token_uri: str = Field(
default="https://oauth2.googleapis.com/token", description="Token uri"
)
auth_provider_x509_cert_url: str = Field(
default="https://www.googleapis.com/oauth2/v1/certs",
description="Auth provider x509 certificate url",
)
type: str = Field(default="service_account", description="Authentication type")
client_x509_cert_url: Optional[str] = Field(
default=None,
description="If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email",
)

_fix_private_key_newlines = pydantic_multiline_string("private_key")

@root_validator(skip_on_failure=True)
def validate_config(cls, values: Dict[str, Any]) -> Dict[str, Any]:
if values.get("client_x509_cert_url") is None:
values["client_x509_cert_url"] = (
f"https://www.googleapis.com/robot/v1/metadata/x509/{values['client_email']}"
)
return values

def create_credential_temp_file(self) -> str:
with tempfile.NamedTemporaryFile(delete=False) as fp:
cred_json = json.dumps(self.dict(), indent=4, separators=(",", ": "))
fp.write(cred_json.encode())
return fp.name


class BigQueryConnectionConfig(ConfigModel):
credential: Optional[BigQueryCredential] = Field(
credential: Optional[GCPCredential] = Field(
default=None, description="BigQuery credential informations"
)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import json
import tempfile
from typing import Any, Dict, Optional

from pydantic import Field, root_validator

from datahub.configuration import ConfigModel
from datahub.configuration.validate_multiline_string import pydantic_multiline_string


class GCPCredential(ConfigModel):
project_id: Optional[str] = Field(description="Project id to set the credentials")
private_key_id: str = Field(description="Private key id")
private_key: str = Field(
description="Private key in a form of '-----BEGIN PRIVATE KEY-----\\nprivate-key\\n-----END PRIVATE KEY-----\\n'"
)
client_email: str = Field(description="Client email")
client_id: str = Field(description="Client Id")
auth_uri: str = Field(
default="https://accounts.google.com/o/oauth2/auth",
description="Authentication uri",
)
token_uri: str = Field(
default="https://oauth2.googleapis.com/token", description="Token uri"
)
auth_provider_x509_cert_url: str = Field(
default="https://www.googleapis.com/oauth2/v1/certs",
description="Auth provider x509 certificate url",
)
type: str = Field(default="service_account", description="Authentication type")
client_x509_cert_url: Optional[str] = Field(
default=None,
description="If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email",
)

_fix_private_key_newlines = pydantic_multiline_string("private_key")

@root_validator(skip_on_failure=True)
def validate_config(cls, values: Dict[str, Any]) -> Dict[str, Any]:
if values.get("client_x509_cert_url") is None:
values["client_x509_cert_url"] = (
f"https://www.googleapis.com/robot/v1/metadata/x509/{values['client_email']}"
)
return values

def create_credential_temp_file(self, project_id: Optional[str] = None) -> str:
configs = self.dict()
if project_id:
configs["project_id"] = project_id
with tempfile.NamedTemporaryFile(delete=False) as fp:
cred_json = json.dumps(configs, indent=4, separators=(",", ": "))
fp.write(cred_json.encode())
return fp.name
Loading
Loading