improve doc

datahub-project · Mar 3, 2025 · 881cb34 · 881cb34
1 parent a293129
commit 881cb34
Showing 1 changed file with 62 additions and 40 deletions.
diff --git a/docs/cli-commands/dataset.md b/docs/cli-commands/dataset.md
@@ -4,24 +4,6 @@ The `dataset` command allows you to interact with Dataset entities in DataHub. T
 
 ## Commands
 
-### upsert
-
-Create or update Dataset metadata in DataHub.
-
-```shell
-datahub dataset upsert -f PATH_TO_YAML_FILE
-```
-
-**Options:**
-- `-f, --file` - Path to the YAML file containing Dataset metadata (required)
-
-**Example:**
-```shell
-datahub dataset upsert -f dataset.yaml
-```
-
-This command will parse the YAML file, validate that any entity references exist in DataHub, and then emit the corresponding metadata change proposals to update or create the Dataset.
-
 ### sync
 
 Synchronize Dataset metadata between YAML files and DataHub.
@@ -46,47 +28,71 @@ datahub dataset sync -f dataset.yaml --from-datahub
 
 The `sync` command offers bidirectional synchronization, allowing you to keep your local YAML files in sync with the DataHub platform. The `upsert` command actually uses `sync` with the `--to-datahub` flag internally.
 
-### get
+For details on the supported YAML format, see the [Dataset YAML Format](#dataset-yaml-format) section.
 
-Retrieve Dataset metadata from DataHub and optionally write it to a file.
+### file
+
+Operate on a Dataset YAML file for validation or linting.
 
 ```shell
-datahub dataset get --urn DATASET_URN [--to-file OUTPUT_FILE]
+datahub dataset file [--lintCheck] [--lintFix] PATH_TO_YAML_FILE
 ```
 
 **Options:**
-- `--urn` - The Dataset URN to retrieve (required)
-- `--to-file` - Path to write the Dataset metadata as YAML (optional)
+- `--lintCheck` - Check the YAML file for formatting issues (optional)
+- `--lintFix` - Fix formatting issues in the YAML file (optional)
 
 **Example:**
 ```shell
-datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file my_dataset.yaml
+# Check for linting issues
+datahub dataset file --lintCheck dataset.yaml
+
+# Fix linting issues
+datahub dataset file --lintFix dataset.yaml
 ```
 
-If the URN does not start with `urn:li:dataset:`, it will be automatically prefixed.
+This command helps maintain consistent formatting of your Dataset YAML files. For more information on the expected format, refer to the [Dataset YAML Format](#dataset-yaml-format) section.
 
-### file
+### upsert
 
-Operate on a Dataset YAML file for validation or linting.
+Create or update Dataset metadata in DataHub.
 
 ```shell
-datahub dataset file [--lintCheck] [--lintFix] PATH_TO_YAML_FILE
+datahub dataset upsert -f PATH_TO_YAML_FILE
 ```
 
 **Options:**
-- `--lintCheck` - Check the YAML file for formatting issues (optional)
-- `--lintFix` - Fix formatting issues in the YAML file (optional)
+- `-f, --file` - Path to the YAML file containing Dataset metadata (required)
 
 **Example:**
 ```shell
-# Check for linting issues
-datahub dataset file --lintCheck dataset.yaml
+datahub dataset upsert -f dataset.yaml
+```
 
-# Fix linting issues
-datahub dataset file --lintFix dataset.yaml
+This command will parse the YAML file, validate that any entity references exist in DataHub, and then emit the corresponding metadata change proposals to update or create the Dataset.
+
+For details on the required structure of your YAML file, see the [Dataset YAML Format](#dataset-yaml-format) section.
+
+### get
+
+Retrieve Dataset metadata from DataHub and optionally write it to a file.
+
+```shell
+datahub dataset get --urn DATASET_URN [--to-file OUTPUT_FILE]
 ```
 
-This command helps maintain consistent formatting of your Dataset YAML files.
+**Options:**
+- `--urn` - The Dataset URN to retrieve (required)
+- `--to-file` - Path to write the Dataset metadata as YAML (optional)
+
+**Example:**
+```shell
+datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file my_dataset.yaml
+```
+
+If the URN does not start with `urn:li:dataset:`, it will be automatically prefixed.
+
+The output file will be formatted according to the [Dataset YAML Format](#dataset-yaml-format) section.
 
 ### add_sibling
 
@@ -130,9 +136,7 @@ schema:
       doc: "First field"           # Alias for description
       nativeDataType: "VARCHAR"    # Native platform type (defaults to type if not specified)
       nullable: false              # Whether field can be null (default: false)
-      jsonPath: "$.field1"         # JSON path for the field
-      label: "Field One"           # Display label
-      recursive: false             # Whether field is recursive (default: false)
+      label: "Field One"           # Display label (optional business label for the field)
       isPartOfKey: true            # Whether field is part of primary key
       isPartitioningKey: false     # Whether field is a partitioning key
       jsonProps: {"customProp": "value"} # Custom JSON properties
@@ -146,6 +150,7 @@ schema:
       structured_properties:
         property1: "value1"
         property2: 42
+  file: example.schema.avsc      # Optional schema file (required if defining tables with nested fields)
 
 # Additional metadata (all optional)
 properties:                        # Custom properties as key-value pairs
@@ -214,12 +219,11 @@ The Schema Field object supports the following properties:
 |----------|------|-------------|
 | `id` | string | Field identifier/path (required if `urn` not provided) |
 | `urn` | string | URN of the schema field (required if `id` not provided) |
-| `type` | string | Data type (one of the supported field types) |
+| `type` | string | Data type (one of the supported [Field Types](#field-types)) |
 | `nativeDataType` | string | Native data type in the source platform (defaults to `type` if not specified) |
 | `description` | string | Field description |
 | `doc` | string | Alias for description |
 | `nullable` | boolean | Whether the field can be null (default: false) |
-| `jsonPath` | string | JSON path for the field |
 | `label` | string | Display label for the field |
 | `recursive` | boolean | Whether the field is recursive (default: false) |
 | `isPartOfKey` | boolean | Whether the field is part of the primary key |
@@ -229,6 +233,24 @@ The Schema Field object supports the following properties:
 | `glossaryTerms` | array | List of glossary terms associated with the field |
 | `structured_properties` | object | Structured properties for the field |
 
+
+**Important Note on Schema Field Types**:
+When specifying fields in the YAML file, you must follow an all-or-nothing approach with the `type` field:
+- If you want the command to generate the schema for you, specify the `type` field for ALL fields.
+- If you only want to add field-level metadata (like tags, glossary terms, or structured properties), do NOT specify the `type` field for ANY field.
+
+Example of fields with only metadata (no types):
+```yaml
+schema:   
+  fields:     
+    - id: "field1"                 # Field identifier
+      structured_properties:
+        prop1: prop_value
+    - id: "field2"
+      structured_properties:
+        prop1: prop_value
+```
+
 ### Ownership Types
 
 When specifying owners, the following ownership types are supported: