Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3001] docs(spark-connector): add spark connector document #3018

Merged
merged 3 commits into from
Apr 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/spark-connector/spark-catalog-hive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Spark connector hive catalog"
slug: /spark-connector/spark-catalog-hive
keyword: spark connector hive catalog
license: "Copyright 2024 Datastrato Pvt Ltd.
This software is licensed under the Apache License version 2."
---

With the Gravitino Spark connector, accessing data or managing metadata in Hive catalogs becomes straightforward, enabling seamless federation queries across different Hive catalogs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some Hive or Iceberg specific configurations that should be listed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the configurations are retrivied from Gravitino server.


## Capabilities

Supports most DDL and DML operations in SparkSQL, except such operations:

- Function operations
- Partition operations
- View operations
- Querying UDF
- `LOAD` clause
- `CREATE TABLE LIKE` clause
- `TRUCATE TABLE` clause

## Requirement

* Hive metastore 2.x
* HDFS 2.x or 3.x

## SQL example


```sql

// Suppose hive_a is the Hive catalog name managed by Gravitino
USE hive_a;

CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;

// Create table
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
age INT
)
PARTITIONED BY (department STRING)
STORED AS PARQUET;
DESC TABLE EXTENDED employees;

INSERT OVERWRITE TABLE employees PARTITION(department='Engineering') VALUES (1, 'John Doe', 30), (2, 'Jane Smith', 28);
INSERT OVERWRITE TABLE employees PARTITION(department='Marketing') VALUES (3, 'Mike Brown', 32);

SELECT * FROM employees WHERE department = 'Engineering';
```
60 changes: 60 additions & 0 deletions docs/spark-connector/spark-catalog-iceberg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: "Spark connector Iceberg catalog"
slug: /spark-connector/spark-catalog-iceberg
keyword: spark connector iceberg catalog
license: "Copyright 2024 Datastrato Pvt Ltd.
This software is licensed under the Apache License version 2."
---

## Capabilities

#### Support basic DML and DDL operations:

- `CREATE TABLE`

Supports basic create table clause including table schema, properties, partition, does not support distribution and sort orders.

- `DROP TABLE`
- `ALTER TABLE`
- `INSERT INTO&OVERWRITE`
- `SELECT`
- `DELETE`

Supports file delete only.

#### Not supported operations:

- Row level operations. like `MERGE INOT`, `DELETE FROM`, `UPDATE`
- View operations.
- Branching and tagging operations.
- Spark procedures.
- Other Iceberg extension SQL, like:
- `ALTER TABLE prod.db.sample ADD PARTITION FIELD xx`
- `ALTER TABLE ... WRITE ORDERED BY`

## SQL example

```sql
// Suppose iceberg_a is the Iceberg catalog name managed by Gravitino
USE iceberg_a;

CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;

CREATE TABLE IF NOT EXISTS employee (
id bigint,
name string,
department string,
hire_date timestamp
) USING iceberg
PARTITIONED BY (days(hire_date));
DESC TABLE EXTENDED employee;

INSERT INTO employee
VALUES
(1, 'Alice', 'Engineering', TIMESTAMP '2021-01-01 09:00:00'),
(2, 'Bob', 'Marketing', TIMESTAMP '2021-02-01 10:30:00'),
(3, 'Charlie', 'Sales', TIMESTAMP '2021-03-01 08:45:00');

SELECT * FROM employee WHERE date(hire_date) = '2021-01-01'
```
93 changes: 93 additions & 0 deletions docs/spark-connector/spark-connector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
title: "Gravitino Spark connector"
slug: /spark-connector/spark-connector
keyword: spark connector federation query
license: "Copyright 2024 Datastrato Pvt Ltd.
This software is licensed under the Apache License version 2."
---

## Overview

The Gravitino Spark connector leverages the Spark DataSourceV2 interface to facilitate the management of diverse catalogs under Gravitino. This capability allows users to perform federation queries, accessing data from various catalogs through a unified interface and consistent access control.

## Capabilities

1. Supports [Hive catalog](spark-catalog-hive.md) and [Iceberg catalog](spark-catalog-iceberg.md).
2. Supports federation query.
3. Supports most DDL and DML SQLs.

## Requirement

* Spark 3.4
* Scala 2.12
* JDK 8,11,17

## How to use it

1. [Build](../how-to-build.md) or download the Gravitino spark connector jar, and place it to the classpath of Spark.
2. Configure the Spark session to use the Gravitino spark connector.

| Property | Type | Default Value | Description | Required | Since Version |
|------------------------------|--------|---------------|-----------------------------------------------------------------------------------------------------|----------|---------------|
| spark.plugins | string | (none) | Gravitino spark plugin name, `com.datastrato.gravitino.spark.connector.plugin.GravitinoSparkPlugin` | Yes | 0.5.0 |
| spark.sql.gravitino.metalake | string | (none) | The metalake name that spark connector used to request to Gravitino. | Yes | 0.5.0 |
| spark.sql.gravitino.uri | string | (none) | The uri of Gravitino server address. | Yes | 0.5.0 |

```shell
./bin/spark-sql -v \
--conf spark.plugins="com.datastrato.gravitino.spark.connector.plugin.GravitinoSparkPlugin" \
--conf spark.sql.gravitino.uri=http://127.0.0.1:8090 \
--conf spark.sql.gravitino.metalake=test \
--conf spark.sql.warehouse.dir=hdfs://127.0.0.1:9000/user/hive/warehouse-hive
```

3. Execute the Spark SQL query.

Suppose there are two catalogs in the metalake `test`, `hive` for Hive catalog and `iceberg` for Iceberg catalog.

```sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank line between paragraphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// use hive catalog
USE hive;
CREATE DATABASE db;
USE db;
CREATE TABLE hive_students (id INT, name STRING);
INSERT INTO hive_students VALUES (1, 'Alice'), (2, 'Bob');

// use Iceberg catalog
USE iceberg;
USE db;
CREATE TABLE IF NOT EXISTS iceberg_scores (id INT, score INT) USING iceberg;
INSERT INTO iceberg_scores VALUES (1, 95), (2, 88);

// execute federation query between hive table and iceberg table
SELECT hs.name, is.score FROM hive.db.hive_students hs JOIN iceberg_scores is ON hs.id = is.id;
```

:::info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The command `SHOW CATALOGS` will only display the Spark default catalog, named spark_catalog, due to limitations within the Spark catalog manager. It does not list the catalogs present in the metalake. However, after explicitly using the `USE` command with a specific catalog name, that catalog name then becomes visible in the output of `SHOW CATALOGS`.
:::

## Datatype mapping

Gravitino spark connector support the following datatype mapping between Spark and Gravitino.

| Spark Data Type | Gravitino Data Type | Since Version |
|-----------------|---------------------|---------------|
| `BooleanType` | `boolean` | 0.5.0 |
| `ByteType` | `byte` | 0.5.0 |
| `ShortType` | `short` | 0.5.0 |
| `IntegerType` | `integer` | 0.5.0 |
| `LongType` | `long` | 0.5.0 |
| `FloatType` | `float` | 0.5.0 |
| `DoubleType` | `double` | 0.5.0 |
| `DecimalType` | `decimal` | 0.5.0 |
| `StringType` | `string` | 0.5.0 |
| `CharType` | `char` | 0.5.0 |
| `VarcharType` | `varchar` | 0.5.0 |
| `TimestampType` | `timestamp` | 0.5.0 |
| `TimestampType` | `timestamp` | 0.5.0 |
| `DateType` | `date` | 0.5.0 |
| `BinaryType` | `binary` | 0.5.0 |
| `ArrayType` | `array` | 0.5.0 |
| `MapType` | `map` | 0.5.0 |
| `StructType` | `struct` | 0.5.0 |
Loading
Oops, something went wrong.