-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#3001] docs(spark-connector): add spark connector document #3018
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: "Spark connector hive catalog" | ||
slug: /spark-connector/spark-catalog-hive | ||
keyword: spark connector hive catalog | ||
license: "Copyright 2024 Datastrato Pvt Ltd. | ||
This software is licensed under the Apache License version 2." | ||
--- | ||
|
||
With the Gravitino Spark connector, accessing data or managing metadata in Hive catalogs becomes straightforward, enabling seamless federation queries across different Hive catalogs. | ||
|
||
## Capabilities | ||
|
||
Supports most DDL and DML operations in SparkSQL, except such operations: | ||
|
||
- Function operations | ||
- Partition operations | ||
- View operations | ||
- Querying UDF | ||
- `LOAD` clause | ||
- `CREATE TABLE LIKE` clause | ||
- `TRUCATE TABLE` clause | ||
|
||
## Requirement | ||
|
||
* Hive metastore 2.x | ||
* HDFS 2.x or 3.x | ||
|
||
## SQL example | ||
|
||
|
||
```sql | ||
|
||
// Suppose hive_a is the Hive catalog name managed by Gravitino | ||
USE hive_a; | ||
|
||
CREATE DATABASE IF NOT EXISTS mydatabase; | ||
USE mydatabase; | ||
|
||
// Create table | ||
CREATE TABLE IF NOT EXISTS employees ( | ||
id INT, | ||
name STRING, | ||
age INT | ||
) | ||
PARTITIONED BY (department STRING) | ||
STORED AS PARQUET; | ||
DESC TABLE EXTENDED employees; | ||
|
||
INSERT OVERWRITE TABLE employees PARTITION(department='Engineering') VALUES (1, 'John Doe', 30), (2, 'Jane Smith', 28); | ||
INSERT OVERWRITE TABLE employees PARTITION(department='Marketing') VALUES (3, 'Mike Brown', 32); | ||
|
||
SELECT * FROM employees WHERE department = 'Engineering'; | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
--- | ||
title: "Spark connector Iceberg catalog" | ||
slug: /spark-connector/spark-catalog-iceberg | ||
keyword: spark connector iceberg catalog | ||
license: "Copyright 2024 Datastrato Pvt Ltd. | ||
This software is licensed under the Apache License version 2." | ||
--- | ||
|
||
## Capabilities | ||
|
||
#### Support basic DML and DDL operations: | ||
|
||
- `CREATE TABLE` | ||
|
||
Supports basic create table clause including table schema, properties, partition, does not support distribution and sort orders. | ||
|
||
- `DROP TABLE` | ||
- `ALTER TABLE` | ||
- `INSERT INTO&OVERWRITE` | ||
- `SELECT` | ||
- `DELETE` | ||
|
||
Supports file delete only. | ||
|
||
#### Not supported operations: | ||
|
||
- Row level operations. like `MERGE INOT`, `DELETE FROM`, `UPDATE` | ||
- View operations. | ||
- Branching and tagging operations. | ||
- Spark procedures. | ||
- Other Iceberg extension SQL, like: | ||
- `ALTER TABLE prod.db.sample ADD PARTITION FIELD xx` | ||
- `ALTER TABLE ... WRITE ORDERED BY` | ||
|
||
## SQL example | ||
|
||
```sql | ||
// Suppose iceberg_a is the Iceberg catalog name managed by Gravitino | ||
USE iceberg_a; | ||
|
||
CREATE DATABASE IF NOT EXISTS mydatabase; | ||
USE mydatabase; | ||
|
||
CREATE TABLE IF NOT EXISTS employee ( | ||
id bigint, | ||
name string, | ||
department string, | ||
hire_date timestamp | ||
) USING iceberg | ||
PARTITIONED BY (days(hire_date)); | ||
DESC TABLE EXTENDED employee; | ||
|
||
INSERT INTO employee | ||
VALUES | ||
(1, 'Alice', 'Engineering', TIMESTAMP '2021-01-01 09:00:00'), | ||
(2, 'Bob', 'Marketing', TIMESTAMP '2021-02-01 10:30:00'), | ||
(3, 'Charlie', 'Sales', TIMESTAMP '2021-03-01 08:45:00'); | ||
|
||
SELECT * FROM employee WHERE date(hire_date) = '2021-01-01' | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
--- | ||
title: "Gravitino Spark connector" | ||
slug: /spark-connector/spark-connector | ||
keyword: spark connector federation query | ||
license: "Copyright 2024 Datastrato Pvt Ltd. | ||
This software is licensed under the Apache License version 2." | ||
--- | ||
|
||
## Overview | ||
|
||
The Gravitino Spark connector leverages the Spark DataSourceV2 interface to facilitate the management of diverse catalogs under Gravitino. This capability allows users to perform federation queries, accessing data from various catalogs through a unified interface and consistent access control. | ||
|
||
## Capabilities | ||
|
||
1. Supports [Hive catalog](spark-catalog-hive.md) and [Iceberg catalog](spark-catalog-iceberg.md). | ||
2. Supports federation query. | ||
3. Supports most DDL and DML SQLs. | ||
|
||
## Requirement | ||
|
||
* Spark 3.4 | ||
* Scala 2.12 | ||
* JDK 8,11,17 | ||
|
||
## How to use it | ||
|
||
1. [Build](../how-to-build.md) or download the Gravitino spark connector jar, and place it to the classpath of Spark. | ||
2. Configure the Spark session to use the Gravitino spark connector. | ||
|
||
| Property | Type | Default Value | Description | Required | Since Version | | ||
|------------------------------|--------|---------------|-----------------------------------------------------------------------------------------------------|----------|---------------| | ||
| spark.plugins | string | (none) | Gravitino spark plugin name, `com.datastrato.gravitino.spark.connector.plugin.GravitinoSparkPlugin` | Yes | 0.5.0 | | ||
| spark.sql.gravitino.metalake | string | (none) | The metalake name that spark connector used to request to Gravitino. | Yes | 0.5.0 | | ||
| spark.sql.gravitino.uri | string | (none) | The uri of Gravitino server address. | Yes | 0.5.0 | | ||
|
||
```shell | ||
./bin/spark-sql -v \ | ||
--conf spark.plugins="com.datastrato.gravitino.spark.connector.plugin.GravitinoSparkPlugin" \ | ||
--conf spark.sql.gravitino.uri=http://127.0.0.1:8090 \ | ||
--conf spark.sql.gravitino.metalake=test \ | ||
--conf spark.sql.warehouse.dir=hdfs://127.0.0.1:9000/user/hive/warehouse-hive | ||
``` | ||
|
||
3. Execute the Spark SQL query. | ||
|
||
Suppose there are two catalogs in the metalake `test`, `hive` for Hive catalog and `iceberg` for Iceberg catalog. | ||
|
||
```sql | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add blank line between paragraphs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
// use hive catalog | ||
USE hive; | ||
CREATE DATABASE db; | ||
USE db; | ||
CREATE TABLE hive_students (id INT, name STRING); | ||
INSERT INTO hive_students VALUES (1, 'Alice'), (2, 'Bob'); | ||
|
||
// use Iceberg catalog | ||
USE iceberg; | ||
USE db; | ||
CREATE TABLE IF NOT EXISTS iceberg_scores (id INT, score INT) USING iceberg; | ||
INSERT INTO iceberg_scores VALUES (1, 95), (2, 88); | ||
|
||
// execute federation query between hive table and iceberg table | ||
SELECT hs.name, is.score FROM hive.db.hive_students hs JOIN iceberg_scores is ON hs.id = is.id; | ||
``` | ||
|
||
:::info | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
The command `SHOW CATALOGS` will only display the Spark default catalog, named spark_catalog, due to limitations within the Spark catalog manager. It does not list the catalogs present in the metalake. However, after explicitly using the `USE` command with a specific catalog name, that catalog name then becomes visible in the output of `SHOW CATALOGS`. | ||
::: | ||
|
||
## Datatype mapping | ||
|
||
Gravitino spark connector support the following datatype mapping between Spark and Gravitino. | ||
|
||
| Spark Data Type | Gravitino Data Type | Since Version | | ||
|-----------------|---------------------|---------------| | ||
| `BooleanType` | `boolean` | 0.5.0 | | ||
| `ByteType` | `byte` | 0.5.0 | | ||
| `ShortType` | `short` | 0.5.0 | | ||
| `IntegerType` | `integer` | 0.5.0 | | ||
| `LongType` | `long` | 0.5.0 | | ||
| `FloatType` | `float` | 0.5.0 | | ||
| `DoubleType` | `double` | 0.5.0 | | ||
| `DecimalType` | `decimal` | 0.5.0 | | ||
| `StringType` | `string` | 0.5.0 | | ||
| `CharType` | `char` | 0.5.0 | | ||
| `VarcharType` | `varchar` | 0.5.0 | | ||
| `TimestampType` | `timestamp` | 0.5.0 | | ||
| `TimestampType` | `timestamp` | 0.5.0 | | ||
| `DateType` | `date` | 0.5.0 | | ||
| `BinaryType` | `binary` | 0.5.0 | | ||
| `ArrayType` | `array` | 0.5.0 | | ||
| `MapType` | `map` | 0.5.0 | | ||
| `StructType` | `struct` | 0.5.0 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some Hive or Iceberg specific configurations that should be listed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the configurations are retrivied from Gravitino server.