Skip to content

Commit 6e701ef

Browse files
authored
Merge pull request #2 from usdot-jpo-ode/develop
Sync the CDOT fork with USDOT
2 parents 8a67952 + 4d76447 commit 6e701ef

File tree

78 files changed

+5528
-327
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+5528
-327
lines changed

.github/workflows/docker.yml

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: Docker build
2+
3+
on:
4+
pull_request:
5+
types: [opened, synchronize, reopened]
6+
7+
jobs:
8+
jpo-deduplicator:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- name: Checkout
12+
uses: actions/checkout@v3
13+
- name: Set up Docker Buildx
14+
uses: docker/setup-buildx-action@v2
15+
- name: Build
16+
uses: docker/build-push-action@v3
17+
with:
18+
context: jpo-deduplicator
19+
build-args: |
20+
MAVEN_GITHUB_TOKEN_NAME=${{ vars.MAVEN_GITHUB_TOKEN_NAME }}
21+
MAVEN_GITHUB_TOKEN=${{ secrets.MAVEN_GITHUB_TOKEN }}
22+
MAVEN_GITHUB_ORG=${{ github.repository_owner }}
23+
secrets: |
24+
MAVEN_GITHUB_TOKEN: ${{ secrets.MAVEN_GITHUB_TOKEN }}

.github/workflows/dockerhub.yml

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: "DockerHub Build and Push"
2+
3+
on:
4+
push:
5+
branches:
6+
- "develop"
7+
- "master"
8+
- "release/*"
9+
10+
jobs:
11+
dockerhub-jpo-deduplicator:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- name: Checkout
15+
uses: actions/checkout@v3
16+
- name: Set up Docker Buildx
17+
uses: docker/setup-buildx-action@v2
18+
- name: Login to DockerHub
19+
uses: docker/login-action@v2
20+
with:
21+
username: ${{ secrets.DOCKERHUB_USERNAME }}
22+
password: ${{ secrets.DOCKERHUB_TOKEN }}
23+
24+
- name: Replcae Docker tag
25+
id: set_tag
26+
run: echo "TAG=$(echo ${GITHUB_REF##*/} | sed 's/\//-/g')" >> $GITHUB_ENV
27+
28+
- name: Build
29+
uses: docker/build-push-action@v3
30+
with:
31+
context: jpo-deduplicator
32+
push: true
33+
tags: usdotjpoode/jpo-deduplicator:${{ env.TAG }}
34+
build-args: |
35+
MAVEN_GITHUB_TOKEN_NAME=${{ vars.MAVEN_GITHUB_TOKEN_NAME }}
36+
MAVEN_GITHUB_TOKEN=${{ secrets.MAVEN_GITHUB_TOKEN }}
37+
MAVEN_GITHUB_ORG=${{ github.repository_owner }}
38+
secrets: |
39+
MAVEN_GITHUB_TOKEN: ${{ secrets.MAVEN_GITHUB_TOKEN }}

.gitignore

+3-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
**/.env
1+
**/.env
2+
3+
**/target

README.md

+107-22
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,12 @@ The JPO ITS utilities repository serves as a central location for deploying open
1919
- [Quick Run](#quick-run-1)
2020
- [4. MongoDB Kafka Connect](#4-mongodb-kafka-connect)
2121
- [Configuration](#configuration)
22+
- [Configure Kafka Connector Creation](#configure-kafka-connector-creation)
2223
- [Quick Run](#quick-run-2)
24+
- [5. Deduplicator](#5-jpo-Deduplicator)
25+
- [Deduplication Configuration](#deduplication-config)
26+
- [Github Token Generation](#generate-a-github-token)
27+
- [Quick Run](#quick-run-3)
2328

2429

2530
<a name="base-configuration"></a>
@@ -88,7 +93,7 @@ An optional `kafka-init`, `schema-registry`, and `kafka-ui` instance can be depl
8893

8994
### Configure Topic Creation
9095

91-
The Kafka topics created by the `kafka-setup` service are configured in the [kafka-topics-values.yaml](kafka/kafka-topics-values.yaml) file. The topics in that file are organized by the application, and sorted into "Stream Topics" (those with `cleanup.policy` = `delete`) and "Table Topics" (with `cleanup.policy` = `compact`).
96+
The Kafka topics created by the `kafka-setup` service are configured in the [kafka-topics-values.yaml](jikkou/kafka-topics-values.yaml) file. The topics in that file are organized by the application, and sorted into "Stream Topics" (those with `cleanup.policy` = `delete`) and "Table Topics" (with `cleanup.policy` = `compact`).
9297

9398
The following enviroment variables can be used to configure Kafka Topic creation.
9499

@@ -103,8 +108,7 @@ The following enviroment variables can be used to configure Kafka Topic creation
103108
| `KAFKA_TOPIC_MIN_INSYNC_REPLICAS` | Minumum number of in-sync replicas (for use with ack=all) |
104109
| `KAFKA_TOPIC_RETENTION_MS` | Retention time for stream topics, milliseconds |
105110
| `KAFKA_TOPIC_DELETE_RETENTION_MS` | Tombstone retention time for compacted topics, milliseconds |
106-
107-
111+
| `KAFKA_TOPIC_CONFIG_RELATIVE_PATH` | Relative path to the Kafka topic yaml configuration script, upper level directories are supported |
108112

109113
### Quick Run
110114

@@ -121,34 +125,49 @@ The following enviroment variables can be used to configure Kafka Topic creation
121125
<a name="mongodb-kafka-connect"></a>
122126

123127
## 4. MongoDB Kafka Connect
124-
The mongo-connector service connects to specified Kafka topics (as defined in the mongo-connector/connect_start.sh script) and deposits these messages to separate collections in the MongoDB Database. The codebase that provides this functionality comes from Confluent using their community licensed [cp-kafka-connect image](https://hub.docker.com/r/confluentinc/cp-kafka-connect). Documentation for this image can be found [here](https://docs.confluent.io/platform/current/connect/index.html#what-is-kafka-connect).
128+
The mongo-connector service connects to specified Kafka topics and deposits these messages to separate collections in the MongoDB Database. The codebase that provides this functionality comes from Confluent using their community licensed [cp-kafka-connect image](https://hub.docker.com/r/confluentinc/cp-kafka-connect). Documentation for this image can be found [here](https://docs.confluent.io/platform/current/connect/index.html#what-is-kafka-connect).
125129

126130
### Configuration
127-
Provided in the mongo-connector directory is a sample configuration shell script ([connect_start.sh](./kafka-connect/connect_start.sh)) that can be used to create kafka connectors to MongoDB. The connectors in kafka connect are defined in the format that follows:
128-
129-
``` shell
130-
declare -A config_name=([name]="topic_name" [collection]="mongo_collection_name"
131-
[convert_timestamp]=true [timefield]="timestamp" [use_key]=true [key]="key" [add_timestamp]=true)
132-
```
133-
134-
The format above describes the basic configuration for configuring a sink connector, this should be placed at the beginning of the connect_start.sh file. In general we recommend to keep the MongoDB collection name the same as the topic name to avoid confusion. Additionally, if there is a top level timefield set `convert_timestamp` to true and then specify the time field name that appears in the message. This will allow MongoDB to transform that message into a date object to allow for TTL creation and reduce message size. To override MongoDB's default message `_id` field, set `use_key` to true and then set the `key` property to "key". The "add_timestamp" field defines whether the connector will add a auto generated timestamp to each document. This allows for creation of Time To Live (TTL) indexes on the collections to help limit collection size growth.
135-
136-
After the sink connector is configured above, then make sure to call the createSink function with the config_name of the configuration like so:
137131

138-
``` shell
139-
createSink config_name
140-
```
141-
142-
This needs to be put after the createSink function definition. To use a different `connect_start.sh` script, pass in the relative path of the new script by overriding the `CONNECT_SCRIPT_RELATIVE_PATH` environmental variable.
132+
Kafka connectors are managed by the
143133

144134
Set the `COMPOSE_PROFILES` environmental variable as follows:
145135

146-
- `kafka_connect` will only spin up the `kafka-connect` service in [docker-compose-connect](docker-compose-connect.yml)
136+
- `kafka_connect` will only spin up the `kafka-connect` and `kafka-init` services in [docker-compose-connect](docker-compose-connect.yml)
147137
- NOTE: This implies that you will be using a separate Kafka and MongoDB cluster
148138
- `kafka_connect_standalone` will run the following:
149139
1. `kafka-connect` service from [docker-compose-connect](docker-compose-connect.yml)
150-
2. `kafka` service from [docker-compose-kafka](docker-compose-kafka.yml)
151-
3. `mongo` and `mongo-setup` services from [docker-compose-mongo](docker-compose-mongo.yml)
140+
2. `kafka-init` service from [docker-compose-connect](docker-compose-connect.yml)
141+
3. `kafka` service from [docker-compose-kafka](docker-compose-kafka.yml)
142+
4. `mongo` and `mongo-setup` services from [docker-compose-mongo](docker-compose-mongo.yml)
143+
144+
### Configure Kafka Connector Creation
145+
146+
The Kafka connectors created by the `kafka-connect-setup` service are configured in the [kafka-connectors-values.yaml](jikkou/kafka-connectors-values.yaml) file. The connectors in that file are organized by the application, and given parameters to define the Kafka -> MongoDB sync connector:
147+
148+
| Connector Variable | Required | Condition | Description|
149+
|---|---|---|---|
150+
| `topicName` | Yes | Always | The name of the Kafka topic to sync from |
151+
| `collectionName` | Yes | Always | The name of the MongoDB collection to write to |
152+
| `generateTimestamp` | No | Optional | Enable or disable adding a timestamp to each message (true/false) |
153+
| `connectorName` | No | Optional | Override the name of the connector from the `collectionName` to this field instead |
154+
| `useTimestamp` | No | Optional | Converts the `timestampField` field at the top level of the value to a BSON date |
155+
| `timestampField` | No | Required if `useTimestamp` is `true` | The name of the timestamp field at the top level of the message |
156+
| `useKey` | No | Optional | Override the document `_id` field in MongoDB to use a specified `keyField` from the message |
157+
| `keyField` | No | Required if `useKey` is `true` | The name of the key field |
158+
159+
The following environment variables can be used to configure Kafka Connectors:
160+
161+
| Environment Variable | Description |
162+
|---|---|
163+
| `CONNECT_URL` | Kafka connect API URL |
164+
| `CONNECT_LOG_LEVEL` | Kafka connect log level (`OFF`, `ERROR`, `WARN`, `INFO`) |
165+
| `CONNECT_TASKS_MAX` | Number of concurrent tasks to configure on kafka connectors |
166+
| `CONNECT_CREATE_ODE` | Whether to create kafka connectors for the ODE |
167+
| `CONNECT_CREATE_GEOJSONCONVERTER` | Whether to create topics for the GeojsonConverter |
168+
| `CONNECT_CREATE_CONFLICTMONITOR` | Whether to create kafka connectors for the Conflict Monitor |
169+
| `CONNECT_CREATE_DEDUPLICATOR` | Whether to create topics for the Deduplicator |
170+
| `CONNECT_CONFIG_RELATIVE_PATH` | Relative path to the Kafka connector yaml configuration script, upper level directories are supported |
152171

153172
### Quick Run
154173

@@ -170,4 +189,70 @@ Set the `COMPOSE_PROFILES` environmental variable as follows:
170189
3. Click `OdeBsmJson`, and now you should see your message!
171190
8. Feel free to test this with other topics or by producing to these topics using the [ODE](https://github.com/usdot-jpo-ode/jpo-ode)
172191

192+
193+
<a name="deduplicator"></a>
194+
195+
## 5. jpo-deduplicator
196+
The JPO-Deduplicator is a Kafka Java spring-boot application designed to reduce the number of messages stored and processed in the ODE system. This is done by reading in messages from an input topic (such as topic.ProcessedMap) and outputting a subset of those messages on a related output topic (topic.DeduplicatedProcessedMap). Functionally, this is done by removing deduplicate messages from the input topic and only passing on unique messages. In addition, each topic will pass on at least 1 message per hour even if the message is a duplicate. This behavior helps ensure messages are still flowing through the system. The following topics currently support deduplication.
197+
198+
- topic.ProcessedMap -> topic.DeduplicatedProcessedMap
199+
- topic.ProcessedMapWKT -> topic.DeduplicatedProcessedMapWKT
200+
- topic.OdeMapJson -> topic.DeduplicatedOdeMapJson
201+
- topic.OdeTimJson -> topic.DeduplicatedOdeTimJson
202+
- topic.OdeRawEncodedTIMJson -> topic.DeduplicatedOdeRawEncodedTIMJson
203+
- topic.OdeBsmJson -> topic.DeduplicatedOdeBsmJson
204+
- topic.ProcessedSpat -> topic.DeduplicatedProcessedSpat
205+
206+
### Deduplication Config
207+
208+
When running the jpo-deduplication as a submodule in jpo-utils, the deduplicator will automatically turn on deduplication for a topic when that topic is created. For example if the KAFKA_TOPIC_CREATE_GEOJSONCONVERTER environment variable is set to true, the deduplicator will start performing deduplication for ProcessedMap, ProcessedMapWKT, and ProcessedSpat data.
209+
210+
To manually configure deduplication for a topic, the following environment variables can also be used.
211+
212+
| Environment Variable | Description |
213+
|---|---|
214+
| `ENABLE_PROCESSED_MAP_DEDUPLICATION` | `true` / `false` - Enable ProcessedMap message Deduplication |
215+
| `ENABLE_PROCESSED_MAP_WKT_DEDUPLICATION` | `true` / `false` - Enable ProcessedMap WKT message Deduplication |
216+
| `ENABLE_ODE_MAP_DEDUPLICATION` | `true` / `false` - Enable ODE MAP message Deduplication |
217+
| `ENABLE_ODE_TIM_DEDUPLICATION` | `true` / `false` - Enable ODE TIM message Deduplication |
218+
| `ENABLE_ODE_RAW_ENCODED_TIM_DEDUPLICATION` | `true` / `false` - Enable ODE Raw Encoded TIM Deduplication |
219+
| `ENABLE_PROCESSED_SPAT_DEDUPLICATION` | `true` / `false` - Enable ProcessedSpat Deduplication |
220+
| `ENABLE_ODE_BSM_DEDUPLICATION` | `true` / `false` - Enable ODE BSM Deduplication |
221+
222+
### Generate a Github Token
223+
224+
A GitHub token is required to pull artifacts from GitHub repositories. This is required to obtain the jpo-deduplicator jars and must be done before attempting to build this repository.
225+
226+
1. Log into GitHub.
227+
2. Navigate to Settings -> Developer settings -> Personal access tokens.
228+
3. Click "New personal access token (classic)".
229+
1. As of now, GitHub does not support `Fine-grained tokens` for obtaining packages.
230+
4. Provide a name and expiration for the token.
231+
5. Select the `read:packages` scope.
232+
6. Click "Generate token" and copy the token.
233+
7. Copy the token name and token value into your `.env` file.
234+
235+
For local development the following steps are also required
236+
8. Create a copy of [settings.xml](jpo-deduplicator/jpo-deduplicator/settings.xml) and save it to `~/.m2/settings.xml`
237+
9. Update the variables in your `~/.m2/settings.xml` with the token value and target jpo-ode organization.
238+
239+
### Quick Run
240+
1. Create a copy of `sample.env` and rename it to `.env`.
241+
2. Update the variable `MAVEN_GITHUB_TOKEN` to a github token used for downloading jar file dependencies. For full instructions on how to generate a token please see here:
242+
3. Set the password for `MONGO_ADMIN_DB_PASS` and `MONGO_READ_WRITE_PASS` environmental variables to a secure password.
243+
4. Set the `COMPOSE_PROFILES` variable to: `kafka,kafka_ui,kafka_setup, jpo-deduplicator`
244+
5. Navigate back to the root directory and run the following command: `docker compose up -d`
245+
6. Produce a sample message to one of the sink topics by using `kafka_ui` by:
246+
1. Go to `localhost:8001`
247+
2. Click local -> Topics
248+
3. Select `topic.OdeMapJson`
249+
4. Select `Produce Message`
250+
5. Copy in sample JSON for a Map Message
251+
6. Click `Produce Message` multiple times
252+
7. View the synced message in `kafka_ui` by:
253+
1. Go to `localhost:8001`
254+
2. Click local -> Topics
255+
3. Select `topic.DeduplicatedOdeMapJson`
256+
4. You should now see only one copy of the map message sent.
257+
173258
[Back to top](#toc)

docker-compose-connect.yml

+46-10
Original file line numberDiff line numberDiff line change
@@ -16,32 +16,68 @@ services:
1616
memory: 4G
1717
ports:
1818
- "8083:8083"
19+
healthcheck:
20+
test: ["CMD", "curl", "-f", "http://localhost:8083/connectors"]
21+
interval: 30s
22+
timeout: 10s
23+
retries: 4
1924
depends_on:
2025
mongo:
2126
condition: service_healthy
27+
kafka:
28+
condition: service_healthy
2229
environment:
23-
MONGO_URI: ${MONGO_URI}
24-
MONGO_DB_NAME: ${MONGO_DB_NAME}
2530
CONNECT_BOOTSTRAP_SERVERS: ${KAFKA_BOOTSTRAP_SERVERS}
2631
CONNECT_REST_ADVERTISED_HOST_NAME: connect
2732
CONNECT_REST_PORT: 8083
2833
CONNECT_GROUP_ID: kafka-connect-group
29-
CONNECT_CONFIG_STORAGE_TOPIC: topic.kafka-connect-configs
30-
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
31-
CONNECT_CONFIG_STORAGE_CLEANUP_POLICY: compact
34+
# Topics are created with jikkou in the kafka-setup service
35+
CONNECT_CONFIG_STORAGE_TOPIC: topic.KafkaConnectConfigs
36+
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: -1
3237
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
33-
CONNECT_OFFSET_STORAGE_TOPIC: topic.kafka-connect-offsets
34-
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
38+
CONNECT_OFFSET_STORAGE_TOPIC: topic.KafkaConnectOffsets
39+
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: -1
3540
CONNECT_OFFSET_STORAGE_CLEANUP_POLICY: compact
36-
CONNECT_STATUS_STORAGE_TOPIC: topic.kafka-connect-status
41+
CONNECT_STATUS_STORAGE_TOPIC: topic.KafkaConnectStatus
3742
CONNECT_STATUS_STORAGE_CLEANUP_POLICY: compact
38-
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
43+
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: -1
3944
CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
4045
CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
4146
CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
4247
CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
4348
CONNECT_LOG4J_ROOT_LOGLEVEL: ${CONNECT_LOG_LEVEL}
4449
CONNECT_LOG4J_LOGGERS: "org.apache.kafka.connect.runtime.rest=${CONNECT_LOG_LEVEL},org.reflections=${CONNECT_LOG_LEVEL},com.mongodb.kafka=${CONNECT_LOG_LEVEL}"
4550
CONNECT_PLUGIN_PATH: /usr/share/confluent-hub-components
51+
52+
kafka-connect-setup:
53+
profiles:
54+
- all
55+
- kafka_connect
56+
- kafka_connect_standalone
57+
image: jpo-jikkou
58+
build:
59+
context: jikkou
60+
dockerfile: Dockerfile.jikkou
61+
entrypoint: ./kafka_connector_init.sh
62+
restart: on-failure
63+
deploy:
64+
resources:
65+
limits:
66+
cpus: '0.5'
67+
memory: 1G
68+
depends_on:
69+
kafka-connect:
70+
condition: service_healthy
71+
environment:
72+
CONNECT_URL: ${CONNECT_URL}
73+
CONNECT_TASKS_MAX: ${CONNECT_TASKS_MAX}
74+
CONNECT_CREATE_ODE: ${CONNECT_CREATE_ODE}
75+
CONNECT_CREATE_GEOJSONCONVERTER: ${CONNECT_CREATE_GEOJSONCONVERTER}
76+
CONNECT_CREATE_CONFLICTMONITOR: ${CONNECT_CREATE_CONFLICTMONITOR}
77+
CONNECT_CREATE_DEDUPLICATOR: ${CONNECT_CREATE_DEDUPLICATOR}
78+
MONGO_CONNECTOR_USERNAME: ${MONGO_ADMIN_DB_USER}
79+
MONGO_CONNECTOR_PASSWORD: ${MONGO_ADMIN_DB_PASS:?}
80+
MONGO_DB_IP: ${MONGO_IP}
81+
MONGO_DB_NAME: ${MONGO_DB_NAME}
4682
volumes:
47-
- ${CONNECT_SCRIPT_RELATIVE_PATH}:/scripts/connect_start.sh
83+
- ${CONNECT_CONFIG_RELATIVE_PATH-./jikkou/kafka-connectors-values.yaml}:/app/kafka-connectors-values.yaml

docker-compose-deduplicator.yml

+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
services:
2+
deduplicator:
3+
profiles:
4+
- all
5+
- deduplicator
6+
build:
7+
context: jpo-deduplicator
8+
dockerfile: Dockerfile
9+
args:
10+
MAVEN_GITHUB_TOKEN: ${MAVEN_GITHUB_TOKEN:?error}
11+
MAVEN_GITHUB_ORG: ${MAVEN_GITHUB_ORG:?error}
12+
image: jpo-deduplicator:latest
13+
restart: ${RESTART_POLICY}
14+
environment:
15+
DOCKER_HOST_IP: ${DOCKER_HOST_IP}
16+
KAFKA_BOOTSTRAP_SERVERS: ${KAFKA_BOOTSTRAP_SERVERS:?error}
17+
spring.kafka.bootstrap-servers: ${KAFKA_BOOTSTRAP_SERVERS:?error}
18+
enableProcessedMapDeduplication: ${ENABLE_PROCESSED_MAP_DEDUPLICATION}
19+
enableProcessedMapWktDeduplication: ${ENABLE_PROCESSED_MAP_WKT_DEDUPLICATION}
20+
enableOdeMapDeduplication: ${ENABLE_ODE_MAP_DEDUPLICATION}
21+
enableOdeTimDeduplication: ${ENABLE_ODE_TIM_DEDUPLICATION}
22+
enableOdeRawEncodedTimDeduplication: ${ENABLE_ODE_RAW_ENCODED_TIM_DEDUPLICATION}
23+
enableProcessedSpatDeduplication: ${ENABLE_PROCESSED_SPAT_DEDUPLICATION}
24+
enableOdeBsmDeduplication: ${ENABLE_ODE_BSM_DEDUPLICATION}
25+
26+
27+
28+
healthcheck:
29+
test: ["CMD", "java", "-version"]
30+
interval: 10s
31+
timeout: 10s
32+
retries: 20
33+
logging:
34+
options:
35+
max-size: "10m"
36+
max-file: "5"
37+
deploy:
38+
resources:
39+
limits:
40+
memory: 3G
41+
depends_on:
42+
kafka:
43+
condition: service_healthy
44+
required: false

0 commit comments

Comments
 (0)