You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -88,7 +93,7 @@ An optional `kafka-init`, `schema-registry`, and `kafka-ui` instance can be depl
88
93
89
94
### Configure Topic Creation
90
95
91
-
The Kafka topics created by the `kafka-setup` service are configured in the [kafka-topics-values.yaml](kafka/kafka-topics-values.yaml) file. The topics in that file are organized by the application, and sorted into "Stream Topics" (those with `cleanup.policy` = `delete`) and "Table Topics" (with `cleanup.policy` = `compact`).
96
+
The Kafka topics created by the `kafka-setup` service are configured in the [kafka-topics-values.yaml](jikkou/kafka-topics-values.yaml) file. The topics in that file are organized by the application, and sorted into "Stream Topics" (those with `cleanup.policy` = `delete`) and "Table Topics" (with `cleanup.policy` = `compact`).
92
97
93
98
The following enviroment variables can be used to configure Kafka Topic creation.
94
99
@@ -103,8 +108,7 @@ The following enviroment variables can be used to configure Kafka Topic creation
103
108
|`KAFKA_TOPIC_MIN_INSYNC_REPLICAS`| Minumum number of in-sync replicas (for use with ack=all) |
104
109
|`KAFKA_TOPIC_RETENTION_MS`| Retention time for stream topics, milliseconds |
105
110
|`KAFKA_TOPIC_DELETE_RETENTION_MS`| Tombstone retention time for compacted topics, milliseconds |
106
-
107
-
111
+
|`KAFKA_TOPIC_CONFIG_RELATIVE_PATH`| Relative path to the Kafka topic yaml configuration script, upper level directories are supported |
108
112
109
113
### Quick Run
110
114
@@ -121,34 +125,49 @@ The following enviroment variables can be used to configure Kafka Topic creation
121
125
<aname="mongodb-kafka-connect"></a>
122
126
123
127
## 4. MongoDB Kafka Connect
124
-
The mongo-connector service connects to specified Kafka topics (as defined in the mongo-connector/connect_start.sh script) and deposits these messages to separate collections in the MongoDB Database. The codebase that provides this functionality comes from Confluent using their community licensed [cp-kafka-connect image](https://hub.docker.com/r/confluentinc/cp-kafka-connect). Documentation for this image can be found [here](https://docs.confluent.io/platform/current/connect/index.html#what-is-kafka-connect).
128
+
The mongo-connector service connects to specified Kafka topics and deposits these messages to separate collections in the MongoDB Database. The codebase that provides this functionality comes from Confluent using their community licensed [cp-kafka-connect image](https://hub.docker.com/r/confluentinc/cp-kafka-connect). Documentation for this image can be found [here](https://docs.confluent.io/platform/current/connect/index.html#what-is-kafka-connect).
125
129
126
130
### Configuration
127
-
Provided in the mongo-connector directory is a sample configuration shell script ([connect_start.sh](./kafka-connect/connect_start.sh)) that can be used to create kafka connectors to MongoDB. The connectors in kafka connect are defined in the format that follows:
128
-
129
-
```shell
130
-
declare -A config_name=([name]="topic_name" [collection]="mongo_collection_name"
The format above describes the basic configuration for configuring a sink connector, this should be placed at the beginning of the connect_start.sh file. In general we recommend to keep the MongoDB collection name the same as the topic name to avoid confusion. Additionally, if there is a top level timefield set `convert_timestamp` to true and then specify the time field name that appears in the message. This will allow MongoDB to transform that message into a date object to allow for TTL creation and reduce message size. To override MongoDB's default message `_id` field, set `use_key` to true and then set the `key` property to "key". The "add_timestamp" field defines whether the connector will add a auto generated timestamp to each document. This allows for creation of Time To Live (TTL) indexes on the collections to help limit collection size growth.
135
-
136
-
After the sink connector is configured above, then make sure to call the createSink function with the config_name of the configuration like so:
137
131
138
-
```shell
139
-
createSink config_name
140
-
```
141
-
142
-
This needs to be put after the createSink function definition. To use a different `connect_start.sh` script, pass in the relative path of the new script by overriding the `CONNECT_SCRIPT_RELATIVE_PATH` environmental variable.
132
+
Kafka connectors are managed by the
143
133
144
134
Set the `COMPOSE_PROFILES` environmental variable as follows:
145
135
146
-
-`kafka_connect` will only spin up the `kafka-connect`service in [docker-compose-connect](docker-compose-connect.yml)
136
+
-`kafka_connect` will only spin up the `kafka-connect`and `kafka-init` services in [docker-compose-connect](docker-compose-connect.yml)
147
137
- NOTE: This implies that you will be using a separate Kafka and MongoDB cluster
148
138
-`kafka_connect_standalone` will run the following:
149
139
1.`kafka-connect` service from [docker-compose-connect](docker-compose-connect.yml)
150
-
2.`kafka` service from [docker-compose-kafka](docker-compose-kafka.yml)
151
-
3.`mongo` and `mongo-setup` services from [docker-compose-mongo](docker-compose-mongo.yml)
140
+
2.`kafka-init` service from [docker-compose-connect](docker-compose-connect.yml)
141
+
3.`kafka` service from [docker-compose-kafka](docker-compose-kafka.yml)
142
+
4.`mongo` and `mongo-setup` services from [docker-compose-mongo](docker-compose-mongo.yml)
143
+
144
+
### Configure Kafka Connector Creation
145
+
146
+
The Kafka connectors created by the `kafka-connect-setup` service are configured in the [kafka-connectors-values.yaml](jikkou/kafka-connectors-values.yaml) file. The connectors in that file are organized by the application, and given parameters to define the Kafka -> MongoDB sync connector:
|`CONNECT_TASKS_MAX`| Number of concurrent tasks to configure on kafka connectors |
166
+
|`CONNECT_CREATE_ODE`| Whether to create kafka connectors for the ODE |
167
+
|`CONNECT_CREATE_GEOJSONCONVERTER`| Whether to create topics for the GeojsonConverter |
168
+
|`CONNECT_CREATE_CONFLICTMONITOR`| Whether to create kafka connectors for the Conflict Monitor |
169
+
|`CONNECT_CREATE_DEDUPLICATOR`| Whether to create topics for the Deduplicator |
170
+
|`CONNECT_CONFIG_RELATIVE_PATH`| Relative path to the Kafka connector yaml configuration script, upper level directories are supported |
152
171
153
172
### Quick Run
154
173
@@ -170,4 +189,70 @@ Set the `COMPOSE_PROFILES` environmental variable as follows:
170
189
3. Click `OdeBsmJson`, and now you should see your message!
171
190
8. Feel free to test this with other topics or by producing to these topics using the [ODE](https://github.com/usdot-jpo-ode/jpo-ode)
172
191
192
+
193
+
<aname="deduplicator"></a>
194
+
195
+
## 5. jpo-deduplicator
196
+
The JPO-Deduplicator is a Kafka Java spring-boot application designed to reduce the number of messages stored and processed in the ODE system. This is done by reading in messages from an input topic (such as topic.ProcessedMap) and outputting a subset of those messages on a related output topic (topic.DeduplicatedProcessedMap). Functionally, this is done by removing deduplicate messages from the input topic and only passing on unique messages. In addition, each topic will pass on at least 1 message per hour even if the message is a duplicate. This behavior helps ensure messages are still flowing through the system. The following topics currently support deduplication.
When running the jpo-deduplication as a submodule in jpo-utils, the deduplicator will automatically turn on deduplication for a topic when that topic is created. For example if the KAFKA_TOPIC_CREATE_GEOJSONCONVERTER environment variable is set to true, the deduplicator will start performing deduplication for ProcessedMap, ProcessedMapWKT, and ProcessedSpat data.
209
+
210
+
To manually configure deduplication for a topic, the following environment variables can also be used.
A GitHub token is required to pull artifacts from GitHub repositories. This is required to obtain the jpo-deduplicator jars and must be done before attempting to build this repository.
225
+
226
+
1. Log into GitHub.
227
+
2. Navigate to Settings -> Developer settings -> Personal access tokens.
228
+
3. Click "New personal access token (classic)".
229
+
1. As of now, GitHub does not support `Fine-grained tokens` for obtaining packages.
230
+
4. Provide a name and expiration for the token.
231
+
5. Select the `read:packages` scope.
232
+
6. Click "Generate token" and copy the token.
233
+
7. Copy the token name and token value into your `.env` file.
234
+
235
+
For local development the following steps are also required
236
+
8. Create a copy of [settings.xml](jpo-deduplicator/jpo-deduplicator/settings.xml) and save it to `~/.m2/settings.xml`
237
+
9. Update the variables in your `~/.m2/settings.xml` with the token value and target jpo-ode organization.
238
+
239
+
### Quick Run
240
+
1. Create a copy of `sample.env` and rename it to `.env`.
241
+
2. Update the variable `MAVEN_GITHUB_TOKEN` to a github token used for downloading jar file dependencies. For full instructions on how to generate a token please see here:
242
+
3. Set the password for `MONGO_ADMIN_DB_PASS` and `MONGO_READ_WRITE_PASS` environmental variables to a secure password.
243
+
4. Set the `COMPOSE_PROFILES` variable to: `kafka,kafka_ui,kafka_setup, jpo-deduplicator`
244
+
5. Navigate back to the root directory and run the following command: `docker compose up -d`
245
+
6. Produce a sample message to one of the sink topics by using `kafka_ui` by:
246
+
1. Go to `localhost:8001`
247
+
2. Click local -> Topics
248
+
3. Select `topic.OdeMapJson`
249
+
4. Select `Produce Message`
250
+
5. Copy in sample JSON for a Map Message
251
+
6. Click `Produce Message` multiple times
252
+
7. View the synced message in `kafka_ui` by:
253
+
1. Go to `localhost:8001`
254
+
2. Click local -> Topics
255
+
3. Select `topic.DeduplicatedOdeMapJson`
256
+
4. You should now see only one copy of the map message sent.
0 commit comments