This is intended to be a monorepo containing all MDC entity recognition software, configuration, documentation, and so on.
There are currently 4 primary applications:
- The recognition API. This HTTP REST API calls out to the recognizer gRPC and HTTP recognizer services. (See the overview diagram and the API docs).
- The regexer recognizer. This simple gRPC recognizer service receives a stream of tokens and returns a stream of entities based on a regex match.
- The dictionary recognizer. This gRPC recognizer service recieves a stream of tokens and looks them up in a backend database, returning a stream of entities based on the result. (This can be complicated by a number of things, see the diagram)
- The dictionary importer. This app reads a file line by line, parses it, and upserts it to a backend database that the dictionary recognizer is compatible with.
To see documentation around endpoints and types, make docs
from project root. This requires go-swagger which can be installed from source:
dir=$(mktemp -d)
git clone https://github.com/go-swagger/go-swagger "$dir"
cd "$dir"
go install ./cmd/swagger
cd -
Troubleshooting:
- If there are missing types from the documentation, it is likely that the file where the documents derive type information has been overwritten, as it is a file generated by
make proto
. Looking through the git history forgo/gen/pb/types.pb
should reveal the missing annotations. It is therefore important that when runningmake proto
you ensure that the annotations are not lost.
make config
make build
docker-compose up -d redis
bin/dictionary-importer
make run
IMPORTANT: The make run command runs processes in the background using &
. There is a bash trap which executes a function to foreground those processes on interrupt. In case this doesn't work, you might have some hanging processes on your machine. Use ps
or pgrep
to find and kill them.
You can also just press the play button next to a main function in intellij 😃.
Grab some html from a website (ctrl+U in chrome). Make a post request to localhost:8080/text
, localhost:8080/tokens
, or localhost:8080/entities
with the html in the body of the request.
For example:
curl -L https://en.wikipedia.org/wiki/Acetylcarnitine > /tmp/acetylcarnitine.html
curl -XPOST -H "Content-Type: text/html" --data-binary "@/tmp/acetylcarnitine.html" 'http://localhost:8080/entities?recogniser=dictionary'
Content-Type
and x-leadmine-chemical-entities
headers are required in order for requests made via Firefox to work. To facilitate local development CORS headers have been 'baked in' to the code in go/cmd/recognition-api/main.go
(this is not ideal and we may wish to extract the CORS setup into config at some stage).
The following annotations have been added to the ner-api-ingress yaml on k8s:
nginx.ingress.kubernetes.io/cors-allow-headers: x-leadmine-chemical-entities, content-type
nginx.ingress.kubernetes.io/cors-allow-methods: PUT, GET, POST, OPTIONS
nginx.ingress.kubernetes.io/cors-allow-origin: '*'
nginx.ingress.kubernetes.io/enable-cors: "true"
Unit and integration test can be run with go test ./...
.
API tests require the env var NER_API_TEST
to be set (the value doesn't matter). Running go test ./...
with this set will run all unit, integration, and API tests but API
tests require the regexer, recognition-api and redis to be running, as well as the dictionaries with imported data. See scripts/test.sh
for how this works.
Some code in this repo is generated. The generated code is committed, so you don't need to regenerate it yourself. See the Makefile for more info.
The recognition API doesn't do anything on its own. You need to configure it with some recognisers
(either http
or grpc
).
To configure the recognisers, you need to mount config map in /app/config
with the name recognition-api.yml
.
(When containerised, all apps in this repo will look for a config file in the /app/config
folder).
Currently, there are only two types of recogniser: grpc
and http
(of which leadmine is a subtype, see the example config file).
To summarise:
- Deploy a grpc or http recogniser. You may need to create additional resources for these recognisers such as configmaps, secrets, or even other deployments such as redis.
- Ensure the recogniser is accessible over the network.
- Create a key in a configmap with the
recognition-api.yml
as the value. - Deploy the recognition api with the config map key mounted in
/app/config
with the pathrecognition-api.yml
.
This project is licensed under the terms of the Apache 2 license, which can be found in the repository as LICENSE.txt