Skip to content

Latest commit

 

History

History
60 lines (50 loc) · 2.58 KB

README.md

File metadata and controls

60 lines (50 loc) · 2.58 KB

Segment text using SpaCy

This project segments text sent to it into both sentences and verbal phrases. For now only German is supported! We primarily aim to provide a simple way of splitting text into verbal phrases as proposed in Vauth et al (2021). In addition, we also provide a way of splitting the text into sentences.

If you are using this in your academic work please cite our paper:

@inproceedings{vauthAutomatedEventAnnotation2021,
  title = {Automated {{Event Annotation}} in {{Literary Texts}}},
  booktitle = {{{CHR}} 2021: {{Computational Humanities Research Conference}}},
  author = {Vauth, Michael and Hatzel, Hans Ole and Gius, Evelyn and Biemann, Chris},
  date = {2021-11-17/2021-11-19},
  series = {{{CEUR Workshop Proceedings}}},
  volume = {2989},
  pages = {333--345},
  location = {Amsterdam, The Netherlands},
  url = {http://ceur-ws.org/Vol-2989/short_paper18.pdf},
  eventtitle = {{{CHR}} 2021: {{Computational Humanities Research Conference}}}
}

Building the Docker Image

In the project's top-level directory run: docker build -t verby . This will build a docker image that can be run with: docker run -p 8000:80 verby where the -p option will ensure that you can access the api on port 8000 from your host.

HTTP API

After starting the server either via docker or in a development setup you should be able to post you segmentation requests.

Using the CLI tool httpie:

http POST 127.0.0.1:8000/segment text="Ich gehe auf einem Wagen, oder wie manche sagen einem Auto, spazieren. Du gehst nachhause."

Or from Python code:

import requests
response = requests.post("http://127.0.0.1:8000/segment", json={"text": "Ich gehe auf einem Wagen, oder wie manche sagen einem Auto, spazieren. Du gehst nachhause."})
print(response.json())
# Prints: {'verbal_phrases': [[[0, 30], [60, 69]], [[31, 47]], [[71, 90]]], 'sentences': [[0, 70], [71, 90]]}

You will get a response object with the character offsets of sentences and verbal phrases. Note that verbal phrases may be discontinuous, as in the case above with the insertion.

Development Server

To run a development server just execute fastapi dev web.py

Library Usage

If you would prefer using verby as a library rather than via HTTP, you can use this sample code as a starting point.

import verby

nlp = verby.pipeline.build_pipeline("de")

doc = nlp("Sie lassen alle die krank sind nachhause gehen.")
for phrase in doc._.verbal_phrases:
    for span in phrase:
        print(span.start_char, span.end_char)