Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update locations with ranges #261

Merged
merged 7 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 57 additions & 57 deletions belief_pipeline/UG.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -26771,60 +26771,60 @@
12420819 Ripon Falls Ripon Falls 0.43291 33.19157 H FLLS UG E 33 7732228 8657707 0 1138 Africa/Kampala 2022-01-23
12541305 Kisoro Airport Kisoro Airport HUKI,KXO -1.28344 29.71846 S AIRP UG W 43 7874701 8658184 0 1989 Africa/Kampala 2023-06-21
12541311 Kajjansi Airfield Kajjansi Airfield HUKJ,KJJ 0.20034 32.54994 S AIRP UG C 96 234053 8658010 0 1200 Africa/Kampala 2023-06-21
Achumo Achumo 2.00736 34.56843
Agili Agili 2.54961 34.63878
Amboso River Amboso River
Bjordal Mine Bjordal Mine -1.15 29.8
Bukusu Complex Bukusu Complex 0.87 34.27056
Busia Tc Busia Tc 0.4676 34.0905 coordinates for Busia
Buteraniro Buteraniro 0.1202 30.62554
Chilima Chilima -1.25991 29.9823 coordinates for the Kabale district
Itega-Manengo Itega-Manengo -0.491 30.183 coordinates for the Bushenyi district
Kabatola Kabatola 0.87097 34.27095 coordinates for Bukusu
Kacharalum Kacharalum 2.956 32.8 coordinates for Pader district
Kafu River Kafu River
Kahengyere Kahengyere -0.90494 30.44871
Kakanena Kakanena -0.93702 30.30483 coordinates for Ntungamo district
Kalokwameri Kalokwameri 2.05253 34.57341 coordinates for Nabilatuk
Kamalera Kamalera 2.6389 34.64395 coordinates for Moroto district
Kamirambuzi Kamirambuzi -0.69077 29.87781 Hills, coordinates for Rukungiri district
Kampono Kampono -0.5 30.6 coordinates for Mbarara District
Kanyambogo Kanyambogo -1.8326 31.59554
Karamoja Karamoja 2.53453 34.66659 coordinates for Moroto town, which the expert stated is considered the regional capital
Kasukengo Kasukengo -0.5 30.6 coordinates for Masaka district
Kazumu Kazumu -0.93702 30.30483 coordinates for Ntungamo district
Kibalya Kibalya 0.17255 30.01446
Kirwa Kirwa -1.1704 29.67437
Kisai Kisai -0.7093 31.41309 coordinates for Rakai district
Kitahurira Kitahurira -1.11201 29.81967
Kitomi Forest Kitomi Forest -1.246389 30.25 coordinates for Kasyoha-Kitomi Central Forest Reserve
Kukor Kukor 2.956 32.8 coordinates for the Pader district
Kyanyamuzinda Kyanyamuzinda -1.20662 29.68113 coordinates for Kisoro district
Lolung-Moruamakale Lolung-Moruamakale 2.6389 34.64395 coordinates for Moroto district
Loyoro Napore Loyoro Napore 3.56534 33.69348 coordinates for Karenga
Lwamwire Lwamwire -0.93702 30.30483 coordinates for Ntungamo district
Matidi Matidi 3.26852 33.05337 coordinates for Kitgum Matidi
Mbale Estate Mbale Estate 1.0712 34.175 coordinates for Mbale the town
Mpororo Mpororo -0.72963 30.82065
Mugyera Mugyera -0.4328 30.45239
Murindi Murindi -1.22296 29.77341
Mutaka Mutaka -1.10319 30.09801
Muti Muti -0.3287 30.32488 coordinates for Buhweju district
Naam Naam 2.956 32.8 coordinates for the Pader district
Nakobekobe Nakobekobe 2.00736 34.56843
Namekara Uganda Vermiculite Mine Namekara Uganda Vermiculite Mine Namekara Mine,Namekhara 0.83972 34.25472
Nampeyo Nampeyo 0 32
Nangalwe Nangalwe
Napak Special Area Napak Special Area 1.93333 34.45 coordinates for Nabilatuk
Natopojo Natopojo 2.54961 34.63878
Nyabakweri Nyabakweri -0.93702 30.30483 coordinates for Ntungamo district
Nyaituma Nyaituma 1.47639 31.44056 coordinates for Bulindi
Nyinamaherere Nyinamaherere -0.93702 30.30483 coordinates for Ntungamo district
Okora Okora 2.956 32.8 coordinates for the Pader district
Omwodulum Omwodulum 2.274 32.953 coordinates for Lira district
Parobong Parobong 2.956 32.8 coordinates for the Pader district
Rwakirenzi Rwakirenzi -0.93702 30.30483 coordinates for Ntungamo district
Rwamanyinya Rwamanyinya
Rwenkanga Rwenkanga 0.83733 31.31765
Sikusi Sikusi 0.95711 34.33992
Surumbusa Surumbusa 0.89711 34.25711
-6 Achumo Achumo 2.00736 34.56843
-7 Agili Agili 2.54961 34.63878
-8 Amboso River Amboso River
-9 Bjordal Mine Bjordal Mine -1.15 29.8
-10 Bukusu Complex Bukusu Complex 0.87 34.27056
-11 Busia Tc Busia Tc 0.4676 34.0905 coordinates for Busia
-12 Buteraniro Buteraniro 0.1202 30.62554
-13 Chilima Chilima -1.25991 29.9823 coordinates for the Kabale district
-14 Itega-Manengo Itega-Manengo -0.491 30.183 coordinates for the Bushenyi district
-15 Kabatola Kabatola 0.87097 34.27095 coordinates for Bukusu
-16 Kacharalum Kacharalum 2.956 32.8 coordinates for Pader district
-17 Kafu River Kafu River
-18 Kahengyere Kahengyere -0.90494 30.44871
-19 Kakanena Kakanena -0.93702 30.30483 coordinates for Ntungamo district
-20 Kalokwameri Kalokwameri 2.05253 34.57341 coordinates for Nabilatuk
-21 Kamalera Kamalera 2.6389 34.64395 coordinates for Moroto district
-22 Kamirambuzi Kamirambuzi -0.69077 29.87781 Hills, coordinates for Rukungiri district
-23 Kampono Kampono -0.5 30.6 coordinates for Mbarara District
-24 Kanyambogo Kanyambogo -1.8326 31.59554
-1 Karamoja Karamoja 2.53453 34.66659 coordinates for Moroto town, which the expert stated is considered the regional capital
-25 Kasukengo Kasukengo -0.5 30.6 coordinates for Masaka district
-26 Kazumu Kazumu -0.93702 30.30483 coordinates for Ntungamo district
-5 Kibalya Kibalya 0.17255 30.01446
-2 Kirwa Kirwa -1.1704 29.67437
-27 Kisai Kisai -0.7093 31.41309 coordinates for Rakai district
-28 Kitahurira Kitahurira -1.11201 29.81967
-29 Kitomi Forest Kitomi Forest -1.246389 30.25 coordinates for Kasyoha-Kitomi Central Forest Reserve
-30 Kukor Kukor 2.956 32.8 coordinates for the Pader district
-31 Kyanyamuzinda Kyanyamuzinda -1.20662 29.68113 coordinates for Kisoro district
-32 Lolung-Moruamakale Lolung-Moruamakale 2.6389 34.64395 coordinates for Moroto district
-33 Loyoro Napore Loyoro Napore 3.56534 33.69348 coordinates for Karenga
-34 Lwamwire Lwamwire -0.93702 30.30483 coordinates for Ntungamo district
-35 Matidi Matidi 3.26852 33.05337 coordinates for Kitgum Matidi
-36 Mbale Estate Mbale Estate 1.0712 34.175 coordinates for Mbale the town
-4 Mpororo Mpororo -0.72963 30.82065
-40 Mugyera Mugyera -0.4328 30.45239
-41 Murindi Murindi -1.22296 29.77341
-42 Mutaka Mutaka -1.10319 30.09801
-43 Muti Muti -0.3287 30.32488 coordinates for Buhweju district
-44 Naam Naam 2.956 32.8 coordinates for the Pader district
-45 Nakobekobe Nakobekobe 2.00736 34.56843
-46 Namekara Uganda Vermiculite Mine Namekara Uganda Vermiculite Mine Namekara Mine,Namekhara 0.83972 34.25472
-47 Nampeyo Nampeyo 0 32
-48 Nangalwe Nangalwe
-49 Napak Special Area Napak Special Area 1.93333 34.45 coordinates for Nabilatuk
-3 Natopojo Natopojo 2.54961 34.63878
-50 Nyabakweri Nyabakweri -0.93702 30.30483 coordinates for Ntungamo district
-51 Nyaituma Nyaituma 1.47639 31.44056 coordinates for Bulindi
-52 Nyinamaherere Nyinamaherere -0.93702 30.30483 coordinates for Ntungamo district
-53 Okora Okora 2.956 32.8 coordinates for the Pader district
-54 Omwodulum Omwodulum 2.274 32.953 coordinates for Lira district
-55 Parobong Parobong 2.956 32.8 coordinates for the Pader district
-56 Rwakirenzi Rwakirenzi -0.93702 30.30483 coordinates for Ntungamo district
-57 Rwamanyinya Rwamanyinya
-58 Rwenkanga Rwenkanga 0.83733 31.31765
-59 Sikusi Sikusi 0.95711 34.33992
-60 Surumbusa Surumbusa 0.89711 34.25711
193 changes: 193 additions & 0 deletions belief_pipeline/tpi_location_patch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
from pandas import DataFrame
from pipeline import InnerStage
# from tqdm import tqdm

# import itertools
import pandas
import re
import spacy

import sys

class TextWithIndices():
def __init__(self, text, indices=None):
super().__init__()
self.text = text
if indices == None:
self.indices = [index for index, value in enumerate(text)]
else:
self.indices = indices

def split(self, separator: str) -> list["TextWithIndices"]:
parts = self.text.split(separator)
textWithIndicesList = []
offset = 0
for part in parts:
textWithIndices = TextWithIndices(part, self.indices[offset:offset + len(part)])
textWithIndicesList.append(textWithIndices)
offset += len(part)
offset += len(separator)
return textWithIndicesList

def re_sub(self, pattern: str, repl: str) -> "TextWithIndices":
done = False
text = self.text
indices = self.indices
while not done:
match = re.search(pattern, text)
if match == None:
done = True
else:
# The indices must be done before the text gets changed.
indices = indices[0:match.start()] + ([-1] * len(repl)) + indices[match.end():len(text)]
text = text[0:match.start()] + repl + text[match.end():len(text)]
return TextWithIndices(text, indices)


class Location():
def __init__(self, textWithIndices: TextWithIndices, lat: float, lon: float, canonical: str, geonameid: str):
# Make sure indices are reasonable.
if -1 in textWithIndices.indices:
print("There is a -1 among the indices!")
for index, offset in enumerate(textWithIndices.indices):
if offset != textWithIndices.indices[0] + index:
print("The indices are not consecutive!")
self.textWithIndices = textWithIndices
self.lat = lat
self.lon = lon
self.canonical = canonical
self.geonameid = geonameid

def __str__(self):
return f"{self.textWithIndices.text}\t{self.canonical}\t{self.geonameid}\t{self.textWithIndices.indices[0]}\t{self.textWithIndices.indices[-1] + 1}\t{self.lat}\t{self.lon}"

class LocationsPatch():
def __init__(self, locations_file_name: str) -> None:
super().__init__()
# The Uganda locations file has an extra column of notes, but we are not using it, so it is not included
# in order to keep the code compatible with the Ghana locations. The extra column causes confusion with
# identification of the index column, so that is explicitly turned off here. You will see a warning
# message on the console about lost data, probably from the extra column that we're not using here:
# ParserWarning: Length of header or names does not match length of data. This leads to a loss of data
# with index_col=False.
locations_data_frame = pandas.read_csv(locations_file_name, sep="\t", encoding="utf-8", index_col=False, names=[
"geonameid", "name", "asciiname", "alternatenames", "latitude", "longitude", "unk1", "unk2", "country_code",
"cc2", "unk3", "unk4", "unk5", "unk6", "population", "elevation", "unk7", "timezone", "unk8", #, "notes"
], dtype={"geonameid": str}) # geonameid needs to be a string or an integer. It should not be a float or else -1.0.
names = locations_data_frame["name"]
ascii_names = locations_data_frame["asciiname"]
alternate_names = locations_data_frame["alternatenames"]
latitudes = locations_data_frame["latitude"]
longitudes = locations_data_frame["longitude"]
geonameids = locations_data_frame["geonameid"]
self.names_to_canonical = self.get_names_to_values(names, ascii_names, alternate_names, names)
self.names_to_latitudes = self.get_names_to_values(names, ascii_names, alternate_names, latitudes)
self.names_to_longitudes = self.get_names_to_values(names, ascii_names, alternate_names, longitudes)
self.names_to_geonameids = self.get_names_to_values(names, ascii_names, alternate_names, geonameids)
# These are real locations in the ghana locations file from geonames (https://download.geonames.org/export/dump/),
# but they also happen to be frequent English words, so we exclude them from being identified as locations.
self.common_words = ["We", "No", "To", "Some"]
self.NER = NER = spacy.load("en_core_web_sm")

def get_names_to_values(self, names: list[str], ascii_names: list[str], alternate_names: list[str], values: list[str]) -> dict[str, str]:
# In case there are duplicates, these are read from lowest to highest priority, so that the lower ones are overwritten.
names_to_values = {}
for name, value in zip(alternate_names, values):
if type(name) is str:
split = name.split(",")
for item in split:
names_to_values[item] = value
for name, value in zip(names, values):
names_to_values[name] = value
for name, value in zip(ascii_names, values):
names_to_values[name] = value
return names_to_values

def add_location(self, locations: list[Location], entityWithIndices: TextWithIndices) -> None:
# NER includes `the` in locations, but the location database file doesn't.
# The above sentence is not true, so by removing "the" we could be missing locations.
# TODO: what should be done if this is not the case? There are some instances
# in the database. Should both be checked? Yes, they should indeed.
# However, they were not in the initial pass and we don't want to find locations
# that weren't there before, because that would necessitate reading every sentence
# just in case there was a new one.
# The canonical name should be the one in the database.
# Let's not find any new ones with this.
# entityWithIndices = entityWithIndices.re_sub("^[Tt]he\s", "")
entityWithIndices = entityWithIndices.re_sub("^the\s", "")
if entityWithIndices.text in self.names_to_latitudes and not entityWithIndices.text in self.common_words and entityWithIndices.text[0].isupper():
location = Location(entityWithIndices, self.names_to_latitudes[entityWithIndices.text], self.names_to_longitudes[entityWithIndices.text], self.names_to_canonical[entityWithIndices.text], self.names_to_geonameids[entityWithIndices.text])
locations.append(location)



def get_cell_locations(self, textWithIndices: TextWithIndices) -> list[Location]:
# Replace author citations where possible because they sometimes get labeled as locations.
textWithIndices = textWithIndices.re_sub("[A-Z][a-z]+,\s\d\d\d\d", "(Author, year)")
entities = self.NER(textWithIndices.text).ents
non_persons = [entity for entity in entities if entity.label_ != "PERSON"]
locations = []
for entity in non_persons:
entitiesWithIndices = TextWithIndices(str(entity), textWithIndices.indices[entity.start_char:entity.end_char])
entityWithIndicesList = entitiesWithIndices.split(", ")
for entityWithIndices in entityWithIndicesList:
self.add_location(locations, entityWithIndices)
# We need to keep track of multiple, matching locations now.
# However, they should have differing ranges, so the set operation is still OK.
# It would work if only they could actually be compared to each other.
return locations # sorted(list(set(locations)))

def patch(self, sentence: str) -> list[Location]:
return self.get_cell_locations(TextWithIndices(sentence))

def run():
sys.stdin.reconfigure(encoding="utf-8")

# Make sure the sentence is from the right dataset
locationsPatch = LocationsPatch("./belief_pipeline/UG.tsv")

# Read input until EOF
while True:
line = sys.stdin.readline().strip()

locations = locationsPatch.patch(line)
for location in locations:
print(location)
print("", flush=True)

# with open("debug.txt", "a") as file:
# print(line, file=file)
# locations = locationsPatch.patch(line)
# for location in locations:
# print(location, file=file)
# print(location)
# print("", flush=True)

def test():
locationsPatch = LocationsPatch("./belief_pipeline/UG.tsv")
# sentence = "Where in Ghana is the town of Zugu, for the town of Zugu is unknown to me?" # This only gets the first.
# sentence = "Lots of people live in Accra when I don't even know where Accra is located." # This also only gets the first.
# sentence = "Do more people live in Accra or in Trume?" # This gets both.
# sentence = "More people live in Accra than in Trume." # This gets both.
# sentence = "More people live in Accra, Trume, Krasa, and Benu than in Damsa." # This misses Krasa.
# sentence = "According to Smith, 2024, more people live in Accra, the Trume, Krasa, and Benu than in Damsa." # This misses Krasa.
# sentence = "According to Smith, 2024, more people live in Accra, the Trume, Krasa, and Benu than in Damsa." # This misses Krasa.
# sentence = "According to Smith, 2024, more people live in Yabonzue, the Tampielim, London, and Benu than in Damsa."
# sentence = "The Bia River serves as a vital source of water for the residents of Bianouan in eastern Ivory Coast."
# sentence = "Ivory Coast’s semi-public water distribution company, SODECI, recently shut down its water treatment plant in the area because of the level of pollution in the Bia River."
# sentence = "According to Smith, 2024 and Jackson, 2022 more people live in Yabonzue, the Tampielim, London, and Benu than in Damsa."
# sentence = "We are staying at The Aknac Hotel in Ghana."
# sentence = "We are staying at the Aknac Hotel in Ghana."
# sentence = "How many people live in the Volta part of Ghana?"
# sentence = "We are staying at the Aknac Hotel in Ghana or the Aknac Hotel in Senegal."
# sentence = "We are staying at either the Fiesta Royale Hotel in Ghana or the Fiesta Royale Hotel in Senegal."
# sentence = "We are staying at either The Fiesta Royale Hotel in Ghana or The Fiesta Royale Hotel in Senegal."
sentence = "Lots of people live in Karamoja."
locations = locationsPatch.patch(sentence)
for location in locations:
print(sentence, location)


if __name__ == "__main__":
run()
# test()
Loading
Loading