This project houses the Latin America & Caribbean (LAC) Environment & Energy S C (ESSC) knowledge base. The knowledge base uses NLP and a graph database to facilitate discovery of Mission projects and related documents.
Project Epics describe the discrete steps required for buildout, Stories within each Epic describe required components for that Epic.
TODO
Try subbing 'bolt+s' for 'neo4j+s' in the URI with the Aura database.
- Generate keywords to use as a tagging system from a sample of documents
- Automate tagging of documents and upload to graph database for initial build out
- Internal automatic tagging of new documents and keyword search of the graph database
- Host all functionality on a web platform, allow external upload, and user search
- Facilitate discovery of related content though manual graph exploration (asp.)
- Dashboard to display metrics on environment and energy activities (asp.)
- Deploy database
- See the Project Description here (link).
- Scoping Sheet (link)
- Energy Project Data Sheet (link): current database of project metadata and links to resources. The Writeups & Reports tab is main for the reports and writeups. Copy/paste to
data/reports.csv
if updates are needed. - Use Case Exercise (link): original use case exercise from Sue
- Sue Hoye: Approver
- Claire Price: helping improve metadata for projects, find missing information, draft project descriptions.
- Izzie: helper and search phrase lead
The tag system will take as input a representative subset of PDF resources from the Energy Project Data Sheet and recommend, using TextRank, the entities that have the highest 'importance' within the corpus. These tags will then be cleaned with human review and hyperparameter tuning, and augmented to support expected use cases.
The first run of the system was attempted using a simple GUI interface with google sheet backend. After the initial proof-of-concept, Izzie downloaded all linked reports, writeups and factsheets and an automated process was run against the entire corpus.
The PDF reader used for this project is tika
, which requires Java to be installed (setting up another user would require python, java, SQLStudio).
- Read report
- Convert PDF to text
- Convert text to tags
- Summarize text
- Save summary, project number, filename, link and notes to graph database
- Upload list of tags and associated file name to graph database, relate tag to report title
- Upload project numbers, project names, and file names to database (to link files and projects)
- Use graph algorithms to rank tags (consider centrality, importance, etc.) or rank based on number of relationships
- Export tags to Excel for ranking, review, and augmentation with topical tags
To exclude specific entity types, add them to the not_entity_types in analyze() of text_rank. See docs for list of entity types.
PyTextRank may be an alternative to what we have, but appears to be less customizable.
Instead of choosing the top 7 sentences, the top X% of sentences could be chosen (implementation). In Annual Reports and similar reports that cover many topics, a sentence based summary will not be perfect, but can help the reader understand the type of content included. In some cases, the summaries will even be misleading, connecting un-related topics by placing the sentences near each other. Summarization works best when the report is focused on a single case study.
// TOP TAGS
MATCH(n:Tag)-[r]-()
RETURN n.name, count(r) as result
order by result desc
-
Copy the write ups and reports tab of the Google sheet into a csv, save to
data/
-
In Neo4j Desktop (within this Project), click Add File and upload the csv
-
Click the 3-dot menu for the file and select copy url
-
In Neo4j Browser, use the following cypher to test the csv
LOAD CSV WITH HEADERS FROM "<paste url here>" as row RETURN row
-
Use the following cypher to load regions and projects
LOAD CSV WITH HEADERS FROM "http://localhost:11001/project-5810fc37-0742-4c0b-b0d7-238646cc50ea/reports.csv" as row MERGE(p:Project {name: row.Project_Number}) MERGE(r:Region {name: row.Region}) RETURN p, r
-
Create relationships with Projects and Regions
LOAD CSV WITH HEADERS FROM "http://localhost:11001/project-5810fc37-0742-4c0b-b0d7-238646cc50ea/reports.csv" as row MATCH(p:Project {name: row.Project_Number}) MATCH(r:Region {name: row.Region}) MERGE(p)-[:LOCATED_IN]->(r)
-
Create relationships between Projects and Reports
LOAD CSV WITH HEADERS FROM "http://localhost:11001/project-5810fc37-0742-4c0b-b0d7-238646cc50ea/reports.csv" as row MATCH(p:Project {name: row.Project_Number}) MATCH(f:Report {name: row.Filename}) MERGE(f)-[:ABOUT]->(p)
creates bi-directional relationships between projects where projects have reports that share tags with a property of the number of shared tags
MATCH (p:Project)<-[:ABOUT]-(r:Report)<-[:TAGGED_IN]-(t:Tag)-[:TAGGED_IN]->(r2:Report)-[:ABOUT]->(p2:Project)
WHERE p['name'] <> p2['name']
WITH p, p2, COUNT(*) AS count
CREATE (p)-[r:SHARES_TAGS]->(p2)
SET r.count = count
To add the graph data science gallery, open the Graph Apps Gallery from the Open split button and click install under the Graph Data Science Playground. If nothing happens, copy the link that opens when you click the square icon (right-most) and paste into the install file or link bar that comes up when you click the four squares button in the sidebar. You also need to install the APOC library (see Add Plugin card at bottom of Neo4j desktop for the project).
You can then find the Graph Data Science Library in the Open split button. Follow the connection guide to get connected the first time.
Using the graph data science playground, use PageRank with the following parameters:
- Label: Project
- Relationship Type: SHARES_TAGS
- Relationship Orientation: Natural
- Weight Property: count
Write results to 'pagerank' property
Using the graph data science playground, use Louvain with the following parameters:
- Label: Project
- Relationship Type: SHARES_TAGS
- Relationship Orientation: Undirected
- Weight Property: count
- Seed property: None
- Intermediate Communities: False
- Community Node Limit: 50
Write results to 'louvain' property
One the tag system is finalized, the new tags will need to be uploaded to the NLP model and all available documents tagged and saved to a graph database
Use the EntityRuler (can read files from JSONL) to add tags to the entity recognition model.
-
Update NLP model (using
spacy
, this would be accomplished with entity ruler; otherwise training the model would require >200 explicit instances) with custom tags and consolidated entities.-
outputs/top_tags_initial_full.csv
holds initial tags and consolidated tags for first 2,500ish tags. Used code DELETE to signify tags that should be deleted. -
data/custom_tags.csv
includes tags provided by Sue
-
-
Build graph database (local)
-
Read in PDFs & txt files and add tag cloud to graph database (automate process from Epic #1 using filenames column as iterator; include entity types)
-
Drop bad tags
-
Read in summaries
- Read to csv (how to handle commas?)
- Edit for clarity
- Add to database
-
Add project database
-
Write initial search terms for Neo4j Bloom (new searches will be added with more user testing in Epic #3)
-
Train Izzie(?) on use of Neo4j Bloom for search of products, exploration, and how to build new Cypher queries (Izzie is taking a class on SQL)
Conservation, deforestation, conservation crime, private sector engagement, plastics, ocean plastics, livelihoods, indigenous peoples, energy auction, clean energy, illegal logging, illegal fishing, fisheries, Amazon Vision, gender, women, vulnerable population, wildlife trafficking, climate change, innovation, self-reliance, conservation enterprise, water, farming, NMR, natural resource management, improving governance, governance, urban, Great Power Competition
These tags were added to data/custom_tags.csv
and read into a JSONL file with scripts/create_pattern_file
After reading in all reports, tags were exported and reviewed. Over 3,000 unique tags were generated.
- Punctuation is not working correctly in the pattern matching with USAID/Haiti (for example).
- Upside-down question mark is coming up a lot. Maybe just drop all punctuation?
- Spanish words like Cuales and Como are showing up (must not be part of stop words)
- The entity matcher may be slightly stochastic, in that different variations of common names are showing up after loading the EntityRuler (e.g., the Nature Conservancy). Is there a way to overwrite entities if they contain the pattern at all, maybe with some REGEX pattern? How to figure out which entities need this ahead of time?
- Tags that are only used once have a higher likelihood of being unnecessary, however a majority of tags are only used once and there are hundreds of potentially interesting entities included.
- Try reducing window size and scaling number of keywords to the size of the text
- Some tags include bullet points--those should be handled somehow
- Checking entities against wikidata's knowledge base may be a way to weed out unhelpful tags
- duplication between entities that start with 'the' and those that don't ('the Environmental Defense Fund', 'Environmental Defense Fund')
The scripts/automated_tag_generate.py
was edited to sample 5 reports from reports/
to test different solutions.
- drop
(
,)
,/
before tokenization - drop all punctuation except periods before tokenizing
re.sub(r'[^\w\s.-], ' ', <text>)
- reduce keywords to 15 (from 30)
- decrease window size to 4 (from 8)
- decrease coefficient to .65 (from .85)
- delete 'the' before token with regex
- put custom tags after NER in pipeline, but keep aggregates before
Of the above, only dropping select punctuation and decreasing the coefficient seems to really add value.
Once the tagging system was refined, tags were exported and any final aggregation was done. The create_pattern_file.py
file will append from the exported tags any synonyms provided in a column titled 'alt'.
Use the keyword DELETE
to id any tags that are spurious. Filter for DELETE (copy into new csv), add the new csv to the project, match any nodes associated with DELETE keyword, and delete.
USAID uses a slew of acronyms and abbreviations. In formal text (i.e., evaluations) these acronyms are introduced (i.e., artisanal and small scale mining (ASM)). We use an Abbreviation detector from scispacy
to detect these and add them as entities (using entity_ids
) to the list of entities before NER. To avoid name collisions, acronyms are not preferred over full text nouns.
Coreferences are a broader case of acronyms where words like 'He' are replaced by the name of the person they represent, and words like 'The' are dropped when they only sometimes modify the entity. This is an important step in improving the NER and NEL components of the pipeline. Most co-reference resolution models are implemented as neural nets or other advanced systems that recognize semantics. A few options:
neuralcoref
(link, requiresspacy==2.3
)graphbrain
(link): this uses a hypergraph and knowledge agents to detect different types of coreferences
spacy
's 'similarity' method also allows for identifying similar mentions of an entity in a text that can be used to disambiguate. My intuition is that combining the similarlity
method with a graph to identify communities (i.e., circular similar mentions) would be an approach to resolving these conflicts. Similarity by itself can be misleading as different cutoff scores will work better for different entities, and some amount of error is probably unavoidable (i.e., United States and Unites States Agency for International Development).
spacy
comes out of the box with a Named Entity Recognition pipeline trained on the OneNote5 corpus, however there are other options as well. Besides training your own (see Named Entity Linking), many other models are trained on wikipedia and so have entity linking capability as well (see Name Entity Linking.)
Keyword extraction is a simpler alternative to named entity extraction that looks for keywords in text based on prevalence and importance. There are a few options, including the text_rank.py
script in scripts/
and packages yake
(yet another keyword extractor) and nltk
. While keyword extraction is not better than NER, it could be used to supplement the provided keyword tags.
NEL links entities to a Knowledge Base. This helps to ensure aliases of entities are linked to the same 'real-world' entity and enriches the information available for each entity (esp. when linked to a Knowledge Base like wikipedia).
Wikifiers (the process of linking entities to Wikipedia is known as Wikification)
- Google's Knowledge Graph Search API (link): allows queries of the Google Knowledge Graph, based on wikipedia. Uses SPARQL to query through RESTful API
wikipedia
: a wikipedia search API and python package- http://wikifier.org/ A web-based API (see token in
secrets/wikifier_api.txt
) that returns the Entity, it's properties, and the QID - NeuroNER (link): ships with a few other NER models
- BLINK (link): facebook's NER
You can also train your own NER/NEL on a wikipedia dump (jsonified data of entire wikidata, I have a copy on the E:/ Drive). A spacy
project provides the necessary scripting, you just provide the data (See here for a script to train a KnowledgeBase from wikidata and wikipedia and here for a script to train a NER from wikidata (which might out-perform the default NER model which is trained on the OneNote5 corpus since we're linking back to a wikidata/wikipedia KB)
I used the spacy project ner_wikiner
to train a new NER on the wikidata dataset. Note you must have spacy v 3.0 or better.
python -m project clone ner_wikiner # clone project
python -m spacy project assets C:\Users\Erik\Documents\_dev\ner_wikiner # fetch wikidata
cd ner_wikiner
python -m spacy project run all # run training workflow
To use:
import spacy
MODEL_PATH = 'training/model-best'
nlp = spacy.load(MODEL_PATH)
doc = nlp("This is a sentence")
for ent in doc.ents:
print(ent.label_, ':', ent.text)
I have not yet tested this on the corpus to see if it does better, however it does not have a knowledge base to connect entities back to.
https://github.com/explosion/projects/tree/master/nel-wikipedia
Thus far, we've simply inferred relationships between projects based on shared tags. Relationship extraction, however, can be much more advanced. For example, a single text could be mined to identify relationships between the entities within it (i.e., LAC is an office of USAID). This type of relationship extraction would result in a knowledge graph for each text, which could then be merged with the projects knowledge graph. That type of graph would provide richer understanding of the relationship between entities and between entities and projects, however would also require a much more robust search engine to make the information meaningful.
Example relationship extraction graph (source).
Topic matching is an advanced form of search that attempts to match a natural language search with similar entries in a corpus. It would be a great option to expand the functionality of the tagging system if the entire corpus could be uploaded. Holmes
is a good project for this (link) and runs on top of spacy
and neuralcoref
. In many ways, holmes
replicates this pipeline on a provided corpus.
SEE SPACY PROJECTS FOR MORE
BIG CAVEAT: spacy
recently updated to version 3.0, breaking most of the packages described above. I expect that many of these packages will be updated soon but for now you'll need to select a spacy version that is compatible with all. The biggest change in spacy is how pipelines are set up and matching syntax, so you cu
After the initial build out, Izzie will be responsible for (1) responding to search requests and (2) adding new products as they come out. Ultimately, the web interface will be needed to make this available to everyone, but this provides a useful check-in point for ensuring value of the project.
- Set Izzie up to automatically update knowledge graph with new products from GUI (expanded in Epic #2).
- Develop search terms as needed to respond to requests for information
- Add Sue's News Digest to the graph each week
To allow users to explore and upload new documents to the system, an interface for search and automated tagging is needed. If this can't be done on Heroku (see Tech Problems), EI may need to "own" adding files until a solution can be found. Uploading new documents is less important than making search available (some searches may be confidential).
- Host graph on Heroku (GrapheneDB)
- Replicate Movies Database site subbing in ESSC graph database (simple proof of concept)
- Add Cypher queries to search to replicate functionality from Bloom
- Convert Projects Sheet to graph database and link search results to project
- Can
tika
be used on Heroku? If not, how will PDFs be parsed? It appears that Java can be installed on Heroku using a specific buildpack (link).
A primary use case of the knowledge graph is discovery by exploration of relationships. If a graph can be hosted on the web for exploration, that may be ideal. Otherwise, we can serve all connected documents and projects, use NLP to discover similar documents, or allow filtering by tags to find connections relevant to the user.
- Need to figure out tech that can host graphs and whether worth paying or not (see Linkurious or GraphGists)
- Display rendered html from spacy to highlight entities upon upload.
- Show related projects and reports based on entities identified upon upload.
- Host interactive graph on the web (Linkurious, GraphGists)
This use case example is for a related project for another client. It is aspirational but illustrates the breadth of need. ("Like wanting a pizza delivery without wanting a restaurant")
Note that some of the contents of this repo are copied directly with minor edits from the ei-knowledge-graph
repo
scripts/
- add_project_db.py: reads a copy of the LAC ESSC energy database (as a csv in
data/
), creates nodes for entities, and adds properties to Project nodes. - auto_tag_generator.py: reads reports from
reports/
as .txt or .pdf files, summarizes, tags, and uploads to graph database. - create_pattern_file.py: creates a JSONL file for spacy's entity ruler to add custom tags and consolidate common enities. Uses the
TOP TAGS
query export from neo4j, add columnalt
and specify desired label for any duplicate entities. - delete_bad_tags.py: deletes unwanted tags from graph database. Reads bad tags from
data/inputs/bad_tags.csv
. - graph_database.py: handles all graph database operations.
- gui_tag_generator.py: a GUI interface for uploading one or more files and generating tags. Upload all files from a project to get the project's top tags.
- text_rank.py: implements a page rank algorithm to extract keywords from text. Defaults to 20 keywords per run.
- text_summary.py: extracts top 7 sentences from the text as summary.
reports/: contains all reports for initial set up of graph database. Must be .pdf or .txt.
data/
- inputs/
- bad_tags.csv: list of tags to delete
- custom_tags.csv: list of tags to add to entities with spacy's entity ruler. Read to
entity-patterns.jsonl
for import to spacy. - entity-patterns.jsonl: newline delimitted json file to add patterns to spacy's entity ruler
- projects.csv: copy of LAC ESSC Energy Database
Sheet 1
- reports.csv: copy of LAC ESSC Energy Database
reports & writeups
sheet
- outputs/: directory to save
TOP TAGS
and other exports from neo4j database
cypher/cypher.txt: stores plain text cypher queries for reference
assets/: icon for GUI
- Re-create virtual environment (
conda create --name essc-knowledge-base
) (if spacy fails, you may need to downgrade pythonconda install python=3.8
) - Download english and spanish models
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
- Download
- Create a Neo4j database
- Open database (should be on port 7...), or update
- Compile pattern file (run
create_pattern_file.py
) - Update location of reports in auto_tag_generator.py, line 64
if f in os.listdir('<reports>')
- (
E:/data/essc-knowledge-base/data/reports_eng
)
- Copy project database from google sheet into
data/projects.csv
- Read in project database (run
add_project_db.py
) - Delete bad tags (run
delete_bad_tags.py
) - upload regions, project numbers, and filenames (see guidance above)
- create connections between projects (see guidance above)
- run data science algorithms pagerank and louvain on graph (see guidance above)
- upload projects database (run
add_project_db.py
)
Note: different machines are parsing the columns with \n
differently. If you get KeyErrors, it's likely that the column name should be changed from \n
to \n\r
or vis versa, print the column names to see.
Why use conda-forge? It's not super clear, but may better ensure compatibility if using conda-forge rather than the default channel, and it has more packages than the default channel. You can add the conda-forge channel to your list of channels as the priority channel so that you don't have to constantly type -c conda-forge
.
conda config --add channels conda-forge
from py2neo import Graph
PORT = 'bolt://localhost:7687'
PASSWORD = 'incentives'
graph = Graph(PORT, auth=('neo4j', PASSWORD))
tx = graph.begin()
# code
tx.commit()
Note some commands are transactions, and you shouldn't use both at the same time (e.g., update properties of node with graph.push()). If you have up to 20,000 operations to run you can batch those with one transaction. This is both safe and efficient. Over 20,000 you may run into a memory issue and you'll need to periodically commit as you are running (committing each of significantly more than 20,000 transaction might take a long time).
tx = graph.begin()
new_node = Node('<LABEL>', <id_property>=<>,
primary_key='<id_property>',
primary_label='<LABEL>')
tx.merge(new_node, '<LABEL', '<id_property>')
tx.commit()
You can also use a dictionary to add properties. Any kwargs are read as properties, any args are read as labels. This can be helpful when reading from a pandas dataframe (see Import from DataFrame rows below).
new_node = Node('<LABEL>', <id_property>=<>,
primary_key='<id_property>',
primary_label='<LABEL>',
**property_dict)
To get data from the graph
results = graph.run("<CYPHER QUERY>") # returns cursor to stream results
for result in results:
# do something
Instead of streaming results, data can be read to a list of dictionaries
results = graph.run(
f"""
MATCH(n:Node)-[]-()
WHERE n.name = "{<name>}"
RETURN n.name, n.prop_1, n.prop_2
"""
).data()
# returns:
[
{'n.name': '', 'n.prop_1': '', 'n.prop_2': ''},
{'n.name': '', 'n.prop_1': '', 'n.prop_2': ''},
...
]
If your graph has spaces in the properties, use indexing:
results = graph.run(
f"""
MATCH(n:Node)-[]-()
WHERE n.name = "{<name>}"
RETURN n['my name'], n['my prop_1'], n['my prop_2']
"""
).data()
If labels have spaces, use backticks
results = graph.run(
f"""
MATCH(n:`My Node`)-[]-()
WHERE n.name = "{<name>}"
RETURN n['my name'], n['my prop_1'], n['my prop_2']
"""
).data()
If you need relationships from one central node to multiple other nodes, use OPTIONAL MATCH:
results = graph.run(
f"""
MATCH(n:Node)-[]-(o:other_node)
WHERE n.name="{<name>}"
OPTIONAL MATCH (o)-[]-(p)
WHERE p:<Label1> OR p:<Label2>
RETURN n.name, o.name, labels(p), p.name
"""
).data()
This query gives you nodes with labels Label1 or Label2 related to node with label node that is connected through other_node. Note that in the above example the identifying property for all additional nodes must be the same, namely name
.
Note that nodes may not conform well to pandas expectations, and unexpected errors can occur.
df = pd.DataFrame(graph.data("MATCH (a:Person) RETURN a.name, a.born"))
# returns:
a.born a.name
0 1964 Keanu Reeves
1 1967 Carrie-Anne Moss
2 1961 Laurence Fishburne
3 1960 Hugo Weaving
# Alternatively
df = graph.run("MATCH (a:Person) RETURN a.name, a.born").to_data_frame()
If lists of nodes of the same type are stored in DataFrame columns, you can create a unique list from each column and create a node using that list with the label as the column header. Labels may contain spaces.
tx = graph.begin()
for column in df.columns:
node_set = df[column].unique().astype(str) # some data types not supported as labels
for node in node_set:
new_node = Node(column, <id_property>=node,
primary_key='<id_property>',
primary_label=column)
tx.merge(new_node, column, '<id_property>')
tx.commit()
Alternatively, if the nodes correspond to rows and columns are properties (you should find this with tidy datasets), read the df with the primary label as the index column, convert to a dictionary, and iterate through the dictionary items to add nodes. Note that the index column must be unique.
df = pd.read_csv('path/to/data', index_col='<id_property>')
node_dict = df.to_dict('index')
tx = graph.begin()
for node, properties in node_dict.items():
node = Node('<LABEL>', <id_property>=node,
primary_key='<id_property>',
primary_label='<LABEL>',
**properties)
tx.merge(node, '<LABEL>', '<id_property>')
tx.commit()
# needs to be tested
Using **properties
passes the dictionary of properties, where each column is a key and the data in the cell is a value, to the Node object.
[(a["name"], a["born"]) for a in graph.nodes.match("Person").limit(3)]
# returns
[('Laurence Fishburne', 1961),
('Hugo Weaving', 1960),
('Lilly Wachowski', 1967)]
This runs as a transaction, so don't wrap in transaction. Note that updates to properties occur only locally until pushed using graph.push
.
node = matcher.match('<LABEL>', <id_property>=<>).first()
if node:
node[<property>] = <>
graph.push(node)
node = matcher.match('<LABEL>', <id_property>=<>).first()
if node:
node.update(**properties)
graph.push(node)
Where nodes are stored in columns, and nodes have already been imported, you can use either df.iterrows()
or convert the DataFrame to a dictionary to relate all nodes in a single row.
from py2neo import NodeMatcher
matcher = NodeMatcher(graph)
df = pd.read_csv('path/to/data', index_col='<id_property>')
entity_dict = df.to_dict()
tx = graph.begin()
for node_label, node_dict in entity_dict.items():
for project_id in entity_dict[node_label]:
project_node = matcher.match('Project',
project_number=project_id).first()
entity_node = matcher.match(
node_label, name=node_dict.get(project_id)).first()
if project_node and entity_node:
relationship = Relationship(project_node, "IN", entity_node)
tx.create(relationship)
tx.commit()
How this works:
The entity_dict
looks like:
{
'Country': {'AID-512-A-00-08-00005': 'Brazil',
'AID-512-A-00-08-00015': 'Brazil',
'AID-512-A-10-00004': 'Brazil',
'AID-512-A-11-00004': 'Brazil',
'AID-512-A-16-00001': 'Brazil'},
'Income Group': {'AID-512-A-00-08-00005': 'Upper Middle Income Country',
'AID-512-A-00-08-00015': 'Upper Middle Income Country',
'AID-512-A-10-00004': 'Upper Middle Income Country',
'AID-512-A-11-00004': 'Upper Middle Income Country',
'AID-512-A-16-00001': 'Upper Middle Income Country'}
}
Each column has the relationship required between the index (in this case a project number) and the node of the type contained in that column. You can create all of the relationships required and then repeat the process by specifying a new index column, if needed.
For each column (node_label), we use the dictionary associated to match each project id and each node. If a match is found for both, we create a relationship. Don't forget to commit the transaction.
You can use another dictionary to specify the label for the relationship if you want to have different relationship labels for different columns. Simply lookup the relationship name in place of "IN".
tx = graph.begin()
for node in bad_node_list:
node_matches = matcher.match('<LABEL>', <id_property>=node)
for node in node_matches:
graph.delete(node)
tx.commit()
from py2neo.ogm import GraphObject, Property
class Person(GraphObject):
name = Property()
born = Property()
[(a.name, a.born) for a in Person.match(graph).limit(3)]
# returns
[('Laurence Fishburne', 1961),
('Hugo Weaving', 1960),
('Lilly Wachowski', 1967)]
Create and account and create your database
Store the credentials in a json
file in your secrets/
directory
Copy the connection URI from the database card (under Databases)
Use the code below to connect with neo4j
from neo4j import GraphDatabase
import json
with open('secrets/aura_creds.json') as f:
creds = json.load(f)
URI = creds.get('URI')
USERNAME = creds.get('USERNAME')
PASSWORD = creds.get('PASSWORD')
graph = GraphDatabase.driver(URI, auth=(USERNAME, PASSWORD))
Note that it appears that py2neo does not support the protocol used by Aura. If your app simply queries data from the database with cypher, neo4j is sufficient. If you need more advanced functionality, and don't want to learn neo4j to that extent, check this stack overflow topic to see if anyone can help.
https://github.com/akash-kaul/Using-scispaCy-for-Named-Entity-Recognition
https://towardsdatascience.com/using-scispacy-for-named-entity-recognition-785389e7918d
https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34 # step by step NLP pipeline for keyword extraction
https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0 # Full implementation of keyword extraction using TextRank algorithm
rake_nltk and multi_rake for rapid automated keyword extraction
https://prodi.gy/features/named-entity-recognition
https://medium.com/neo4j/py2neo-v4-2bedc8afef2
The benefit of SQL Alchemy is that you can quickly switch from using SQLite to Postgres between local dev and deployment.