We provide code for loading Microsoft Excel data and turning it into a Neo4j graph as well as for updating an existing graph with new data. We provide the following scripts:
load_data.py
which loads data from a given file and creates a Neo4j graph out of this data.update_data.py
which loads data from a given file and updates an existing Neo4j graph from this data.
The scripts used for anomaly detection require the following Python libraries to be installed:
pip install pandas py2neo tqdm
Our code works with a Neo4j database including the Graph Data Science plugin. The most straightforward way of setting up such a database is with docker:
docker run -p7474:7474 -p7687:7687 -e NEO4J_AUTH=neo4j/password --env NEO4JLABS_PLUGINS='["graph-data-science"]' neo4j
Note that in our code, we use the credentials:
username: neo4j
password: password
In case you setup your own database using different credentials, please change the credentials in the code as well. The README.md files of each individual directory indicate which lines have to be changed.
To load data into a graph, please run load_data.py
.
Note that the script will attempt to connect to a Neo4j database instance. By default, we connect to the following instance, with the following credentials:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
Please modify this line in the __main__
function of the script to connect to your database instance with the correct credentials.
The current implementation loads the data from the path ../collector/example/iam_policy_data_2021-03-26_14:11.xlsx
, to change this to a custom file, please change the following line in the __main__
function of the script.
# Load data from stored files
df_policies, df_users, df_groups, df_roles = load_excel("../collector/example/iam_policy_data_2021-03-26_14:11.xlsx")
To update an existing graph with new data, please run update_data.py
.
Note that the script will attempt to connect to a Neo4j database instance. By default, we connect to the following instance, with the following credentials:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
Please modify this line in the __main__
function of the script to connect to your database instance with the correct credentials.
The current implementation loads the original data from the path ../collector/example/iam_policy_data_2021-03-26_14:11.xlsx
,
and the updated data from the path ../collector/example/iam_policy_data_2021-03-26_14:11.xlsx
.
To change these inputs to custom files, please change the following lines in the __main__
function of the script.
df_policies, df_users, df_groups, df_roles = load_excel('../collector/example/iam_policy_data_2021-03-26_14:11.xlsx')
new_df_policies, new_df_users, new_df_groups, new_df_roles = load_excel('../collector/example/iam_policy_data_2021-03-26_14:11.xlsx')
Important: To run any of the anomaly detection algorithms, we must create the graph embedding through the Neo4j database, otherwise we will miss some features. To create the graph embedding for each policy node, we run the following command on the Neo4j database:
CALL gds.beta.node2vec.write({
nodeProjection: "Policy",
relationshipProjection: {
contains: {
type: "CONTAINS",
orientation: "NATURAL"
},
works_on: {
type: "WORKS_ON",
orientation: "NATURAL"
}
},
embeddingDimension: 128,
iterations: 100,
walkLength: 5000,
writeProperty: "embeddingNode2vec"
})
This command will create a variable embeddingNode2vec
for each Policy
node, which we will use during anomaly detection.