The script is designed for synchronizing GMQL public repository between two servers
The following tools are required to be installed on both servers for the correct work of the script:
- Apache Hadoop.
- Guide for Apache Hadoop installation can be found in Hadoop installation page.
- Make sure that
rsync
,ssh
andxpath
are installed.- You can use this command in terminal (Ubuntu/Debian) to install them:
$ sudo apt-get install rsync $ sudo apt-get install libxml-xpath-perl $ sudo apt-get install ssh
- You can use this command in terminal (Ubuntu/Debian) to install them:
git clone https://github.com/DEIB-GECO/GMQL-Sync.git
cd GMQL-Sync
mvn clean install
$ ./bin/gmqlsync.sh [<options>] <SOURCE> <DEST>
gmqlsync.sh
MUST BE run on <SOURCE>
server
Option | Description |
---|---|
--delete |
delete extraneous files from destination dirs |
--dry-run |
perform a trial run with no changes made, this only generates a list of changed datasets |
--user |
set user name (hdfs folder name) to synchronize |
--tmpDir |
set temporary directory for local script output files. Default value is: "/share/repository/gmqlsync/tmpSYNC" |
--tmpHdfsSource |
set temporary directory for hdfs files movement on source server. Default value is: "/share/repository/gmqlsync/tmpHDFS" |
--tmpHdfsDest |
set temporary directory for hdfs files movement on destination server. Default value is: "/hadoop/gmql-sync-temp" |
--logsDir |
logging directory on source server. Default value is "/share/repository/gmqlsync/logs/" |
--help, (-h) |
show help |
- To synchronize cineca server with genomic, run the following command:
$ ./bin/gmqlsync.sh /home/gmql/gmql_repository/data/public cineca:/gmql-data/gmql_repository/data/public
- To get a list of datasets that exists on genomic server, but are missed on cineca server, run the following:
$ ./bin/gmqlsync.sh --dry-run /home/gmql/gmql_repository/data/public cineca:/gmql-data/gmql_repository/data/public
With --dry-run
option, the script only checks for differences on the serves, and generates a file with a list of new datasets.
- To allow deletion of datasets that we removed from genomic server, use the following command:
$ ./bin/gmqlsync.sh --delete /home/gmql/gmql_repository/data/public cineca:/gmql-data/gmql_repository/data/public
The script is build for synchronizing gmql repository of public datasets.
It first checks if there are any difference in local FS GMQL repository by running rsync
tool in dry-run mode.
The result of rsync
is then stored in rsync_out.txt
file and parse into two arrays: one is for all the datasets that should be added in <DEST>
, and the other one is for datasets to delete from <DEST>
. Datasets list to add and delete are then saved to rsync_add.txt
and rsync_del.txt
files respectivelly.
NOTE: Dataset names are file names in local FS GMQL repository. If tool was run with --dry-run
option, it will exit after generating datasets list.
After getting the datasets lists, it retrieves hdfs path of every DS, and copy the datasets from hdfs repository to temporary hdfs directory in local FS.
Then, using rsync
, the files from temporary hdfs dir on <SOURCE>
are copied to temporary hdfs dir on <DEST>
.
After that, on <DEST>
, all files in temporary hdfs directory are copied to hdfs repository on <DEST>
.
Then, compare the dataset sizes in hdfs repositories on both, <SOURCE>
and <DEST>
, to make sure that the copy was successful.
If copying of hdfs files finnished successfully, then copy files in local FS gmql repository by running rsync
in normal mode.
If the script was run with --delete
option, then perform removing of datasets on <DEST>
.
Finally, clean up temporary hdfs directories.
The script also generates a .log
file in logs
folder.
The tool consists of several script files to make less ssh connections:
gmqlsync.sh
is the main script file to be usedgmqlsyncCheckHdfsDest.sh
gets dataset size in hdfs on the destination servergmqlsyncDelHdfsDest.sh
removes datasets on the destination server
NOTE: ssh
connection to the remote server should be passwordless.
The tool is designed for backing up GMQL public repository after the synchronization.
The script performs the backup of the datasets from a given list to a given destination server. It first exports a dataset to a temporary folder on local file system, then zip it and sends to the destination backup server.
$ bin/DSBackup $COMMAND
-
BackupTo <DS_NAME> <TMP_DIR> <BACKUP_PATH>
This command backup a dataset named
DS_NAME
to backup folder.
TMP_DIR
is the full path to the temporary local location.
BACKUP_PATH
is the full path to the backup location on the remote server -
BackupALL <DS_LIST_FILE> <TMP_DIR> <BACKUP_PATH>
This command backup all datasets from the
DS_LIST_FILE
to backup folder.
TMP_DIR
is the full path to the temporary local location.
BACKUP_PATH
is the full path to the backup location on the remote server. -
Zip <DS_NAME> <LOCAL_DIRECTORY>
This command zip a dataset named
DS_NAME
to local folder.
LOCAL_DIRECTORY
is the full path to the local location. -
ZipALL <DS_LIST_FILE> <LOCAL_DIRECTORY>
This command zip all datasets listed in the file
DS_LIST_FILE
to local directory. -
h
orhelp
Shows usage.
- After running the synchronization, you can find the datasets that were added in the following file:
"/share/repository/gmqlsync/tmpSYNC/rsync_add.txt"
You can use this file to backup the newly added datasets:
$ bin/DSBackup BackupAll /share/repository/gmqlsync/tmpSYNC/rsync_add.txt /share/repository/gmqlsync/tmpHDFS geco:/home/hdfs/gmql_repo_backup/
NOTE: Datasets will be first copied to local folder "/share/repository/gmqlsync/tmpHDFS"
, then zipped and sent to "geco:/home/hdfs/gmql_repo_backup/"
. After that, the temporary folder will be cleaned up.
- If you want to do the backup of only one dataset, use the following command:
$ bin/DSBackup BackupTo <DS_NAME> /share/repository/gmqlsync/tmpHDFS geco:/home/hdfs/gmql_repo_backup/
- If the backup directory is on the local server, you can use
Zip
orZipAll
commands:
$ bin/DSBackup Zip <DS_NAME> /home/hdfs/gmql_repo_backup/
OR
$ bin/DSBackup ZipAll /share/repository/gmqlsync/tmpSYNC/rsync_add.txt /home/hdfs/gmql_repo_backup/