Skip to content

Commit

Permalink
Merge pull request #93 from ascmitc/dev/chain-format
Browse files Browse the repository at this point in the history
New chain format and minor cleanup
  • Loading branch information
jwaggs authored Oct 1, 2021
2 parents 409a833 + f91f43d commit abcb928
Show file tree
Hide file tree
Showing 45 changed files with 380 additions and 74 deletions.
63 changes: 59 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ Additional utility commands:

The most common commands when using the `ascmhl` in data management scenarios are the `create` and the `check` commands in their default behavior (without subcommand options).

Sealing a folder / drive with the `create` command traverses through a folder hierarchy, hashes all found files and compares the hashes against the records in the `ascmhl` folder (if present). The command creates a new generation (or an initial one) for the content of an entire folder at the given folder level. It can be used to document all files in a folder or drive with all verified or newly created file hashes of the moment the `create` command ran.
Creating a new generation for a folder / drive with the `create` command traverses through a folder hierarchy, hashes all found files and compares the hashes against the records in the `ascmhl` folder (if present). The command creates a new generation (or an initial one) for the content of an entire folder at the given folder level. It can be used to document all files in a folder or drive with all verified or newly created file hashes of the moment the `create` command ran.

Checking a folder / drive with the `verify` command traverses through the content of a folder, hashes all found files and compares the hashes against the records in the `ascmhl` folder. The `verify` command behaves like a `create` command (both without additional options), but doesn't write new generations. It can be used to verify the content of a received drive with existing ascmhl information.

Expand All @@ -151,8 +151,8 @@ The `info -sf` ("single file") command prints the known history of a single file

_Implementation status 2020-09-08:_

* __Implemented__: `create`, `verify` (partially), `diff`, `info` (partially), `xsd-schema-check`
* __Not implemented yet__: some subcommands for `verify`, `info`
* __Implemented__: `create`, `flatten` (partially), `verify` (partially), `diff`, `info` (partially), `xsd-schema-check`
* __Not implemented yet__: some subcommands for `flatten`, `verify`, `info`

_The commands are also marked below with their current implementation status._

Expand Down Expand Up @@ -220,6 +220,32 @@ for each file from input
add a new generation if necessary in appropriate `ascmhl` folder (mhllib)
```


### The `flatten` command

_TBD_

```
% ascmhl flatten --help
Usage: ascmhl flatten [OPTIONS] ROOT_PATH DESTINATION_PATH
Flatten an MHL history into one external manifest
The flatten command iterates through the mhl-history, collects all known files and
their hashes in multiple hash formats and writes them to a new mhl file outside of the
iterated history.
Options:
-v, --verbose Verbose output
-n, --no_directory_hashes Skip creation of directory hashes, only reference
directories without hash
-i, --ignore TEXT A single file pattern to ignore.
-ii, --ignore_spec PATH A file containing multiple file patterns to
ignore.
--help Show this message and exit.
```


### The `verify` command

#### `verify` default behavior (for file hierarchy, with completeness check)
Expand Down Expand Up @@ -320,6 +346,13 @@ on error (including mismatching hash):
end with exit !=0
```


#### `verify` with `-pl` subcommand option (for packing lists)

_TBD_



### The `diff` command

The `diff` command is very similar to the `verify` command in the default behavior, only that it doesn't create hashes and doesn't verify them. It can be used to quickly check if a folder structure has new files that have not been recorded yet, or if files are missing.
Expand Down Expand Up @@ -428,11 +461,33 @@ print directory hash

### The `xsd-schema-check` command

The `xsd-schema-check` command validates a given ASC MHL file against the XML XSD. This command can be used to ensure the creation of syntactically valid ASC MHL files, for example during implementation of tools creating ASC MHL files.
The `xsd-schema-check` command validates a given ASC MHL Manifest file against the XML XSD. This command can be used to ensure the creation of syntactically valid ASC MHL files, for example during implementation of tools creating ASC MHL files.

_Note: The `xsd-schema-check` command must be run from a directory with a `xsd` subfolder where the ASC MHL xsd files are located (for example it can be run from the root folder of the ASC MHL git repository)._

```
$ ascmhl xsd-schema-check /path/to/ascmhl/XXXXX.mhl
```

#### `xsd-schema-check` with the `-df` subcommand option

The `xsd-schema-check` command with the `-df` subcommand option can validates a ASC MHL Directory file instead of a manifest file.

It is run with the path to a ASC MHL Directory file.

```
$ ascmhl xsd-schema-check -df /path/to/ascmhl/ascmhl_chain.xml
```


## Known issues

The current state of the implementation is intended to give a good overview what can be done with ASC MHL. Nonetheless this is not yet a complete implementation of the ASC MHL specification:

* Currently not all initially specified commands are implemented yet (see sections above)
* Renaming of files is currently not implemented (neither as command, nor proper handling in histories and packing lists)
* The chain file is currently not verified yet
* Some secondary features of the ASC MHL specification are not implemented yet.

_Also see the [GitHub issues](https://github.com/ascmitc/mhl/issues) page for more._

4 changes: 2 additions & 2 deletions ascmhl/__version__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@

ascmhl_folder_name = "ascmhl"
ascmhl_file_extension = ".mhl"
ascmhl_chainfile_name = "ascmhl_chain.txt"
ascmhl_collectionfile_name = "ascmhl_collection.txt"
ascmhl_chainfile_name = "ascmhl_chain.xml"
ascmhl_collectionfile_name = "ascmhl_collection.xml"
# decreasing priority list for verification
ascmhl_supported_hashformats = [
"md5",
Expand Down
4 changes: 2 additions & 2 deletions ascmhl/_debug_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

import click
from .history import MHLHistory
from . import chain_txt_parser
from . import chain_xml_parser
from . import hashlist_xml_parser


Expand All @@ -24,7 +24,7 @@ def readchainfile(filepath, verbose):
read an ASC-MHL file
"""

chain = chain_txt_parser.parse(filepath)
chain = chain_xml_parser.parse(filepath)

if verbose:
chain.log()
Expand Down
2 changes: 1 addition & 1 deletion ascmhl/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ class for representing one generation
hash_format: str
hash_string: str

def __init__(self, generation_number, ascmhl_filename, hash_format, hash_string):
def __init__(self, generation_number=-1, ascmhl_filename=None, hash_format=None, hash_string=None):
# line string examples:
# 0001 A002R2EC_2019-10-08_100916_0001.ascmhl SHA1: 9e9302b3d7572868859376f0e5802b87bab3199e

Expand Down
142 changes: 142 additions & 0 deletions ascmhl/chain_xml_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""
__author__ = "Patrick Renner"
__copyright__ = "Copyright 2021, Pomfort GmbH"
__license__ = "MIT"
__maintainer__ = "Patrick Renner, Alexander Sahm"
__email__ = "opensource@pomfort.com"
"""

from . import logger
from .__version__ import ascmhl_reference_hash_format
from .chain import MHLChain, MHLChainGeneration
from .hashlist import MHLHashList
from .hasher import create_filehash
import os
import textwrap
from timeit import default_timer as timer
import dateutil.parser

from lxml import etree
from lxml.builder import E

from . import logger
from .hashlist import *
from .utils import datetime_isostring
from .__version__ import ascmhl_supported_hashformats
from .hashlist import (
MHLCreatorInfo,
MHLHashEntry,
MHLHashList,
MHLHashListReference,
MHLMediaHash,
MHLProcessInfo,
MHLTool,
)
from .ignore import MHLIgnoreSpec
from .utils import datetime_isostring


def parse(file_path):
"""parsing the MHL directory file and building the MHLChain for the chain member variable"""
logger.debug(f'parsing "{os.path.basename(file_path)}"...')

chain = MHLChain(file_path)
chain.file_path = file_path
current_object = None

if not os.path.exists(file_path):
return chain

file = open(file_path, "rb")
for event, element in etree.iterparse(file, events=("start", "end")):

# check if we need to create a new container
if event == "start":
# the tag might contain the namespace like {urn:ASC:MHL:v2.0}hash, so we need to strip the namespace part
# doing it with split is faster than using the lxml QName method
tag = element.tag.split("}", 1)[-1]

if not current_object:
if tag == "hashlist":
current_object = MHLChainGeneration()

elif event == "end":
if current_object:
tag = element.tag.split("}", 1)[-1]

if type(current_object) is MHLChainGeneration:
if tag == "path":
current_object.ascmhl_filename = element.text
elif tag in ascmhl_supported_hashformats:
current_object.hash_format = tag
current_object.hash_string = element.text
elif tag == "hashlist":
current_object.generation_number = element.attrib.get("sequencenr")
chain.append_generation(current_object)
current_object = None

return chain


def write_chain(chain: MHLChain, new_hash_list: MHLHashList):
logger.debug(f'writing "{os.path.basename(chain.file_path)}"...')

"""creates a new chain file and writes the xml to disk
"""

directory_path = os.path.dirname(chain.file_path)
if not os.path.isdir(directory_path):
os.mkdir(directory_path)

file = open(chain.file_path, "wb")
file.write(b'<?xml version="1.0" encoding="UTF-8"?>\n<ascmhldirectory xmlns="urn:ASC:MHL:DIRECTORY:v2.0">\n')
current_indent = " "

for generation in chain.generations:
_write_xml_element_to_file(file, _hashlist_xml_element_from_chaingeneration(generation), " ")

# write new hashlist
_write_xml_element_to_file(file, _hashlist_xml_element_from_hashlist(new_hash_list), " ")

current_indent = current_indent[:-2]
_write_xml_string_to_file(file, "</ascmhldirectory>\n", current_indent)
file.flush()


def _write_xml_element_to_file(file, xml_element, indent: str):
xml_string = etree.tostring(xml_element, pretty_print=True, encoding="unicode")
_write_xml_string_to_file(file, xml_string, indent)


def _write_xml_string_to_file(file, xml_string: str, indent: str):
result = textwrap.indent(xml_string, indent)
file.write(result.encode("utf-8"))


def _hashlist_xml_element_from_hashlist(hash_list: MHLHashList):
"""builds and returns one <hashlist> element for a given HashList object"""

hash_list_element = E.hashlist(
E.path(os.path.basename(hash_list.file_path)),
E.c4(hash_list.generate_reference_hash()),
)
hash_list_element.attrib["sequencenr"] = str(hash_list.generation_number)

return hash_list_element


def _hashlist_xml_element_from_chaingeneration(generation: MHLChainGeneration):
"""builds and returns one <hashlist> element for a given ChainGeneration object"""

if generation.hash_format == "c4":
hash_list_element = E.hashlist(
E.path(generation.ascmhl_filename),
E.c4(generation.hash_string),
)
hash_list_element.attrib["sequencenr"] = str(generation.generation_number)

return hash_list_element
else:
logger.error("ERR: fixme: non-c4 hash in chain file, not implemented")
return E.hashlist()
17 changes: 15 additions & 2 deletions ascmhl/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def create_for_folder_subcommand(
if not os.path.isabs(root_path):
root_path = os.path.join(os.getcwd(), root_path)

logger.verbose(f"Sealing folder at path: {root_path} ...")
logger.verbose(f"Creating new generation for folder at path: {root_path} ...")

existing_history = MHLHistory.load_from_path(root_path)

Expand Down Expand Up @@ -738,6 +738,7 @@ def flatten_history(root_path, destination_path, verbose, no_directory_hashes, i
hash_entry.hash_format,
hash_entry.hash_string,
action=hash_entry.action,
hash_date=hash_entry.hash_date,
)

commit_session_for_collection(session, root_path)
Expand Down Expand Up @@ -835,7 +836,15 @@ def info_for_single_file(root_path, verbose, single_file):

@click.command()
@click.argument("file_path", type=click.Path(exists=True))
def xsd_schema_check(file_path):
# subcommands
@click.option(
"--directory_file",
"-df",
default=False,
is_flag=True,
help="Check directory file (e.g. ascmhl_chain.xml) instead of manifest file",
)
def xsd_schema_check(file_path, directory_file):
"""
Checks a .mhl file against the xsd schema definition
Expand All @@ -847,6 +856,10 @@ def xsd_schema_check(file_path):
"""

xsd_path = "xsd/ASCMHL.xsd"

if directory_file:
xsd_path = "xsd/ASCMHLDirectory__combined.xsd"

xsd = etree.XMLSchema(etree.parse(xsd_path))

# pass a file handle to support the fake file system used in the tests
Expand Down
4 changes: 2 additions & 2 deletions ascmhl/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from collections import defaultdict
from typing import Dict, List

from . import chain_txt_parser
from . import chain_xml_parser
from . import logger
from .ignore import MHLIgnoreSpec
from .hashlist import MHLHashList, MHLHashEntry, MHLCreatorInfo, MHLProcessInfo
Expand Down Expand Up @@ -175,4 +175,4 @@ def commit(self, creator_info: MHLCreatorInfo, process_info: MHLProcessInfo):
if history.parent_history is not None:
referenced_hash_lists[history.parent_history].append(new_hash_list)

chain_txt_parser.write_chain(history.chain, new_hash_list)
chain_xml_parser.write_chain(history.chain, new_hash_list)
2 changes: 1 addition & 1 deletion ascmhl/hashlist_xml_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def parse(file_path):
# doing it with split is faster than using the lxml QName method
tag = element.tag.split("}", 1)[-1]

if not current_object and event == "start":
if not current_object:
if tag == "creatorinfo":
current_object = MHLCreatorInfo()
elif tag == "processinfo":
Expand Down
6 changes: 3 additions & 3 deletions ascmhl/history.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from datetime import datetime, date, time

from .__version__ import ascmhl_folder_name, ascmhl_file_extension, ascmhl_chainfile_name, ascmhl_collectionfile_name
from . import hashlist_xml_parser, chain_txt_parser
from . import hashlist_xml_parser, chain_xml_parser
from .utils import datetime_now_filename_string
from typing import Tuple, List, Dict, Optional, Set
from . import logger
Expand Down Expand Up @@ -216,7 +216,7 @@ def load_from_path(cls, root_path):
history.asc_mhl_path = asc_mhl_folder_path

file_path = os.path.join(asc_mhl_folder_path, ascmhl_chainfile_name)
history.chain = chain_txt_parser.parse(file_path)
history.chain = chain_xml_parser.parse(file_path)

hash_lists = []
for root, directories, filenames in os.walk(asc_mhl_folder_path):
Expand Down Expand Up @@ -280,7 +280,7 @@ def create_collection_at_path(cls, root_path, debug=False):
history.asc_mhl_path = collection_folder_path

file_path = os.path.join(collection_folder_path, ascmhl_collectionfile_name)
history.chain = chain_txt_parser.parse(file_path)
history.chain = chain_xml_parser.parse(file_path)

return history

Expand Down
2 changes: 1 addition & 1 deletion examples/scenarios/Output/scenario_01/log.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Assume the source card /A002R2EC is copied to a travel drive /travel_01.
Seal the copy on the travel drive /travel_01 to create the original mhl generation.

$ ascmhl.py create -v /travel_01/A002R2EC -h xxh64
Sealing folder at path: /travel_01/A002R2EC ...
Creating new generation for folder at path: /travel_01/A002R2EC ...
created original hash for Clips/A002C006_141024_R2EC.mov xxh64: 0ea03b369a463d9d
created original hash for Clips/A002C007_141024_R2EC.mov xxh64: 7680e5f98f4a80fd
calculated directory hash for Clips xxh64: 6d43a82e7a5d40f6 (content), a27e08b77ae22c78 (structure)
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<ascmhldirectory xmlns="urn:ASC:MHL:DIRECTORY:v2.0">
<hashlist sequencenr="1">
<path>0001_A002R2EC_2020-01-16_091500.mhl</path>
<c4>c44cT42udFcktEWg2GLRbcSsTeUGXTyHA7yaqxcL2NC2bhPjoYtFjNCiab5ndByhrpYWLbcAQ6s1sBPxHXLdRbyWqR</c4>
</hashlist>
</ascmhldirectory>
Loading

0 comments on commit abcb928

Please sign in to comment.