-
Notifications
You must be signed in to change notification settings - Fork 31
Hadoop File Format
Basically this Hadoop file format is suitable for reading and writing row data from/to Office documents, such as MS Excel or ODF Spreadsheet, from files in HDFS. They can be used by any MapReduce/Tez/Spark application to process them. Currently the following formats are supported:
- MS Excel. This module is based on Apache POI for parsing Excel documents. It provides the following formats:
- ExcelFileInputFormat: Deserializes rows containing cells into an Array of SpreadSheetCellDAO object of MS Excel (.xls,.xlsx) . Null values for cell means that the cell does not have content. Each sheet will be processed one after the other.
- ExcelCellFileInputFormat: Deserializes rows containing cells into objects of type SpreadSheetCellDAO object of MS Excel (.xls,.xlsx).
- ExcelFileOutputFormat: Serializes SpreadSheetCellDAO objects into Excel files (.xls,.xlsx).
- ExcelRowFileOutputFormat: Serializes an array of SpreadSheetCellDAO objects into Excel files (.xls,.xlsx).
Please note that Office documents are generally not splitable due to their structure. Usually this is not problem, because Office documents are rather small (< HDFS blocksize). However, due to their size another problem arises if you have many small files: The namenode has to process a lot of requests. Hence, it is recommended to Improve performance for processing a lot of small Office files with Hadoop Archives (HAR).
Note: In Version 1.5.0 we introduce a breaking change that requires to write all option keys in lower case. Please make sure that you apply this change when upgrading to version 1.5.0 or higher
Execute:
git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice
You can build the application by changing to the directory hadoopoffice/fileformat and using the following command:
../gradlew clean build publishToMavenLocal
The following configuration options for reading exist.
- "io.file.buffer.size": Size of io Buffer (in bytes). Defaults to 64K. This is an option of Hadoop.
- "hadoopoffice.read.mimetype" (Before 1.5: "hadoopoffice.read.mimeType"). mime type of the office document. Default: "" (hadoopoffice tries to autodetect it). Alternatively:
- "application/vnd.ms-excel" (.xls)
- "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" (.xlsx)
- "hadoopoffice.read.sheets". A ":" separated list of sheets to be processed. Default "", which means all sheets in a workbook are processed
- "hadoopoffice.read.locale.bcp47". Locale (in BCP47 format) of the document to be used for evaluating fields (e.g. numeric or date fields). Default "", which means that the system locale is used.
- "hadoopoffice.read.linkedworkbooks". True if linked workbooks should be read. False if not. Note: linked workbooks must be in the same folder as the main document. Default: false
- "hadoopoffice.read.ignoremissinglinkedworkbooks". True if missing linked workbooks should be ignored. Values are read from the cache of the main document, if exists. False if missing linked workbooks should not be ignored. Default: false
- hadoopoffice.read.linkedworkbooks.location (as of 1.2.3): location of linked workbooks (only one folder where all of them are located). If set to empty String then the same folder as the main excel file is used. Default: set to empty String
Spreadsheet files consume a lot of memory and CPU when reading them. Hence, we introduced in version 1.0.4 the low footprint mode that allows reading of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI event API (see also Limitations of reading Excel files in low footprint mode). The following configuration options exist:
- "hadoopoffice.read.lowfootprint" (Before 1.5: "hadoopoffice.read.lowFootprint"). true if low footprint reading should be used. False, if not. Default false
As of version 1.2.0 you have the following options:
- hadoopoffice.read.lowfootprint.parser (Before 1.5: "hadoopoffice.read.lowFootprint.parser"): Only valid for Excel files in new Excel format. Can be either SAX (as it is implemented in version previous to 1.2.0) or STAX (improved more memory efficient parser). SAX is generally consuming a lot of memory in the case the Excel contains a lot of unique strings (due to a design mistake of the Excel format - it is not an implementation error) - even for small Excel files (that are compressed, eg a 50 MB .xlsx can easily consume several gigabytes of memory). Stax is much more memory-efficient in case of a lot of unique entries, but can be also slower to process. For encrypted files we recommend to use SAX and not Stax, because the encryption makes it difficult to randomly scan a file for entries in the unqiue string table. Default: StAX
- hadoopoffice.read.lowfootprint.stax.sst.cache (Before 1.5: "hadoopoffice.read.lowFootprint.stax.sst.cache"): if stax parser is used a cache size can be defined for the so-called sharedstringtable (an Excel concept where it stores all unique strings, this can save space, but is not very memory efficient for large files). The cache can be -1 (everything in-memory), 0 (nothing in memory), n>0 (n entries in cache). All that does not fit in the cache will be swapped to disk and read from disk when needed. This might be slow (especially if source file is encrypted, because the sst table is stored in this case on disk as well encrypted). Generally the strategy should be that if you have a lot of entries repeating at various positions in the document then you should have a rather large cache in-memory. If you have entries that appear in a sequential manner and ideally do not repeat then you can have a smaller cache. You may need to experiment in case of large Excel files if you want to save memory. Alternatively, provide enough memory and put everything in-memory (can be potentially large!). Default: 10000 unique string entries (memory consumption depends on the size of the Strings + overhead due to HashMap Cache).
- hadoopoffice.read.lowfootprint.stax.sst.compress (Before 1.5: "hadoopoffice.read.lowFootprint.stax.sst.compress"): compress (gzip) swapped Excel sst items to disk (if stax parser is used). true if should be compressed, false if not. Note: Compression can significantly reduce performance. Default:False
You can filter documents by metadata. if they do not match then the content of the document is ignored. For each metadata attribute it is supported to filter by a regular expression if you use the Excel format (see Excel Metadata Filter).
- "hadoopoffice.read.filter.metadata": filters documents according to metadata. For example, hadoopoffice.read.filter.metadata.author will filter by author and the filter defined as value. Filtering is done by the parser and it is recommended that it supports regular expression for filtering, but this is up to the parser!
You may optionally skip some rows of the first sheet or for all sheets. Furthermore, you may want to read a header for the first sheet or all of the sheets. You have the following options:
- hadoopoffice.read.header.read: if true then the next row of the first sheet is interpreted as header and ignored for subsequent reads (you can get the header row content by calling reader.getOfficeReader().getCurrentParser().getHeader()). Default false
- hadoopoffice.read.header.skipheaderinallsheets: if true then the next row in all sheet is interpreted as header and ignored. Default false.
- hadoopoffice.read.sheet.skiplines.num: number of rows to be skipped of the first Excel sheet. Default: 0. Note: this applies before reading the header. E.g. if the header row is in the 5th line then you can skip the previous 4 lines.
- hadoopoffice.read.sheet.skiplines.allsheets: skip number of rows configured using the previous option in all sheets.
You can also change the header names according to regular expressions (as of 1.2.0). This can be useful if your tool (e.g. the Big Data platform, such as Flink or Spark) do not support certain characters in the column name
- hadoopoffice.read.header.column.names.regex: Regular expression that is supposed to be replaced by the text in the following option
- hadoopoffice.read.header.column.names.replace: Text to replace all (!) occurrences of the text matching the regular expression
Example to replace all occurences of . in the column name with an empty String:
- hadoopoffice.read.header.column.names.regex = "."
- hadoopoffice.read.header.column.names.replace = ""
More information about regular expressions in Java can be found here: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Please also think about hard disk encryption to avoid leakage of decrypted temporary files.
You can read decrypt encrypted documents with the following options:
- hadoopoffice.read.security.crypt.password: if set then hadoopoffice will try to decrypt the file
- hadoopoffice.read.security.crypt.linkedworkbooks.*: if set then hadoopoffice will try to decrypt all the linked workbooks where a password has been specified. If no password is specified then it is assumed that the linked workbook is not encrypted. Example: Property key for file "linkedworkbook1.xlsx" is "hadoopoffice.read.security.crypt.linkedworkbooks.linkedworkbook1.xslx". Value is the password. You must not include path or protocol information in the filename
Since version 1.1.0 you can also store passwords encrypted in a keystore file and use the keystore file to access the encrypted passwords in your Hadoop job (done transparently by HadoopOffice). You can create a keystore with encrypted passwords using keytool (see here for more details): keytool -importpassword -alias filename.xlsx -storetype JCEKS -keystore keystore.jceks -storepassword keystorepassword
You can use as an alias the filename or use any other custom alias, if needed. One keystore can contain multiple passwords for different files.
You can configure then the following options:
- hadoopoffice.read.security.crypt.credential.keystore.file: filename of the keystore containing the encrypted passwords. Can be stored on HDFS
- hadoopoffice.read.security.crypt.credential.keystore.type: type of keystore. Default: JCEKS
- hadoopoffice.read.security.crypt.credential.keystore.password: password of the keystore to decrypt passwords. Note this is not the password of the office document.
- hadoopoffice.read.security.crypt.credential.keystore.alias: alias under which the password for an office document can be found. Default: name of the document without the path. e.g. example.xlsx
Note this feature requires JDK8.
As of version 1.1.0 the HadoopOffice library supports verifying digital signatures of office documents. Currently Excel files in new Excel format (.xlsx) are supported (find more information here).
You simply configure the following option:
- hadoopoffice.read.security.sign.verifysignature, if true then the signature of the document will be verified. It throws an exception if the signature cannot be verified. Ignored if signature verification is not supported. Default: false
Optionally you can use a truststore to establish a certification chain by using the following options:
- hadoopoffice.read.security.sign.truststore.file: location of keystore (either local, HDFS or any other Hadoop-supported filesystem)
- hadoopoffice.read.security.sign.truststore.type truststore type. Default: JKS
- hadoopoffice.read.security.sign.truststore.password truststore password
If a truststore is defined then it will be verified that the signature certificate is part of a certification chain to verify that the person signing the document is indeed the person it pretends to be. Note: This also imply that certification revocation lists are checked as part of the certificate. Depending which certification authority has signed the certificate you may need to open firewall traffic from all cluster node to the server hosting the certification revocation lists and/or define a proxy to be used by all JDKs on each node of the cluster.
Please note that you need to import the root and other certificates as part of the chain in the truststore, e.g. by using keytool.
Note that signature verification requires to read the full file and thus might have performance implications.
Please note that you will need to add the following dependencies to your application to use this option:
- org.bouncycastle:bcprov-ext-jdk15on:1.60
- org.bouncycastle:bcpkix-jdk15on:1.60
- org.apache.santuario:xmlsec:2.1.2
Some Big Data platform components of HadoopOffice, such as Flink, Hive and Spark have special configurations to automatically convert Excel data into Flink/Spark/Hive datatypes to use them natively with those platforms. However, to interpret textual data in Excel in these cases, a Date, Date/Time and decimal format need to be provided to do this correctly. This can be done by passing the following options (or set them directly in HadoopOfficeReadConfiguration) as of version 1.2.0:
- hadoopoffice.read.simple.dateformat (Before 1.5: "hadoopoffice.read.simple.dateFormat"): applies only to HadoopOffice components that use the Converter to convert SpreadSheetCellDAOs into simple Java objects. Describes the date format to interpret dates using the BCP47 notation. Note that even in non-US Excel versions Excel stores them most of the times internally in US format. Leave it empty for using the systems locale. Default: "US".
- hadoopoffice.read.simple.datepattern (Before 1.5: "hadoopoffice.read.simple.datePattern"): applies only to HadoopOffice components that use the Converter to convert SpreadSheetCellDAOs into simple Java objects. Overrides "hadoopoffice.read.simple.dateFormat" - describes a date pattern according to the pattern in SimpleDateFormat - you can define any pattern that dates have (e.g. yyyy-MM-dd for dates in the style 2001-01-30)
- hadoopoffice.read.simple.datetimeformat (Before 1.5: "hadoopoffice.read.simple.dateTimeFormat"): applies only to HadoopOffice components that use the Converter to convert SpreadSheetCellDAOs into simple Java objects. Describes the date/time format to interpret date/timestamps using the BCP47 notation. Leave it empty for using the systems locale. Default: "US".
- hadoopoffice.read.simple.datetimepattern (Before 1.5: "hadoopoffice.read.simple.dateTimePattern"): applies only to HadoopOffice components that use the Converter to convert SpreadSheetCellDAOs into simple Java objects. Overrides "hadoopoffice.read.simple.dateTimeFormat" - describes a date/time pattern according to the pattern in SimpleDateFormat - you can define any pattern that date/time have (e.g. yyyy-MM-dd HH:mm:ss for timestamps in the style 2001-01-30 12:12:01). Defaults to java.sql.Timestamp pattern, if not specified
- hadoopoffice.read.simple.decimalformat (Before 1.5: "hadoopoffice.read.simple.decimalFormat"): applies only to HadoopOffice components that use the Converter to convert SpreadSheetCellDAOs into simple Java objects. Describes the decimal format to interpret decimal numbers using the BCP47 notation. Leave it empty for using the systems locale. Default: "".
- hadoopoffice.read.emulatecsv (since 1.2.1) (Before 1.5: "hadoopoffice.read.emulateCSV"): Simulates when reading Excel to interpret content as if the Excel would have been saved as CSV. Technically it is based on the emulateCSV option of the DataFormatter in Apache POI. Default: false
Note: Very large numbers are automatically rounded by Excel: For example: 123456789101112000001 is automatically rounded to 1,23457E+20. If you want to avoid this then you need to format very large numbers in Excel as Text and HadoopOffice automatically recognizes this as well as uses the appropriate DataType (BigDecimal) in case of conversions. However, you cannot use those numbers in Excel formulas due to Excel restriction on numbers as doubles (see whole explanation here).
The following configuration options for writing exist.
- hadoopoffice.write.mimetype (Before 1.5: "hadoopoffice.write.mimeType"): mime type of the document.
- "application/vnd.ms-excel" (.xls)
- Default: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" (.xlsx). Note you cannot write spreadsheets with linked workbooks in this Excel format at the moment due to a bug in Apache POI. All other options work perfectly fine.
- "hadoopoffice.write.locale.bcp47": Locale (in BCP47 format) of the document to be used for evaluating fields (e.g. numeric or date fields). Default "", which means that the system locale is used.
- "hadoopoffice.write.linkedworkbooks": a []: separated list of existing linked workbooks. Example: [hdfs:///home/user/excel/linkedworkbook1.xls]:[hdfs:///home/user/excel/linkedworkbook2.xls]. Note: these workbooks are loaded during writing the current workbook. This means you may need a lot of memory on the node writing the file. Furthermore, you can only specify files and NOT directories.
- "hadoopoffice.write.ignoremissinglinkedworkbooks": if you have specified linked workbooks then they are not read during writing. This implies also that the written document does NOT have a cached value. Value is ignored if you did not specify linked workbooks. Default: false.
- "hadoopoffice.write.comment.author": Default author for comments
- "hadoopoffice.write.comment.width": Default comment width (in columns)
- "hadoopoffice.write.comment.height": Default comment height (in rows)
If your Big Data platform has meta information about the header then you can instruct it to write those information into the first row of an Excel file by applying the option:
- "hadoopoffice.write.header.write": true, if meta data should be written in the first row, false if not. Default: false
Spreadsheet files consume a lot of memory and CPU when writing them. Hence, we introduced in version 1.0.4 the low footprint mode that allows writing of spreadsheet files with a low CPU and memory footprint. For Excel this is based on the Apache POI SXSSF api. Only new Excel files (.xlsx) are supported in this mode. Formulas can be inserted, but will not be evaluated during writing. The following options exist:
- "hadoopoffice.write.lowfootprint" (Before 1.5: "hadoopoffice.write.lowFootprint"): True if Excel (.xlsx) files should be written in low footprint mode, false if not. Default false
- "hadoopoffice.write.lowfootprint.cacherows" (Before 1.5: "hadoopoffice.write.lowFootprint.cacherows"): Number of rows to cache in-memory before flushing them to disk. Default: 1000
You can write metadata of the document using the following options. Keep in mind that you need to check for each parser what metadata is supported.
- hadoopoffice.write.metadata.*: Write metadata properties of the document. All properties belonging to the base (e.g. hadoopoffice.write.metadata.author for author of the .xls fileformat) will be handed over to the corresponding writer. See writer documentation which properties are supported. See Supported Excel Metadata Attributes.
You can use templates (ie normal office documents) as input that are only updated when writing the office document. For example, you can create a diagram in an office spreadsheet and use this as template. During writing you update the data fields to which the diagram is referring and voila the spreadsheet has diagrams filled with data from your Big Data processing platform. Only available since version 1.0.4 Options:
- hadoopoffice.write.template.file: location of the template. Example: hdfs:///home/user/excel/template.xlsx
- hadoopoffice.write.template.password: Optional, if template is encrypted the password of the template.
Note that certain JDKs require optional extensions for best security (e.g. for Oracle JDK the Java Cryptography Extension). Please also think about hard disk encryption to avoid leakage of decrypted temporary files. Find here:
You can use the following options to encrypt a document. The support depends on the corresponding Writer.
- hadoopoffice.write.security.crypt.password: use password to encrypt the document. Note: There is no security check of strongness of password. This is up to the application developer. See Wikipedia and other sources for strong passwords.
- hadoopoffice.write.security.crypt.encrypt.algorithm: use the following algorithm to encrypt. Note that some writers do not support all algorithms and an exception will be thrown if the algorithm is not supported. See corresponding writer documentation for supported algorithms.
- hadoopoffice.write.security.crypt.hash.algorithm: use the following algorithm to hash. Note that some writers do not support all algorithms and an exception will be thrown if the algorithm is not supported. See corresponding writer documentation for supported algorithms.
- hadoopoffice.write.security.crypt.encrypt.mode: use the following mode to encrypt. Note that some writers do not support all modes and an exception will be thrown if the mode is not supported. See corresponding writer documentation for supported algorithms.
- hadoopoffice.write.security.crypt.chain.mode: use the following mode to chain. Note that some writers do not support all modes and an exception will be thrown if the mode is not supported. See corresponding writer documentation for supported algorithms.
- hadoopoffice.write.security.crypt.linkedworkbooks.*: if set then hadoopoffice will try to decrypt all the linked workbooks where a password has been specified. If no password is specified then it is assumed that the linked workbook is not encrypted. Example: Property key for file "linkedworkbook1.xlsx" is "hadoopoffice.read.security.crypt.linkedworkbooks.linkedworkbook1.xslx". Value is the password. You must not include path or protocol information in the filename
Since version 1.1.0 you can use also store passwords encrypted in a keystore file and use the keystore file to access the encrypted passwords in your Hadoop job (done transparently by HadoopOffice). You can create a keystore with encrypted passwords using keytool (see here for more details): keytool -importpassword -alias filename.xlsx -storetype JCEKS -keystore keystore.jceks -storepassword keystorepassword
You can use as an alias the filename or use any other custom alias, if needed. One keystore can contain multiple passwords for different files.
You can configure then the following options:
- hadoopoffice.write.security.crypt.credential.keystore.file: filename of the keystore containing the encrypted passwords. Can be stored on HDFS
- hadoopoffice.write.security.crypt.credential.keystore.type: type of keystore. Default: JCEKS
- hadoopoffice.write.security.crypt.credential.keystore.password: password of the keystore to decrypt passwords. Note this is not the password of the office document.
- hadoopoffice.write.security.crypt.credential.keystore.alias: alias under which the password for an office document can be found. Default: name of the document without the path. e.g. example.xlsx
Note this feature requires at least JDK8.
As of version 1.1.0 the HadoopOffice library supports signing of office documents with a digital cryptographic signautre. Currently Excel files in new Excel format (.xlsx) are supported (find more information here). Find here:
You simply configure by using the following options:
- hadoopoffice.write.security.sign.keystore.file: file where the private key and certificate are stored for signing. Can be a local file or on HDFS
- hadoopoffice.write.security.sign.hash.algorithm: hash algorithm to be used for the signature. See corresponding writer documentation for supported hash algorithms.
- hadoopoffice.write.security.sign.keystore.type: format of the keystore file. Default: PKCS12
- hadoopoffice.write.security.sign.keystore.password: passphrase to access the private key stored in the keystore file
- hadoopoffice.write.security.sign.keystore.alias: alias under which the private key can be found in the keystore file
- hadoopoffice.write.security.sign.ignoreLineBreaks (as of 1.2.0): if set to true then the Java system property (!) org.apache.xml.security.ignoreLineBreaks is set to true if signing is activated. Note: This affects the full JVM on which the code runs and thus may impact other processes running on it. This is needed, because otherwise XMLSec would introduce linebreask into signatures > 64 byte (e.g. SHA512) and this is not understood by office (an exception is thrown in this case when writing a file) - see details here. If false this option is not touched. Default: false.
Important: Creating a signature in low footprint mode, but also in normal mode requires local temporary storage for the file. It is crucial that you configure correctly access to the tmp folder only for the user executing the job. This depends on your operating system (e.g. /tmp is by default writable for everyone, which is a security issue) and you should consult your security expert for this. The location of the tmp folder can additionally be modified by setting the global property: java.io.tmpdir
Please ask your security expert on how to create a secure keystore with a good private key/certificate for signing. Additionally, please ask for a secure hash algorithm, e.g. use as a minimum SHA256, because SHA1/MD5 have been broken.
Please note that you will need to add the following dependencies to your application to use this option:
- org.bouncycastle:bcprov-ext-jdk15on:1.60
- org.bouncycastle:bcpkix-jdk15on:1.60
- org.apache.santuario:xmlsec:2.1.2
Creating a signature requires local temporary space and may have impact on performance.
Some Big Data platform components of HadoopOffice, such as Flink, Hive and Spark have special configurations to automatically convert Flink/Spark/Hive datatypes into Excel data. In order to write them properly to Excel in these cases, a Date, Date/Time and decimal format need to be provided to do this correctly. This can be done by passing the following options (or set them directly in HadoopOfficeWriuteConfiguration) as of version 1.2.0:
- hadoopoffice.write.simple.dateformat (Before 1.5: "hadoopoffice.write.simple.dateFormat"): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. Describes the date format to format dates using the BCP47 notation. Note that even in non-US Excel versions Excel stores them most of the times internally in US format. Leave it empty for using the systems locale. Default: "US".
- hadoopoffice.write.simple.datepattern (Before 1.5: "hadoopoffice.write.simple.datePattern"): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. Overrides "hadoopoffice.write.simple.dateFormat" - describes a date pattern according to the pattern in SimpleDateFormat - you can define any pattern that dates have
- hadoopoffice.write.simple.datetimeformat (Before 1.5: "hadoopoffice.write.simple.dateTimeFormat"): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. Describes the date/time format format date/timestamps using the BCP47 notation. Leave it empty for using the systems locale. Default: "US".
- hadoopoffice.write.simple.datetimepattern (Before 1.5: "hadoopoffice.write.simple.dateTimePattern"): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. Overrides "hadoopoffice.write.simple.dateTimeFormat" - describes a date/time pattern according to the pattern in SimpleDateFormat - you can define any pattern that date/time have. Defaults to java.sql.Timestamp pattern, if cannot be converted according to pattern or locale
- hadoopoffice.write.simple.decimalformat (Before 1.5: "hadoopoffice.write.simple.decimalFormat): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. Describes the decimal format for decimal numbers using the BCP47 notation. Leave it empty for using the systems locale. Default: "".
- hadoopoffice.write.simple.spilloversheet (as of 1.3.9) (Before 1.5: "hadoopoffice.write.simple.spillOverSheet"): applies only to HadoopOffice components that use the Converter to convert native objects into SpreadSheetCellDAO objects. If the number of rows written exceed the limitations of Excel per sheet then additional rows are written on subsequent sheets. Note: Excel need to have enough resources to open such large Excel sheets. Default: true.
Additionally, you may use Hadoop Compression during writing of office documents.
- "mapreduce.output.fileoutputformat.compress": Enable compression. True if yes, False if not. Default depends on your Hadoop system configuration
- "mapreduce.output.fileoutputformat.compress.codec": Compression codec class. Default depends on your Hadoop system configuration.
During reading it is automatically detected if files are compressed.