Refactored CSVReader/first-class support for handling variable column…

… rows (#80) * Refactored ParseFlags out of CSVReader * Got most tests to pass after refactoring feed() out * CSV guessing is now independent of CSVReader * Move guess_format() function definition * Update single header file * Ignore zero-length rows when guessing * Attempt to fix Linux issue wrt empty rows * Refactored ParseFlags out of CSVReader * Got most tests to pass after refactoring feed() out * CSV guessing is now independent of CSVReader * Move guess_format() function definition * Update single header file * Ignore zero-length rows when guessing * Attempt to fix Linux issue wrt empty rows * Regenerated single header files after rebase from master * Refactored get_col_names() * Refactored write_record implementation * Simplified write_record() even more * Got rid of write_record() * Refactored ColNames * Fixed the dumbest bug ever * No more CSVCollection * Got rid of bad row handler * Simplified CSVReader attributes * Update csv_reader.cpp * Code clean up + renaming * Removed error message for unescaped single quote * CSVStat tests pass again * Added some small optimizations * Simplified CSVRow implementation * Update csv_reader.cpp * Added ability to accept/reject/ignore variable length columns * Fixed warnings * Updated README * Added too short/too long distinction * Update README.md
vincentlaucsb · Mar 12, 2020 · cd78068 · cd78068
1 parent de1fa1d
commit cd78068
Show file tree

Hide file tree

Showing 26 changed files with 2,099 additions and 2,087 deletions.
diff --git a/README.md b/README.md
@@ -14,6 +14,7 @@
    * [Numeric Conversions](#numeric-conversions)
    * [Specifying the CSV Format](#specifying-the-csv-format)
       * [Trimming Whitespace](#trimming-whitespace)
+      * [Handling Variable Numbers of Columns](#handling-variable-numbers-of-columns)
       * [Setting Column Names](#setting-column-names)
    * [Converting to JSON](#converting-to-json)
    * [Parsing an In-Memory String](#parsing-an-in-memory-string)
@@ -31,18 +32,20 @@ This CSV parser uses multiple threads to simulatenously pull data from disk and
 
 On my computer (Intel Core i7-8550U @ 1.80GHz/Toshiba XG5 SSD), it is capable of parsing the [69.9 MB 2015_StateDepartment.csv](https://github.com/vincentlaucsb/csv-data/tree/master/real_data) in 0.33 seconds.
 
-### Robust
-#### RFC 4180 Compliance
-This CSV parser is much more than a fancy string splitter, and follows every guideline from [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.txt). An optional strict parsing mode can be enabled to sniff out errors in files.
+### Robust Yet Flexible
+#### RFC 4180 and Beyond
+This CSV parser is much more than a fancy string splitter, and parses all files following [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.txt).
 
-#### Non-RFC 4180 Deviations
-We know that actual CSV files come with many different quirks. In addition, there are many CSV-inspired formats like tab-separated values. Thus, this CSV library has many features for dealing with this reality:
+However, in reality we know that RFC 4180 is just a suggestion, and there's many "flavors" of CSV such as tab-delimited files. Thus, this library has:
  * Automatic delimiter guessing
  * Ability to ignore comments in leading rows and elsewhere
  * Ability to handle rows of different lengths
 
+By default, rows of variable length are silently ignored, although you may elect to keep them or throw an error.
+
 #### Encoding
-This CSV parser will handle ANSI and UTF-8 encoded files. It does not try to decode UTF-8, except for detecting and stripping byte order marks.
+This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files.
+It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks.
 
 ### Well Tested
 This CSV parser has an extensive test suite and is checked for memory safety with Valgrind. If you still manage to find a bug,
@@ -235,6 +238,25 @@ CSVFormat format;
 format.trim({ ' ', '\t'  });
 ```
 
+#### Handling Variable Numbers of Columns
+Sometimes, the rows in a CSV are not all of the same length. Whether this was intentional or not,
+this library is built to handle all use cases.
+
+```cpp
+CSVFormat format;
+
+// Default: Silently ignoring rows with missing or extraneous columns
+format.variable_columns(false); // Short-hand
+format.variable_columns(VariableColumnPolicy::IGNORE);
+
+// Case 2: Keeping variable-length rows
+format.variable_columns(true); // Short-hand
+format.variable_columns(VariableColumnPolicy::KEEP);
+
+// Case 3: Throwing an error if variable-length rows are encountered
+format.variable_columns(VariableColumnPolicy::THROW);
+```
+
 #### Setting Column Names
 If a CSV file does not have column names, you can specify your own:
 

diff --git a/include/csv.hpp b/include/csv.hpp
@@ -1,5 +1,5 @@
 /*
-CSV for C++, version 1.2.5.1
+CSV for C++, version 1.6.0
 https://github.com/vincentlaucsb/csv-parser
 
 MIT License

diff --git a/include/internal/CMakeLists.txt b/include/internal/CMakeLists.txt
@@ -8,6 +8,8 @@ target_sources(csv
 		csv_format.cpp
 		csv_reader.hpp
 		csv_reader.cpp
+		csv_reader_internals.hpp
+        csv_reader_internals.cpp
 		csv_reader_iterator.cpp
 		csv_row.hpp
 		csv_row.cpp

diff --git a/include/internal/compatibility.hpp b/include/internal/compatibility.hpp
@@ -15,9 +15,6 @@
 // See: https://github.com/nemequ/hedley
 #include "../external/hedley.h"
 
-/** Used to supress unused variable warning in g++ */
-#define SUPPRESS_UNUSED_WARNING(x) (void)x
-
 namespace csv {
     /**
      *  @def IF_CONSTEXPR

diff --git a/include/internal/constants.hpp b/include/internal/constants.hpp
@@ -36,12 +36,12 @@ namespace csv {
         /** For functions that lazy load a large CSV, this determines how
          *  many bytes are read at a time
          */
-        const size_t ITERATION_CHUNK_SIZE = 50000000; // 50MB
+        constexpr size_t ITERATION_CHUNK_SIZE = 50000000; // 50MB
     }
 
+    /** Integer indicating a requested column wasn't found. */
+    constexpr int CSV_NOT_FOUND = -1;
+
     /** Used for counting number of rows */
     using RowCount = long long int;
-
-    class CSVRow;
-    using CSVCollection = std::deque<CSVRow>;
 }
diff --git a/include/internal/csv_format.hpp b/include/internal/csv_format.hpp
@@ -12,6 +12,13 @@
 namespace csv {
     class CSVReader;
 
+    /** Determines how to handle rows that are shorter or longer than the majority */
+    enum class VariableColumnPolicy {
+        THROW = -1,
+        IGNORE = 0,
+        KEEP   = 1
+    };
+
     /** Stores the inferred format of a CSV file. */
     struct CSVGuessResult {
         char delim;
@@ -64,11 +71,15 @@ namespace csv {
          */
         CSVFormat& header_row(int row);
 
-        /** Tells the parser to throw an std::runtime_error if an
-         *  invalid CSV sequence is found
-         */
-        CONSTEXPR CSVFormat& strict_parsing(bool is_strict = true) {
-            this->strict = is_strict;
+        /** Tells the parser how to handle columns of a different length than the others */
+        CONSTEXPR CSVFormat& variable_columns(VariableColumnPolicy policy = VariableColumnPolicy::IGNORE) {
+            this->variable_column_policy = policy;
+            return *this;
+        }
+
+        /** Tells the parser how to handle columns of a different length than the others */
+        CONSTEXPR CSVFormat& variable_columns(bool policy) {
+            this->variable_column_policy = (VariableColumnPolicy)policy;
             return *this;
         }
 
@@ -79,7 +90,7 @@ namespace csv {
         }
 
         #ifndef DOXYGEN_SHOULD_SKIP_THIS
-        char get_delim() {
+        char get_delim() const {
             // This error should never be received by end users.
             if (this->possible_delimiters.size() > 1) {
                 throw std::runtime_error("There is more than one possible delimiter.");
@@ -88,9 +99,10 @@ namespace csv {
             return this->possible_delimiters.at(0);
         }
 
-        CONSTEXPR int get_header() {
-            return this->header;
-        }
+        CONSTEXPR int get_header() const { return this->header; }
+        std::vector<char> get_possible_delims() const { return this->possible_delimiters; }
+        std::vector<char> get_trim_chars() const { return this->trim_chars; }
+        CONSTEXPR VariableColumnPolicy get_variable_column_policy() const { return this->variable_column_policy; }
         #endif
 
         /** CSVFormat for guessing the delimiter */
@@ -104,24 +116,13 @@ namespace csv {
             return format;
         }
 
-        /** CSVFormat for strict RFC 4180 parsing */
-        CSV_INLINE static CSVFormat rfc4180_strict() {
-            CSVFormat format;
-            format.delimiter(',')
-                .quote('"')
-                .header_row(0)
-                .detect_bom(true)
-                .strict_parsing(true);
-
-            return format;
-        }
-
-        friend CSVReader;
-    private:
         bool guess_delim() {
             return this->possible_delimiters.size() > 1;
         }
 
+        friend CSVReader;
+
+    private:
         /**< Throws an error if delimiters and trim characters overlap */
         void assert_no_char_overlap();
 
@@ -131,17 +132,17 @@ namespace csv {
         /**< Set of whitespace characters to trim */
         std::vector<char> trim_chars = {};
 
-        /**< Quote character */
-        char quote_char = '"';
-
         /**< Row number with columns (ignored if col_names is non-empty) */
         int header = 0;
 
+        /**< Quote character */
+        char quote_char = '"';
+
         /**< Should be left empty unless file doesn't include header */
         std::vector<std::string> col_names = {};
 
-        /**< RFC 4180 non-compliance -> throw an error */
-        bool strict = false;
+        /**< Allow variable length columns? */
+        VariableColumnPolicy variable_column_policy = VariableColumnPolicy::IGNORE;
 
         /**< Detect and strip out Unicode byte order marks */
         bool unicode_detect = true;