Skip to content

Commit

Permalink
Refactored CSVReader/first-class support for handling variable column…
Browse files Browse the repository at this point in the history
… rows (#80)

* Refactored ParseFlags out of CSVReader

* Got most tests to pass after refactoring feed() out

* CSV guessing is now independent of CSVReader

* Move guess_format() function definition

* Update single header file

* Ignore zero-length rows when guessing

* Attempt to fix Linux issue wrt empty rows

* Refactored ParseFlags out of CSVReader

* Got most tests to pass after refactoring feed() out

* CSV guessing is now independent of CSVReader

* Move guess_format() function definition

* Update single header file

* Ignore zero-length rows when guessing

* Attempt to fix Linux issue wrt empty rows

* Regenerated single header files after rebase from master

* Refactored get_col_names()

* Refactored write_record implementation

* Simplified write_record() even more

* Got rid of write_record()

* Refactored ColNames

* Fixed the dumbest bug ever

* No more CSVCollection

* Got rid of bad row handler

* Simplified CSVReader attributes

* Update csv_reader.cpp

* Code clean up + renaming

* Removed error message for unescaped single quote

* CSVStat tests pass again

* Added some small optimizations

* Simplified CSVRow implementation

* Update csv_reader.cpp

* Added ability to accept/reject/ignore variable length columns

* Fixed warnings

* Updated README

* Added too short/too long distinction

* Update README.md
  • Loading branch information
vincentlaucsb authored Mar 12, 2020
1 parent de1fa1d commit cd78068
Show file tree
Hide file tree
Showing 26 changed files with 2,099 additions and 2,087 deletions.
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
* [Numeric Conversions](#numeric-conversions)
* [Specifying the CSV Format](#specifying-the-csv-format)
* [Trimming Whitespace](#trimming-whitespace)
* [Handling Variable Numbers of Columns](#handling-variable-numbers-of-columns)
* [Setting Column Names](#setting-column-names)
* [Converting to JSON](#converting-to-json)
* [Parsing an In-Memory String](#parsing-an-in-memory-string)
Expand All @@ -31,18 +32,20 @@ This CSV parser uses multiple threads to simulatenously pull data from disk and

On my computer (Intel Core i7-8550U @ 1.80GHz/Toshiba XG5 SSD), it is capable of parsing the [69.9 MB 2015_StateDepartment.csv](https://github.com/vincentlaucsb/csv-data/tree/master/real_data) in 0.33 seconds.

### Robust
#### RFC 4180 Compliance
This CSV parser is much more than a fancy string splitter, and follows every guideline from [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.txt). An optional strict parsing mode can be enabled to sniff out errors in files.
### Robust Yet Flexible
#### RFC 4180 and Beyond
This CSV parser is much more than a fancy string splitter, and parses all files following [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.txt).

#### Non-RFC 4180 Deviations
We know that actual CSV files come with many different quirks. In addition, there are many CSV-inspired formats like tab-separated values. Thus, this CSV library has many features for dealing with this reality:
However, in reality we know that RFC 4180 is just a suggestion, and there's many "flavors" of CSV such as tab-delimited files. Thus, this library has:
* Automatic delimiter guessing
* Ability to ignore comments in leading rows and elsewhere
* Ability to handle rows of different lengths

By default, rows of variable length are silently ignored, although you may elect to keep them or throw an error.

#### Encoding
This CSV parser will handle ANSI and UTF-8 encoded files. It does not try to decode UTF-8, except for detecting and stripping byte order marks.
This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files.
It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks.

### Well Tested
This CSV parser has an extensive test suite and is checked for memory safety with Valgrind. If you still manage to find a bug,
Expand Down Expand Up @@ -235,6 +238,25 @@ CSVFormat format;
format.trim({ ' ', '\t' });
```
#### Handling Variable Numbers of Columns
Sometimes, the rows in a CSV are not all of the same length. Whether this was intentional or not,
this library is built to handle all use cases.
```cpp
CSVFormat format;
// Default: Silently ignoring rows with missing or extraneous columns
format.variable_columns(false); // Short-hand
format.variable_columns(VariableColumnPolicy::IGNORE);
// Case 2: Keeping variable-length rows
format.variable_columns(true); // Short-hand
format.variable_columns(VariableColumnPolicy::KEEP);
// Case 3: Throwing an error if variable-length rows are encountered
format.variable_columns(VariableColumnPolicy::THROW);
```

#### Setting Column Names
If a CSV file does not have column names, you can specify your own:

Expand Down
2 changes: 1 addition & 1 deletion include/csv.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
CSV for C++, version 1.2.5.1
CSV for C++, version 1.6.0
https://github.com/vincentlaucsb/csv-parser
MIT License
Expand Down
2 changes: 2 additions & 0 deletions include/internal/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ target_sources(csv
csv_format.cpp
csv_reader.hpp
csv_reader.cpp
csv_reader_internals.hpp
csv_reader_internals.cpp
csv_reader_iterator.cpp
csv_row.hpp
csv_row.cpp
Expand Down
3 changes: 0 additions & 3 deletions include/internal/compatibility.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,6 @@
// See: https://github.com/nemequ/hedley
#include "../external/hedley.h"

/** Used to supress unused variable warning in g++ */
#define SUPPRESS_UNUSED_WARNING(x) (void)x

namespace csv {
/**
* @def IF_CONSTEXPR
Expand Down
8 changes: 4 additions & 4 deletions include/internal/constants.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@ namespace csv {
/** For functions that lazy load a large CSV, this determines how
* many bytes are read at a time
*/
const size_t ITERATION_CHUNK_SIZE = 50000000; // 50MB
constexpr size_t ITERATION_CHUNK_SIZE = 50000000; // 50MB
}

/** Integer indicating a requested column wasn't found. */
constexpr int CSV_NOT_FOUND = -1;

/** Used for counting number of rows */
using RowCount = long long int;

class CSVRow;
using CSVCollection = std::deque<CSVRow>;
}
57 changes: 29 additions & 28 deletions include/internal/csv_format.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@
namespace csv {
class CSVReader;

/** Determines how to handle rows that are shorter or longer than the majority */
enum class VariableColumnPolicy {
THROW = -1,
IGNORE = 0,
KEEP = 1
};

/** Stores the inferred format of a CSV file. */
struct CSVGuessResult {
char delim;
Expand Down Expand Up @@ -64,11 +71,15 @@ namespace csv {
*/
CSVFormat& header_row(int row);

/** Tells the parser to throw an std::runtime_error if an
* invalid CSV sequence is found
*/
CONSTEXPR CSVFormat& strict_parsing(bool is_strict = true) {
this->strict = is_strict;
/** Tells the parser how to handle columns of a different length than the others */
CONSTEXPR CSVFormat& variable_columns(VariableColumnPolicy policy = VariableColumnPolicy::IGNORE) {
this->variable_column_policy = policy;
return *this;
}

/** Tells the parser how to handle columns of a different length than the others */
CONSTEXPR CSVFormat& variable_columns(bool policy) {
this->variable_column_policy = (VariableColumnPolicy)policy;
return *this;
}

Expand All @@ -79,7 +90,7 @@ namespace csv {
}

#ifndef DOXYGEN_SHOULD_SKIP_THIS
char get_delim() {
char get_delim() const {
// This error should never be received by end users.
if (this->possible_delimiters.size() > 1) {
throw std::runtime_error("There is more than one possible delimiter.");
Expand All @@ -88,9 +99,10 @@ namespace csv {
return this->possible_delimiters.at(0);
}

CONSTEXPR int get_header() {
return this->header;
}
CONSTEXPR int get_header() const { return this->header; }
std::vector<char> get_possible_delims() const { return this->possible_delimiters; }
std::vector<char> get_trim_chars() const { return this->trim_chars; }
CONSTEXPR VariableColumnPolicy get_variable_column_policy() const { return this->variable_column_policy; }
#endif

/** CSVFormat for guessing the delimiter */
Expand All @@ -104,24 +116,13 @@ namespace csv {
return format;
}

/** CSVFormat for strict RFC 4180 parsing */
CSV_INLINE static CSVFormat rfc4180_strict() {
CSVFormat format;
format.delimiter(',')
.quote('"')
.header_row(0)
.detect_bom(true)
.strict_parsing(true);

return format;
}

friend CSVReader;
private:
bool guess_delim() {
return this->possible_delimiters.size() > 1;
}

friend CSVReader;

private:
/**< Throws an error if delimiters and trim characters overlap */
void assert_no_char_overlap();

Expand All @@ -131,17 +132,17 @@ namespace csv {
/**< Set of whitespace characters to trim */
std::vector<char> trim_chars = {};

/**< Quote character */
char quote_char = '"';

/**< Row number with columns (ignored if col_names is non-empty) */
int header = 0;

/**< Quote character */
char quote_char = '"';

/**< Should be left empty unless file doesn't include header */
std::vector<std::string> col_names = {};

/**< RFC 4180 non-compliance -> throw an error */
bool strict = false;
/**< Allow variable length columns? */
VariableColumnPolicy variable_column_policy = VariableColumnPolicy::IGNORE;

/**< Detect and strip out Unicode byte order marks */
bool unicode_detect = true;
Expand Down
Loading

0 comments on commit cd78068

Please sign in to comment.