-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add integration tests using example files from apache/orc #65
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
These files are imported from [Apache ORC's examples](https://github.com/apache/orc/tree/207085de3722054485e685811f8e5f2e11aa4deb/examples) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
0,a,0.0 | ||
1,b,1.1 | ||
2,c,2.2 | ||
3,d, | ||
4,,4.4 | ||
,f,5.5 | ||
,, | ||
7,h,7.7 | ||
8,i,8.8 | ||
9,j,9.9 |
1 change: 1 addition & 0 deletions
1
tests/integration/data/TestCSVFileImport.testTimezoneOption.csv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
2021-12-27 00:00:00.000 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+307 Bytes
tests/integration/data/TestOrcFile.testSargSkipPickupGroupWithoutIndexCPlusPlus.orc
Binary file not shown.
Binary file added
BIN
+25.9 KB
tests/integration/data/TestOrcFile.testSargSkipPickupGroupWithoutIndexJava.orc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+341 Bytes
tests/integration/data/TestOrcFile.testStringAndBinaryStatistics.orc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+63 Bytes
tests/integration/data/TestOrcFile.testWithoutCompressionBlockSize.orc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+1.75 KB
tests/integration/data/corrupt/missing_length_stream_in_string_dict.orc
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+780 Bytes
tests/integration/data/corrupt/stripe_footer_bad_column_encodings.orc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+13.4 KB
tests/integration/data/expected/TestOrcFile.testMemoryManagementV11.jsn.gz
Binary file not shown.
Binary file added
BIN
+13.4 KB
tests/integration/data/expected/TestOrcFile.testMemoryManagementV12.jsn.gz
Binary file not shown.
Binary file added
BIN
+18.6 KB
tests/integration/data/expected/TestOrcFile.testPredicatePushdown.jsn.gz
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+146 Bytes
tests/integration/data/expected/TestOrcFile.testStringAndBinaryStatistics.jsn.gz
Binary file not shown.
Binary file added
BIN
+931 Bytes
tests/integration/data/expected/TestOrcFile.testStripeLevelStats.jsn.gz
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+3.35 KB
tests/integration/data/expected/TestOrcFile.testUnionAndTimestamp.jsn.gz
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+83.5 KB
tests/integration/data/expected/TestStringDictionary.testRowIndex.jsn.gz
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
#![allow(non_snake_case)] | ||
|
||
/// Tests against `.orc` and `.jsn.gz` in the official test suite (`orc/examples/`) | ||
use std::fs::File; | ||
use std::io::Read; | ||
|
||
use pretty_assertions::assert_eq; | ||
|
||
use arrow::array::StructArray; | ||
use arrow::record_batch::RecordBatch; | ||
use datafusion_orc::arrow_reader::ArrowReaderBuilder; | ||
|
||
/// Checks parsing a `.orc` file produces the expected result in the `.jsn.gz` path | ||
fn test_expected_file(name: &str) { | ||
let dir = env!("CARGO_MANIFEST_DIR"); | ||
let orc_path = format!("{}/tests/integration/data/{}.orc", dir, name); | ||
let jsn_gz_path = format!("{}/tests/integration/data/expected/{}.jsn.gz", dir, name); | ||
let f = File::open(orc_path).expect("Could not open .orc"); | ||
let builder = ArrowReaderBuilder::try_new(f).unwrap(); | ||
let orc_reader = builder.build(); | ||
let total_row_count = orc_reader.total_row_count(); | ||
|
||
// Read .orc into JSON objects | ||
let batches: Vec<RecordBatch> = orc_reader.collect::<Result<Vec<_>, _>>().unwrap(); | ||
let objects: Vec<serde_json::Value> = batches | ||
.into_iter() | ||
.map(|batch| -> StructArray { batch.into() }) | ||
.flat_map(|array| { | ||
arrow_json::writer::array_to_json_array(&array) | ||
.expect("Could not serialize convert row from .orc to JSON value") | ||
}) | ||
.collect(); | ||
|
||
// Read expected JSON objects | ||
let mut expected_json = String::new(); | ||
flate2::read::GzDecoder::new(&File::open(jsn_gz_path).expect("Could not open .jsn.gz")) | ||
.read_to_string(&mut expected_json) | ||
.expect("Could not read .jsn.gz"); | ||
|
||
let objects_count = objects.len(); | ||
|
||
// Reencode the input to normalize it | ||
let expected_lines = expected_json | ||
.split('\n') | ||
.filter(|line| !line.is_empty()) | ||
.map(|line| { | ||
serde_json::from_str::<serde_json::Value>(line) | ||
.expect("Could not parse line in .jsn.gz") | ||
}) | ||
.map(|v| { | ||
serde_json::to_string_pretty(&v).expect("Could not re-serialize line from .jsn.gz") | ||
}) | ||
.collect::<Vec<_>>() | ||
.join("\n"); | ||
|
||
let lines = objects | ||
.into_iter() | ||
.map(|v| serde_json::to_string_pretty(&v).expect("Could not serialize row from .orc")) | ||
.collect::<Vec<_>>() | ||
.join("\n"); | ||
|
||
if lines.len() < 1000 { | ||
assert_eq!(lines, expected_lines); | ||
} else { | ||
// pretty_assertions consumes too much RAM and CPU on large diffs, | ||
// and it's unreadable anyway | ||
assert_eq!(lines[0..1000], expected_lines[0..1000]); | ||
assert!(lines == expected_lines); | ||
} | ||
|
||
assert_eq!(total_row_count, objects_count as u64); | ||
} | ||
|
||
#[test] | ||
fn columnProjection() { | ||
test_expected_file("TestOrcFile.columnProjection"); | ||
} | ||
#[test] | ||
fn emptyFile() { | ||
test_expected_file("TestOrcFile.emptyFile"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Why? | ||
fn metaData() { | ||
test_expected_file("TestOrcFile.metaData"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Why? | ||
fn test1() { | ||
test_expected_file("TestOrcFile.test1"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Incorrect timezone + representation differs | ||
fn testDate1900() { | ||
test_expected_file("TestOrcFile.testDate1900"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Incorrect timezone + representation differs | ||
fn testDate2038() { | ||
test_expected_file("TestOrcFile.testDate2038"); | ||
} | ||
#[test] | ||
fn testMemoryManagementV11() { | ||
test_expected_file("TestOrcFile.testMemoryManagementV11"); | ||
} | ||
#[test] | ||
fn testMemoryManagementV12() { | ||
test_expected_file("TestOrcFile.testMemoryManagementV12"); | ||
} | ||
#[test] | ||
fn testPredicatePushdown() { | ||
test_expected_file("TestOrcFile.testPredicatePushdown"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Why? | ||
fn testSeek() { | ||
test_expected_file("TestOrcFile.testSeek"); | ||
} | ||
#[test] | ||
fn testSnappy() { | ||
test_expected_file("TestOrcFile.testSnappy"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: arrow_json does not support binaries | ||
fn testStringAndBinaryStatistics() { | ||
test_expected_file("TestOrcFile.testStringAndBinaryStatistics"); | ||
} | ||
#[test] | ||
fn testStripeLevelStats() { | ||
test_expected_file("TestOrcFile.testStripeLevelStats"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Non-struct root type are not supported yet | ||
fn testTimestamp() { | ||
test_expected_file("TestOrcFile.testTimestamp"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Unions are not supported yet | ||
fn testUnionAndTimestamp() { | ||
test_expected_file("TestOrcFile.testUnionAndTimestamp"); | ||
} | ||
#[test] | ||
fn testWithoutIndex() { | ||
test_expected_file("TestOrcFile.testWithoutIndex"); | ||
} | ||
#[test] | ||
fn testLz4() { | ||
test_expected_file("TestVectorOrcFile.testLz4"); | ||
} | ||
#[test] | ||
fn testLzo() { | ||
test_expected_file("TestVectorOrcFile.testLzo"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Differs on representation of some Decimals | ||
fn decimal() { | ||
test_expected_file("decimal"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Too slow | ||
fn zlib() { | ||
test_expected_file("demo-12-zlib"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Why? | ||
fn nulls_at_end_snappy() { | ||
test_expected_file("nulls-at-end-snappy"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: Why? | ||
fn orc_11_format() { | ||
test_expected_file("orc-file-11-format"); | ||
} | ||
#[test] | ||
fn orc_index_int_string() { | ||
test_expected_file("orc_index_int_string"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: not yet implemented | ||
fn orc_split_elim() { | ||
test_expected_file("orc_split_elim"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: not yet implemented | ||
fn orc_split_elim_cpp() { | ||
test_expected_file("orc_split_elim_cpp"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: not yet implemented | ||
fn orc_split_elim_new() { | ||
test_expected_file("orc_split_elim_new"); | ||
} | ||
#[test] | ||
#[ignore] // TODO: not yet implemented | ||
fn over1k_bloom() { | ||
test_expected_file("over1k_bloom"); | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can parse the expected JSON into Arrow first with arrow_json then compare on RecordBatches
Assuming the schema inference works in our favour 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would avoid the issue of different decimal representation.
However, it makes the tests a little unreliable, as they wouldn't detect data lost both by
arrow_json
anddatafusion_orc
. Probably not a big dealThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. I guess we're a bit handicapped by the expected data being in JSON form which can make it harder for us to be rigorous. I'll raise an issue for exploring ways around this 👍