Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving DLO strategy output table to a StrategyDAO for better testing #294

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Mar 7, 2025

Summary

problem:
CREATE TABLE and INSERT INTO for DLO strategy output is currently not tested. this requires more reliance on docker testing. docker testing is manual so prone to forgetting/incorrectly testing. and delete files cannot be easily tested in docker, ref: #287 (comment)

solution:
the IntegrationTest infra to create data, test tables, execute DLO strategy, and verify output, already exists, but is in a different module/diff paradigm. already there exists an interface for StrategyDAO, so it is sensible to have both tblprop output and DLO strategy output table under the same interface. this enables shared unittests AND enables testing the DLO strategy output table

note: in future PR, I will add delete files to the test table so that we aren't testing against the default value of 0 ref: #287 (review)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

docker results (the reuqirement to do this will be phased out)

➜ docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up --build
➜ docker exec -it local.spark-master /bin/bash
bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0 \
  --jars openhouse-spark-runtime_2.12-*-all.jar  \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions   \
  --conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog   \
  --conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog     \
  --conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter    \
  --conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080   \
  --conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/$(whoami).token) \
  --conf spark.sql.catalog.openhouse.cluster=LocalHadoopCluster
spark.sql("use openhouse");
spark.sql("drop table if exists u_openhouse.dlo_run_part")
spark.sql("drop table if exists u_openhouse.dlo_run")
spark.sql("create table u_openhouse.dlo_run_part (ts timestamp, id int, data string) partitioned by (days(ts), id)").show()
spark.sql("create table u_openhouse.dlo_run (ts timestamp, id int, data string)").show()
spark.sql("INSERT INTO u_openhouse.dlo_run (ts, id, data) VALUES (current_timestamp(), 1, 'data'), (current_timestamp(), 2, 'data'), (current_timestamp(), 3, 'data'), (current_timestamp(), 1, 'data'), (current_timestamp(), 2, 'data'), (current_timestamp(), 3, 'data');")
spark.sql("INSERT INTO u_openhouse.dlo_run_part (ts, id, data) VALUES (current_timestamp(), 1, 'data'), (current_timestamp(), 2, 'data'), (current_timestamp(), 3, 'data'), (current_timestamp(), 1, 'data'), (current_timestamp(), 2, 'data'), (current_timestamp(), 3, 'data');")
spark.sql("select content, file_path, file_size_in_bytes, record_count from u_openhouse.dlo_run.files").show(200, false)
spark.sql("select content, file_path, file_size_in_bytes, record_count, partition from u_openhouse.dlo_run_part.files").show(200, false)
spark.sql("show tables in u_openhouse").show(100, false)
docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml --profile with_jobs_scheduler run openhouse-jobs-scheduler - --type DATA_LAYOUT_STRATEGY_GENERATION --cluster local --tablesURL http://openhouse-tables:8080 --jobsURL http://openhouse-jobs:8080 --tableMinAgeThresholdHours 0
spark.sql("show tblproperties u_openhouse.dlo_run_part ('write.data-layout.strategies')").show(200, false)
spark.sql("show tblproperties u_openhouse.dlo_run ('write.data-layout.strategies')").show(200, false)
spark.sql("select * from u_openhouse.dlo_strategies").show(80,false)
+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|key                         |value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|write.data-layout.strategies|[{"score":9.999985377015197,"entropy":2.77080843971934816E17,"cost":0.5000007311503093,"gain":5.0,"config":{"targetByteSize":526385152,"minByteSizeRatio":0.75,"maxByteSizeRatio":10.0,"minInputFiles":5,"maxConcurrentFileGroupRewrites":5,"partialProgressEnabled":true,"partialProgressMaxCommits":1,"maxFileGroupSizeBytes":107374182400,"deleteFileThreshold":2147483647},"posDeleteFileCount":0,"eqDeleteFileCount":0,"posDeleteFileBytes":0,"eqDeleteFileBytes":0,"posDeleteRecordCount":0,"eqDeleteRecordCount":0}]|
+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


+----------------------------+----------------------------------------------------------------------------------------+
|key                         |value                                                                                   |
+----------------------------+----------------------------------------------------------------------------------------+
|write.data-layout.strategies|Table openhouse.u_openhouse.dlo_run does not have property: write.data-layout.strategies|
+----------------------------+----------------------------------------------------------------------------------------+


scala> spark.sql("select * from u_openhouse.dlo_strategies").show(80,false)
+------------------------------------+-----------------------+----------------------+------------------------------+----------------------+---------------------+--------------------+---------------------+--------------------+-----------------------+----------------------+
|fqtn                                |timestamp              |estimated_compute_cost|estimated_file_count_reduction|file_size_entropy     |pos_delete_file_count|eq_delete_file_count|pos_delete_file_bytes|eq_delete_file_bytes|pos_delete_record_count|eq_delete_record_count|
+------------------------------------+-----------------------+----------------------+------------------------------+----------------------+---------------------+--------------------+---------------------+--------------------+-----------------------+----------------------+
|u_openhouse.dlo_strategies          |2025-02-27 08:03:54.471|0.5                   |0.0                           |2.7708015546118544E17 |0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_run_part            |2025-03-07 20:38:47.403|0.500001              |5.0                           |2.77080843971934816E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_run                 |2025-03-07 20:30:13.376|0.500001              |5.0                           |2.77080843971934816E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_strategies          |2025-03-07 20:11:26.879|0.500001              |3.0                           |2.77080117298345248E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_partition_strategies|2025-03-07 20:37:40.361|0.500004              |11.0                          |2.77079862528548448E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_run                 |2025-02-27 07:11:10.589|0.5                   |3.0                           |2.77080843971934848E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.trino_mor_test_table    |2025-03-07 20:29:48.501|0.500001              |5.0                           |2.77080860114398976E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_strategies          |2025-03-07 20:38:16.158|0.500003              |8.0                           |2.77080110543083456E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_strategies          |2025-03-07 20:29:12.733|0.500002              |4.0                           |2.77080115034893824E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_partition_strategies|2025-02-27 08:03:30.822|0.500001              |1.0                           |2.77079875424947968E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_run_part            |2025-03-07 20:28:31.053|0.500001              |5.0                           |2.77080843971934816E17|0                    |0                   |0                    |0                   |0                      |0                     |
|u_openhouse.dlo_run                 |2025-02-27 08:03:03.454|0.500001              |5.0                           |2.77080843971934816E17|0                    |0                   |0                    |0                   |0                      |0                     |
+------------------------------------+-----------------------+----------------------+------------------------------+----------------------+---------------------+--------------------+---------------------+--------------------+-----------------------+----------------------+

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

@cbb330 cbb330 changed the title bringing dlo strategy output under unittests bringing dlo strategy output under test Mar 7, 2025
@cbb330 cbb330 changed the title bringing dlo strategy output under test expanding test coverage to include dlo strategy output Mar 7, 2025
@cbb330 cbb330 changed the title expanding test coverage to include dlo strategy output expanding test coverage by moving dlo strategy output to a tested module Mar 7, 2025
@cbb330 cbb330 changed the title expanding test coverage by moving dlo strategy output to a tested module moving dlo strategy output to a tested module Mar 7, 2025
@cbb330 cbb330 changed the title moving dlo strategy output to a tested module covering DLO strategy output with tests Mar 7, 2025
@cbb330 cbb330 changed the title covering DLO strategy output with tests moving DLO strategy output logic to a StrategyDAO for better testing Mar 7, 2025
@cbb330 cbb330 changed the title moving DLO strategy output logic to a StrategyDAO for better testing moving DLO strategy output table to a StrategyDAO for better testing Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant