Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Fix multi-lines printing issue for datafusion-cli #14954

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 1, 2025

Which issue does this PR close?

Initial PR:

Rationale for this change

  1. Revert the reverted PR Revert Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead #14948
  2. Fix multi-lines printing issue for datafusion-cli

What changes are included in this PR?

  1. Revert the reverted PR
  2. Fix multi-lines printing issue for datafusion-cli

Are these changes tested?

Yes

Are there any user-facing changes?

Fix the explain format

@zhuqi-lucas zhuqi-lucas marked this pull request as draft March 1, 2025 05:31
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 1, 2025

Sorry for the inconvenience!

We can review and merge this PR after the release: cc @alamb @xudong963

  1. Revert the reverted PR
  2. Fix multi-lines printing issue for datafusion-cli

Meanwhile, i think more forks can also test it, i am not sure if we will have more bugs come since it's a totally new way for datafusion-cli, and i have added more testing cases, thanks!

@zhuqi-lucas
Copy link
Contributor Author

Updated testing result:

DataFusion CLI v45.0.0
> create table foo(x int, y int) as values (1,2), (3,4);
0 row(s) fetched.
Elapsed 0.026 seconds.

> explain select * from foo where x = 4;
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: foo.x = Int32(4)                              |
|               |   TableScan: foo projection=[x, y]                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: x@0 = 4                                 |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
+---------------+-------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.022 seconds.

> explain verbose select * from foo where x = 4;
+------------------------------------------------------------+-------------------------------------------+
| plan_type                                                  | plan                                      |
+------------------------------------------------------------+-------------------------------------------+
| initial_logical_plan                                       | Projection: *                             |
|                                                            |   Filter: foo.x = Int64(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                        |
| logical_plan after expand_wildcard_rule                    | Projection: foo.x, foo.y                  |
|                                                            |   Filter: foo.x = Int64(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after resolve_grouping_function               | SAME TEXT AS ABOVE                        |
| logical_plan after type_coercion                           | Projection: foo.x, foo.y                  |
|                                                            |   Filter: CAST(foo.x AS Int64) = Int64(4) |
|                                                            |     TableScan: foo                        |
| analyzed_logical_plan                                      | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | Projection: foo.x, foo.y                  |
|                                                            |   Filter: foo.x = Int32(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                        |
| logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                        |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                        |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                        |
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_limit                         | SAME TEXT AS ABOVE                        |
| logical_plan after propagate_empty_relation                | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_one_union                     | SAME TEXT AS ABOVE                        |
| logical_plan after filter_null_join_keys                   | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_outer_join                    | SAME TEXT AS ABOVE                        |
| logical_plan after push_down_limit                         | SAME TEXT AS ABOVE                        |
| logical_plan after push_down_filter                        | SAME TEXT AS ABOVE                        |
| logical_plan after single_distinct_aggregation_to_group_by | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                        |
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_group_by_constant             | SAME TEXT AS ABOVE                        |
| logical_plan after optimize_projections                    | Filter: foo.x = Int32(4)                  |
|                                                            |   TableScan: foo projection=[x, y]        |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                        |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                        |
| logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                        |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                        |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                        |
| .                                                          | .                                         |
| .                                                          | .                                         |
| .                                                          | .                                         |
+------------------------------------------------------------+-------------------------------------------+
78 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.005 seconds.

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review March 1, 2025 11:52
@alamb
Copy link
Contributor

alamb commented Mar 1, 2025

Thanks @zhuqi-lucas -- I'll try and check this out soon (likely tomorrow)

@@ -209,14 +211,175 @@ impl PrintFormat {
}
Ok(())
}

#[allow(clippy::too_many_arguments)]
pub fn process_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we now have some more time, I wonder if you could potentially consider refactoring this code now.

I found it somewhat hard to understand when I was trying to debug why the explain plans were not showing up reasonably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for this suggestion, may after this PR finalizing. We can open the #14961 to my branch? Is it a common way for us to collaborate?

@@ -127,9 +107,125 @@ impl PrintOptions {
Ok(())
}

pub async fn print_table_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments here, as well as how it is different than print_batches

@@ -110,7 +90,7 @@ impl PrintOptions {
self.format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed when reviewing this code is that there are different passes for print_batches and print_table_batch

I wonder if there is some way to combine them (so we always use the same printing logic)

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas -- I'll try and check this out soon (likely tomorrow)

Thank you @alamb for review and good suggestion! I will try to address it today.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 4, 2025

I am continuing polish the code besides the #14886 which will add the streaming state struct. I believe the code is more readable now.

///
/// The query_start_time is used to calculate the elapsed time for the query.
/// The schema is used to print the header.
pub async fn print_stream(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified to one API for all cases.

@@ -51,6 +51,163 @@ fn init() {
["--command", "show datafusion.execution.batch_size", "--format", "json", "-q", "-b", "1"],
"[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n"
)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add rich testing cases in cli_integration, it can avoid regression for datafusion-cli changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants