-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check duplicate keys dataframe rename column #850
Check duplicate keys dataframe rename column #850
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor suggestion :)
lib/explorer/data_frame.ex
Outdated
@@ -3609,6 +3609,7 @@ defmodule Explorer.DataFrame do | |||
df | |||
|
|||
pairs -> | |||
maybe_raise_column_duplicate(pairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairs_map
will already be free of duplicates. So maybe we can compare: Enum.count(pairs) == map_size(pairs_map)
and raise if they mismatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a fair point. It doesn't allow us to call out the specific column though, right? I felt that was a useful addition to the developer experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's helpful to not compute which are duplicates unless we have to. We could do something like:
if Enum.count(pairs) != map_size(pairs_map) do
duplicates =
pairs
|> Enum.frequencies_by(fn {col, _} -> col end)
|> Enum.filter(fn {_, count} -> count > 1 end)
|> Enum.map(fn {col, _} -> col end)
raise "found duplicate columns: #{duplicates}"
end
That way we don't traverse the list unless we already know there's an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree with this wholeheartedly, and I can make the change. However, there are two points I'm unclear on:
- Traversal of the pairs is probably constrained by the normal size of a dataframe. I would imagine that it would be relatively small in 99% of cases, and the use of
rename/2
is most likely a 1-time operation. That leads me to think that it's a bit of a boondoggle to try and optimize overly much. That's why I accepted the change to use a list, as it's concise, readable, and unlikely to be a performance issue. I admit I am fairly naive in this space though, so this assumption may be poor. - I wanted to verify that
Enum.count/1
would not traverse the list. I'll trust if I'm told that's the case, but here's what I was able to dig up myself:
Enum.count/1
guarded for a list will call toKernel.length/1
Kernel.length/1
will call:erlang.length/1
, which is where I start to get a little shaky.- It looks like
:erlang.length/1
is probably the ERTS length/1 function, whose docs don't mention complexity.
It's at this point I get a little lost, since I'm not great at digging into the erlang source code. I was hoping to find either the implementation for length/1
or a declaration for the List data structure which carried a len[gth]:
field (similar to python) whereby it could be assumed that length/1
would be a constant time operation.
If anyone has the time to help point me in the right direction, I'd love to get a little more savvy on reading the "nuts and bolts" of erlang. Otherwise, as I mentioned, I'm happy to simply implement this as suggested.
p.s. Thank you for introducing the Enum.frequencies_by/2
function. Great Today I Learned (TIL)!
p.p.s to_column_pairs/3
actually traverses the list, so if we wanted to be particularly parsimonious, we could place the check there. I opted not do because it damages the readability, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Your assumption is correct AFAIK. Readable > optimized for this function, 100%.
- Great point! I had it in my head that
length
was O(1). That appears to be incorrect: https://www.erlang.org/doc/efficiency_guide/commoncaveats#length-1. (Jose would know better than I, to please call me out if that's wrong.)
In light of those points, I retract my statement!
Though I'm glad it resulted in a TIL. And I wish I knew more about Erlang too. One thing at a time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to request my changes, where we compare if Enum.count(pairs) != map_size(pairs_map) do
. That's an O(n)
operation (Enum.count/1
is O(n)
, map_size/1
is O(1)
). I am afraid maybe_raise_column_duplicate/1
is O(n^2)
, as it traverses once and then again on col in seen
.
Renames could be done programmatically too, for example, by downcasing 2048 columns. Still unlikely to matter but I think avoiding O(n^2)
doesn't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man, this has been a really engaging way to get into the Explorer/Elixir/Erlang internals. I wouldn't have seen that coming!
I'm happy to implement the simple solution proposed by José. I would note that it fails to deal with the fact that rename
will also raise a NIF panic when there are duplicated target columns, e.g. rename(df, [a: "first", b: "first"]
. This is a little trickier, because the most reasonable place to deal with that is in to_column_pairs/3
. That leads down a rabbit hole dealing with the other places that this function is used.
The to_column_pairs/3
use in other locations expects to allow LazySeries
as values. In that case, they basically all look the same to any Map
or MapSet
. So you end up throwing up errors from mutate_with
and summarise_with
, as well as their dependent macros.
All this is to say, I spent a lot of time thinking about how to get this to work before I saw a funny little output in one of the tests that I wrote:
thread '<unnamed>' panicked at src/dataframe.rs:536:39:
should rename: Duplicate(ErrString("duplicate column names found"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
And that immediately caught my attention. That's where I found this:
#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
df: ExDataFrame,
renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
let mut df = df.clone();
for (original, new_name) in renames {
df.rename(original, new_name).expect("should rename");
}
Ok(ExDataFrame::new(df))
}
The panic is called because of the expect
and is causing the NIF panic. Swap that out for something like:
#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
df: ExDataFrame,
renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
let mut df = df.clone();
for (original, new_name) in renames {
match df.rename(original, new_name){
Ok(df) => df,
Err(e) => return Err(ExplorerError::Polars(e))
};
}
Ok(ExDataFrame::new(df))
}
And you get a runtime error, with the same test output looking more like:
1) test rename/2 with binary tuples and a target column that is duplicated (Explorer.DataFrameTest)
test/explorer/data_frame_test.exs:2790
Expected exception ErlangError but got RuntimeError (Polars Error: duplicate: duplicate column names found)
code: assert_raise ErlangError, ~r"Erlang error: :nif_panicked", fn ->
stacktrace:
(explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:77: Explorer.PolarsBackend.Shared.apply_dataframe/4
test/explorer/data_frame_test.exs:2793: (test)
Now, I like the idea of knowing rust, but I honestly don't right now. This seems like the best solution to me because polars is already doing the check, but I would want someone who knows more about the interface between the two double checking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I think after reading the docs on handling errors in Rust, this makes the most sense:
#[rustler::nif(schedule = "DirtyCpu")]
pub fn df_rename_columns(
df: ExDataFrame,
renames: Vec<(&str, &str)>,
) -> Result<ExDataFrame, ExplorerError> {
let mut df = df.clone();
for (original, new_name) in renames {
df.rename(original, new_name)?;
}
Ok(ExDataFrame::new(df))
}
The downside from our perspective is that the error returned in several cases is still weird. You would end up seeing RuntimeError (Polars Error: not found: a)
if you duplicated the :a
column name, for example. It might make more sense to capture the error and derive the issue after the fact. This eliminates concern over performing a perhaps costly operation during the expected path, and pushes any extra computation into a sad path.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My view is RuntimeError Polars Error
is better than nif_panic
as it at least hints error has some thing with column a
and it is duplicated. User will figure out something sooner or later.
Also regarding O(1), O(n), etc does it really matter for columns names as they will be less. Time complexity is kind of relevant for rows than columns. I agree with Jose on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcapel great call on handling it at the Rust side as well. I think the Rust change + the conditional check is the way to go.
* Check pairs for duplicated source column values
* Refactor the reduction to use a list instead of a map. This preserves the ability to name the duplicated column while simplifying the procedure. Co-authored-by: Philip Sampaio <philip.sampaio@gmail.com>
* df.rename propagates error, resulting in RuntimeError for duplicated columns
a30b1c9
to
c0a5afb
Compare
💚 💙 💜 💛 ❤️ |
Adds a column pair check to the rename function prior to passing to Polars. Fixes #848
The root cause of the NIF panic was that Polars is unable to find the column a second time. So when given a rename with the same column name, the second parameter will always fail with
ColumnNotFound
.That could be something that is fixed from within Polars, or we could mirror the error message more directly. Though I don't think that would work well as an error without being able to report the partially updated column names which are stuck down in the Rust code at that point.
I opted for a simple argument error which raises when a duplicated column is found. I assume that there are few enough parameters passed to a rename that iterating them isn't a performance hit in any way that matters.
Performing the check after the column pairs are parsed makes things simplest. I'm not sure if this is something that bears abstracting into the
Shared
module, but I named the function according to the patterns I saw there.I'm not sold on my error message. If there's a consistent way that error messages are written, please advise on that.
I wrote a test for the case of a keyword list only, as the case of a map given to rename will raise a warning to the user alerting them to the duplicate key and it will be overwritten such that this error doesn't occur.