-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Float16
data type
#7288
Comments
Float16 is a very limited type, and for many (most?) use-cases it's quite inaccurate; do you know that your data would actually work with it? (Genuinely curious here, as there aren't that many times where you'd really want to do this ;) |
I would also be happy to see f16 support. We use f16 for storage and transfer of large feature data. Conversion is trivial. For majority of the tasks, memory saving is a massive plus. Loss of precision (especially during arithmetics) is something to think about, but I guess that goes for f32 as well if you push enough :) |
There is also no native (IEEE half precision) Float16 calculation support for CPUs, so conversions to float32 have to be done. If you don't need most of the float16 column for calculations (with Polars), you can encode those float16 columns as uint16 or int16 and after filtering, groupby, ... operations on other columns of the dataframe, convert them back to float16: df = pl.DataFrame(
{
"a": ["some", "text", "!"],
"b": np.array([5.0, 7.0, 3.0], dtype=np.float16).view(np.uint16)
}
)
df.filter(pl.col("a").str.contains("e"))
Out[218]:
shape: (2, 2)
┌──────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ u16 │
╞══════╪═══════╡
│ some ┆ 17664 │
│ text ┆ 18176 │
└──────┴───────┘
df.filter(
# Filter some columns, grouby, ...
pl.col("a").str.contains("e")
)
.with_columns(
# Convert all encoded Float16 (UInt16) columns to Float32 Polars Series.
pl.col(pl.UInt16).map(
lambda x: pl.Series(x.to_numpy().view(np.float16).astype(np.float32))
)
)
Out[219]:
shape: (2, 2)
┌──────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ f32 │
╞══════╪═════╡
│ some ┆ 5.0 │
│ text ┆ 7.0 │
└──────┴─────┘ |
Does this f16 -> u16 conversion preserve ordering? In other words can one filter |
They should for all positive To expand on that, the first bit (sign bit) is 1 if the floating point value is negative, and 0 if not. For unsigned integers, this bit would be the most significant, so this would definitely disrupt ordering in that if you have |
If you want to keep the ordering, you need some additional manipulation as is done when ordering Float32 values in Rust by their integer representation: https://doc.rust-lang.org/src/core/num/f32.rs.html#1338 In [86]: x = np.array([-200.0, -180.00008, -8.0, -5.0, -0.000002, 0.0, 0.0000002, 3.0, 5.0, 7.0, 200.0], dtype=np.float16).view(np.int16)
In [87]: x
Out[87]:
array([ -9664, -9824, -14336, -15104, -32734, 0, 3, 16896,
17664, 18176, 23104], dtype=int16)
In [88]: x ^ (((x >> 15).view(np.uint16) >> 1)).view(np.int16)
Out[88]:
array([-23105, -22945, -18433, -17665, -35, 0, 3, 16896,
17664, 18176, 23104], dtype=int16) |
Here are some encoding and decoding functions that will keep the ordering of float16 when converted to int16: In [106]: def encode_float16_array_as_ordered_int16_series(array_f16):
...: x = array_f16.view(np.int16)
...: total_order_float16_as_int16 = x ^ (((x >> 15).view(np.uint16) >> 1)).view(np.int16)
...: return pl.Series("", total_order_float16_as_int16)
...:
In [107]: def decode_ordered_int16_series_to_float16_array(series_int16):
...: x = series_int16.to_numpy().view(np.int16)
...: array_float16 = (x ^ (((x >> 15).view(np.uint16) >> 1)).view(np.int16)).view(np.float16)
...: return array_float16
...:
In [108]: x = np.array([-200.0, -180.00008, -8.0, -5.0, -0.000002, 0.0, 0.0000002, 3.0, 5.0, 7.0, 200.0], dtype=np.float16)
In [109]: x
Out[109]:
array([-2.0e+02, -1.8e+02, -8.0e+00, -5.0e+00, -2.0e-06, 0.0e+00,
1.8e-07, 3.0e+00, 5.0e+00, 7.0e+00, 2.0e+02], dtype=float16)
In [110]: s = encode_float16_array_as_ordered_int16_series(x)
In [111]: s
Out[111]:
shape: (11,)
Series: '' [i16]
[
-23105
-22945
-18433
-17665
-35
0
3
16896
17664
18176
23104
]
In [112]: decode_ordered_int16_series_to_float16_array(s)
Out[112]:
array([-2.0e+02, -1.8e+02, -8.0e+00, -5.0e+00, -2.0e-06, 0.0e+00,
1.8e-07, 3.0e+00, 5.0e+00, 7.0e+00, 2.0e+02], dtype=float16) |
We have a slew of 0 to 1 score columns generated by ML for ranking purposes. In our case, the numerical precision does not matter too much, as long as it doesn't change the ranking, but the reduction in memory footprint is significant. |
Can use scale the score to |
Context: I have some paid time to work on Polars. My client would like float16 support. As a general principle, it seems like "pass through data unchanged from inputs" is worth doing, especially given Polars now supports extension plugins. In this specific case, I know @ritchie46 has been skeptical in the past, so I would like to suggest as a starting point a minimal implementation that would have low maintenance overhead, with the presumption that I do the work:
That's it. So e.g. Benefits:
|
I don't think so. I don't know the ins and outs of half floats, but from a usability perspective, I am not happy with a data type that doesn't support the full Polars API. We will get a host of issues of "why does We can include a bunch of caveats in our documentation, but I'm not happy with caveats. If we don't want to fully support Float16, we should not have it as a data type. I can see the potential benefits of allowing a Float16 data type as a 'passenger'. But I think the downsides are too significant. Just my 2 cents though. |
A comparison to another case may be useful in thinking this through. Arrow supports extension types, which e.g. how the geoarrow work is being done: https://github.com/geoarrow/geoarrow/blob/main/extension-types.md. There are four things you'd want to do with extension typed data:
The above suggests that at minimum Polars must support having columns whose data types it can't handle internally, otherwise it would break an important third-party use case. Loading and saving these data types from e.g. Feather is probably worth doing too, and perhaps some sort of registry mechanism for mapping between Parquet and Arrow extensions for things like GeoParquet <-> GeoArrow. Float16 is a slightly special case in that it's not a custom extension type, it's more built-in to Arrow, but similar concerns apply. And I can see plenty of good reasons for Polars to say it doesn't want to have any Float16 processing, or even casting code, as @stinodego suggests. And that's fine, since casting could be done via a third-party extension namespace... but only if Polars is willing have column types it doesn't know about, and ideally to load and save the data. Otherwise you're vetoing everyone else from supporting the use case. |
Might be useful to have e.g. @kylebarron weigh in on the above, in case I'm wrong about the requirements from Polars for GeoArrow support. |
@itamarst My 2 cents is that we could reasonably support a |
tl;dr: Similar to what I think @itamarst is advocating for, I would love an arbitrary series implementation defined as something like Longer digression for GeoArrow: sometimes it's necessary to represent union types. In geospatial you have boolean operations like "intersection". When intersecting two polygons, the output could be any of:
To handle this, you need to be able to represent a column of mixed geometries. GeoArrow will almost certainly represent this as an Arrow union type. (I've also already implemented it on a union type.) Unsurprisingly, polars doesn't want to support that natively. But then polars is unable to represent the output of "intersects". (I'm trying to stabilize https://github.com/geoarrow/geoarrow-rs a bit more before dusting off my work on https://github.com/geopolars/geopolars (which will be "just a wrapper" around geoarrow-rs), but I absolutely see geopolars as the way to get my work on geoarrow-rs to a wider audience. And >400 stars on a project that isn't currently functional implies users do want such a thing.) |
I am suggesting an opaque |
@itamarst Such a cast operation would still need to know whether it's |
The Arrow standard has When I say "opaque" I mean "you can't do math or other operations on it directly", in the same way a GeoArrow column would be opaque to Polars. |
To summarize the discussion so far, and maybe make more explicit some of the options: What to do about loading unsupported data typesAn unsupported data type is a type that Polars can't do operations on. Examples given were There are three options for unsupported data types:
What to about about float16Options include:
|
Any thoughts on the above? I should be able to do the work, but there needs to be a some policy decisions first. |
Would an extension type over binary storage be expressive enough for everyone's usecases? |
I think the decision is "not yet". Adding additional data types brings a maintainance burden, and we're not prepared to take that on for the limited benefit that a Float16 type offers right now. I'll leave this open as we would like to support this at some point in the future. |
Given this is not yet, the question is what users can do for now. I could create a temporary third party extension library... but there's still the problem of Polars automatic conversion of float16 to float32 at load time, and inability to write float16. For How would you feel about a minimal patch that just allows as an option loading float16 as 2-length binary, and allows converting 2-length binary to float16 at write time? That means no new data types, just a little conversion logic, and then all the rest can be in a third-party extension. You could also imagine a generic mapping facility, similar to https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.view but applied at load/save time, which might be more broadly useful. |
I'd also be happy with any other suggestion maintainers would find acceptable to make |
I would want to keep casting Not sure what the API would be for allowing reading We don't want Adding a Plus any solution we come up with is going into the bin once we do support Float16. So I don't really see it. What would your proposed API be? |
Maybe I'm off base, but wouldn't using or having something similiar to |
My thought was basically what @mcrumiller suggested, if going for a general API. Documentation would match how PyArrow describes equivalent feature, "Return zero-copy “view” of array as another data type."
The equivalent would also be needed for This is only a blocker for lazy/streaming operations. For just reading a Parquet file I can use PyArrow, use the |
The goal, just to reiterate, is to then have float16 operations in a third party plugin. The problem is that lacking the ability to load (and ideally save) the data, a third party plugin is impossible. |
Hey Stijn, it's not just a Float16 type being gated behind additional data types. Proper support of geo types is also gated behind it. See Kyle's remark here and in particular here. GeoPolars coming of age would be very good Polars! I think seriously considering a way for Polars to support custom data types would be very good. As of right now I use Polars for pretty much all of my data analysis and then for the Geo bits I have to cast back to Pandas and leverage geopandas and then cast back. These bits of my workflows are predictably slow. |
@RmStorm I think you're talking about extension types - there's a separate issue for that here: #9112 We do plan to support this in the future. We are excited about the prospect of GeoPolars and definitely see the value there. But as mentioned before, supporting additional data types is not a small thing. It adds a lot of maintenance burden and we're a small team. Currently we're focused on different features. |
Problem description
Hi there,
Is there any plan to add support for float16, I recently run into a situation where I have to optimize the memory footprint of my dataframe to some extreme due to the memory limit of our k8s clusters. The biggest area for improvement in my case would be to downcast many of the float32 columns to float16.
I can imagine this can be useful for many such resource constrained scenarios too.
Thanks.
The text was updated successfully, but these errors were encountered: