Skip to content

Commit

Permalink
Bug workaround (parquet serialization with non str values in Document…
Browse files Browse the repository at this point in the history
…s column)
  • Loading branch information
picaultj committed Feb 14, 2025
1 parent f3eb5b2 commit 363f44a
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions bertrend_apps/prospective_demo/process_new_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,13 @@ def train_new_model(
lambda x: list(set(x))
) # Removes duplicates within each list

# FIXME: for some unknown reasons, a few elements in the Documents column are not a str but a
# timestamp (the identifier of current model); this generates errors when trying to serialize the
# df to parquet. The code snippet below is a workaround to avoid this issue.
df["Documents"] = df["Documents"].apply(
lambda l: [x if isinstance(x, str) else "" for x in l]
)

output_path = interpretation_path / f"{df_name}.parquet"
df.to_parquet(output_path)
logger.success(f"{df_name} saved to: {output_path}")
Expand Down

0 comments on commit 363f44a

Please sign in to comment.