Skip to content

Commit

Permalink
PI-2526 Remove repeated punctuation strings
Browse files Browse the repository at this point in the history
to prevent model errors due to "too many tokens":
```
Input validation error: `inputs` must have less than 512 tokens Given: 566
```

caused by strings such as:
```
---------------------------------------------------------
Comment added by name on 01/02/2023 at 12:34
Report Edited: 01/02/2023 at 12:34
---------------------------------------------------------
Comment added by name on 01/02/2023 at 12:34
Report Edited: 01/02/2023 at 12:34
...
```
  • Loading branch information
marcus-bcl committed Feb 3, 2025
1 parent e5eae45 commit 3dea72d
Showing 1 changed file with 8 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
{
"description": "Split text into chunks and generate embeddings",
"processors": [
{
"gsub": {
"tag": "Remove any repeated non-alphanumeric strings. The pattern looks for 2 or more non-alphanumeric characters surrounded by whitespace.",
"field": "notes",
"pattern": "(^|\\s)[^\\w\\s]{2,}(\\s|$)",
"replacement": " "
}
},
{
"text_chunking": {
"algorithm": {
Expand Down

0 comments on commit 3dea72d

Please sign in to comment.