Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ngram] Improve tokenization regarding punctuation, @mentions, hashtags #53

Open
4 tasks
andi-halim opened this issue Jan 31, 2025 · 0 comments
Open
4 tasks
Assignees

Comments

@andi-halim
Copy link
Collaborator

Background
Default ngram analyzer removes punctuation for the default analyzer. Furthermore, there should be code written out to do the same for @mentions and hashtags, but this will be implemented later on as an n-gram customization

Problems
Punctuation, @mentions and hashtags are quite dominant in certain ngram analysis output, we need to allow this to be customizable.

Desired Outcome
Not splitting tokens based on punctuation.

Tasks

  • Punctuation functionality
  • Punctuation implemented in default ngram analyzer
  • @mention filtering functionality
  • hashtag filtering functionality
@andi-halim andi-halim self-assigned this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant