Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tackle the issue of XML files filtered as binaries in search results #910

Merged
merged 5 commits into from
Feb 6, 2025

Conversation

srijansk
Copy link
Contributor

@srijansk srijansk commented Feb 5, 2025

When skipping a doc, we currently report the detected language as "binary" (if it looks like binary) or "skipped" (if it's skipped for any other reason). Skipped docs are still added to the index and can still be returned as search results, for example if you only match on filename. So sometimes file matches are returned with "skipped" as their language, even though the file path is clearly some other language like XML.

This PR updates the indexing logic to still detect the language even if the document is skipped. However, we avoid passing the contents to the language detection library to avoid running detection on huge files.

Closes https://linear.app/sourcegraph/issue/SPLF-826/xml-files-filtered-as-binaries-in-search-results

Test example
Test example 2

@srijansk srijansk requested a review from jtibshirani February 5, 2025 17:01
@jtibshirani
Copy link
Member

@srijansk I made some changes to the PR:

  • Update description to explain problem and solution
  • Avoid passing doc.Content to language detection library when the document is skipped. I found this cleaner.
  • Remove check for maximum file size, as I'm not sure what this accomplished (?)
  • Add test for language detection

@jtibshirani jtibshirani requested a review from a team February 6, 2025 18:12
@jtibshirani jtibshirani merged commit 6b8e10c into main Feb 6, 2025
9 checks passed
@jtibshirani jtibshirani deleted the srijansk/bugfix/xml branch February 6, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants