Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Adding boolean query method #240

Closed
alex-au-922 opened this issue Apr 20, 2024 · 5 comments
Closed

[feature request] Adding boolean query method #240

alex-au-922 opened this issue Apr 20, 2024 · 5 comments
Labels
feature-parity Feature parity with upstream tantivy query Issues related to the tantivy query engine question Further information is requested

Comments

@alex-au-922
Copy link
Contributor

The existing boolean query feature could be done from the index.parse_query, as long as we type the correct characters like +, - for must and must_not respectively.

However, there could be cases that users would like to create their inner query dynamically, or for the sake readability that they would like a container for their other query types like FuzzyTermQuery and PhraseQuery.

Currently the rust tantivy package allows creating the boolean query from the Struct tantivy::query::BooleanQuery. Will tantivy-py also have the boolean_query staticmethod for the Query class?

@alex-au-922 alex-au-922 changed the title Adding boolean query method [feature request] Adding boolean query method Apr 20, 2024
@cjrh
Copy link
Collaborator

cjrh commented Apr 21, 2024

However, there could be cases that users would like to create their inner query dynamically, or for the sake readability that they would like a container for their other query types like FuzzyTermQuery and PhraseQuery.

I agree with you that parse_query covers a lot of ground, using the tantivy query language. With my maintainer hat on, I see that as less code in tantivy-py, compared to adding extra explicit query types. However, the request does come up a fair bit and so I was wondering whether you could describe a specific use case here?

@cjrh cjrh added question Further information is requested query Issues related to the tantivy query engine feature-parity Feature parity with upstream tantivy labels Apr 21, 2024
@alex-au-922
Copy link
Contributor Author

Sure, let's use a common ground for easier discussion, consider the following elasticsearch query:

{
    "query": {
        "bool": {
            "must": [
                {
                    "dis_max": {
                        "queries": [
                            {
                                "match": {
                                    "title": {
                                        "query": "sea whale",
                                        "boost": 2
                                    }
                                }
                            },
                            {
                                "match": {
                                    "body": {
                                        "query": "white dog",
                                        "boost": 1.5
                                    }
                                }
                            }
                        ],
                        "tie_breaker": 0.3
                    }
                }
            ]
        }
    }
}

The current parse_query method is impossible to construct this query as the tantivy query language currently cannot parse other query types say regex or disjunction max queries. However, this functionality is available in Rust's BooleanQuery and PyLucene's equivalent method.

For tantivy-py's case, we might consider the following function signature:

class Query:
    ...
    @staticmethod
    def boolean_query(subqueries: Iterator[tuple[Occur, Query]]) -> Query:
        ...

This requires the introduction of Occur enum in tantivy rust package.

The above elasticsearch syntax can be then transformed to:

Query.boolean_query(
    [
        (
            Occur.MUST,
            Query.dis_max_query(
                [
                    Query.phrase_query("title", "sea whale", boost=2),
                    Query.phrase_query("body", "white dog", boost=1.5)
                ],
                tie_breaker=0.3
            )
        )
    ]
)

which providers 3 benefits to developers:

  1. Easier syntax (compared to tantivy's query language and even elasticsearch) and better maintainability.
  2. Easier unit-testing the syntax as we are building the query block by block, instead of crumbling them into a single query string.
  3. Enable other currently not-supported query types in tantivy's query language, e.g. ConstScoreQuery, DisjunctionMaxQuery, ExistsQuery.

@cjrh
Copy link
Collaborator

cjrh commented Apr 22, 2024

Thanks for taking the time to write it out. You've explained it well 👍🏼

We are currently tracking progress on wrapping these query types in this comment in #20. I see BooleanQuery is already there along with the disjunction max and the regex query.

@alex-au-922
Copy link
Contributor Author

Added pull request for the implementation #243

@cjrh
Copy link
Collaborator

cjrh commented Apr 22, 2024

PR has been merged, thanks!

@cjrh cjrh closed this as completed Apr 22, 2024
cjrh pushed a commit to cjrh/tantivy-py that referenced this issue Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-parity Feature parity with upstream tantivy query Issues related to the tantivy query engine question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants