diff --git a/docs/source/FAQ/index.rst b/docs/source/FAQ/index.rst index ffbc386a..915c788c 100644 --- a/docs/source/FAQ/index.rst +++ b/docs/source/FAQ/index.rst @@ -47,8 +47,7 @@ Below are some commonly asked questions. :animate: fade-in-slide-down - Dense retrieval: map the text into a single embedding, e.g., `DPR `_, `BGE-v1.5 <../bge/bge_v1_v1.5>`_ - - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. - e.g., BM25, `unicoil `_, and `splade `_ + - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, `unicoil `_, and `splade `_ - Multi-vector retrieval: use multiple vectors to represent a text, e.g., `ColBERT `_. .. dropdown:: Recommended vector database? diff --git a/docs/source/Introduction/embedder.rst b/docs/source/Introduction/embedder.rst new file mode 100644 index 00000000..71da13a3 --- /dev/null +++ b/docs/source/Introduction/embedder.rst @@ -0,0 +1,39 @@ +Embedder +======== + +.. tip:: + + If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`! + +Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space. +These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis. + +A very famous demonstration is the example from `word2vec `_. It shows how word embeddings capture semantic relationships through vector arithmetic: + +.. image:: ../_static/img/word2vec.png + :width: 500 + :align: center + +Nowadays, embedders are capable of mapping sentences and even passages into vector space. +They are widely used in real world tasks such as retrieval, clustering, etc. +In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets. + + +Sparse Vector +------------- + +Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching. +Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language. +Each dimension is assigned a value representing the token's relative importance within the document. +Some well known algorithms for sparse vector embedding includes `bag-of-words `_, `TF-IDF `_, `BM25 `_, etc. +Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents. + +Dense Vector +------------ + +Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space. +Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity. +Those vectors can represent deeper meaning of the sentences. +Thus we can distinguish sentences using similar words but actually have different meaning. +And also understand different ways in speaking and writing that express the same thing. +Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics. \ No newline at end of file diff --git a/docs/source/Introduction/index.rst b/docs/source/Introduction/index.rst index a78f6256..f7ea3c3c 100644 --- a/docs/source/Introduction/index.rst +++ b/docs/source/Introduction/index.rst @@ -25,5 +25,6 @@ Quickly get started with: :caption: Concept IR - model + embedder + reranker retrieval_demo \ No newline at end of file diff --git a/docs/source/Introduction/model.rst b/docs/source/Introduction/reranker.rst similarity index 52% rename from docs/source/Introduction/model.rst rename to docs/source/Introduction/reranker.rst index 295171f7..05df215e 100644 --- a/docs/source/Introduction/model.rst +++ b/docs/source/Introduction/reranker.rst @@ -1,26 +1,5 @@ -Model -===== - -If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`! - -Embedder --------- - -Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space. -These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis. - -A very famous demonstration is the example from `word2vec `_. It shows how word embeddings capture semantic relationships through vector arithmetic: - -.. image:: ../_static/img/word2vec.png - :width: 500 - :align: center - -Nowadays, embedders are capable of mapping sentences and even passages into vector space. -They are widely used in real world tasks such as retrieval, clustering, etc. -In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets. - Reranker --------- +======== Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.