Skip to content

Commit

Permalink
docs update
Browse files Browse the repository at this point in the history
  • Loading branch information
krypticmouse committed May 23, 2024
1 parent 65636ee commit 99a60e8
Show file tree
Hide file tree
Showing 23 changed files with 409 additions and 10 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 6 additions & 0 deletions docs/blog/.authors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
authors:
krypticmouse:
name: Herumb Shandilya
description: Creator
avatar: https://avatars.githubusercontent.com/u/43719685?v=4
url: https://github.com/krypticmouse
3 changes: 3 additions & 0 deletions docs/blog/.meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
comments: true
hide:
- feedback
2 changes: 2 additions & 0 deletions docs/blog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Blog

9 changes: 9 additions & 0 deletions docs/blog/posts/mining-triples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
title: Mining Negatives with Pirate
date: 2024-05-22
authors: [krypticmouse]
description: An overview of how to mine negative examples with Pirate.
tags: [data-mining, negative-mining, pirate]
---

# Mining Negatives with Pirate
Empty file added docs/docs/chains/mine_chain.md
Empty file.
95 changes: 95 additions & 0 deletions docs/docs/data/passages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Passages

The `Passages` class in Pirate is used to handle and manipulate text data in the form of passages. It inherits from the `BaseData` class and provides additional functionality specific to working with passages.

## Use Case

The `Passages` class is particularly useful when you have a collection of text passages that you want to load, process, and manipulate. It supports loading data from various sources such as JSON files, CSV files, lists, and dictionaries. This flexibility allows you to work with passage data regardless of its original format.

Some common use cases for the `Passages` class include:

- Loading and preprocessing a dataset of text passages for further analysis or mining.
- Extracting specific passages based on certain criteria or conditions.
- Performing text-based operations on the loaded passages, such as filtering, transforming, or aggregating.

## Module Overview

The `Passages` class provides the following key features:

- **Initialization:** You can create a `Passages` object by passing the data source (file path, list, or dictionary) along with optional parameters for specifying the ID key and content key.
- **Data Loading:** The class supports loading passage data from JSON files, CSV files, lists, and dictionaries. It automatically determines the appropriate loading method based on the input format.
- **Data Saving:** You can save the passage data to JSON or CSV files using the `save()` method.
- **Data Access:** The loaded passages can be accessed using dictionary-like syntax, allowing you to retrieve specific passages by their ID or index.
- **Iteration:** The `Passages` object is iterable, enabling you to loop over the loaded passages easily.

## Loading Passages

To load passages into a `Passages` object, you can provide the data in one of three formats:

1. **File Path**: Pass the path to a file containing the passage data. The file should be in a supported format such as JSON or CSV.

2. **List**: Pass a list of passage dictionaries, where each dictionary represents a single passage and contains the necessary fields for id and content.

3. **Dictionary**: Pass a dictionary where the keys are passage IDs and the values are the corresponding passage contents.

Here is an example of loading passages from a file:

```python
from pirate.data import Passages

# Load passages from a JSON file
passages = Passages("passages.json")
```

By default for JSON files and dictionaries, the `Passages` class assumes that the passage data contains fields named `pid` and `content` for each passage. However, you can specify custom ID and content keys when initializing the `Passages` object if your data has different field names.

Here is an example of loading passages with custom ID and content keys:

```python
from pirate.data import Passages

# Load passages with custom ID and content keys
passages = Passages("passages.json", id_key="passage_id", content_key="passage_text")
```

Once passages are loaded into a `Passages` object, you can access individual passages using their IDs or indices. The `Passages` class provides a dictionary-like interface for accessing passages, making it easy to retrieve specific passages based on their IDs.

Here is an example of accessing a passage by its ID:

```python
# Access a specific passage by its ID
passage = passages["123"]
```

## Saving Passages

You can save the loaded passages to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter.

Here is an example of saving passages to a CSV file:

```python
from pirate.data import Passages

# Load passages from a JSON file
passages = Passages("passages.json")

# Save passages to a CSV file
passages.save("passages.csv")
```

## Iterating Over Passages

The `Passages` object is iterable, which means you can easily loop over the loaded passages using a `for` loop. This allows you to perform operations on each passage or extract specific information from the passages.

Here is an example of iterating over the loaded passages:

```python
from pirate.data import Passages

# Load passages from a JSON file
passages = Passages("passages.json")

# Iterate over the loaded passages
for id in passages:
print(id, passage[id])
```
75 changes: 75 additions & 0 deletions docs/docs/data/queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Queries

The `Queries` class in Pirate is used to handle and manipulate query data. It inherits from the `BaseData` class so in terms of functionalities, usecases and utilities is quite similar.

## Loading Queries

To load queries into a `Queries` object, you can provide the data in one of three formats:

1. **File Path**: Pass the path to a file containing the query data. The file should be in a supported format such as JSON or CSV.

2. **List**: Pass a list of query dictionaries, where each dictionary represents a single query and contains the necessary fields such as `id` and `content`.

3. **Dictionary**: Pass a dictionary where the keys are query IDs and the values are the corresponding query contents.

Here is an example of loading queries from a file:

```python
from pirate.data import Queries

# Load queries from a JSON file
queries = Queries("queries.json")
```

By default for JSON files and dictionaries, the `Queries` class assumes that the query data contains fields named `qid` and `query` for each query. However, you can specify custom ID and content keys when initializing the `Queries` object if your data has different field names.

Here is an example of loading queries with custom ID and content keys:

```python
from pirate.data import Queries

# Load queries with custom ID and content keys
queries = Queries("queries.json", id_key="query_id", content_key="query_text")
```

Once queries are loaded into a `Queries` object, you can access individual queries using their IDs or indices. The `Queries` class provides a dictionary-like interface for accessing queries, making it easy to retrieve specific queries based on their IDs.

Here is an example of accessing a query by its ID:

```python
# Access a specific query by its ID
query = queries["123"]
```

## Saving Queries

You can save the loaded queries to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter.

Here is an example of saving queries to a CSV file:

```python
from pirate.data import Queries

# Load queries from a JSON file
queries = Queries("queries.json")

# Save queries to a CSV file
queries.save("queries.csv")
```

## Iterating Over Queries

The `Queries` object is iterable, which means you can easily loop over the loaded queries using a `for` loop. This allows you to perform operations on each query or extract specific information from the queries.

Here is an example of iterating over the loaded queries:

```python
from pirate.data import Queries

# Load queries from a JSON file
queries = Queries("queries.json")

# Iterate over the loaded queries
for id in queries:
print(id, query[id])
```
67 changes: 67 additions & 0 deletions docs/docs/data/ranking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Ranking Class

The `Ranking` class is designed to handle ranking data in different formats such as JSON, JSONL, and CSV. It provides methods to load rankings from files or lists, save rankings to files, and access individual rankings. Current there is no way to pass a dictionary of ranking, but there might be in the future.

## Loading Rankings

To load rankings into a `Ranking` object, you can provide the data in one of three formats:

1. **File Path**: Pass the path to a file containing the ranking data. The file should be in a supported format such as JSON, JSONL, or CSV.

2. **List**: Pass a list of ranking tuples, where each tuple represents a single ranking and contains the necessary fields such as `qid` and `ranking`.

Here is an example of loading rankings from a file:

```python
from pirate.data import Ranking

# Load rankings from a JSON file
rankings = Ranking("rankings.jsonl")
```

## Saving Rankings

You can save the loaded rankings to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter.

Here is an example of saving rankings to a CSV file:

```python
from pirate.data import Ranking

# Load rankings from a JSON file
rankings = Ranking("rankings.jsonl")

# Save rankings to a CSV file
rankings.save("rankings.csv")
```

## Accessing Rankings

Once rankings are loaded into a `Ranking` object, you can access individual rankings using their indices. The `Ranking` class provides list-like indexing, allowing you to retrieve specific rankings based on their position in the list.

Here is an example of accessing a ranking by its index:

```python
# Access a specific ranking by its index
ranking = rankings[0]
```

## Getting a Passage Group for a Query

You can get a list of ranked passages for a query by passing the query ID to the `get_passage_groups()` method. This will return a `pl.DataFrame` with pasasges for the query ranked in the order specified in the ranking data.

Here is an example of getting a passage group for a query:

```python
# Get the passage group for a query
passage_grp = rankings.get_passage_groups("123")
```

## Iterating Over Rankings

You can iterate over rankings using a `for` loop, which will iterate over each ranking in the order they were loaded. Here is an example of iterating over rankings:

```python
for ranking in rankings:
print(ranking)
```
79 changes: 79 additions & 0 deletions docs/docs/data/triples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Triples Class

The `Triples` class is designed to handle triple or N-way tuple data in different formats such as JSON, JSONL, and CSV. It provides methods to load triples from files or lists, save triples to files, and access individual triples. These are crucial and pirate relies a lot on them for negative mining.

## What are Triples?

Triples are quite frequently used in Information Retrieval, they are pairs that define the passage that is relevant to a query called **positive passage** and the passage that is not relevant to the query called **negative passage**. The `Triples` class is designed to handle such triple data.

```
[query, positive_passage, negative_passage]
OR
[query_id, positive_passage_id, negative_passage_id]
```

However, you can also load a *n-way tuple* which is a general form of a triple. Instead of have just one negative you can have n-1 negatives, triple is a 2-way tuple in that manner.

```
[query, positive_passage, negative_passage_1, negative_passage_2, negative_passage_3, ...]
```

## Loading Triples

To load triples into a `Triples` object, you can provide the data in one of three formats:

1. **File Path**: Pass the path to a file containing the triple data. The file should be in a supported format such as JSON, JSONL, or CSV.

2. **List**: Pass a list of triple tuples, where each tuple represents a single triple and contains the necessary fields such as `qid`, `positive_pid`, and `negative_pid`.

3. **Dictionary**: Pass a dictionary where the keys are query IDs and the values are the corresponding positive and negative passage IDs.

Here is an example of loading triples from a file:

```python
from pirate.data import Triples

# Load triples from a JSON file
triples = Triples("triples.jsonl")
```

## Saving Triples

You can save the loaded triples to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter.

Here is an example of saving triples to a CSV file:

```python
from pirate.data import Triples

# Load triples from a JSON file
triples = Triples("triples.jsonl")

# Save triples to a CSV file
triples.save("triples.csv")
```

## Accessing Triples

Once triples are loaded into a `Triples` object, you can access individual triples using their indices. The `Triples` class provides list-like indexing, allowing you to retrieve specific triples based on their position in the list.

Here is an example of accessing a triple by its index:

```python
# Access a specific triple by its index
triple = triples[0]
```

## Iterating Over Triples

The `Triples` object is iterable, allowing you to loop over the loaded triples easily. You can use a `for` loop to iterate over the triples and access each one in turn.

Here is an example of iterating over triples:

```python
# Iterate over the loaded triples
for triple in triples:
print(triple)
```
Empty file.
Empty file.
Binary file added docs/img/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 99a60e8

Please sign in to comment.