-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
65636ee
commit 99a60e8
Showing
23 changed files
with
409 additions
and
10 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
authors: | ||
krypticmouse: | ||
name: Herumb Shandilya | ||
description: Creator | ||
avatar: https://avatars.githubusercontent.com/u/43719685?v=4 | ||
url: https://github.com/krypticmouse |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
comments: true | ||
hide: | ||
- feedback |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Blog | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
title: Mining Negatives with Pirate | ||
date: 2024-05-22 | ||
authors: [krypticmouse] | ||
description: An overview of how to mine negative examples with Pirate. | ||
tags: [data-mining, negative-mining, pirate] | ||
--- | ||
|
||
# Mining Negatives with Pirate |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Passages | ||
|
||
The `Passages` class in Pirate is used to handle and manipulate text data in the form of passages. It inherits from the `BaseData` class and provides additional functionality specific to working with passages. | ||
|
||
## Use Case | ||
|
||
The `Passages` class is particularly useful when you have a collection of text passages that you want to load, process, and manipulate. It supports loading data from various sources such as JSON files, CSV files, lists, and dictionaries. This flexibility allows you to work with passage data regardless of its original format. | ||
|
||
Some common use cases for the `Passages` class include: | ||
|
||
- Loading and preprocessing a dataset of text passages for further analysis or mining. | ||
- Extracting specific passages based on certain criteria or conditions. | ||
- Performing text-based operations on the loaded passages, such as filtering, transforming, or aggregating. | ||
|
||
## Module Overview | ||
|
||
The `Passages` class provides the following key features: | ||
|
||
- **Initialization:** You can create a `Passages` object by passing the data source (file path, list, or dictionary) along with optional parameters for specifying the ID key and content key. | ||
- **Data Loading:** The class supports loading passage data from JSON files, CSV files, lists, and dictionaries. It automatically determines the appropriate loading method based on the input format. | ||
- **Data Saving:** You can save the passage data to JSON or CSV files using the `save()` method. | ||
- **Data Access:** The loaded passages can be accessed using dictionary-like syntax, allowing you to retrieve specific passages by their ID or index. | ||
- **Iteration:** The `Passages` object is iterable, enabling you to loop over the loaded passages easily. | ||
|
||
## Loading Passages | ||
|
||
To load passages into a `Passages` object, you can provide the data in one of three formats: | ||
|
||
1. **File Path**: Pass the path to a file containing the passage data. The file should be in a supported format such as JSON or CSV. | ||
|
||
2. **List**: Pass a list of passage dictionaries, where each dictionary represents a single passage and contains the necessary fields for id and content. | ||
|
||
3. **Dictionary**: Pass a dictionary where the keys are passage IDs and the values are the corresponding passage contents. | ||
|
||
Here is an example of loading passages from a file: | ||
|
||
```python | ||
from pirate.data import Passages | ||
|
||
# Load passages from a JSON file | ||
passages = Passages("passages.json") | ||
``` | ||
|
||
By default for JSON files and dictionaries, the `Passages` class assumes that the passage data contains fields named `pid` and `content` for each passage. However, you can specify custom ID and content keys when initializing the `Passages` object if your data has different field names. | ||
|
||
Here is an example of loading passages with custom ID and content keys: | ||
|
||
```python | ||
from pirate.data import Passages | ||
|
||
# Load passages with custom ID and content keys | ||
passages = Passages("passages.json", id_key="passage_id", content_key="passage_text") | ||
``` | ||
|
||
Once passages are loaded into a `Passages` object, you can access individual passages using their IDs or indices. The `Passages` class provides a dictionary-like interface for accessing passages, making it easy to retrieve specific passages based on their IDs. | ||
|
||
Here is an example of accessing a passage by its ID: | ||
|
||
```python | ||
# Access a specific passage by its ID | ||
passage = passages["123"] | ||
``` | ||
|
||
## Saving Passages | ||
|
||
You can save the loaded passages to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter. | ||
|
||
Here is an example of saving passages to a CSV file: | ||
|
||
```python | ||
from pirate.data import Passages | ||
|
||
# Load passages from a JSON file | ||
passages = Passages("passages.json") | ||
|
||
# Save passages to a CSV file | ||
passages.save("passages.csv") | ||
``` | ||
|
||
## Iterating Over Passages | ||
|
||
The `Passages` object is iterable, which means you can easily loop over the loaded passages using a `for` loop. This allows you to perform operations on each passage or extract specific information from the passages. | ||
|
||
Here is an example of iterating over the loaded passages: | ||
|
||
```python | ||
from pirate.data import Passages | ||
|
||
# Load passages from a JSON file | ||
passages = Passages("passages.json") | ||
|
||
# Iterate over the loaded passages | ||
for id in passages: | ||
print(id, passage[id]) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Queries | ||
|
||
The `Queries` class in Pirate is used to handle and manipulate query data. It inherits from the `BaseData` class so in terms of functionalities, usecases and utilities is quite similar. | ||
|
||
## Loading Queries | ||
|
||
To load queries into a `Queries` object, you can provide the data in one of three formats: | ||
|
||
1. **File Path**: Pass the path to a file containing the query data. The file should be in a supported format such as JSON or CSV. | ||
|
||
2. **List**: Pass a list of query dictionaries, where each dictionary represents a single query and contains the necessary fields such as `id` and `content`. | ||
|
||
3. **Dictionary**: Pass a dictionary where the keys are query IDs and the values are the corresponding query contents. | ||
|
||
Here is an example of loading queries from a file: | ||
|
||
```python | ||
from pirate.data import Queries | ||
|
||
# Load queries from a JSON file | ||
queries = Queries("queries.json") | ||
``` | ||
|
||
By default for JSON files and dictionaries, the `Queries` class assumes that the query data contains fields named `qid` and `query` for each query. However, you can specify custom ID and content keys when initializing the `Queries` object if your data has different field names. | ||
|
||
Here is an example of loading queries with custom ID and content keys: | ||
|
||
```python | ||
from pirate.data import Queries | ||
|
||
# Load queries with custom ID and content keys | ||
queries = Queries("queries.json", id_key="query_id", content_key="query_text") | ||
``` | ||
|
||
Once queries are loaded into a `Queries` object, you can access individual queries using their IDs or indices. The `Queries` class provides a dictionary-like interface for accessing queries, making it easy to retrieve specific queries based on their IDs. | ||
|
||
Here is an example of accessing a query by its ID: | ||
|
||
```python | ||
# Access a specific query by its ID | ||
query = queries["123"] | ||
``` | ||
|
||
## Saving Queries | ||
|
||
You can save the loaded queries to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter. | ||
|
||
Here is an example of saving queries to a CSV file: | ||
|
||
```python | ||
from pirate.data import Queries | ||
|
||
# Load queries from a JSON file | ||
queries = Queries("queries.json") | ||
|
||
# Save queries to a CSV file | ||
queries.save("queries.csv") | ||
``` | ||
|
||
## Iterating Over Queries | ||
|
||
The `Queries` object is iterable, which means you can easily loop over the loaded queries using a `for` loop. This allows you to perform operations on each query or extract specific information from the queries. | ||
|
||
Here is an example of iterating over the loaded queries: | ||
|
||
```python | ||
from pirate.data import Queries | ||
|
||
# Load queries from a JSON file | ||
queries = Queries("queries.json") | ||
|
||
# Iterate over the loaded queries | ||
for id in queries: | ||
print(id, query[id]) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Ranking Class | ||
|
||
The `Ranking` class is designed to handle ranking data in different formats such as JSON, JSONL, and CSV. It provides methods to load rankings from files or lists, save rankings to files, and access individual rankings. Current there is no way to pass a dictionary of ranking, but there might be in the future. | ||
|
||
## Loading Rankings | ||
|
||
To load rankings into a `Ranking` object, you can provide the data in one of three formats: | ||
|
||
1. **File Path**: Pass the path to a file containing the ranking data. The file should be in a supported format such as JSON, JSONL, or CSV. | ||
|
||
2. **List**: Pass a list of ranking tuples, where each tuple represents a single ranking and contains the necessary fields such as `qid` and `ranking`. | ||
|
||
Here is an example of loading rankings from a file: | ||
|
||
```python | ||
from pirate.data import Ranking | ||
|
||
# Load rankings from a JSON file | ||
rankings = Ranking("rankings.jsonl") | ||
``` | ||
|
||
## Saving Rankings | ||
|
||
You can save the loaded rankings to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter. | ||
|
||
Here is an example of saving rankings to a CSV file: | ||
|
||
```python | ||
from pirate.data import Ranking | ||
|
||
# Load rankings from a JSON file | ||
rankings = Ranking("rankings.jsonl") | ||
|
||
# Save rankings to a CSV file | ||
rankings.save("rankings.csv") | ||
``` | ||
|
||
## Accessing Rankings | ||
|
||
Once rankings are loaded into a `Ranking` object, you can access individual rankings using their indices. The `Ranking` class provides list-like indexing, allowing you to retrieve specific rankings based on their position in the list. | ||
|
||
Here is an example of accessing a ranking by its index: | ||
|
||
```python | ||
# Access a specific ranking by its index | ||
ranking = rankings[0] | ||
``` | ||
|
||
## Getting a Passage Group for a Query | ||
|
||
You can get a list of ranked passages for a query by passing the query ID to the `get_passage_groups()` method. This will return a `pl.DataFrame` with pasasges for the query ranked in the order specified in the ranking data. | ||
|
||
Here is an example of getting a passage group for a query: | ||
|
||
```python | ||
# Get the passage group for a query | ||
passage_grp = rankings.get_passage_groups("123") | ||
``` | ||
|
||
## Iterating Over Rankings | ||
|
||
You can iterate over rankings using a `for` loop, which will iterate over each ranking in the order they were loaded. Here is an example of iterating over rankings: | ||
|
||
```python | ||
for ranking in rankings: | ||
print(ranking) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Triples Class | ||
|
||
The `Triples` class is designed to handle triple or N-way tuple data in different formats such as JSON, JSONL, and CSV. It provides methods to load triples from files or lists, save triples to files, and access individual triples. These are crucial and pirate relies a lot on them for negative mining. | ||
|
||
## What are Triples? | ||
|
||
Triples are quite frequently used in Information Retrieval, they are pairs that define the passage that is relevant to a query called **positive passage** and the passage that is not relevant to the query called **negative passage**. The `Triples` class is designed to handle such triple data. | ||
|
||
``` | ||
[query, positive_passage, negative_passage] | ||
OR | ||
[query_id, positive_passage_id, negative_passage_id] | ||
``` | ||
|
||
However, you can also load a *n-way tuple* which is a general form of a triple. Instead of have just one negative you can have n-1 negatives, triple is a 2-way tuple in that manner. | ||
|
||
``` | ||
[query, positive_passage, negative_passage_1, negative_passage_2, negative_passage_3, ...] | ||
``` | ||
|
||
## Loading Triples | ||
|
||
To load triples into a `Triples` object, you can provide the data in one of three formats: | ||
|
||
1. **File Path**: Pass the path to a file containing the triple data. The file should be in a supported format such as JSON, JSONL, or CSV. | ||
|
||
2. **List**: Pass a list of triple tuples, where each tuple represents a single triple and contains the necessary fields such as `qid`, `positive_pid`, and `negative_pid`. | ||
|
||
3. **Dictionary**: Pass a dictionary where the keys are query IDs and the values are the corresponding positive and negative passage IDs. | ||
|
||
Here is an example of loading triples from a file: | ||
|
||
```python | ||
from pirate.data import Triples | ||
|
||
# Load triples from a JSON file | ||
triples = Triples("triples.jsonl") | ||
``` | ||
|
||
## Saving Triples | ||
|
||
You can save the loaded triples to a JSON or CSV file using the `save()` method. By default, the data will be saved in JSON format, but you can specify the file format using the `format` parameter. | ||
|
||
Here is an example of saving triples to a CSV file: | ||
|
||
```python | ||
from pirate.data import Triples | ||
|
||
# Load triples from a JSON file | ||
triples = Triples("triples.jsonl") | ||
|
||
# Save triples to a CSV file | ||
triples.save("triples.csv") | ||
``` | ||
|
||
## Accessing Triples | ||
|
||
Once triples are loaded into a `Triples` object, you can access individual triples using their indices. The `Triples` class provides list-like indexing, allowing you to retrieve specific triples based on their position in the list. | ||
|
||
Here is an example of accessing a triple by its index: | ||
|
||
```python | ||
# Access a specific triple by its index | ||
triple = triples[0] | ||
``` | ||
|
||
## Iterating Over Triples | ||
|
||
The `Triples` object is iterable, allowing you to loop over the loaded triples easily. You can use a `for` loop to iterate over the triples and access each one in turn. | ||
|
||
Here is an example of iterating over triples: | ||
|
||
```python | ||
# Iterate over the loaded triples | ||
for triple in triples: | ||
print(triple) | ||
``` |
Empty file.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.