-
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #50 from inaridiy/feat/docs
Feat/docs
- Loading branch information
Showing
46 changed files
with
5,249 additions
and
478 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
--- | ||
"webforai": major | ||
"ai-learning": patch | ||
"translate": patch | ||
"worker": patch | ||
"bench": patch | ||
"site": patch | ||
--- | ||
|
||
New Documentation Site |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
github: inaridiy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
node_modules | ||
dist | ||
.npmrc | ||
.DS_Store | ||
.DS_Store | ||
.wrangler |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,128 +1 @@ | ||
# Web for AI | ||
|
||
![LOGO](https://github.com/inaridiy/webforai/blob/main/images/logo.webp) | ||
|
||
A library that provides a web interface for AI | ||
|
||
## Features | ||
|
||
- Ultra-precise HTML2Markdown conversion | ||
- Ultra-precise Markdown segmentation based on AST | ||
- Ultra-precise HTML retrieval functions using headless browsers | ||
- The core functionality is edge-native (runs on Cloudflare Workers!!!) | ||
- Command-line interface for easy use without coding | ||
|
||
## Demo | ||
|
||
There is a demo API for Html2Markdown deployed on CloudflareWorker. Please access the following link | ||
|
||
- [NPM Package page](https://webforai.inaridiy.workers.dev/?url=https://www.npmjs.com/package/webforai) | ||
- [Wikipedia of Cloudflare (AI Mode)](https://webforai.inaridiy.workers.dev/?url=https://en.wikipedia.org/wiki/Cloudflare&mode=ai) | ||
|
||
## Installation | ||
|
||
### As a library | ||
To use WebforAI as a library in your project, install it along with playwright: | ||
```bash | ||
pnpm i webforai playwright | ||
``` | ||
|
||
### As a CLI tool | ||
To use WebforAI as a command-line tool, install it globally: | ||
```bash | ||
npm install -g webforai | ||
``` | ||
|
||
## Quick Start | ||
|
||
### Using as a library | ||
Just install and execute scripts | ||
|
||
```js | ||
import { promises as fs } from "fs"; | ||
import { htmlToMarkdown, htmlToMdast } from "webforai"; | ||
import { loadHtml } from "webforai/loaders/playwright"; | ||
|
||
const url = "https://www.npmjs.com/package/webforai"; | ||
const html = await loadHtml(url); | ||
|
||
const markdown = htmlToMarkdown(html, { baseUrl: url }); | ||
|
||
await fs.writeFile("output.md", markdown); | ||
``` | ||
|
||
other examples are in [examples](./examples/simple/src/index.ts) | ||
|
||
### Using as a CLI tool | ||
After installing globally, you can use the webforai command: | ||
```bash | ||
webforai https://www.npmjs.com/package/webforai output.md | ||
``` | ||
This will fetch the HTML from the specified URL, convert it to Markdown, and save it to output.md. | ||
|
||
For more CLI options, run: | ||
```bash | ||
webforai --help | ||
``` | ||
|
||
## Examples | ||
|
||
- [Simple Example](https://github.com/inaridiy/webforai/tree/main/examples/simple/src/index.ts) | ||
- [Scraping With ChatGPT API](https://github.com/inaridiy/webforai/blob/main/examples/scraping/src/index.ts) | ||
- [Translate Markdown with Splitter](https://github.com/inaridiy/webforai/tree/main/examples/translate) | ||
- [Cloudflare Worker with puppeteer & DO](https://github.com/inaridiy/webforai/tree/main/examples/worker) | ||
|
||
## Usage | ||
|
||
### Main Functions | ||
|
||
**`htmlToMarkdown(html: string, options?: HtmlToMarkdownOptions): string`** | ||
Convert HTML to Markdown. By default, unnecessary HTML is excluded and processed. | ||
If `solveLinks` is specified, the relative links in the Mdast will be resolved. | ||
This function just calls htmlToMdast and mdastToMarkdown in that order internally. | ||
|
||
**`htmlToMdast(html: string, options?: HtmlToMdastOptions): Mdast`** | ||
This project uses Hast and Mdast as defined by syntax-tree internally. | ||
This function converts HTML to Mdast, an intermediate representation, which is required when using `mdastSplitter`, etc. | ||
|
||
**`mdastToMarkdown(mdast: Mdast | RootContent[], options?: { solveLinks?: string }): string`** | ||
Convert Mdast to Markdown. If `solveLinks` is specified, the relative links in the Mdast will be resolved. | ||
|
||
### Loader Functions | ||
|
||
**`loadHtml(url: string, options?: LoadHtmlOptions): Promise<string>`** | ||
Load HTML from the specified URL. This function uses Playwright internally. | ||
|
||
### CLI Commands | ||
|
||
**`webforai <source?> <outputPath?> [options]`** | ||
Converts the HTML at the specified URL or path to Markdown and saves it to an output file. | ||
Arguments and options can be specified in the interactive interface even if they are not specified. | ||
- **`source`** | ||
The URL or path to the HTML file to convert. | ||
- **`outputPath`** | ||
The path to save the output Markdown file. | ||
|
||
#### Options | ||
- **`--mode <mode>`** | ||
Specify the mode to use for conversion. Options are `default` and `ai`. Default is `default`. | ||
- **`--loader <loader>`** | ||
Specify the loader to use for fetching HTML. Options are `fetch`, `playwright` and `puppeteer`. Default is `fetch`. | ||
- **`--baseUrl <baseUrl>`** | ||
Specify the base URL to use for relative links in the output Markdown. | ||
- **`-o --stdout`** | ||
Output the converted Markdown to the console instead of saving it to a file. | ||
- **`-d --debug`** | ||
Enable debug mode. This will output additional information to the console. | ||
|
||
**`webforai --help`** | ||
Displays help information for the CLI tool. | ||
|
||
## Contributing | ||
1. Fork it! | ||
2. Create your feature branch: `git checkout -b my-new-feature` | ||
3. Add your changes: `git add .` | ||
4. Commit your changes: `git commit -am 'Add some feature'` | ||
5. Add a changelog: `pnpm changeset` | ||
6. Push to the branch: `git push origin my-new-feature` | ||
7. Submit a pull request :sunglasses: | ||
packages/webforai/README.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,65 +1,31 @@ | ||
import { promises as fs } from "node:fs"; | ||
import Anthropic from "@anthropic-ai/sdk"; | ||
import arg from "arg"; | ||
import dotenv from "dotenv"; | ||
import { htmlToMdast, mdastSplitter, mdastToMarkdown } from "webforai"; | ||
import { google } from "@ai-sdk/google"; | ||
import { generateText } from "ai"; | ||
import dotevn from "dotenv"; | ||
import { htmlToMarkdown } from "webforai"; | ||
import { loadHtml } from "webforai/loaders/playwright"; | ||
|
||
dotenv.config(); | ||
dotevn.config(); | ||
|
||
const anthropic = new Anthropic({ | ||
apiKey: process.env.ANTHROPIC_API_KEY, | ||
}); | ||
|
||
await fs.mkdir(".output", { recursive: true }); | ||
|
||
const args = arg({ "--url": String }); | ||
|
||
const url = args["--url"] || "https://www.npmjs.com/package/webforai"; | ||
const fileName = url.replace(/[^a-zA-Z0-9]/g, "-"); | ||
const url = "https://blog.cloudflare.com/the-story-of-web-framework-hono-from-the-creator-of-hono/"; | ||
const targetLanguage = "ja"; | ||
|
||
await fs.writeFile(`.output/${fileName}.md`, url); | ||
const html = await loadHtml(url, { superBypassMode: true }); | ||
const markdown = htmlToMarkdown(html); | ||
|
||
const html = await loadHtml(url); | ||
const mdast = htmlToMdast(html); | ||
const prompt = `Translate mechanically converted HTML-based Markdown into ${targetLanguage}, while refining and correcting the content for clarity and coherence. | ||
const splitted = await mdastSplitter(mdast, async (md) => 2000 > md.length); | ||
The Markdown provided may contain redundant or unnecessary information and errors due to mechanical conversion. Your task is to translate the text into Japanese, fixing these issues and improving the overall quality of the Markdown document. | ||
for (const mdast of splitted) { | ||
let linkIndex = 0; | ||
const links: string[] = []; | ||
const baseMd = mdastToMarkdown(mdast, { baseUrl: url }).replace(/!?\[([^\]]+)\]\(([^)]+)\)/g, (match, text, url) => { | ||
const placeholder = `URL_${linkIndex++}`; | ||
links.push(url); | ||
return match.startsWith("!") ? `![${text}](${placeholder})` : `[${text}](${placeholder})`; | ||
}); | ||
if (baseMd.length < 10) { | ||
continue; | ||
} | ||
const prompt = `あなたは、Markdownドキュメントを翻訳するスペシャリストです。次のWebサイトをMarkdownに変換したドキュメントを、日本語に翻訳してください。 | ||
<input_document> | ||
${markdown} | ||
</input_document>`; | ||
|
||
<instruction> | ||
1. 翻訳する際は、推敲を重ね、あたかも最初から日本語の文章であるかのように翻訳してください。 | ||
2. 専門分野の翻訳では、その分野の用語を正しく理解し、適切な訳語をあててください。 | ||
3. ドキュメントは機械的に変換したもので、不整合や誤り等が含まれています。これらを発見し、修正してください。 | ||
4. まず翻訳に移る前に、文章をよく読んで<thinking></thinking>タグ内で内容のメモや、不整合確認を行ってください。 | ||
5. その後、<translation></translation>タグ内に翻訳したMarkdownを記入してください。 | ||
</instruction> | ||
<markdown> | ||
${baseMd} | ||
</markdown>`; | ||
const translatedRes = await anthropic.messages.create({ | ||
model: "claude-3-haiku-20240307", | ||
max_tokens: 4096, | ||
temperature: 0, | ||
messages: [{ role: "user", content: prompt }], | ||
}); | ||
|
||
let translatedMd = `${translatedRes.content[0].text.trim().match(/<translation>([\s\S]+)<\/translation>/)?.[1]}\n`; | ||
for (const [i, link] of links.entries()) { | ||
translatedMd = translatedMd.replace(new RegExp(`URL_${i}`, "g"), link); | ||
} | ||
const response = await generateText({ | ||
model: google("gemini-1.5-flash-latest"), | ||
temperature: 0, | ||
prompt, | ||
maxSteps: 10, | ||
experimental_continueSteps: true, | ||
}); | ||
|
||
await fs.appendFile(`.output/${fileName}.md`, translatedMd); | ||
} | ||
console.info(response.text); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.