Skip to content

Commit

Permalink
Merge pull request #50 from inaridiy/feat/docs
Browse files Browse the repository at this point in the history
Feat/docs
  • Loading branch information
inaridiy authored Oct 20, 2024
2 parents 7bd835d + ff85d73 commit 7f2d5d3
Show file tree
Hide file tree
Showing 46 changed files with 5,249 additions and 478 deletions.
10 changes: 10 additions & 0 deletions .changeset/famous-mangos-sneeze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
"webforai": major
"ai-learning": patch
"translate": patch
"worker": patch
"bench": patch
"site": patch
---

New Documentation Site
1 change: 1 addition & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
github: inaridiy
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
node_modules
dist
.npmrc
.DS_Store
.DS_Store
.wrangler
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,8 @@
},
"[html]": {
"editor.defaultFormatter": "vscode.html-language-features"
},
"[javascript]": {
"editor.defaultFormatter": "biomejs.biome"
}
}
129 changes: 1 addition & 128 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,128 +1 @@
# Web for AI

![LOGO](https://github.com/inaridiy/webforai/blob/main/images/logo.webp)

A library that provides a web interface for AI

## Features

- Ultra-precise HTML2Markdown conversion
- Ultra-precise Markdown segmentation based on AST
- Ultra-precise HTML retrieval functions using headless browsers
- The core functionality is edge-native (runs on Cloudflare Workers!!!)
- Command-line interface for easy use without coding

## Demo

There is a demo API for Html2Markdown deployed on CloudflareWorker. Please access the following link

- [NPM Package page](https://webforai.inaridiy.workers.dev/?url=https://www.npmjs.com/package/webforai)
- [Wikipedia of Cloudflare (AI Mode)](https://webforai.inaridiy.workers.dev/?url=https://en.wikipedia.org/wiki/Cloudflare&mode=ai)

## Installation

### As a library
To use WebforAI as a library in your project, install it along with playwright:
```bash
pnpm i webforai playwright
```

### As a CLI tool
To use WebforAI as a command-line tool, install it globally:
```bash
npm install -g webforai
```

## Quick Start

### Using as a library
Just install and execute scripts

```js
import { promises as fs } from "fs";
import { htmlToMarkdown, htmlToMdast } from "webforai";
import { loadHtml } from "webforai/loaders/playwright";

const url = "https://www.npmjs.com/package/webforai";
const html = await loadHtml(url);

const markdown = htmlToMarkdown(html, { baseUrl: url });

await fs.writeFile("output.md", markdown);
```

other examples are in [examples](./examples/simple/src/index.ts)

### Using as a CLI tool
After installing globally, you can use the webforai command:
```bash
webforai https://www.npmjs.com/package/webforai output.md
```
This will fetch the HTML from the specified URL, convert it to Markdown, and save it to output.md.

For more CLI options, run:
```bash
webforai --help
```

## Examples

- [Simple Example](https://github.com/inaridiy/webforai/tree/main/examples/simple/src/index.ts)
- [Scraping With ChatGPT API](https://github.com/inaridiy/webforai/blob/main/examples/scraping/src/index.ts)
- [Translate Markdown with Splitter](https://github.com/inaridiy/webforai/tree/main/examples/translate)
- [Cloudflare Worker with puppeteer & DO](https://github.com/inaridiy/webforai/tree/main/examples/worker)

## Usage

### Main Functions

**`htmlToMarkdown(html: string, options?: HtmlToMarkdownOptions): string`**
Convert HTML to Markdown. By default, unnecessary HTML is excluded and processed.
If `solveLinks` is specified, the relative links in the Mdast will be resolved.
This function just calls htmlToMdast and mdastToMarkdown in that order internally.

**`htmlToMdast(html: string, options?: HtmlToMdastOptions): Mdast`**
This project uses Hast and Mdast as defined by syntax-tree internally.
This function converts HTML to Mdast, an intermediate representation, which is required when using `mdastSplitter`, etc.

**`mdastToMarkdown(mdast: Mdast | RootContent[], options?: { solveLinks?: string }): string`**
Convert Mdast to Markdown. If `solveLinks` is specified, the relative links in the Mdast will be resolved.

### Loader Functions

**`loadHtml(url: string, options?: LoadHtmlOptions): Promise<string>`**
Load HTML from the specified URL. This function uses Playwright internally.

### CLI Commands

**`webforai <source?> <outputPath?> [options]`**
Converts the HTML at the specified URL or path to Markdown and saves it to an output file.
Arguments and options can be specified in the interactive interface even if they are not specified.
- **`source`**
The URL or path to the HTML file to convert.
- **`outputPath`**
The path to save the output Markdown file.

#### Options
- **`--mode <mode>`**
Specify the mode to use for conversion. Options are `default` and `ai`. Default is `default`.
- **`--loader <loader>`**
Specify the loader to use for fetching HTML. Options are `fetch`, `playwright` and `puppeteer`. Default is `fetch`.
- **`--baseUrl <baseUrl>`**
Specify the base URL to use for relative links in the output Markdown.
- **`-o --stdout`**
Output the converted Markdown to the console instead of saving it to a file.
- **`-d --debug`**
Enable debug mode. This will output additional information to the console.

**`webforai --help`**
Displays help information for the CLI tool.

## Contributing
1. Fork it!
2. Create your feature branch: `git checkout -b my-new-feature`
3. Add your changes: `git add .`
4. Commit your changes: `git commit -am 'Add some feature'`
5. Add a changelog: `pnpm changeset`
6. Push to the branch: `git push origin my-new-feature`
7. Submit a pull request :sunglasses:
packages/webforai/README.md
7 changes: 6 additions & 1 deletion biome.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
{
"$schema": "https://biomejs.dev/schemas/1.5.3/schema.json",

"files": {
"ignore": ["worker-configuration.d.ts"]
},
"vcs": {
"enabled": true,
"clientKind": "git",
Expand All @@ -25,6 +27,9 @@
}
}
},
"correctness": {
"noUndeclaredVariables": "off"
},
"complexity": {
"noExcessiveCognitiveComplexity": {
"level": "error",
Expand Down
2 changes: 1 addition & 1 deletion examples/ai-learning/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"license": "ISC",
"dependencies": {
"@ai-sdk/google": "^0.0.48",
"ai": "^3.3.39",
"ai": "^3.4.7",
"arg": "^5.0.2",
"dotenv": "^16.4.5",
"hast-util-from-html": "^2.0.3",
Expand Down
2 changes: 1 addition & 1 deletion examples/ai-learning/src/manual.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ await fs.mkdirSync(".output", { recursive: true });
await fs.writeFileSync(".output/html.html", html);

const rawContent = await htmlToMarkdown(html, { baseUrl: url, extractors: false });
const cleanedContent = await htmlToMarkdown(html, { baseUrl: url, extractors: "readability" });
const cleanedContent = await htmlToMarkdown(html, { baseUrl: url, extractors: "takumi" });

await fs.writeFileSync(".output/raw.md", rawContent);
await fs.writeFileSync(".output/cleaned.md", cleanedContent);
2 changes: 1 addition & 1 deletion examples/bench/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ for (const url of targets) {

const markdown = htmlToMarkdown(html, {
baseUrl: url,
extractors: "readability",
extractors: "takumi",
linkAsText: true,
tableAsText: true,
hideImage: true,
Expand Down
2 changes: 2 additions & 0 deletions examples/translate/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@
"author": "",
"license": "ISC",
"dependencies": {
"@ai-sdk/google": "^0.0.48",
"@anthropic-ai/sdk": "^0.18.0",
"@google/generative-ai": "^0.12.0",
"ai": "^3.4.7",
"arg": "^5.0.2",
"dotenv": "^16.4.5",
"playwright": "^1.40.1",
Expand Down
56 changes: 0 additions & 56 deletions examples/translate/src/gemini.ts

This file was deleted.

78 changes: 22 additions & 56 deletions examples/translate/src/index.ts
Original file line number Diff line number Diff line change
@@ -1,65 +1,31 @@
import { promises as fs } from "node:fs";
import Anthropic from "@anthropic-ai/sdk";
import arg from "arg";
import dotenv from "dotenv";
import { htmlToMdast, mdastSplitter, mdastToMarkdown } from "webforai";
import { google } from "@ai-sdk/google";
import { generateText } from "ai";
import dotevn from "dotenv";
import { htmlToMarkdown } from "webforai";
import { loadHtml } from "webforai/loaders/playwright";

dotenv.config();
dotevn.config();

const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});

await fs.mkdir(".output", { recursive: true });

const args = arg({ "--url": String });

const url = args["--url"] || "https://www.npmjs.com/package/webforai";
const fileName = url.replace(/[^a-zA-Z0-9]/g, "-");
const url = "https://blog.cloudflare.com/the-story-of-web-framework-hono-from-the-creator-of-hono/";
const targetLanguage = "ja";

await fs.writeFile(`.output/${fileName}.md`, url);
const html = await loadHtml(url, { superBypassMode: true });
const markdown = htmlToMarkdown(html);

const html = await loadHtml(url);
const mdast = htmlToMdast(html);
const prompt = `Translate mechanically converted HTML-based Markdown into ${targetLanguage}, while refining and correcting the content for clarity and coherence.
const splitted = await mdastSplitter(mdast, async (md) => 2000 > md.length);
The Markdown provided may contain redundant or unnecessary information and errors due to mechanical conversion. Your task is to translate the text into Japanese, fixing these issues and improving the overall quality of the Markdown document.
for (const mdast of splitted) {
let linkIndex = 0;
const links: string[] = [];
const baseMd = mdastToMarkdown(mdast, { baseUrl: url }).replace(/!?\[([^\]]+)\]\(([^)]+)\)/g, (match, text, url) => {
const placeholder = `URL_${linkIndex++}`;
links.push(url);
return match.startsWith("!") ? `![${text}](${placeholder})` : `[${text}](${placeholder})`;
});
if (baseMd.length < 10) {
continue;
}
const prompt = `あなたは、Markdownドキュメントを翻訳するスペシャリストです。次のWebサイトをMarkdownに変換したドキュメントを、日本語に翻訳してください。
<input_document>
${markdown}
</input_document>`;

<instruction>
1. 翻訳する際は、推敲を重ね、あたかも最初から日本語の文章であるかのように翻訳してください。
2. 専門分野の翻訳では、その分野の用語を正しく理解し、適切な訳語をあててください。
3. ドキュメントは機械的に変換したもので、不整合や誤り等が含まれています。これらを発見し、修正してください。
4. まず翻訳に移る前に、文章をよく読んで<thinking></thinking>タグ内で内容のメモや、不整合確認を行ってください。
5. その後、<translation></translation>タグ内に翻訳したMarkdownを記入してください。
</instruction>
<markdown>
${baseMd}
</markdown>`;
const translatedRes = await anthropic.messages.create({
model: "claude-3-haiku-20240307",
max_tokens: 4096,
temperature: 0,
messages: [{ role: "user", content: prompt }],
});

let translatedMd = `${translatedRes.content[0].text.trim().match(/<translation>([\s\S]+)<\/translation>/)?.[1]}\n`;
for (const [i, link] of links.entries()) {
translatedMd = translatedMd.replace(new RegExp(`URL_${i}`, "g"), link);
}
const response = await generateText({
model: google("gemini-1.5-flash-latest"),
temperature: 0,
prompt,
maxSteps: 10,
experimental_continueSteps: true,
});

await fs.appendFile(`.output/${fileName}.md`, translatedMd);
}
console.info(response.text);
6 changes: 3 additions & 3 deletions examples/worker/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@
"devDependencies": {
"@cloudflare/puppeteer": "^0.0.6",
"@cloudflare/vitest-pool-workers": "^0.1.0",
"@cloudflare/workers-types": "^4.20240405.0",
"@cloudflare/workers-types": "^4.20241018.0",
"typescript": "^5.4.5",
"vitest": "1.3.0",
"wrangler": "^3.48.0"
"wrangler": "^3.81.0"
},
"dependencies": {
"@hono/valibot-validator": "^0.2.2",
"hono": "^4.1.1",
"hono": "^4.6.5",
"valibot": "^0.30.0",
"webforai": "workspace:^"
}
Expand Down
Loading

0 comments on commit 7f2d5d3

Please sign in to comment.