ExtractContentJS¶ ↑

Text extraction JavaScript library

You can do it¶ ↑

Text extraction
Tag Recommended

File¶ ↑

Basically, moving it to read the following in this order:

Lib / lib.js: thing in common
lib/extract-content.js: 本文抽出

The repository of route make package Then extract-content-all.js that is the concatenation of these are generated.

When you see you want to detail the actual use:

sketch/extract-content.test.js: 本文抽出テスト
Lib / scoring-words.js: scoring tag (sample)

How to use¶ ↑

Text extraction interface¶ ↑

Use if you want to specify that you want / handler only text extraction.

ExtractContentJS.LayeredExtractor¶ ↑

var ex = new ExtractContentJS.LayeredExtractor();
//ex.addHandler( ex.factory.getHandler('Description') );
//ex.addHandler( ex.factory.getHandler('Scraper'));
//ex.addHandler( ex.factory.getHandler('GoogleAdsence') );
ex.addHandler( ex.factory.getHandler('Heuristics') );
var res = ex.extract(document);

if (res.isSuccess) {
    res.url;   // URL string
    res.title; // title string
    res.engine; // handler itself used for extraction
    res.content; // content class of an instance (see below)
}

Handler is far Heuristics only been implemented.

Content class¶ ↑

Returns an array of // body's determined to be leaves class instance that contains the leaf node (see below); content.asLeaves ()
content.asNode (); // return the things of the deepest of the common ancestor of all of the leaf nodes
content.asTextFragment (); return a concatenation of the text of the node that is included in the // asLeaves ()
Return the textContent of // asNode (); content.toString ()

Leaf class¶ ↑

leaf.node; // leaf node
leaf.depth; // depth from the body of the node

AUTHOR¶ ↑

Ina Lintro

Copryright¶ ↑

Copyright of the original implementation¶ ↑

labs.cybozu.co.jp/blog/nakatani/2007/09/web_1.html

LICENCE¶ ↑

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
lib		lib
sketch		sketch
.gitignore		.gitignore
Makefile		Makefile
README.rdoc		README.rdoc
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExtractContentJS¶ ↑

You can do it¶ ↑

File¶ ↑

How to use¶ ↑

Text extraction interface¶ ↑

ExtractContentJS.LayeredExtractor¶ ↑

Content class¶ ↑

Leaf class¶ ↑

AUTHOR¶ ↑

Copryright¶ ↑

Copyright of the original implementation¶ ↑

LICENCE¶ ↑

About

Releases

Packages

Languages

jumbojett/extract-content-javascript

Folders and files

Latest commit

History

Repository files navigation

ExtractContentJS¶ ↑

You can do it¶ ↑

File¶ ↑

How to use¶ ↑

Text extraction interface¶ ↑

ExtractContentJS.LayeredExtractor¶ ↑

Content class¶ ↑

Leaf class¶ ↑

AUTHOR¶ ↑

Copryright¶ ↑

Copyright of the original implementation¶ ↑

LICENCE¶ ↑

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages