Text extraction JavaScript library
-
Text extraction
-
Tag Recommended
Basically, moving it to read the following in this order:
- Lib / lib.js
-
thing in common
- lib/extract-content.js
-
本文抽出
The repository of route make package Then extract-content-all.js that is the concatenation of these are generated.
When you see you want to detail the actual use:
- sketch/extract-content.test.js
-
本文抽出テスト
- Lib / scoring-words.js
-
scoring tag (sample)
Use if you want to specify that you want / handler only text extraction.
var ex = new ExtractContentJS.LayeredExtractor(); //ex.addHandler( ex.factory.getHandler('Description') ); //ex.addHandler( ex.factory.getHandler('Scraper')); //ex.addHandler( ex.factory.getHandler('GoogleAdsence') ); ex.addHandler( ex.factory.getHandler('Heuristics') ); var res = ex.extract(document); if (res.isSuccess) { res.url; // URL string res.title; // title string res.engine; // handler itself used for extraction res.content; // content class of an instance (see below) }
Handler is far Heuristics only been implemented.
Returns an array of // body's determined to be leaves class instance that contains the leaf node (see below); content.asLeaves () content.asNode (); // return the things of the deepest of the common ancestor of all of the leaf nodes content.asTextFragment (); return a concatenation of the text of the node that is included in the // asLeaves () Return the textContent of // asNode (); content.toString ()
leaf.node; // leaf node leaf.depth; // depth from the body of the node
Ina Lintro
Copyright © 2009 INA Lintaro / Hatena. All rights reserved.
Copyright © 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
MIT License