A node.js wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.
node-boilerpipe depends on Boilerpipe v1.2.0 or higher.
WARNING: Don't forget to set JAVA variable referred to node-java.
Via npm:
$ npm install boilerpipe
$ mvn compile
$ mvn package
var Boilerpipe = require('boilerpipe');
The constructor takes a extractor
, being one of the available boilerpipe extractor types:
- DefaultExtractor
- ArticleExtractor
- ArticleSentencesExtractor
- KeepEverythingExtractor
- KeepEverythingWithMinKWordsExtractor
- LargestContentExtractor
- NumWordsRulesExtractor
- CanolaExtractor
If no extractor is passed the DefaultExtractor
will be used by default. Additional keyword arguments are either html
for HTML text or url
.
var boilerpipe = new Boilerpipe();
var boilerpipe = new Boilerpipe({
extractor: Boilerpipe.Extractor.Canola
});
var boilerpipe = new Boilerpipe({
extractor: Boilerpipe.Extractor.Article,
url: 'http://...'
});
var boilerpipe = new Boilerpipe({
extractor: Boilerpipe.Extractor.ArticleSentences,
html: '<html>...</html>'
}, function(err) {
...
});
If you set both URL and HTML then only URL will work for you. HTML will be ignored at this case.
boilerpipe.setUrl('http://...');
boilerpipe.setHtml('<html>...</html>');
boilerpipe.getText(function(err, text) {
...
});
boilerpipe.getHtml(function(err, html) {
...
});
boilerpipe.getImages(function(err, images) {
...
});