Skip to content

carson0321/node-boilerpipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

node-boilerpipe

A node.js wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Installation

node-boilerpipe depends on Boilerpipe v1.2.0 or higher.

WARNING: Don't forget to set JAVA variable referred to node-java.

Via npm:

$ npm install boilerpipe

Source code project

$ mvn compile
$ mvn package

Usage

Load in the module

  var Boilerpipe = require('boilerpipe');

Create a new instance

The constructor takes a extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

  var boilerpipe = new Boilerpipe();

  var boilerpipe = new Boilerpipe({
    extractor: Boilerpipe.Extractor.Canola
  });

  var boilerpipe = new Boilerpipe({
    extractor: Boilerpipe.Extractor.Article,
    url: 'http://...'
  });

  var boilerpipe = new Boilerpipe({
    extractor: Boilerpipe.Extractor.ArticleSentences,
    html: '<html>...</html>'
  }, function(err) {
    ...
  });

Set URL or HTML

If you set both URL and HTML then only URL will work for you. HTML will be ignored at this case.

  boilerpipe.setUrl('http://...');

  boilerpipe.setHtml('<html>...</html>');

Get text, html and images

  boilerpipe.getText(function(err, text) {
    ...
  });

  boilerpipe.getHtml(function(err, html) {
    ...
  });

  boilerpipe.getImages(function(err, images) {
    ...
  });

License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published