- Updates gem dependencies
- PR #52 Allow passing the URL as the
Wombat#crawl
argument - PR #51 Allow crawler classes inheritance
- PR #50 Add HTTP methods support (
POST
,PUT
,HEAD
, etc)
- Updates gem dependencies
- Adds
user_agent
anduser_agent_alias
config options toWombat.configure
- Updates gem dependencies
- Adds content-type=text/html header to Mechanize if missing
- Retry page.click on relative links
- Adds ability to crawl a prefetched Mechanize page (thanks to @dsjbirch)
- Added support for hash based property selectors (eg.:
css: 'header'
instead of'css=.header'
)
- Updated gem dependencies
- Added header properties (thanks to @kdridi)
- Fixed bug in selectors that used XPath functions like
concat
(thanks to @viniciusdaniel)
- Added proxy settings configuration (thanks to @phortx)
- Fixed minor bug in HTML property locator
This version contains some breaking changes (not backwards compatible), most notably to for_each
that is now specified through the option :iterator
and nested block parameters that are gone.
- Added syntactic sugar methods
Wombat.scrape
andCrawler#scrape
that alias to their respectivecrawl
method implementation; - Gem internals suffered big refactoring, removed code duplication;
- DSL syntax simplified for nested properties. Now the nested block takes no arguments;
- DSL syntax changed for iterated properties. Iterators can now be named just like other properties and won't be automatically named as
iterator#{i}
anymore. Specified through the:iterator
option; Crawler#list_page
is now calledCrawler#path
;- Added new
:follow
property type that crawls links in pages.
- Breaking change:
Metadata#format
renamed toMetadata#document_format
due to method name clash with Kernel#format
- Fixed a bug on malformed selectors
- Fixed a bug where multiple calls to #crawl would not clean up previously iterated array results and yield repeated results
- Added utility method
Wombat.crawl
that eliminates the need to have a ruby class instance to use Wombat. Now you can use justWombat.crawl
and start working. The class based format still works as before though.
- Added the ability to provide a block to Crawler#crawl and override the default crawler properties for a one off run (thanks to @danielnc)