A Ruby library for harvesting metadata from OAI-PMH repositories.
Current version: 0.12.0
Supported Ruby versions: 1.8.7, 1.9.2, 1.9.3, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6
gem install fieldhand -v '~> 0.12'
Or, in your Gemfile
:
gem 'fieldhand', '~> 0.12'
require 'fieldhand'
repository = Fieldhand::Repository.new('http://example.com/oai')
repository.identify.name
#=> "Repository Name"
repository.metadata_formats.map { |format| format.prefix }
#=> ["oai_dc"]
repository.sets.map { |set| set.name }
#=> ["Set A.", "Set B."]
repository.records.each do |record|
# ...
end
repository.get('oai:www.example.com:12345')
#=> #<Fieldhand::Record: ...>
Fieldhand::Repository
Fieldhand::Identify
Fieldhand::MetadataFormat
Fieldhand::Set
Fieldhand::Record
Fieldhand::Header
Fieldhand::NetworkError
Fieldhand::ProtocolError
A class to represent an OAI-PMH repository:
A repository is a network accessible server that can process the 6 OAI-PMH requests [...]. A repository is managed by a data provider to expose metadata to harvesters.
Fieldhand::Repository.new('http://www.example.com/oai')
Fieldhand::Repository.new(URI('http://www.example.com/oai'))
Fieldhand::Repository.new('http://www.example.com/oai', :logger => Logger.new(STDOUT), :timeout => 10, :bearer_token => 'decafbad')
Fieldhand::Repository.new('http://www.example.com/oai', :logger => Logger.new(STDOUT), :timeout => 10, :headers => { 'Custom header' => 'decafbad' })
Fieldhand::Repository.new('http://www.example.com/oai', :logger => Logger.new(STDOUT), :retries => 5, :interval => 30)
Return a new Repository
instance accessible at the given uri
(specified
either as a URI
or
something that can be coerced into a URI
such as a String
) with options passed as a Hash
:
:logger
: aLogger
-compatiblelogger
, defaults to a platform-specific null logger;:timeout
: aNumeric
number of seconds to wait before timing out any HTTP requests, defaults to 60;:retries
: aNumeric
maximum number of times an HTTP request will be retried before raising an error, defaults to 0;:interval
: aNumeric
number of seconds to wait before the next retry attempt, defaults to 10;:bearer_token
: aString
bearer token to authorize any HTTP requests, defaults tonil
.:headers
: aHash
containing custom HTTP headers, defaults to{}
.
repository.identify
#=> #<Fieldhand::Identify: ...>
Return an Identify
for the repository including information such as the repository name, base URL, protocol version, etc.
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
repository.metadata_formats
#=> #<Enumerator: ...>
repository.metadata_formats('oai:www.example.com:1')
Return an Enumerator
of MetadataFormat
s available from the repository. Optionally takes an identifier
that specifies the unique identifier of the item for which available metadata formats are being requested.
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
repository.sets
#=> #<Enumerator: ...>
Return an Enumerator
of Set
s that represent the set structure of a repository.
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
repository.records
repository.records(:metadata_prefix => 'oai_dc', :from => '2001-01-01')
repository.records(:metadata_prefix => 'oai_dc', :from => Date.new(2001, 1, 1))
repository.records(:set => 'A', :until => Time.utc(2010, 1, 1, 12, 0))
Return an Enumerator
of all Record
s harvested from the repository.
Optional arguments can be passed as a Hash
of arguments
to permit selective harvesting of records based on set membership and/or datestamp:
:metadata_prefix
: aString
orMetadataFormat
to specify the metadata format that should be included in the metadata part of the returned record, defaults tooai_dc
;:from
: an optional argument with aString
,Date
orTime
UTCdatetime value, which specifies a lower bound for datestamp-based selective harvesting;:until
: an optional argument with aString
,Date
orTime
UTCdatetime value, which specifies a upper bound for datestamp-based selective harvesting;:set
: an optional argument with a set spec value (passed as either aString
or aSet
), which specifies set criteria for selective harvesting;:resumption_token
: an exclusive argument with aString
value that is the flow control token returned by a previous request that issued an incomplete list.
Note that datetimes should respect the repository's granularity otherwise they will return a BadArgumentError
.
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
repository.identifiers
repository.identifiers(:metadata_prefix => 'oai_dc', :from => '2001-01-01')
repository.identifiers(:metadata_prefix => 'oai_dc', :from => Date.new(2001, 1, 1))
repository.identifiers(:set => 'A', :until => Time.utc(2010, 1, 1, 12, 0))
Return an Enumerator
for an abbreviated form of records, retrieving only Header
s with the given optional arguments
.
See Fieldhand::Repository#records
for supported arguments
.
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
repository.get('oai:www.example.com:1')
repository.get('oai:www.example.com:1', :metadata_prefix => 'oai_dc')
#=> #<Fieldhand::Record: ...>
Return an individual metadata Record
from a repository with the given identifier
and optional :metadata_prefix
argument (defaults to oai_dc
).
May raise a NetworkError
if there is a problem contacting the repository or any descendant ProtocolError
if received in response.
A class to represent information about a repository as returned from the Identify
request.
repository.identify.name
#=> "Repository Name"
Return a human readable name for the repository as a String
.
repository.identify.base_url
#=> #<URI::HTTP http://www.example.com/oai>
Returns the base URL of the repository as a URI
.
repository.identify.protocol_version
#=> "2.0"
Returns the version of the OAI-PMH protocol supported by the repository as a String
.
repository.identify.earliest_datestamp
#=> 2011-01-01 00:00:00 UTC
repository.identify.earliest_datestamp
#=> #<Date: 2001-01-01 ((2451911j,0s,0n),+0s,2299161j)>
Returns the guaranteed lower limit of all datestamps recording changes, modifications, or deletions in the repository as a Time
or Date
. Note that the datestamp will be at the finest granularity supported by the repository.
repository.identify.deleted_record
#=> "persistent"
Returns the manner in which the repository supports the notion of deleted records as a String
. Legitimate values are no
; transient
; persistent
with meanings defined in the section on deletion.
repository.identify.granularity
#=> "YYYY-MM-DDThh:mm:ssZ"
Returns the finest harvesting granularity supported by the repository as a String
. The legitimate values are YYYY-MM-DD
and YYYY-MM-DDThh:mm:ssZ
with meanings as defined in ISO 8601.
repository.identify.admin_emails
#=> ["[email protected]"]
Returns the e-mail addresses of administrators of the repository as an Array
of String
s.
repository.identify.compression
#=> ["gzip", "deflate"]
Returns the compression encodings supported by the repository as an Array
of String
s. The recommended values are those defined for the Content-Encoding header in Section 14.11 of RFC 2616 describing HTTP 1.1
repository.identify.descriptions
#=> ["<description>..."]
Returns descriptions of this repository as an Array
of String
s.
As descriptions can be in any format, Fieldhand doesn't attempt to parse descriptions but leaves parsing to the client.
repository.identify.response_date
#=> 2017-05-08 11:21:38 +0100
Return the time and date that the response was sent.
A class to represent a metadata format available from a repository.
repository.metadata_formats.first.prefix
#=> "oai_dc"
Return the prefix of the metadata format to be used when requesting records as a String
.
repository.metadata_formats.first.schema
#=> #<URI::HTTP http://www.openarchives.org/OAI/2.0/oai_dc.xsd>
Return the location of an XML Schema describing the format as a URI
.
repository.metadata_formats.first.namespace
#=> #<URI::HTTP http://www.openarchives.org/OAI/2.0/oai_dc/>
Return the XML Namespace URI for the format as a URI
.
repository.metadata_formats.first.response_date
#=> 2017-05-08 11:21:38 +0100
Return the time and date that the response was sent.
A class representing an optional construct for grouping items for the purpose of selective harvesting.
repository.sets.first.spec
#=> "A"
Return unique identifier for the set which is also the path from the root of the set hierarchy to the respective node as a String
.
repository.sets.first.name
#=> "Set A."
Return a short human-readable String
naming the set.
repository.sets.first.descriptions
#=> ["<setDescription>..."]
Return an Array
of String
s of any optional and repeatable containers that may hold community-specific XML-encoded data about the set.
repository.sets.first.response_date
#=> 2017-05-08 11:21:38 +0100
Return the time and date that the response was sent.
A class representing a record from the repository:
A record is metadata expressed in a single format.
repository.records.first.deleted?
#=> true
Return whether or not a record is deleted as a Boolean
.
repository.records.first.status
#=> "deleted"
Return the optional status
attribute of the record's header as a String
or nil
.
[A] value of deleted indicates the withdrawal of availability of the specified metadata format for the item, dependent on the repository support for deletions.
repository.records.first.identifier
#=> "oai:www.example.com:1"
Return the unique identifier for this record in the repository.
repository.records.first.datestamp
#=> 2011-03-03 16:29:24 UTC
Return the date of creation, modification or deletion of the record for the purpose of selective harvesting as a Time
or Date
depending on the granularity of the repository.
repository.records.first.sets
#=> ["A", "B"]
Return an Array
of String
set specs indicating set memberships of this record.
repository.records.first.to_xml
#=> "<record><metadata>...</metadata><record>"
Return the record as a String
of XML.
repository.records.first.metadata
#=> "<metadata>..."
Return a single manifestation of the metadata from a record as a String
or nil
if this is a deleted record.
As the metadata can be in any format supported by the repository, Fieldhand doesn't attempt to parse the metadata but leaves parsing to the client.
repository.records.first.about
#=> ["<about>..."]
Return an Array
of String
s of any optional and repeatable containers holding data about the metadata part of the record.
repository.records.first.response_date
#=> 2017-05-08 11:21:38 +0100
Return the time and date that the response was sent.
A class representing the header of a record:
Contains the unique identifier of the item and properties necessary for selective harvesting. The header consists of the following parts:
- the unique identifier -- the unique identifier of an item in a repository;
- the datestamp -- the date of creation, modification or deletion of the record for the purpose of selective harvesting.
- zero or more setSpec elements -- the set membership of the item for the purpose of selective harvesting.
- an optional status attribute with a value of deleted indicates the withdrawal of availability of the specified metadata format for the item, dependent on the repository support for deletions.
repository.identifiers.first.deleted?
#=> true
Return whether or not a record is deleted as a Boolean
.
repository.identifiers.first.status
#=> "deleted"
Return the optional status
attribute of the header as a String
or nil
.
[A] value of deleted indicates the withdrawal of availability of the specified metadata format for the item, dependent on the repository support for deletions.
repository.identifiers.first.identifier
#=> "oai:www.example.com:1"
Return the unique identifier for this record in the repository.
repository.identifiers.first.datestamp
#=> 2011-03-03 16:29:24 UTC
Return the date of creation, modification or deletion of the record for the purpose of selective harvesting as a Time
or Date
depending on the granularity of the repository.
repository.identifiers.first.sets
#=> ["A", "B"]
Return an Array
of String
set specs indicating set memberships of this record.
repository.identifiers.first.response_date
#=> 2017-05-08 11:21:38 +0100
Return the time and date that the response was sent.
An error (descended from StandardError
) to represent any network issues encountered during interaction with the repository. Any underlying exception is exposed in Ruby 2.1 onwards through Exception#cause
.
An error (descended from NetworkError
) to represent any issues in the response from the repository.
If the HTTP request is not successful (returning a status code other than 200),
a ResponseError
exception will be raised containing the error message and the response object.
begin
repository.records.each do |record|
# ...
end
rescue Fieldhand::ResponseError => e
puts e.response
#=> #<Net::HTTPServiceUnavailable 503 Service Unavailable readbody=true>
end
Returns the unsuccessful
Net::HTTPResponse
that caused this error.
The parent error class (descended from StandardError
) for any errors returned
by a repository as defined in the protocol's Error and Exception
Conditions.
This can be used to rescue all the following child error types.
The request includes illegal arguments, is missing required arguments, includes a repeated argument, or values for arguments have an illegal syntax.
The value of the
resumptionToken
argument is invalid or expired.
Value of the
verb
argument is not a legal OAI-PMH verb, theverb
argument is missing, or theverb
argument is repeated.
The metadata format identified by the value given for the
metadataPrefix
argument is not supported by the item or by the repository.
The value of the
identifier
argument is unknown or illegal in this repository.
The combination of the values of the
from
,until
,set
andmetadataPrefix
arguments results in an empty list.
There are no metadata formats available for the specified item.
The repository does not support sets.
- Example XML responses are taken from Datacite's OAI-PMH repository.
- Null device detection is based on the implementation from the backports gem.
- Much of the documentation relies on wording from version 2.0 of The Open Archives Initiative Protocol for Metadata Harvesting.
Copyright © 2017-2019 Altmetric and Paul Mucur
Distributed under the MIT License.