Skip to content

Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download #121

Open
@nickynicolson

Description

Assuming that a SIMPLE_CSV format download is a subset of the DWCA download - as per the GBIF download FAQ entry:

CSV: Tab delimited CSV. Only contains the data after GBIF interpretation. No multimedia included. More information about CSV
Darwin Core Archive: The Darwin Core Archive (DwC-A) contains both the original data as publisher provided it and the GBIF interpretation. Links (but not files) to multimedia included. More information about DwC-A

... then it would be good to have a utility that could create a SIMPLE_CSV format file from the larger DWCA.

Rationale: - a user develops a script that needs only minimal data and therefore is designed to operate on SIMPLE_CSV format input. Another user of the script has a pre-existing DWCA format download and wants to use this as input to the script (without having to create another download) - so they need a way to slim down the DWCA to the SIMPLE_CSV set of fields.

Is there a GBIF metadata service that returns the fieldnames used in each of these download formats which could help? If so, pygbif could provide access to this and a column rename mapping (if required).

Activity

jmbarrios

jmbarrios commented on Nov 15, 2023

@jmbarrios

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader.
These package are developed to work with the GBIF DWCA downloads.

nickynicolson

nickynicolson commented on Nov 15, 2023

@nickynicolson
Author

I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader. These package are developed to work with the GBIF DWCA downloads.

The default download format for occurrences appears to be SIMPLE_CSV:

def download(queries, format = "SIMPLE_CSV", user=None, pwd=None, email=None, pred_type="and"):

jmbarrios

jmbarrios commented on Nov 15, 2023

@jmbarrios

Indeed, the current default parameter is SIMPLE_CSV, and the supported formats are listed here.

A major challenge when working with a DWCA is that this format is not always consistent. Many times, there isn't a common way to map extra tables to one table. For instance, in the case of a multimedia table, you can expect the id field to be present in a multimedia.txt file. However, it could have a one-to-many relation with an occurrence. In such cases, what would be a common mapping strategy from the multimedia table to a plain table?

I believe that the SIMPLE_CSV format is just focused on share occurrence data without additional information.

nickynicolson

nickynicolson commented on Nov 15, 2023

@nickynicolson
Author

My request is about occurrences only and relates only to DWCA downloads originating from GBIF (pygbif is about facilitating access to the GBIF API from Python).
As GBIF is able to represent occurrence data in both DWCA and SIMPLE_CSV formats I'd like to be able to convert from DWCA to SIMPLE_CSV.
Out of interest, are you speaking for GBIF @jmbarrios ?

jmbarrios

jmbarrios commented on Nov 15, 2023

@jmbarrios

Out of interest, are you speaking for GBIF @jmbarrios ?

No, I'm not. Also I am not associated with GBIF.

MattBlissett

MattBlissett commented on Nov 15, 2023

@MattBlissett
Member

Hi Nicky,

There's an experimental (not necessarily stable, not documented) API for the columns returned in GBIF downloads:

The SIMPLE_CSV format should be a Simple Darwin Core-compatible file, see §6.1 where these files can be shared without a meta.xml description. It's also a subset of the occurrence file in a DWCA format download, and the column names should be identical — if a CSV reader is referencing columns by name it should work fine with either file.

@CecSve is maintaining pygbif, but is on leave until the end of January.

nickynicolson

nickynicolson commented on Nov 16, 2023

@nickynicolson
Author

Thanks @MattBlissett - if / when this becomes stable it might be good to consider making it available from the pygbif library.
I did find two fields in SIMPLE_CSV that are not in DWCA: publishingOrgKey and verbatimScientificNameAuthorship

MattBlissett

MattBlissett commented on Nov 17, 2023

@MattBlissett
Member

My mistake, verbatimScientificNameAuthorship should be scientificNameAuthorship from verbatim.txt. publishingOrgKey would need to be retrieved using the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Provide conversion utility that could create a SIMPLE_CSV format file from the more complete DWCA format download · Issue #121 · gbif/pygbif