Description
Assuming that a SIMPLE_CSV format download is a subset of the DWCA download - as per the GBIF download FAQ entry:
CSV: Tab delimited CSV. Only contains the data after GBIF interpretation. No multimedia included. More information about CSV
Darwin Core Archive: The Darwin Core Archive (DwC-A) contains both the original data as publisher provided it and the GBIF interpretation. Links (but not files) to multimedia included. More information about DwC-A
... then it would be good to have a utility that could create a SIMPLE_CSV format file from the larger DWCA.
Rationale: - a user develops a script that needs only minimal data and therefore is designed to operate on SIMPLE_CSV format input. Another user of the script has a pre-existing DWCA format download and wants to use this as input to the script (without having to create another download) - so they need a way to slim down the DWCA to the SIMPLE_CSV set of fields.
Is there a GBIF metadata service that returns the fieldnames used in each of these download formats which could help? If so, pygbif could provide access to this and a column rename mapping (if required).
Activity
jmbarrios commentedon Nov 15, 2023
I believe such thing is outside of the scope of this package. Also there is a Python package to manipulate a DWC Archive python-dwca-reader.
These package are developed to work with the GBIF DWCA downloads.
nickynicolson commentedon Nov 15, 2023
The default download format for occurrences appears to be SIMPLE_CSV:
pygbif/pygbif/occurrences/download.py
Line 69 in 9590fcf
jmbarrios commentedon Nov 15, 2023
Indeed, the current default parameter is
SIMPLE_CSV
, and the supported formats are listed here.A major challenge when working with a DWCA is that this format is not always consistent. Many times, there isn't a common way to map extra tables to one table. For instance, in the case of a multimedia table, you can expect the
id
field to be present in amultimedia.txt
file. However, it could have a one-to-many relation with an occurrence. In such cases, what would be a common mapping strategy from the multimedia table to a plain table?I believe that the
SIMPLE_CSV
format is just focused on share occurrence data without additional information.nickynicolson commentedon Nov 15, 2023
My request is about occurrences only and relates only to DWCA downloads originating from GBIF (pygbif is about facilitating access to the GBIF API from Python).
As GBIF is able to represent occurrence data in both DWCA and SIMPLE_CSV formats I'd like to be able to convert from DWCA to SIMPLE_CSV.
Out of interest, are you speaking for GBIF @jmbarrios ?
jmbarrios commentedon Nov 15, 2023
No, I'm not. Also I am not associated with GBIF.
MattBlissett commentedon Nov 15, 2023
Hi Nicky,
There's an experimental (not necessarily stable, not documented) API for the columns returned in GBIF downloads:
The
SIMPLE_CSV
format should be a Simple Darwin Core-compatible file, see §6.1 where these files can be shared without a meta.xml description. It's also a subset of the occurrence file in aDWCA
format download, and the column names should be identical — if a CSV reader is referencing columns by name it should work fine with either file.@CecSve is maintaining pygbif, but is on leave until the end of January.
nickynicolson commentedon Nov 16, 2023
Thanks @MattBlissett - if / when this becomes stable it might be good to consider making it available from the pygbif library.
I did find two fields in SIMPLE_CSV that are not in DWCA:
publishingOrgKey
andverbatimScientificNameAuthorship
MattBlissett commentedon Nov 17, 2023
My mistake,
verbatimScientificNameAuthorship
should bescientificNameAuthorship
from verbatim.txt.publishingOrgKey
would need to be retrieved using the API.