Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSL metadata incomplete when there's a consortium in the author list #158

Open
jmonlong opened this issue Oct 22, 2019 · 10 comments
Open

CSL metadata incomplete when there's a consortium in the author list #158

jmonlong opened this issue Oct 22, 2019 · 10 comments

Comments

@jmonlong
Copy link

We noticed that some references didn't format well and that it always happened when there was a consortium in the author list. Also, the consortium name was missing.

There is always the solution of fixing them manually. After doing that I realized that using URLs (@url:https://doi.org/DOI) works too, like for the bioRxiv problem (#16).

Example:

> manubot cite doi:10.1038/ng.3834
[
  {
    "publisher": "Springer Science and Business Media LLC",
    "issue": "5",
    "DOI": "10.1038/ng.3834",
    "type": "article-journal",
    "page": "692-699",
    "source": "Crossref",
    "title": "The impact of structural variation on human gene expression",
    "volume": "49",
    "author": [
      {
        "given": "Colby",
        "family": "Chiang"
      },
      {},
      {
        "given": "Alexandra J",
        "family": "Scott"
      },
      {
        "given": "Joe R",
        "family": "Davis"
      },
      {
        "given": "Emily K",
        "family": "Tsang"
      },
      {
        "given": "Xin",
        "family": "Li"
      },
      {
        "given": "Yungil",
        "family": "Kim"
      },
      {
        "given": "Tarik",
        "family": "Hadzic"
      },
      {
        "given": "Farhan N",
        "family": "Damani"
      },
      {
        "given": "Liron",
        "family": "Ganel"
      },
      {
        "given": "Stephen B",
        "family": "Montgomery"
      },
      {
        "given": "Alexis",
        "family": "Battle"
      },
      {
        "given": "Donald F",
        "family": "Conrad"
      },
      {
        "given": "Ira M",
        "family": "Hall"
      }
    ],
    "container-title": "Nature Genetics",
    "language": "en",
    "issued": {
      "date-parts": [
        [
          2017,
          4,
          3
        ]
      ]
    },
    "URL": "https://doi.org/f9xvr6",
    "container-title-short": "Nat Genet",
    "PMCID": "PMC5406250",
    "PMID": "28369037",
    "id": "2gpKwL67"
  }
]

No consortium and a {} as second author that affects the final output.

> manubot cite url:https://doi.org/10.1038/ng.3834
[
  {
    "id": "1Aom77Is5",
    "type": "article-journal",
    "title": "The impact of structural variation on human gene expression",
    "container-title": "Nature Genetics",
    "page": "692-699",
    "volume": "49",
    "issue": "5",
    "source": "www.nature.com",
    "abstract": "Structural variants (SVs) are an important source of human genetic diversity, but their contribution to traits, disease and gene regulation remains unclear. We mapped cis expression quantitative trait loci (eQTLs) in 13 tissues via joint analysis of SVs, single-nucleotide variants (SNVs) and short insertion/deletion (indel) variants from deep whole-genome sequencing (WGS). We estimated that SVs are causal at 3.5–6.8% of eQTLs—a substantially higher fraction than prior estimates—and that expression-altering SVs have larger effect sizes than do SNVs and indels. We identified 789 putative causal SVs predicted to directly alter gene expression: most (88.3%) were noncoding variants enriched at enhancers and other regulatory elements, and 52 were linked to genome-wide association study loci. We observed a notable abundance of rare high-impact SVs associated with aberrant expression of nearby genes. These results suggest that comprehensive WGS-based SV analyses will increase the power of common- and rare-variant association studies.",
    "URL": "https://www.nature.com/articles/ng.3834",
    "DOI": "10.1038/ng.3834",
    "ISSN": "1546-1718",
    "language": "en",
    "author": [
      {
        "family": "Chiang",
        "given": "Colby"
      },
      {
        "family": "Scott",
        "given": "Alexandra J."
      },
      {
        "family": "Davis",
        "given": "Joe R."
      },
      {
        "family": "Tsang",
        "given": "Emily K."
      },
      {
        "family": "Li",
        "given": "Xin"
      },
      {
        "family": "Kim",
        "given": "Yungil"
      },
      {
        "family": "Hadzic",
        "given": "Tarik"
      },
      {
        "family": "Damani",
        "given": "Farhan N."
      },
      {
        "family": "Ganel",
        "given": "Liron"
      },
      {
        "family": "GTEx Consortium",
        "given": ""
      },
      {
        "family": "Montgomery",
        "given": "Stephen B."
      },
      {
        "family": "Battle",
        "given": "Alexis"
      },
      {
        "family": "Conrad",
        "given": "Donald F."
      },
      {
        "family": "Hall",
        "given": "Ira M."
      }
    ],
    "issued": {
      "date-parts": [
        [
          "2017",
          5
        ]
      ]
    },
    "accessed": {
      "date-parts": [
        [
          "2019",
          10,
          22
        ]
      ]
    }
  }
]

Looks good.

@dhimmel
Copy link
Member

dhimmel commented Oct 23, 2019

manubot cite doi:10.1038/ng.3834 uses DOI Content Negotiation. You can see the raw JSON we get which we then process to be valid CSL JSON with:

curl --silent --location \
  --header "Accept: application/vnd.citationstyles.csl+json" \
  https://doi.org/10.1038/ng.3834 \
  | python -m json.tool

You'll notice the following:

    "author": [
        {
            "given": "Colby",
            "family": "Chiang",
            "sequence": "first",
            "affiliation": []
        },
        {
            "name": "GTEx Consortium",
            "sequence": "first",
            "affiliation": []
        },

The mistake is that "name": "GTEx Consortium", should be "literal": "GTEx Consortium",, since that is the field according to the CSL JSON specification. @gbilder, what is the correct place to report issues related to Crossref's DOI content negotiation?

After doing that I realized that using URLs (@url:https://doi.org/DOI) works too

Nice workaround... it ends up using a Zotero translator that is likely designed specifically for the Nature website.

@jmonlong
Copy link
Author

I'm not sure I understand. Just to clarify, in practice all the references containing a Consortium didn't format well in the final output of the manuscript (when using @doi:). The problematic references were from Nature journals but also GigaScience and bioRxiv.

I didn't know if this issie fit better here or in the rootstock repo.

This was for this manuscript.

@dhimmel
Copy link
Member

dhimmel commented Oct 24, 2019

https://jmonlong.github.io/manu-vgsv/ is looking great!

didn't know if this issie fit better here or in the rootstock repo.

This is correct! The code that actually downloads citation metadata lives here.

Just to clarify, in practice all the references containing a Consortium didn't format well in the final output of the manuscript (when using @doi:)

Yes, the issue (as I understand it) is that all of these DOIs are registered with Crossref. We therefore retrieve the metadata from Crossref using "DOI Content Negotation". We request the metadata in CSL JSON format, but Crossref returns it in a pseudo CSL JSON format with lot's of fields that don't meet the CSL JSON specification (past examples at CrossRef/rest-api-doc#222 (comment)). In this case, Crossref should put the consortium name in a field called "literal" rather than "name".

As far as solutions go:

  1. have Crossref fix how they convert consortiums to CSL JSON authors to use the "literal" field name. See Cross DOI Content Negotiation: use CSL JSON's author.literal field for consortiums crosscite/content-negotiation#92.

  2. write a special case workaround in the manubot package to convert "name" to "literal" for these CSL JSON

  3. Switch our default protocol for retrieving DOI metadata to use Zotero's translators. This bypasses the dirty Crossref CSL JSON, but increases the number of services that retrieving DOI metadata is dependent on. Will get easier with the addition of Add get_doi_csl_item_zotero functionality #161.

  4. Update the docs to suggest the URL citation workaround for bad DOI metadata.

Here is some prototype code regarding 3 showing that it would work for this case (not perfectly, but enough):

>>> import manubot.cite.zotero
>>> zotero_data = manubot.cite.zotero.search_query('doi:10.1038/ng.3834')
>>> csl_json_data = manubot.cite.zotero.export_as_csl(zotero_data)
>>> csl_json_data[0]['author'][:2]
[{'family': 'GTEx Consortium', 'given': ''}, {'family': 'Chiang', 'given': 'Colby'}]

@dhimmel
Copy link
Member

dhimmel commented Feb 21, 2020

@jmonlong I opened a pull request at #206 that should fix this issue. It switches the service we use to convert DOI metadata to CSL JSON. The new service does better at properly handling the consortium author name. See the new raw data in https://data.crosscite.org/application/vnd.citationstyles.csl+json/10.1038/ng.3834. The "GTEx Consortium" will now show up properly in the reference list, although it won't be in the right place. However, this is due to improperly submitted metadata.

The problematic references were from Nature journals but also GigaScience and bioRxiv.

If you happen to know those DOIs, it'd be helpful for us to see if we're now able to generate metadata correctly.

I found those DOIs and commented on them at crosscite/content-negotiation#92 (comment). The consortium name in https://doi.org/10.1101/664623 is properly parsed now, but https://doi.org/10.1186/s13742-015-0103-4 still has trouble using DOI content negotiation.

@twrightsman
Copy link
Contributor

twrightsman commented Jun 23, 2022

This unfortunately still seems to be an issue for me with certain DOIs with consortium authors. The resulting references don't have any authors since the consortium is the only author. @dhimmel could we reopen this or is there a configuration option I'm missing?

$ manubot cite 10.1038/nature08747
[
  {
    "publisher": "Springer Science and Business Media LLC",
    "issue": "7282",
    "DOI": "10.1038/nature08747",
    "type": "article-journal",
    "page": "763-768",
    "source": "Crossref",
    "title": "Genome sequencing and analysis of the model grass Brachypodium distachyon",
    "volume": "463",
    "author": [
      {}
    ],
    "container-title": "Nature",
    "language": "en",
    "issued": {
      "date-parts": [
        [
          2010,
          2
        ]
      ]
    },
    "URL": "https://doi.org/d6n7pw",
    "container-title-short": "Nature",
    "PMID": "20148030",
    "id": "BHb1o1Tm",
    "note": "This CSL Item was generated by Manubot v0.5.2 from its persistent identifier (standard_id).\nstandard_id: doi:10.1038/nature08747"
  }
]
$ curl --silent https://api.crossref.org/v1/works/10.1038/nature08747 | jq .message.author
[
  {
    "name": "The International Brachypodium Initiative",
    "sequence": "first",
    "affiliation": []
  }
]

Also shows up for this DOI:

$ curl --silent https://api.crossref.org/v1/works/10.1038/nature06148 | jq .message.author
[
  {
    "name": "The French–Italian Public Consortium for Grapevine Genome Characterization",
    "sequence": "first",
    "affiliation": []
  }
]

@agitter
Copy link
Member

agitter commented Jun 23, 2022

@twrightsman what author lists are you expecting to see for these two examples? The single consortium author matches what is shown at the publisher's site:
image
image

For the second article, PubMed has a different author list that you can obtain with manubot cite pmid:17721507

@twrightsman
Copy link
Contributor

twrightsman commented Jun 24, 2022

@agitter When manubot processes the JSON metadata the "name" key under authors gets dropped because it is invalid in the CSL schema, if I am understanding the initial conversation in the issue correctly.

See this section of the manubot cite 10.1038/nature08747 output for the Brachypodium genome paper:

"author": [
      {}
],

Even though the Brachypodium genome paper raw JSON has the author:

[
  {
    "name": "The International Brachypodium Initiative",
    "sequence": "first",
    "affiliation": []
  }
]

This ends up showing as a blank author list in my rendered manubot manuscript.

@agitter
Copy link
Member

agitter commented Jun 24, 2022

Thanks. I understand the issue now.

@agitter
Copy link
Member

agitter commented Jun 24, 2022

I believe this is related to the changes in #319.

I'll note that for 10.1038/nature08747, the workaround of using Zotero isn't helpful. Zotero includes all of the "Consortia" information into the author list so that authors are listed multiple times and contribution types are included as authors

$ manubot cite https://doi.org/10.1038/nature08747
...
      {
        "family": "Baxter",
        "given": "Ivan"
      },
      {
        "family": "The International Brachypodium Initiative",
        "given": ""
      },
      {
        "family": "Principal investigators",
        "given": ""
      },
      {
        "family": "DNA sequencing and assembly",
        "given": ""
      },
      {
        "family": "Pseudomolecule assembly and BAC end sequencing",
        "given": ""
      },
      {
        "family": "Transcriptome sequencing and analysis",
        "given": ""
      },
      {
        "family": "Gene analysis and annotation",
        "given": ""
      },
      {
        "family": "Repeats analysis",
        "given": ""
      },
      {
        "family": "Comparative genomics",
        "given": ""
      },
      {
        "family": "Small RNA analysis",
        "given": ""
      },
      {
        "family": "Manual annotation and gene family analysis",
        "given": ""
      }
    ],

@dhimmel should we reconsider this possible solution you listed earlier:

write a special case workaround in the manubot package to convert "name" to "literal" for these CSL JSON

@agitter agitter reopened this Jun 24, 2022
@agitter
Copy link
Member

agitter commented May 26, 2023

I'm following up on this issue with a workaround that has been helpful for some of my citations involving consortium authors that have broken metadata. Using Zotero to fetch the metadata from the PubMed URL works in many of my examples. For the paper above:

$ manubot cite https://www.ncbi.nlm.nih.gov/pubmed/20148030
[
  {
    "id": "1GGiDD9ah",
    "type": "article-journal",
    "abstract": "Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy and food crops.",
    "container-title": "Nature",
    "DOI": "10.1038/nature08747",
    "ISSN": "1476-4687",
    "issue": "7282",
    "journalAbbreviation": "Nature",
    "language": "eng",
    "note": "PMID: 20148030\nThis CSL Item was generated by Manubot v0.5.5 from its persistent identifier (standard_id).\nstandard_id: url:https://www.ncbi.nlm.nih.gov/pubmed/20148030",
    "page": "763-768",
    "source": "PubMed",
    "title": "Genome sequencing and analysis of the model grass Brachypodium distachyon",
    "volume": "463",
    "author": [
      {
        "literal": "International Brachypodium Initiative"
      }
    ],
    "issued": {
      "date-parts": [
        [
          "2010",
          2,
          11
        ]
      ]
    },
    "URL": "https://www.ncbi.nlm.nih.gov/pubmed/20148030"
  }
]

This works even though the PubMed citation does not (the literal author is still missing):

$ manubot cite pubmed:20148030
[
  {
    "title": "Genome sequencing and analysis of the model grass Brachypodium distachyon.",
    "volume": "463",
    "issue": "7282",
    "page": "763-8",
    "container-title": "Nature",
    "container-title-short": "Nature",
    "ISSN": "1476-4687",
    "issued": {
      "date-parts": [
        [
          2010,
          2,
          11
        ]
      ]
    },
    "author": [
      {}
    ],
    "PMID": "20148030",
    "DOI": "10.1038/nature08747",
    "abstract": "Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy and food crops.",
    "URL": "https://www.ncbi.nlm.nih.gov/pubmed/20148030",
    "type": "article-journal",
    "id": "pXInMU2w",
    "note": "This CSL Item was generated by Manubot v0.5.5 from its persistent identifier (standard_id).\nstandard_id: pubmed:20148030"
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants