Skip to content

Documentation for SWAG contradicts itself when constructing the first sentence. #35095

@bauwenst

Description

@bauwenst

System Info

Not relevant.

Who can help?

@stevhliu @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The docs for multiple choice use SWAG as an example, which is the task of selecting the next sentence given a context. Somewhat strangely, rather than being given in the format (sentence1, [sentence2a, sentence2b, sentence2c, sentence2d]), the dataset is given in the format (sentence1, sentence2_start, [sentence2_endA, sentence2_endB, sentence2_endC, sentence2_endD]).

The code given in the docs basically turns the dataset into the first format, where sentence 1 is kept intact and the start of sentence 2 is concatenated to each ending:

... first_sentences = [[context] * 4 for context in examples["sent1"]]
... question_headers = examples["sent2"]
... second_sentences = [
... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
... ]

Yet, the docs say:

The preprocessing function you want to create needs to:
1. Make four copies of the `sent1` field and combine each of them with `sent2` to recreate how a sentence starts.
2. Combine `sent2` with each of the four possible sentence endings.

What is being described is formatting the dataset as (sentence1 + sentence2_start, [sentence2_start + sentence2_endA, sentence2_start + sentence2_endB, sentence2_start + sentence2_endC, sentence2_start + sentence2_endD]), where there is overlap between the first and the second sentence (namely sentence2_start).

Expected behavior

Either the code is wrong or the description is wrong.

If the description is wrong, it should be:

The preprocessing function you want to create needs to:

  1. Make four copies of the sent1 field.
  2. Combine sent2 with each of the four possible sentence endings.

If the code is wrong, it should be:

    first_sentences = [[f"{s1} {s2_start}"] * 4 for s1,s2_start in zip(examples["sent1"], examples["sent2"])]
    second_sentences = [
        [f"{s2_start} {examples[end][i]}" for end in ending_names] for i, s2_start in enumerate(examples["sent2"])
    ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions