ogre pad 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 42

🚀 Welcome to the Recap Human

Eval Project!
📝 This is an introductory course to prepare you for this
exciting project, where you will evaluate an AI meeting
assistant that helps the user summarize meeting
transcripts!

Please have the project instructions open while


you go through the course!
https://docs.google.com/document/d/e/2PACX-
1vS1wkEjG3iZrJiLu7yx1euxT6-
n_8Rfuqwo8jqQ7G1MoQnG2_TDSdPqr837aWHbPbCGyfFt
wIV24uX-/pub
📚 Task Workflow
Here is a simple overview of the task:
🗂️You will be provided with a meeting transcript and also
2 AI generated recaps of the transcript, your task is to
evaluate these recaos and decide which one is better.
ONLY PROCEED IF YOU FULLY READ THE 5 TASK STEPS
BELOW.

Step 1: Read the meeting transcript.


Step 2: Rate the transcript in terms of how well you can
understand it.
Step 3: Read the recap A, and evaluate it in multiple
dimensions.
Step 4: Read the recap B, and evaluate it in multiple
dimensions.
Step 5: Complete the Side by Side score between A, B
and explain your choice!
📤 So you will rate 2 different AI generated recaps based
on a provided meeting transcript!
🤖 What is a "Recap"

You are presented with a full meeting transcript and its


corresponding Recap from two different models. Your
task is to evaluate the usefulness of the notes, identify
any issues, and ultimately choose the one that is better.
Please read through the transcript carefully first
and try writing down all of the main topics covered
that you believe to be important to help someone who
was not in the meeting to catch up.
Each paragraph has a first sentence that states a key
topic from the discussion, then several talking points
from that discussion and, if applicable, an outcome
sentence. Definitions for “Key Topic”, “Talking point” and
“Outcome” are provided below.
Key Topic: A sentence that captures what a specific part
of the conversation was about. Key topics can be further
broken down into talking points that go into detail about
what was discussed about the topic. If there are two
speakers, it should mention both of their names. If there
are more than two speakers, referring to them as “the
group” is sufficient. This should answer the question,
“What was discussed in this portion of the meeting?”
Talking points: Supporting discussion about the topic
that provides context to the Key Topic including, points of
view, important facts to help someone not in the meeting
understand this topic, etc. The set of talking points should
comprehensively cover the conversation about the Key
Topic.
 Outcome (optional): A key takeaway/next step for the
topic. This may not exist. If the outcome does exist (a
decision was made, a conclusion reached after some
back and forth discussion, etc.) the model should
prioritize this information.

Transcript Comprehension
After reading the transcript, you will be rating it in two
dimensions:
Transcript Comprehension
Examine the transcript and indicate how much of the
notes you understand as defined by:

 Clear understanding of the topic that is being discussed


 Clear understanding of talking points/contributions of
each participant in the discussion

Translation quality

 Examine the transcript and indicate how much of it felt


poorly translated, whether it was too “literally translated”
from English, completely incorrectly translated or other
issues with translation to the point where it becomes too
confusing to interpret the meaning.

🔍 Evaluation Dimensions
You will be evaluating each Recap on:
Completeness – How well does the Recap capture the
important points from the transcript?
Understanding – How well can you understand the
Recap?
Groundedness – How well grounded is the Recap based
on the transcript?
Citation accuracy - How accurate were the citation links
in the Recap paragraphs (e.g. [3]) in directing you to the
correct location within the transcript?

IMPORTANT NOTE: The evaluation dimensions should


be rated INDEPENDENTLY from each other! For example a
groundedness problem SHOULD NOT affect the
completeness rating and vice versa!
IMPORTANT NOTE: The evaluation dimensions should
be rated INDEPENDENTLY from each other! For example a
groundedness problem SHOULD NOT affect the
completeness rating and vice versa!
✅ Overall Quality Scores and Side by Side Scoring
After completing the evaluation dimensions for each
response, please rate the overall quality for each
response! If you selected anything other than "Amazing"
for overall quality, please specifically explain what made
it NOT Amazing, point out specific instances from the
recap/transcript.
After evaluations for each response is completed,
complete the Side by Side scoring indicating which
response is better. PLEASE BE VERY SPECIFIC IN
YOUR JUSTIFICATIONS when explaining what made a
response better than the other!
🔍 Final Step - Side by Side Comparison
The final step in the task is to complete the side by side
scoring of the two generated live summaries!

THANK YOU for completing this course! It is great to


have you in this project! 🎊 🎉
Tasker Instructions
TNFM Human Eval Live Summary Instructions
NOTE: Please open these instructions so you can reference them as you
complete the task. → This is a link to these instructions that you can open
in a separate tab/window.
Table of Contents
Project Overview
Detailed Rating Instructions
Task Steps
Reading the Transcript
Transcript/Translation Comprehension
Important Note about transcript and translation comprehension!
Reading the Live Summary
Live Summary Evaluation Dimensions:
Side by Side scoring:
Examples

Project Overview
Get ready to work on an exciting task: evaluating an AI meeting assistant that helps summarize
meetings for you!
In this task you will be provided with a meeting transcript and also 2 model generated live
summaries of the transcript, your task is to evaluate these live summaries and decide which one
is better.
Task Workflow
Step 1: Read the meeting transcript.
Step 2: Rate the transcript in terms of how well you can understand it.
Step 3: Read the live summary A, and evaluate it.
Step 4: Read the live summary B, and evaluate it.
Step 5: Complete the Side by Side score between A,B and explain your choice!

Detailed Rating Instructions


You are presented with a segment of a meeting transcript and its corresponding summary from
two different models. This “Live Summary” feature is useful when they arrived late to a meeting
or stepped out for a minute that helps them understand the details of a conversation they
missed so they can quickly get up to speed on what the conversation has been about while they
weren’t there.
Good summaries are written as a short paragraph that concisely capture all the talking points of
the transcript. A talking point is a meaningful discussion relevant to the meeting. A talking point
is not small talk or pleasantries shared between meeting participants.
Live summaries should:

1. Be Concise: The live summary is not meant to be exhaustive and should cover only the
high level detail necessary for someone to catch up on what they missed from the
discussion in a concise way.

2. Be Descriptive: Although not all the details need to be covered, each sentence should
be descriptive enough that it ideally covers the main talking points. The reader should be
able to understand the points made. We don’t want any vague words such as things,
stuff, someone, something, etc., even if they are grounded in the transcript.

3. Have correct speaker attribution: When a name is attributed to a statement, it should


be the correct name.

4. Not be repetitive: Live summaries that don’t have any repetitive/redundant sentences
should be preferred over those that are repetitive.

Task Steps
Here is what the task looks like!
Reading the Transcript
Start with reading the transcript carefully!
Transcript/Translation Comprehension
Answer the transcription comprehension question.
Question: Examine the transcript and indicate how much of the notes you understand as
defined by:

 Clear understanding of the topic that is being discussed


 Clear understanding of talking points/contributions of each participant in the discussion
Answer the translation comprehension question.
Question: examine the transcript and indicate how much of it felt poorly translated, whether it
was too “literally translated” from English, completely incorrectly translated or other issues with
translation to the point where it becomes too confusing to interpret the meaning.

Pro Tip!
Toggle to the icon on the top right to view the rubric for EVERY QUESTION!
RUBRIC:
Important Note about transcript and translation comprehension!
YOU CAN’T CONTINUE THE TASK IF EITHER TRANSCRIPT OR TRANSLATION
COMPREHENSION IS “VERY UNCLEAR”. If either one of them is “very unclear” then
the platform will not allow you to continue, and you should submit the task without
completing the rest of it since it is very hard to understand!
Reading the Live Summary
This step (reading the live summary) and the following step (live summary evaluation
dimensions) will be done for both responses! First you will read live summary A and complete
the evaluation questions relating to A and then you will read live summary B, followed by
completing the evaluation questions for B.
Live Summary Evaluation Dimensions:
Please use the following dimensions to evaluate both live summary A and B.
HIGHLY IMPORTANT NOTE: EACH DIMENSION SHOULD BE
EVALUATED SEPARATELY, FOR EXAMPLE A COMPLETENESS
ISSUE SHOULD NOT IN ANY WAY AFFECT THE GROUNDEDNESS
RATING.
Eval Dimension Scoring Guidelines

Completeness issues: Please select  Vague summary: The summary overall does
any of the following issues with not provide enough context for it to be useful.
“Completeness” that apply to this  Missing most talking points: The summary
summary: missed most of the talking points of the
transcript. If it missed some talking points but
was still useful, don’t penalize.
 Other: Issue not covered above

Completeness – How well does the  Incomplete: You indicated “Missing most
Summary capture the important points talking points” in the previous question.
from the transcript?
More details:  Partially complete: The summary covered
[INTERNAL] TNFM Human Eval - Live most of the transcript, but missed or
Summary misinterpreted some small parts or you
indicated “Vague summary” in the previous
question.

 Complete: The summary covered all of the


important points in the transcript.

Minor Understanding Issues: Please  Minor redundancy: The summary repeats


select any of the following minor itself unnecessarily.
issues that you observed in the  Minor incorrect language / tone: The
summary that did not make it summary contains phrases that sound
significantly harder to understand. awkward, unnatural, overly formalized or
Note: Please only select these issues uncommon in conversation. This issue is not
if you observed them, but could still observed in the transcript
understand the intended meaning.  Minor poor grammar / syntax: The
MINOR UNDERSTANDING summary contains bad grammar, punctuation
ISSUES SHOULD NOT AFFECT or misspellings. This issue is not observed in
OVERALL QUALITY / the transcript excerpt, only the summary.
UNDERSTANDING  Minor verbosity: The summary contained
too much unnecessary detail/overly
EVALUATION.
formalized language to be useful
 Other: Issue not covered above

Major Understanding Issues: Please  Redundant: The summary repeats itself


select any of the following major unnecessarily.
issues that you observed in the  Incorrect language / tone: The summary
summary that made it challenging to contains phrases that sound awkward,
understand. unnatural, overly formalized or uncommon in
Note: Please only select these issues conversation. This issue is not observed in
if it specifically made the summary the transcript
hard to understand. If you saw an  Poor grammar / syntax: The summary
issue but could still understand the contains bad grammar, punctuation or
meaning, please don’t penalize this misspellings. This issue is not observed in
question. the transcript excerpt, only the summary.
 Too verbose: The summary contained too
much unnecessary detail/overly formalized
language to be useful
 Other: Issue not covered above

Understanding – How well can you  Not understandable: The summary was
understand the Summary? barely intelligible, very confusing, or so poorly
More details: organized that they were difficult to read and
[INTERNAL] TNFM Human Eval - Live understand or you chose multiple
Summary Understanding issues in the previous
question.

 Partially understandable: You indicated one


of the following issues in the previous
question; “Redundant”, “Incorrect language /
tone”, “Poor grammar / Syntax”, “Too
verbose” or “Other”

 Understandable: You did not select any


issues from the previous question or the
summary was mostly easy to understand.

Groundedness issues: Please select  Does not preserve meaning- You can see
any of the following issues with where the model tried to summarize a portion
“Groundedness” that apply to this of the transcript, but it did so incorrectly or
summary: without preserving the intended meaning.
(e.g. “Lindsay agrees with X” when Lindsay
actually disagrees with X)
 Wrong Attribution - The summary attributes
a speaker’s words to another speaker (e.g.
“A says something”, but B says it in the
transcript)
 Assumed/Incorrect Pronoun - The
summary assumes a pronoun without
evidence. (e.g. "Jackie is doing her work," but
there is no indication in the transcript of
which pronouns Jackie prefers.)
 Other type of hallucination - The summary
contains content that is not supported by
clear evidence from the transcript and does
not fall into one of the categories above.
 Other: Issue not covered above

Groundedness – How well grounded  Not grounded: You chose “Does not
is the summary based on the preserve meaning” as an issue with the
transcript? summary.
More details:
[INTERNAL] TNFM Human Eval - Live  Partially grounded: You chose “Wrong
Summary Attribution”, “Assumed/Incorrect Pronoun” or
“Other type of hallucination” or “Other” as an
issue with the summary.

 Grounded: You did not select any issues


from the previous question with the summary.

Overall Quality – How good is the  Horrible: The summary was marked Not
Summary overall? Grounded. Compared to what a human in the
meeting would write, this summary is much
worse.

 Pretty bad: The summary was marked at


least Partially Grounded, but was also
marked either Not Understandable or
Incomplete. Compared to what a human in
the meeting would write, this summary is
somewhat worse.

 Okay: The summary was at least Partially


Grounded, at least Partially Understandable
and at least Partially Complete. Compared to
what a human in the meeting would write,
this summary is slightly worse.

 Pretty good: The summary was


Understandable, at least Partially Complete
and at least Partially Grounded. Compared to
what a human in the meeting would write,
this summary is about the same.

 Amazing: The summary was


Understandable, Complete and Grounded.
Compared to what a human in the meeting
would write, this Recap is about the same or
better.

Side by Side scoring:


Finally please compare the two responses side by side!
Preference – Please indicate the relative quality of the Summary.
1 is much better 1 is better 1 is slightly better About the same 2 is slightly better 2 is
better 2 is much better

Explanation: If you chose one side as being better than the other, what were the main
reasons? (Select all that apply.)

 Captured summary for the transcript more concisely.


 Was easier to read (proper spelling, grammar, punctuation).
 Tone was more appropriate for the context.
 Contained less inaccurate, deceptive, or misleading content.
 Contained less off-topic information like small talk or pleasantries.
 Other (explain below)

Comment: Please explain your rating (i.e., why users would prefer one side over the other or
consider them about the same).
Completeness vs. Understanding vs. Groundedness
Completeness: You should only penalize completeness if the Summary is missing information
such as key topics discussed, talking points, or outcomes. The Summary may not include all
details from the transcript and that is ok as it is not meant to be a word for word transcription. If
the output misinterpreted points, was too wordy or had any grammar / language issues, these
should not be penalized under completeness. Completeness is only about missing
information.
Understanding: This section is split into two different types of understanding issues:

1. Minor issues: Any small issues in language, grammar, syntax, wording, verbosity,
redundancy etc., that may have been awkward or incorrect but does not significantly
impact your understanding of the intended meaning behind the phrase.
1. Examples: First name switched with last name, incorrect/missing punctuation,
incorrect/shorthand grammar but you still understood the intended meaning,
some slight redundancy but you could still understand, etc.

2. Major issues: Any significant issues in language, grammar, syntax, wording, verbosity,
redundancy etc., that significantly impacted your ability to understand the intended
meaning in the notes.

1. Examples: Phrases that are too hard to understand the intended meaning based
on poor translation, wording. Very redundant/verbose making it hard to
understand the conclusions / meaning of the Summary.

Groundedness: You should only penalize groundedness if the Summary is including


information that is incorrect. This could be incorrect speaker attribution, incorrect interpretation
of information from the transcript, incorrect dates, etc. If it is missing information such as it
didn’t capture all the speaker’s perspectives, missed details, was too vague in parts, this
should not be penalized for groundedness. If it was hard to understand, but the information
can be traced back to the transcript, this also should not be penalized for groundedness
because it’s an understanding issue. Groundedness is only penalized if the information
included in the Summary is contradictory to what’s in the transcript or is hallucinated.
Examples
Transcript Live summary Rating

Adam Andrews: but I want to bundle it up Adam Andrews proposed to Overall Quality: Amazing. It
for these two days I want a dog walker, launch the service in big captured the main topic of
and then after that I need boarding. That cities such as San Francisco, this transcript segment which
kind of partnerships could also unlock. New York, Boston, was Adam proposing to
Matthew Miller: Okay. Yeah, that's great. Philadelphia and Seattle. launch the service in the
I was worried that this would just be a Matthew Miller agreed with cities listed and Matthew
single revenue stream um model. So I'm the plan and asked for more agreeing.
glad to hear that you've already thought details. It was also Complete,
about all these different channels to pull Grounded and
in revenue. um At the start uh what Understandable
markets do you think you'll be launching
in?
Adam Andrews: Um I would like to start
in the big cities. Um like my hope is to
start with San Francisco, the Bay area
um New York, Boston Philadelphia.
Because this is where I feel like most of
the cosmopolitan people live who travel
and then um who also love dogs. Seattle
for one which has the highest dog
population and there I based on the
market study we've done, the need is
more pronounced.
Matthew Miller: Okay. Yeah. That all
sounds really great uh to me, and I know
we're just out of time right now. So if you
can just send over um basically
everything with uh actual numbers and
just the paper for me to sign off on um
that would be great.
Adam Andrews: Sure. That sounds like a
plan.
Matthew Miller: Awesome. Thank you so
much for your time, and uh yeah, looking
forward to working with you.
Adam Andrews: Thank you. Thank you
Matthew Miller: Thank you. Bye bye.
Adam Andrews: Bye.

Frank Garcia: Yeah, I was actually going Frank Garcia and Mamadou Overall Quality: Horrible It
to ask you because some of the ad Diallo talked about a product missed the main points and
formats that our team was thinking about incorrectly summarized what
was the 6 second bumper ads and then the conversation was about.
the 15 second true view ads. So, we we It was also Incomplete,
have this time limitation of 6 seconds and Ungrounded and
15 seconds in terms of the media we put Understandable (since it was
out there. So, if you could give us the easy to understand)
best 6 second and the best 15 seconds
of each song or even multiples there of
we could use that as the kind of
background music in the creative itself
because you're right. The first 6 or 15
seconds of a song are not always the
best. You have to find it and people can
manually do this. So if you already have
that we could use this as the background
score to attract people even more.
Mamadou Diallo: Okay, yeah. That's
okay, that's that's a great point. I will take
note of that, and I will talk to a couple of
engineers on my team who are looking at
this. They're they're trying to basically
determine from like where cuz people
tend to repeat the parts of the songs that
they like in on YouTube and stuff.
Maybe they can have some insight in to
how can we narrow that down for like a
large quantity of songs and see.
Frank Garcia: Yeah, I believe there is
something called the YouTube
Dashboard where you can see exactly
where do people drop off and they
rewind, so we can use the algorithmic
they have going through that data and
then cutting it down into those slices.
Mamadou Diallo: Okay, sounds good.
Sounds like I have an action item there
too.
Frank Garcia: Great. Well, we'll
reconvene next week. I will share some
more prototypes specifically in terms of
what the creative could look like and then
we can go through the technical APIs
and everything too.
Mamadou Diallo: Awesome. Thank you
so much.
Frank Garcia: Great. See you.
Mamadou Diallo: Bye-bye.
Frank Garcia: Bye.

Justifications
⚠️Remember - all justifications need to be in English 🇺🇸

Content
🧐 Why Do We Need Justifications
😡 Let's Chat about Lousy Justifications
🤩 Getting to Good Justifications

🧐 Why Do We Need
Justifications
What is Justification?
This is the rationale you provide for your evaluation of a
particular response to a prompt and/or to your selection
of one response over another.
Why are good justifications important?
Justifications allow reviewers, the project team, and the
client to better understand your thinking and preferences
so that we can provide accurate feedback, useful and
targeted training, and improve model alignment with
human preferences (e.g., values, satisfaction).
Justifications, in short, explain why you picked one
response over the other. Crucially, you must write a very
beautiful, holistic justification.
In the TNFM project there are multiple questions where
justifications are required! The justifications guidelines
apply to ALL JUSTIFICATIONS BOXES!

😡 Let's Chat about Lousy Justifications


Getting to Good Justifications
After doing your side-by-side rating, please write a detailed justification, in English, for
your rating with examples whenever possible. A good justification includes your
reasoning, specific examples from the responses, and how they influenced
your ratings.
🧑‍🏫Elements of a good justification
 Aim for at least 4-5 sentences
 Lead with what the user intent is (What does the user
want to achieve?)
 Includes key differences between the responses (How is
summary/recap A different to summary/recap B)
 Highlight these factors with reasonably detailed evidence
(use examples from the text!)
 INCLUDE appropriate textual evidence to support your
verdict.
 ALIGN it with your verdict (i.e. rating it “much better”
requires stronger evidence) MAKE SURE THERE ARE
NO CONTRADICTIONS BETWEEN JUSTIFICATIONS
AND RATINGS.
 Show your research & your work, based on the
dimensions:
 If an issue (eg grammar or something that doesn’t sound
like native) is identified for writing quality → share it in
your comment
 If a summary/recap is not grounded on the transcripts
explain exactly what makes it not GROUNDED - give
examples.
 If a recap/summary is not COMPLETE, explain what
should have been included!
 If a response has other issues, explain what are these!
 Justifications without evidence and supporting
claims from the responses or your research will not
be counted as good.

Elements of good
justifications:
 User Intent: The justification should include a comment
that demonstrates an understanding of what the user is
looking for with the prompt

 Conclusion: We need to determine which response is


better and why. The comment should clearly explain the
reasoning behind this decision.

 “@Response A is better than @Response B ...”

 Supporting Claims: The justification needs to have key


supporting points to defend its conclusion.

 “...because @Response B has a hallucination issue ... "

 Specific Evidence: The precise examples or evidence in


the text used to support each supporting point.

 “..Response B has a groundedness issue in the second


paragraph when its talking about”

 Analysis: The explanation(s) of how the evidence


defends the supporting claim

 Summarize how your evidence supports your preference


🤩 Guiding Principles for Good
Justification
Focus on ultimately why you selected your preference
(i.e., which response best aligned/answered the user's
core intents)
Be specific about what the response(s) might have
missed (i.e., if you say it didn't follow what the user asked
for, say specifically what it didn't follow; if you say it had
major factual inaccuracies, say precisely what it got
wrong)
Note
⚠️When working in Remotask ensure you use ‘@’ before talking about each response
@response 1 , @response 2
Example
TFNM Human Evals Reviewer
Instructions
PROJECTS
goat_pad
gearshift_proposal
graph_houseboat_c830
Reviewer Rating System Guidelines
Detailed Rubric with Error Types
Single Task Reviews (L0 level reviews)
Multi Replica Reviews (L10 level reviewers)
Edge Cases/Error Examples
Congratulations on being selected as a reviewer! 🎉
Q: Why are reviewers important?
A: Reviewers play a crucial role in the project, ensuring that only high-quality tasks are
delivered to customers. They act as a safeguard by identifying high quality work and
fixing low quality tasks.
Q: How are reviewers chosen?
A: Reviewers are selected from a pool of top-performing contributors who consistently
demonstrate attention to detail and a high level of quality in their work.
✅Good Reviewer Traits:

 Knows the task and instructions very well


 Ratings given are accurate and precise, not just all 5s
 Ensures consistency between Likert scale ratings and task evaluations
 Makes logical changes across the reviewed tasks but does not artificially fix them
to create agreement

❌Bad Reviewer Traits:

 Does not really know the task


 Gives extreme ratings (e.g., all 5s) without reason or close attention
 Alters tasks to force agreement or makes contradictory or unclear changes
 Shows inconsistency between overall and individual evaluations
Only proceed if you have read and completely understand the Attempter
instructions for the project you are reviewing!

Reviewer Rating System Guidelines


Grading Rubric Overview
Grades Actions Reason
5 - Perfect Approve No changes necessary, in total agreement.
Slightly disagree with 1-2 eval questions (selected vs correct
Make
4 - Good option has distance within 1) but the task is mostly correct and
Changes
justifications are well-written.
Make Task is overall okay, either 1 eval question was marked wrong
3 - Okay
Changes (with a distance of 2), justifications are okay.
Task does not meet the basic requirements, requires fixes on 3+
Make
2 - Bad questions (with a distance of 1) of 2+ eval questions (with a
Changes
distance of 2), justifications are poor.
Task is spam or otherwise terrible quality and needs major fixes
on 4+ eval questions which are more than 2 points difference,
1 - Terrible Reject and justifications are missing or poorly written.
If there are contradictions between the justification or
explanations and the overall scores or side by score selected.
Example for distance calculating:
If the correct answer to the below question is “understandable”, the selection is wrong with a
distance of 2. If the correct answer is “partially understandable” the selection is wrong with a
distance of 1.

Detailed Rubric with Error Types


Check each dimension while you are reviewing the task and rate accordingly.
 All tasks start out with 5 points.
 Each minor error deducts 1 point.
 Each major error deducts 3 points.
 The remaining points are the task’s overall score (1–5).

Field Major Error (-3 pts) Minor Error (-1 pt) Additional Notes
Likert ratings can
- [Major Ranking be separated into 3
Disagreement] You and the buckets:
contributor disagree on the 1. Response 1 is
likert rating bucket (1,2) vs. better than
- [Loose Ranking
(3,4,5) vs. (6,7) and the Response 2: 1/2
Disagreement] you and
justification does not 2. Both responses
the contributor disagree
adequately support this are about the same:
between adjacent buckets
Likert or Comparison variance 3/4/5
(1,2 / 3,4,5) or (3,4,5 / 6,7)
Rating - [Inconsistent 3. Response 2 is
AND the contributor's
Ranking] The contributors better than
justification is persuasive
individual quality ratings Response 1: 6/7
enough to make you
support the opposite If you and the
uncertain in your position
conclusion of their likert rating contributor disagree
and this contradiction is not on the bucket
adequately supported in the (without good
justification justification), the
task is a fail
- [Incorrect
Transcript/Translation
- [Loose
Comprehension] The
comprehension
contributor marked the
disagreement] The
Transcript/Translation prompt in the wrong bucket.
contributor’s response is
Comprehension Bucket 1: “Very unclear,
in the same bucket with
somewhat unclear, Neither
the correct response but
clear or unclear”
it’s the same response.
Bucket 2: “Very clear,
somewhat clear”

- [Major Overall Ranking - [Loose Overall Ranking


Disagreement ] you and the Disagreement] You
Overall Quality
contributor disagree with a disagree with the
Rating (check per
distance of 2 or more (Okay contributor by a distance
response)
vs Horrible) in response 1 or of 1 (Pretty Good vs
2 Okay) in response 1 or 2

- [Incorrect Response
Rateability] The contributor
Response Ratability
marked the response as
(check per response)
rateable when it was not
rateable (or vice versa)
- [Major Rating
Disagreement v1] You
disagree with the contributor’s - [Minor Rating
Completeness rating by 2 points in any of Disagreement v3] You
Understanding these dimensions. disagree with the
Groundedness - [Major Rating contributor’s rating by 1
(check per response) Disagreement v2] You point in one of these
disagree with the contributor’s dimensions.
rating by 1 point, more than
one of these dimensions.

- [Major Rating
- [Major Rating
Disagreement v4] You
Disagreement v5] You
disagree with the contributor’s
disagree with the
response on multiple of these
contributor’s response on
dimensions, but you agree
one of these dimensions,
with the contributor's
Completeness issues but you agree with the
selections in the following
Understanding issues contributor's selection in
questions.
Groundedness issues the following question.
(check per response)
Example: Groundedness
Example: Groundedness
issues and understanding
issues is marked wrong
issues are marked wrong but
but the following
the following groundedness
groundedness question is
and understanding question
marked correctly
is marked correctly.

- [Major Rating
Disagreement other
Other dimensions (if
dimension] You disagree
applicable)
with the contributor’s
response.

Justification Rubric
Criteria Major Error (-3 pts) Minor Error (-1 pt) Additional Notes

- [Generic Justification] The


justification isn't specific to the - The justification is not If a specific justification is
Analysis given task and the justification generic and is not not needed than do not
needs to be specific in order to skewed dock points for this
make sense of the rating.

Accuracy - [Evidence Inaccurate] 1


piece of evidence used in the
justification is inaccurate
- [LLM USAGE] If it is clear that
the tasker used an LLM to
generate justifications or
explanations.

Single Task Reviews (L0 level reviews)


IMPORTANT NOTE/EDGE CASES FOR
REVIEWERS:
 Our contributors were penalizing the same error
in multiple error categories. For example if there
is a groundedness error in a task it should NOT
affect the completeness rating. As much as
possible we need to map each error to ONE
evaluation dimension. Another example if there
was a problem with citations, it should not affect
the completeness dimension, it should only
affect the citation accuracy dimension (this
example is for recap specifically).
 Citation Accuracy should not affect Overall
Rating! (only for recap)
 Understanding dimension should not be marked
down when the contributors can still mostly
understand the recap/summary. If the
recap/summary only has a small grammar issue
or a formatting issue that DOES NOT affect the
understanding/comprehensibility the
understanding dimension can still be and should
be "Understandable"
 Justifications should be specific and point out
specific words/phrases/parts from the
recap/summary/transcript to support the rating
selection!
Multi Replica Reviews (L10 level reviewers)
Here is a recording explaining the multi replica reviewer workflow!
Multi Replica Reviewer Instructions.mp4
Multi Replica Workflow
1. You will be reviewing three identical tasks completed by different attempters.

1. Please have the attempter instructions to reference while completing the


reviews.
2. You can see each task at the top of the screen by clicking on it.

2. Look through all three tasks (WITHOUT EDITING) and check for accuracy
(ratability, SxS, eval questions).

A task is accurate if the rating is 3, 4 or 5


A task is inaccurate if the rating is 1, 2
from the rubric
A task is accurate if most of the eval questions are correct, the CB has done a pretty good job attempting
the task, the side by side and the individual overall rating responses are correct.
For accurate tasks, reviewers do not need to do major fixes, but they still need to polish - if they
haven’t edit a task (KEEP IN MIND YOU CAN EDIT 1 TASK AT MOST)
the task so it is ready to send to the customer. An example of this could be improving the
justifications.
A task is inaccurate if there are more than two-three rating questions that are clearly wrong, or the side by
side evaluation or any of the overall individual rating responses is clearly wrong.
For an inaccurate task fixing means fixing the task overall and correcting all mistakes.

3. If all 3 of the tasks are accurate:

1. Review all by scoring the task and giving feedback (do not make
any edits to the task)
2. Approve all three tasks and press submit all!

4. If 2 of them are accurate and 1 is inaccurate

1. Fix the one inaccurate task.


2. Approve all tasks and submit all.
5. If 1 of them is accurate and the other 2 is inaccurate

1. Fix one of the two inaccurate tasks (do not fix the last inaccurate
one)
2. Select approve with changes for the fixed inaccurate task, Send
back to

queue the second inaccurate task and approve the accurate task

6. If none of the three tasks are accurate:

1. Fix one of the inaccurate tasks (do not fix two more more
inaccurate ones)
2. Review all the tasks considering the version you received them
3. Send back to queue the inaccurate ones, and select approve with
changes for the one you fixed.

3. Here is what the reviewer box looks like: (THIS BOX DOES NOT COUNT AS A
TASK EDIT)

1. Rate the quality of each task as you received it from 1-5. Refer to
the scoring guidelines below.
2. Your task ratings are really important for us to provide feedback to our
attempters!
Basically we should NOT have artificial agreement - fixing all attempts to be the same.
When reviewing (3 attempts) reviewers should only fix one of the attempts, the
remaining incorrect one or ones you should select Send Back to Queue!
Workflow Diagram
As soon as you edit a task, it counts as an edit. Reverting
back your edits does not reset your edit count (which is max
1) !

You might also like