ogre pad 1
ogre pad 1
ogre pad 1
Eval Project!
📝 This is an introductory course to prepare you for this
exciting project, where you will evaluate an AI meeting
assistant that helps the user summarize meeting
transcripts!
Transcript Comprehension
After reading the transcript, you will be rating it in two
dimensions:
Transcript Comprehension
Examine the transcript and indicate how much of the
notes you understand as defined by:
Translation quality
🔍 Evaluation Dimensions
You will be evaluating each Recap on:
Completeness – How well does the Recap capture the
important points from the transcript?
Understanding – How well can you understand the
Recap?
Groundedness – How well grounded is the Recap based
on the transcript?
Citation accuracy - How accurate were the citation links
in the Recap paragraphs (e.g. [3]) in directing you to the
correct location within the transcript?
Project Overview
Get ready to work on an exciting task: evaluating an AI meeting assistant that helps summarize
meetings for you!
In this task you will be provided with a meeting transcript and also 2 model generated live
summaries of the transcript, your task is to evaluate these live summaries and decide which one
is better.
Task Workflow
Step 1: Read the meeting transcript.
Step 2: Rate the transcript in terms of how well you can understand it.
Step 3: Read the live summary A, and evaluate it.
Step 4: Read the live summary B, and evaluate it.
Step 5: Complete the Side by Side score between A,B and explain your choice!
1. Be Concise: The live summary is not meant to be exhaustive and should cover only the
high level detail necessary for someone to catch up on what they missed from the
discussion in a concise way.
2. Be Descriptive: Although not all the details need to be covered, each sentence should
be descriptive enough that it ideally covers the main talking points. The reader should be
able to understand the points made. We don’t want any vague words such as things,
stuff, someone, something, etc., even if they are grounded in the transcript.
4. Not be repetitive: Live summaries that don’t have any repetitive/redundant sentences
should be preferred over those that are repetitive.
Task Steps
Here is what the task looks like!
Reading the Transcript
Start with reading the transcript carefully!
Transcript/Translation Comprehension
Answer the transcription comprehension question.
Question: Examine the transcript and indicate how much of the notes you understand as
defined by:
Pro Tip!
Toggle to the icon on the top right to view the rubric for EVERY QUESTION!
RUBRIC:
Important Note about transcript and translation comprehension!
YOU CAN’T CONTINUE THE TASK IF EITHER TRANSCRIPT OR TRANSLATION
COMPREHENSION IS “VERY UNCLEAR”. If either one of them is “very unclear” then
the platform will not allow you to continue, and you should submit the task without
completing the rest of it since it is very hard to understand!
Reading the Live Summary
This step (reading the live summary) and the following step (live summary evaluation
dimensions) will be done for both responses! First you will read live summary A and complete
the evaluation questions relating to A and then you will read live summary B, followed by
completing the evaluation questions for B.
Live Summary Evaluation Dimensions:
Please use the following dimensions to evaluate both live summary A and B.
HIGHLY IMPORTANT NOTE: EACH DIMENSION SHOULD BE
EVALUATED SEPARATELY, FOR EXAMPLE A COMPLETENESS
ISSUE SHOULD NOT IN ANY WAY AFFECT THE GROUNDEDNESS
RATING.
Eval Dimension Scoring Guidelines
Completeness issues: Please select Vague summary: The summary overall does
any of the following issues with not provide enough context for it to be useful.
“Completeness” that apply to this Missing most talking points: The summary
summary: missed most of the talking points of the
transcript. If it missed some talking points but
was still useful, don’t penalize.
Other: Issue not covered above
Completeness – How well does the Incomplete: You indicated “Missing most
Summary capture the important points talking points” in the previous question.
from the transcript?
More details: Partially complete: The summary covered
[INTERNAL] TNFM Human Eval - Live most of the transcript, but missed or
Summary misinterpreted some small parts or you
indicated “Vague summary” in the previous
question.
Understanding – How well can you Not understandable: The summary was
understand the Summary? barely intelligible, very confusing, or so poorly
More details: organized that they were difficult to read and
[INTERNAL] TNFM Human Eval - Live understand or you chose multiple
Summary Understanding issues in the previous
question.
Groundedness issues: Please select Does not preserve meaning- You can see
any of the following issues with where the model tried to summarize a portion
“Groundedness” that apply to this of the transcript, but it did so incorrectly or
summary: without preserving the intended meaning.
(e.g. “Lindsay agrees with X” when Lindsay
actually disagrees with X)
Wrong Attribution - The summary attributes
a speaker’s words to another speaker (e.g.
“A says something”, but B says it in the
transcript)
Assumed/Incorrect Pronoun - The
summary assumes a pronoun without
evidence. (e.g. "Jackie is doing her work," but
there is no indication in the transcript of
which pronouns Jackie prefers.)
Other type of hallucination - The summary
contains content that is not supported by
clear evidence from the transcript and does
not fall into one of the categories above.
Other: Issue not covered above
Groundedness – How well grounded Not grounded: You chose “Does not
is the summary based on the preserve meaning” as an issue with the
transcript? summary.
More details:
[INTERNAL] TNFM Human Eval - Live Partially grounded: You chose “Wrong
Summary Attribution”, “Assumed/Incorrect Pronoun” or
“Other type of hallucination” or “Other” as an
issue with the summary.
Overall Quality – How good is the Horrible: The summary was marked Not
Summary overall? Grounded. Compared to what a human in the
meeting would write, this summary is much
worse.
Explanation: If you chose one side as being better than the other, what were the main
reasons? (Select all that apply.)
Comment: Please explain your rating (i.e., why users would prefer one side over the other or
consider them about the same).
Completeness vs. Understanding vs. Groundedness
Completeness: You should only penalize completeness if the Summary is missing information
such as key topics discussed, talking points, or outcomes. The Summary may not include all
details from the transcript and that is ok as it is not meant to be a word for word transcription. If
the output misinterpreted points, was too wordy or had any grammar / language issues, these
should not be penalized under completeness. Completeness is only about missing
information.
Understanding: This section is split into two different types of understanding issues:
1. Minor issues: Any small issues in language, grammar, syntax, wording, verbosity,
redundancy etc., that may have been awkward or incorrect but does not significantly
impact your understanding of the intended meaning behind the phrase.
1. Examples: First name switched with last name, incorrect/missing punctuation,
incorrect/shorthand grammar but you still understood the intended meaning,
some slight redundancy but you could still understand, etc.
2. Major issues: Any significant issues in language, grammar, syntax, wording, verbosity,
redundancy etc., that significantly impacted your ability to understand the intended
meaning in the notes.
1. Examples: Phrases that are too hard to understand the intended meaning based
on poor translation, wording. Very redundant/verbose making it hard to
understand the conclusions / meaning of the Summary.
Adam Andrews: but I want to bundle it up Adam Andrews proposed to Overall Quality: Amazing. It
for these two days I want a dog walker, launch the service in big captured the main topic of
and then after that I need boarding. That cities such as San Francisco, this transcript segment which
kind of partnerships could also unlock. New York, Boston, was Adam proposing to
Matthew Miller: Okay. Yeah, that's great. Philadelphia and Seattle. launch the service in the
I was worried that this would just be a Matthew Miller agreed with cities listed and Matthew
single revenue stream um model. So I'm the plan and asked for more agreeing.
glad to hear that you've already thought details. It was also Complete,
about all these different channels to pull Grounded and
in revenue. um At the start uh what Understandable
markets do you think you'll be launching
in?
Adam Andrews: Um I would like to start
in the big cities. Um like my hope is to
start with San Francisco, the Bay area
um New York, Boston Philadelphia.
Because this is where I feel like most of
the cosmopolitan people live who travel
and then um who also love dogs. Seattle
for one which has the highest dog
population and there I based on the
market study we've done, the need is
more pronounced.
Matthew Miller: Okay. Yeah. That all
sounds really great uh to me, and I know
we're just out of time right now. So if you
can just send over um basically
everything with uh actual numbers and
just the paper for me to sign off on um
that would be great.
Adam Andrews: Sure. That sounds like a
plan.
Matthew Miller: Awesome. Thank you so
much for your time, and uh yeah, looking
forward to working with you.
Adam Andrews: Thank you. Thank you
Matthew Miller: Thank you. Bye bye.
Adam Andrews: Bye.
Frank Garcia: Yeah, I was actually going Frank Garcia and Mamadou Overall Quality: Horrible It
to ask you because some of the ad Diallo talked about a product missed the main points and
formats that our team was thinking about incorrectly summarized what
was the 6 second bumper ads and then the conversation was about.
the 15 second true view ads. So, we we It was also Incomplete,
have this time limitation of 6 seconds and Ungrounded and
15 seconds in terms of the media we put Understandable (since it was
out there. So, if you could give us the easy to understand)
best 6 second and the best 15 seconds
of each song or even multiples there of
we could use that as the kind of
background music in the creative itself
because you're right. The first 6 or 15
seconds of a song are not always the
best. You have to find it and people can
manually do this. So if you already have
that we could use this as the background
score to attract people even more.
Mamadou Diallo: Okay, yeah. That's
okay, that's that's a great point. I will take
note of that, and I will talk to a couple of
engineers on my team who are looking at
this. They're they're trying to basically
determine from like where cuz people
tend to repeat the parts of the songs that
they like in on YouTube and stuff.
Maybe they can have some insight in to
how can we narrow that down for like a
large quantity of songs and see.
Frank Garcia: Yeah, I believe there is
something called the YouTube
Dashboard where you can see exactly
where do people drop off and they
rewind, so we can use the algorithmic
they have going through that data and
then cutting it down into those slices.
Mamadou Diallo: Okay, sounds good.
Sounds like I have an action item there
too.
Frank Garcia: Great. Well, we'll
reconvene next week. I will share some
more prototypes specifically in terms of
what the creative could look like and then
we can go through the technical APIs
and everything too.
Mamadou Diallo: Awesome. Thank you
so much.
Frank Garcia: Great. See you.
Mamadou Diallo: Bye-bye.
Frank Garcia: Bye.
Justifications
⚠️Remember - all justifications need to be in English 🇺🇸
Content
🧐 Why Do We Need Justifications
😡 Let's Chat about Lousy Justifications
🤩 Getting to Good Justifications
🧐 Why Do We Need
Justifications
What is Justification?
This is the rationale you provide for your evaluation of a
particular response to a prompt and/or to your selection
of one response over another.
Why are good justifications important?
Justifications allow reviewers, the project team, and the
client to better understand your thinking and preferences
so that we can provide accurate feedback, useful and
targeted training, and improve model alignment with
human preferences (e.g., values, satisfaction).
Justifications, in short, explain why you picked one
response over the other. Crucially, you must write a very
beautiful, holistic justification.
In the TNFM project there are multiple questions where
justifications are required! The justifications guidelines
apply to ALL JUSTIFICATIONS BOXES!
Elements of good
justifications:
User Intent: The justification should include a comment
that demonstrates an understanding of what the user is
looking for with the prompt
Field Major Error (-3 pts) Minor Error (-1 pt) Additional Notes
Likert ratings can
- [Major Ranking be separated into 3
Disagreement] You and the buckets:
contributor disagree on the 1. Response 1 is
likert rating bucket (1,2) vs. better than
- [Loose Ranking
(3,4,5) vs. (6,7) and the Response 2: 1/2
Disagreement] you and
justification does not 2. Both responses
the contributor disagree
adequately support this are about the same:
between adjacent buckets
Likert or Comparison variance 3/4/5
(1,2 / 3,4,5) or (3,4,5 / 6,7)
Rating - [Inconsistent 3. Response 2 is
AND the contributor's
Ranking] The contributors better than
justification is persuasive
individual quality ratings Response 1: 6/7
enough to make you
support the opposite If you and the
uncertain in your position
conclusion of their likert rating contributor disagree
and this contradiction is not on the bucket
adequately supported in the (without good
justification justification), the
task is a fail
- [Incorrect
Transcript/Translation
- [Loose
Comprehension] The
comprehension
contributor marked the
disagreement] The
Transcript/Translation prompt in the wrong bucket.
contributor’s response is
Comprehension Bucket 1: “Very unclear,
in the same bucket with
somewhat unclear, Neither
the correct response but
clear or unclear”
it’s the same response.
Bucket 2: “Very clear,
somewhat clear”
- [Incorrect Response
Rateability] The contributor
Response Ratability
marked the response as
(check per response)
rateable when it was not
rateable (or vice versa)
- [Major Rating
Disagreement v1] You
disagree with the contributor’s - [Minor Rating
Completeness rating by 2 points in any of Disagreement v3] You
Understanding these dimensions. disagree with the
Groundedness - [Major Rating contributor’s rating by 1
(check per response) Disagreement v2] You point in one of these
disagree with the contributor’s dimensions.
rating by 1 point, more than
one of these dimensions.
- [Major Rating
- [Major Rating
Disagreement v4] You
Disagreement v5] You
disagree with the contributor’s
disagree with the
response on multiple of these
contributor’s response on
dimensions, but you agree
one of these dimensions,
with the contributor's
Completeness issues but you agree with the
selections in the following
Understanding issues contributor's selection in
questions.
Groundedness issues the following question.
(check per response)
Example: Groundedness
Example: Groundedness
issues and understanding
issues is marked wrong
issues are marked wrong but
but the following
the following groundedness
groundedness question is
and understanding question
marked correctly
is marked correctly.
- [Major Rating
Disagreement other
Other dimensions (if
dimension] You disagree
applicable)
with the contributor’s
response.
Justification Rubric
Criteria Major Error (-3 pts) Minor Error (-1 pt) Additional Notes
2. Look through all three tasks (WITHOUT EDITING) and check for accuracy
(ratability, SxS, eval questions).
1. Review all by scoring the task and giving feedback (do not make
any edits to the task)
2. Approve all three tasks and press submit all!
1. Fix one of the two inaccurate tasks (do not fix the last inaccurate
one)
2. Select approve with changes for the fixed inaccurate task, Send
back to
queue the second inaccurate task and approve the accurate task
1. Fix one of the inaccurate tasks (do not fix two more more
inaccurate ones)
2. Review all the tasks considering the version you received them
3. Send back to queue the inaccurate ones, and select approve with
changes for the one you fixed.
3. Here is what the reviewer box looks like: (THIS BOX DOES NOT COUNT AS A
TASK EDIT)
1. Rate the quality of each task as you received it from 1-5. Refer to
the scoring guidelines below.
2. Your task ratings are really important for us to provide feedback to our
attempters!
Basically we should NOT have artificial agreement - fixing all attempts to be the same.
When reviewing (3 attempts) reviewers should only fix one of the attempts, the
remaining incorrect one or ones you should select Send Back to Queue!
Workflow Diagram
As soon as you edit a task, it counts as an edit. Reverting
back your edits does not reset your edit count (which is max
1) !