Intensive Speaking
Evaluate scoring scale on page 190+191
Strengths:
- Consistency: The specific point ranges and clear descriptions (e.g., frequent
errors, occasional errors) help a rater apply the same criteria each time, improving
consistency.
- Defined Criteria: The emphasis on intelligibility (for pronunciation) and flow
(for fluency) makes it easier for a rater to consistently apply the same standard for
repeated assessments.
- Guidelines: The scale’s well-defined criteria for each point range support *inter-
rater reliability* by giving multiple raters a shared reference point for evaluating
pronunciation and fluency.
- *Scoring Bands*: Dividing the scale into specific bands (e.g., 0.0–0.4, 0.5–1.4)
helps limit variability between raters, as they have a narrower scope to consider.
Weakness:
- *Borderline Cases*: Since some score ranges are broad (e.g., 0.5–1.4 for both
pronunciation and fluency), raters might struggle to consistently place borderline
performances in the same category across different sessions. A single rater might
shift between categories (e.g., scoring a speaker 1.4 one time and 1.5 the next)
when re-evaluating similar performances.
- *Subjectivity*: Raters may interpret vague terms like “occasionally
unintelligible” or “occasional errors” slightly differently over time, introducing
variability in scoring.
- *Interpretation Differences*: Raters might have different interpretations of
phrases like "frequent errors" or "nonnative pauses," which could result in
*inconsistent scoring between raters*. For example, one rater might score a
performance as 1.4 (frequent errors), while another might rate it at 1.5 (occasional
errors) for the same speaker.
- *Training Needs*: To ensure high inter-rater reliability, raters may need to
undergo *calibration sessions* or training to interpret the terms in the same way,
especially for more subjective categories.
Intra-Rater Reliability
- The clear criteria for each scoring range support consistency in a rater’s
evaluations over time. Specific guidelines on pronunciation errors and fluency
pauses help raters maintain consistent standards.
- Broad score ranges (e.g., 0.5–1.4) may cause inconsistencies in borderline cases.
A single rater might score similar performances differently during separate
evaluations due to subjective interpretations of terms like “frequent errors.”
Inter-Rater Reliability
- The structured scale with defined ranges helps different raters align their scoring,
reducing variability. This shared framework supports consistent evaluations across
raters.
- Subjective language (e.g., “occasional errors”) can lead to differing
interpretations, causing variability between raters when assessing the same
performance. Broad score ranges also increase the risk of different raters assigning
different scores for similar performances.