You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-7Lines changed: 4 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,8 +62,6 @@ The task statistics are shown as follows:
62
62
63
63
For the three LLMs (Llama2-70b-Chat, Mixtral-8x7b-Instruct, and GPT-3.5-Turbo), we evaluate a total of 106,758 segments drawn from 54 MT systems. For GPT-4, we restrict the evaluation to Chinese–English, using 30 randomly selected segments per MT system, for a total of 600 samples ("WMT22-Subset").
64
64
65
-
The querys and responses of the LLMs can be found in "[results](./results/)".
66
-
67
65
<h2align="center">EAPrompt Implementation</h2>
68
66
69
67
The main implementation is provided in [./EAPrompt](./EAPrompt/).
@@ -92,11 +90,9 @@ All prompt types used in the study are provided for replication. We adopt a stru
92
90
-`"SRC"` for **reference-free** evaluation (source only);
93
91
-`"REF"` for **reference-based** evaluation.
94
92
95
-
> Note: For the counting step, we use a simple identifier "COUNT". No additional keywords are required.
96
-
97
-
According to our ablation experiments, we recommend using the prompt type **ERROR\_\{LANG\}\_ITEMIZED\_\{IS_REF\}** as the default configuration, For example: ERROR_ENDE_ITEMIZED_SRC
93
+
> Note: For the counting step, we use a simple identifier `"COUNT"`. No additional keywords are required.
98
94
99
-
According to our ablation experiments, we recommend **ERROR\_\{LANG\}\_ITEMIZED\_\{IS_REF\}** as prompt type, e.g. ERROR_ENDE_ITEMIZED_SRC.
95
+
According to our ablation experiments, we recommend using the prompt type **ERROR\_\{LANG\}\_ITEMIZED\_\{IS_REF\}** as the default configuration, For example: `ERROR_ENDE_ITEMIZED_SRC`
100
96
101
97
**🚀 Generating Queries & Responses**
102
98
@@ -109,9 +105,10 @@ For large-scale evaluation across multiple MT systems, we provide two example sc
109
105
110
106
These scripts demonstrate the complete workflow for evaluating entire datasets efficiently.
111
107
112
-
113
108
<h2align="center">Results and Findings</h2>
114
109
110
+
The querys and responses of the LLMs can be found in "[results](./results/)".
111
+
115
112
1.**EAPrompt significantly enhances the performance of LLMs at the system level**. Notably, prompting *GPT-3.5-Turbo* with EAPrompt outperforms all other metrics and prompting strategies, establishing a new state-of-the-art.
116
113
117
114
2.**EAPrompt surpasses GEMBA in 8 out of 9 test scenarios** across various language models and language pairs at the segment level.
0 commit comments