LLMs' Self-Knowledge Evaluation
LLMs' Self-Knowledge Evaluation
♢ ♠ ♢
Zhangyue Yin Qiushi Sun Qipeng Guo
♢ ♢∗ ♢
Jiawen Wu Xipeng Qiu Xuanjing Huang
♢
School of Computer Science, Fudan University
♠
Department of Mathematics, National University of Singapore
{yinzy21,jwwu21}@m.fudan.edu.cn [email protected]
{qpguo16,xpqiu,xjhuang}@fudan.edu.cn
Abstract Unknows
Knows
Large language models (LLMs) have a wealth
of knowledge that allows them to excel in vari-
arXiv:2305.18153v2 [cs.CL] 30 May 2023
Knows
ous Natural Language Processing (NLP) tasks.
Current research focuses on enhancing their Known Knows
Known Unknows
performance within their existing knowledge.
Despite their vast knowledge, LLMs are still
limited by the amount of information they can Unlock
accommodate and comprehend. Therefore, the
ability to understand their own limitations on
Unknows
Table 1: Unanswerable questions in the SelfAware dataset that span across multiple categories.
contains a proportion of answerable questions that In order to achieve this, we define a similarity func-
is twice as large as the volume of unanswerable tion, fsim , to compute the similarity, S, between
ones. Nevertheless, to ensure the feasibility of test- a given sentence, t, and a collection of reference
ing, we have purposefully capped the number of sentences, U = {u1 , u2 , . . . , un }, endowed with
answerable questions. uncertain meanings.
F1 Scores
F1 Scores
50 50 48.79 49.61 50 47.93 48.42
45.91 45.67 47.24
43.47 44.87
42.31
40.11 40.33
40 40 40
36.27
33.33 34.27
30.42 30.17
30 26.96 27.54 30 30
26.17
22.38 GPT-3
InstructGPT
20 20 20
0M
5B
0M
5B
0M
5B
1.3
6.7
1.3
6.7
1.3
6.7
17
17
17
35
35
35
Model
Figure 2: Experimental results using three different input forms on a series of models from GPT-3(ada, babbage,
curie, and davinci) and InstructGPT(text-ada-001, text-babbage-001, text-curie-001, and text-davinci-001)
F1 Scores
40
gpt-3.5-turbo-0301 54.12
30
Models
text-davinci-003 51.43 20
text-davinci-002 47.48 10
49.61 0
text-davinci-001 ci
1
00
00
00
30
vin
ci-
ci-
ci-
o-0
da
davinci 45.67
in
in
in
b
av
av
av
tur
t-d
t-d
t-d
.5-
tex
tex
tex
t-3
0 10 20 30 40 50 60 70 80
gp
F1 Scores Models
Figure 3: Comparison between the davinci series and Figure 4: Experimental comparison of davinci series in
human self-knowledge in instruction input form. ICL input form.
40 35
37.44
35.87 30.25
30
30.12 30.3
30 28.57 25
Accuracy
F1 Scores
20
15.7
20
15
10.61
10
10 4.45 4.7
5
2.48
1
01
01
14
1
3
0
30
-00
00
00
00
-0
e-0
-03
o -0
ci-
ci-
ci-
da
urie
ag
t-4
in
in
in
t-a
b
-7B
-7B
3B
av
av
av
b
tur
t-c
gp
a-7
-13
-13
-30
-65
ab
tex
a-1
t-d
t-d
t-d
MA
na
tex
.5-
t-b
ac
MA
MA
MA
na
tex
tex
tex
u
ac
t-3
LLa
Alp
Vic
tex
LLa
LLa
LLa
Alp
Vic
gp
Models Models
Figure 5: Experimental results obtained from LLaMA Figure 6: Accuracy of the InstructGPT series when
and its derived models, Alpaca and Vicuna in instruction responding to answerable questions in instruction input
input form. form.
Instruction Tuning. Figure 2 delineates that performance to the human benchmark of 84.93%.
models from the InstructGPT series exhibit a su- This underscores the considerable potential that re-
perior level of self-knowledge compared to their mains for enhancing the self-knowledge level of
GPT-3 counterparts. Further evidence of model LLMs.
enhancement is provided by Figure 4, where text-
davinci models show significant improvement rela- Answerable Questions. Figure 6 traces the per-
tive to the base davinci model. An additional com- formance evolution of the InstructGPT series in
parative analysis, presented in Figure 5, evaluates addressing answerable questions, adhering to the
LLaMA against its derivative models. The results closed-book question answering paradigm (Tou-
underscore a notable increase in self-knowledge vron et al., 2023), where output accuracy is con-
for Alpaca and Vicuna upon instruction tuning, ex- tingent on the presence of the correct answer. Our
ceeding their base model performances. Among observations underscore a steady enhancement in
these, Vicuna-13B outperforms the LLaMA-65B, QA task accuracy corresponding to an increase
corroborating the efficacy of instruction tuning for in model parameter size and continuous learning.
enhancing model self-knowledge. Particularly, the accuracy of text-davinci-001 expe-
riences a significant ascent, scaling from a meager
Input Forms. As shown in Figure 2, the incorpo- 2.48% in text-ada-001 to 10.61%, whereas GPT-4
ration of instructions and examples serves to boost marks an even more striking jump to 42.64%.
the self-knowledge of both the GPT-3 and Instruct-
GPT series. Specifically, ICL input form, providing 5 Conclusion
richer contextual information, contributes to a sig-
nificant enhancement in models’ self-knowledge. This study investigates the self-knowledge of
This impact is particularly noticeable in the davinci LLMs by evaluating their ability to identify unan-
model, where ICL facilitates a 27.96% improve- swerable questions. Through the introduction of a
ment over the direct. Moreover, a comparison be- novel dataset and an automated method for detect-
tween Figure 3 and Figure 4 reveals that the in- ing uncertainty in the models’ responses, we are
clusion of instructions and examples successfully able to accurately measure the self-knowledge of
minimizes the performance disparity between the LLMs such as GPT-3, InstructGPT and LLaMA.
davinci and text-davinci models, suggesting an ac- Our results reveal that while these models possess
quisition of self-knowledge from the instructions a certain degree of self-knowledge, there is still
and provided examples. an apparent disparity in comparison to human self-
knowledge. This highlights the need for further
Compared with Human. Figure 3 reveals that, research in this area to enhance the ability of LLMs
without supplementary samples, GPT-4 currently to understand their own limitations on the unknows.
performs best among the tested models, achieving Such efforts will lead to more accurate and reliable
an impressive F1 score of 75.47%. However, a no- responses from LLMs, which will have a positive
ticeable gap becomes evident when comparing this impact on their applications in diverse fields.
Limitations Adhering to the CC-BY-SA-4.0 protocol, the
dataset, once publicly released, will be reserved
• Generalization of reference sentences. At exclusively for research purposes. We pledge to
present, we have selected sentences with un- promptly and effectively address any concerns relat-
certain meanings exclusively from the GPT-3 ing to the dataset, while concurrently anticipating
and InstructGPT series, potentially overlook- researchers to maintain high ethical standards in
ing uncertainty present in responses generated their utilization of this data.
by other LLMs. However, it is not feasible
to catalog all sentences with uncertain mean- Acknowledgement
ings exhaustively. As a direction for future
research, we propose to concentrate on the We wish to express our gratitude to our colleagues
automated acquisition of more accurate refer- in the FudanNLP group whose insightful sugges-
ence sentences to address this concern. tions, perspectives, and thought-provoking discus-
sions significantly contributed to this work. Our
• Limitations of input forms: Our exami- sincere appreciation also extends to the anonymous
nation was confined to three unique input reviewers and area chairs, whose constructive feed-
forms: direct, instruction, and ICL. There back was instrumental in refining the quality of
is burgeoning research aimed at bridging the our study. This work was supported by the Na-
gap between models and human-like meth- tional Natural Science Foundation of China (No.
ods of reasoning and problem-solving, includ- 62236004 and No. 62022027) and CAAI-Huawei
ing but not limited to approaches like Re- MindSpore Open Fund.
flexion (Shinn et al., 2023), ToT (Yao et al.,
2023), MoT (Li and Qiu, 2023). Future en-
deavors will integrate additional cognitive and References
decision-making methods to delve deeper into Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
the self-knowledge exhibited by these LLMs. son, Dmitry Lepikhin, Alexandre Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
Chen, Eric Chu, Jonathan H. Clark, Laurent El
Ethics Statement Shafey, Yanping Huang, Kathy Meier-Hellstern, Gau-
rav Mishra, Erica Moreira, Mark Omernick, Kevin
The SelfAware dataset, meticulously curated to Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao,
evaluate LLMs’ ability to discern unanswerable Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez
questions, is composed of unanswerable questions Abrego, Junwhan Ahn, Jacob Austin, Paul Barham,
extracted from sources such as Quora and How- Jan Botha, James Bradbury, Siddhartha Brahma,
Kevin Brooks, Michele Catasta, Yong Cheng, Colin
StuffWorks, alongside answerable questions pro-
Cherry, Christopher A. Choquette-Choo, Aakanksha
cured from three distinct open datasets. Every ques- Chowdhery, Clément Crepy, Shachi Dave, Mostafa
tion was thoroughly examined for relevance and Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,
harmlessness. To ensure content validity, three an- Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu
notation analysts, compensated at local wage stan- Feng, Vlad Fienber, Markus Freitag, Xavier Gar-
cia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-
dards, dedicated regular working hours to content Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua
review. Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur-
Throughout our research process, we under- witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel-
scored the significance of privacy, data security, ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun,
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben-
and strict compliance with dataset licenses. In jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li,
order to protect data integrity, we implemented Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu,
anonymization and content filtration mechanisms. Frederick Liu, Marcello Maggioni, Aroma Mahendru,
Our adherence to OpenAI’s stipulations remained Joshua Maynez, Vedant Misra, Maysam Moussalem,
Zachary Nado, John Nham, Eric Ni, Andrew Nys-
unyielding for the usage of GPT-3 and InstructGPT trom, Alicia Parrish, Marie Pellat, Martin Polacek,
models, and likewise for Meta’s terms pertaining Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif,
to LLaMA models. We rigorously vetted the li- Bryan Richter, Parker Riley, Alex Castro Ros, Au-
censes of the three publicly available datasets for rko Roy, Brennan Saeta, Rajkumar Samuel, Renee
Shelby, Ambrose Slone, Daniel Smilkov, David R.
compliance, ensuring that all our research method- So, Daniel Sohn, Simon Tokumine, Dasha Valter,
ologies were in alignment with ethical standards at Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang,
the institutional, national, and global levels. Pidong Wang, Zirui Wang, Tao Wang, John Wiet-
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Ambrose Slone, Cem Anil, Imanol Schlag, Theo
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Gutman-Solo, et al. 2022. Solving quantitative
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav reasoning problems with language models. ArXiv
Petrov, and Yonghui Wu. 2023. Palm 2 technical preprint, abs/2206.14858.
report.
Xiaonan Li and Xipeng Qiu. 2023. Mot: Pre-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie thinking and recalling enable chatgpt to self-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind improve with memory-of-thoughts. ArXiv preprint,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda abs/2305.05181.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
OpenAI. 2023. Gpt-4 technical report.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Clemens Winter, Christopher Hesse, Mark Chen, Eric roll L Wainwright, Pamela Mishkin, Chong Zhang,
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Jack Clark, Christopher Berner, Sam McCandlish, 2022. Training language models to follow in-
Alec Radford, Ilya Sutskever, and Dario Amodei. structions with human feedback. ArXiv preprint,
2020. Language models are few-shot learners. In Ad- abs/2203.02155.
vances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Process- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
ing Systems 2020, NeurIPS 2020, December 6-12, Know what you don’t know: Unanswerable ques-
2020, virtual. tions for SQuAD. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and guistics (Volume 2: Short Papers), pages 784–789,
William W Cohen. 2022. Program of thoughts Melbourne, Australia. Association for Computational
prompting: Disentangling computation from reason- Linguistics.
ing for numerical reasoning tasks. ArXiv preprint,
abs/2211.12588. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, machine comprehension of text. In Proceedings of
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan the 2016 Conference on Empirical Methods in Natu-
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion ral Language Processing, pages 2383–2392, Austin,
Stoica, and Eric P. Xing. 2023. Vicuna: An open- Texas. Association for Computational Linguistics.
source chatbot impressing gpt-4 with 90%* chatgpt
quality. Noah Shinn, Federico Cassano, Beck Labash, Ashwin
Gopinath, Karthik Narasimhan, and Shunyu Yao.
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, 2023. Reflexion: Language agents with verbal rein-
and Tushar Khot. 2022. Complexity-based prompt- forcement learning.
ing for multi-step reasoning. ArXiv preprint, Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
abs/2210.00720. Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
Garriga-Alonso, et al. 2022. Beyond the imitation
SimCSE: Simple contrastive learning of sentence em-
game: Quantifying and extrapolating the capabilities
beddings. In Proceedings of the 2021 Conference
of language models. ArXiv preprint, abs/2206.04615.
on Empirical Methods in Natural Language Process-
ing, pages 6894–6910, Online and Punta Cana, Do- Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
minican Republic. Association for Computational Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
Linguistics. and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke github.com/tatsu-lab/stanford_alpaca.
Zettlemoyer. 2017. TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehen- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
sion. In Proceedings of the 55th Annual Meeting of Martinet, Marie-Anne Lachaux, Timothée Lacroix,
the Association for Computational Linguistics (Vol- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
ume 1: Long Papers), pages 1601–1611, Vancouver, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Canada. Association for Computational Linguistics. Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. ArXiv
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom preprint, abs/2302.13971.
Henighan, Dawn Drain, Ethan Perez, Nicholas
Schiefer, Zac Hatfield Dodds, Nova DasSarma, Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
Eli Tran-Johnson, et al. 2022. Language models ris, Alessandro Sordoni, Philip Bachman, and Kaheer
(mostly) know what they know. ArXiv preprint, Suleman. 2017. NewsQA: A machine comprehen-
abs/2207.05221. sion dataset. In Proceedings of the 2nd Workshop
on Representation Learning for NLP, pages 191–200,
Aitor Lewkowycz, Anders Andreassen, David Dohan, Vancouver, Canada. Association for Computational
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Linguistics.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. 2022. Self-consistency im-
proves chain of thought reasoning in language mod-
els. ArXiv preprint, abs/2203.11171.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022. Chain of thought prompt-
ing elicits reasoning in large language models. In
Advances in Neural Information Processing Systems.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
When answering questions, interpret them literally and think them carefully. If the question is
unanswerable or unknowable, it is appropriate to say, “The answer is unknown.”
Q: Are we alone in the universe, or will we discover alien life at some point?
A:
Q: Are we alone in the universe, or will we discover alien life at some point?
A: