0% found this document useful (0 votes)
76 views10 pages

LLMs' Self-Knowledge Evaluation

Uploaded by

renxuancheng561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views10 pages

LLMs' Self-Knowledge Evaluation

Uploaded by

renxuancheng561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Do Large Language Models Know What They Don’t Know?

♢ ♠ ♢
Zhangyue Yin Qiushi Sun Qipeng Guo
♢ ♢∗ ♢
Jiawen Wu Xipeng Qiu Xuanjing Huang

School of Computer Science, Fudan University

Department of Mathematics, National University of Singapore
{yinzy21,jwwu21}@m.fudan.edu.cn [email protected]
{qpguo16,xpqiu,xjhuang}@fudan.edu.cn

Abstract Unknows
Knows
Large language models (LLMs) have a wealth
of knowledge that allows them to excel in vari-
arXiv:2305.18153v2 [cs.CL] 30 May 2023

Knows
ous Natural Language Processing (NLP) tasks.
Current research focuses on enhancing their Known Knows
Known Unknows
performance within their existing knowledge.
Despite their vast knowledge, LLMs are still
limited by the amount of information they can Unlock
accommodate and comprehend. Therefore, the
ability to understand their own limitations on
Unknows

the unknows, referred to as self-knowledge, Unknown Knows Unknown Unknows


is of paramount importance. This study aims
to evaluate LLMs’ self-knowledge by assess-
ing their ability to identify unanswerable or
unknowable questions. We introduce an auto-
mated methodology to detect uncertainty in the Figure 1: Know-Unknow Quadrant. The horizontal axis
responses of these models, providing a novel represents the model’s memory capacity for knowledge,
measure of their self-knowledge. We further in- and the vertical axis represents the model’s ability to
troduce a unique dataset, SelfAware, consisting comprehend and utilize knowledge.
of unanswerable questions from five diverse cat-
egories and their answerable counterparts. Our
extensive analysis, involving 20 LLMs includ- matical problem-solving (Lewkowycz et al., 2022;
ing GPT-3, InstructGPT, and LLaMA, discov- Chen et al., 2022). Despite their ability to learn
ering an intrinsic capacity for self-knowledge from huge amounts of data, LLMs still have lim-
within these models. Moreover, we demon-
itations in their capacity to retain and understand
strate that in-context learning and instruction
tuning can further enhance this self-knowledge. information. To ensure responsible usage, it is cru-
Despite this promising insight, our findings also cial for LLMs to have the capability of recognizing
highlight a considerable gap between the capa- their limitations and conveying uncertainty when
bilities of these models and human proficiency responding to unanswerable or unknowable ques-
in recognizing the limits of their knowledge. tions. This acknowledgment of limitations, also
known as “knowing what you don’t know,” is a
crucial aspect in determining their practical appli-
“True wisdom is knowing what you don’t know.”
cability. In this work, we refer to this ability as
–Confucius
model self-knowledge.
1 Introduction The Know-Unknow quadrant in Figure 1 il-
lustrates the relationship between the model’s
Recently, Large Language Models (LLMs) such knowledge and comprehension. The ratio of
as GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al., “Known Knows” to “Unknown Knows” demon-
2023), and LLaMA (Touvron et al., 2023) have strates the model’s proficiency in understanding
shown exceptional performance on a wide range and applying existing knowledge. Techniques
of NLP tasks, including common sense reason- such as Chain-of-Thought (Wei et al., 2022), Self-
ing (Wei et al., 2022; Zhou et al., 2022) and mathe- Consistency (Wang et al., 2022), and Complex

Corresponding author. CoT (Fu et al., 2022) can be utilized to increase
this ratio, resulting in improved performance on • We have developed a new dataset, SelfAware,
NLP tasks. We focus on the ratio of “Known Un- that comprises a diverse range of commonly
knows” to “Unknown Unknows”, which indicates posed unanswerable questions.
the model’s self-knowledge level, specifically un-
derstanding its own limitations and deficiencies in • We propose an innovative evaluation tech-
the unknows. nique based on text similarity to quantify the
degree of uncertainty inherent in model out-
Existing datasets such as SQuAD2.0 (Rajpurkar puts.
et al., 2018) and NewsQA (Trischler et al., 2017),
widely used in question answering (QA), have been • Through our detailed analysis of 20 LLMs,
utilized to test the self-knowledge of models with benchmarked against human self-knowledge,
unanswerable questions. However, these questions we identified a significant disparity between
1
are context-specific and could become answerable the most advanced LLMs and humans .
when supplemented with additional information.
Srivastava et al. (2022) attempted to address this by 2 Dataset Construction
evaluating LLMs’ competence in delineating their To conduct a more comprehensive evaluation of
knowledge boundaries, employing a set of 23 pairs the model’s self-knowledge, we constructed a
of answerable and unanswerable multiple-choice dataset that includes a larger number and more di-
questions. They discovered that these models’ per- verse types of unanswerable questions than Know-
formance barely surpassed that of random guessing. Unknowns dataset (Srivastava et al., 2022). To
Kadavath et al. (2022) suggested probing the self- facilitate this, we collected a corpus of 2,858 unan-
knowledge of LLMs through the implementation swerable questions, sourced from online platforms
of a distinct "Value Head". Yet, this approach may like Quora and HowStuffWorks. These questions
encounter difficulties when applied across varied were meticulously evaluated by three seasoned an-
domains or tasks due to task-specific training. Con- notation analysts, each operating independently.
sequently, we redirect our focus to the inherent The analysts were permitted to leverage external
abilities of LLMs, and pose the pivotal question: resources, such as search engines. To ensure the va-
“Do large language models know what they don’t lidity of our dataset, we retained only the questions
know?”. that all three analysts concurred were unanswerable.
In this study, we investigate the self-knowledge This rigorous process yielded a finalized collection
of LLMs using a novel approach. By gathering of 1,032 unanswerable questions.
reference sentences with uncertain meanings, we In pursuit of a comprehensive evaluation, we
can determine whether the model’s responses re- opted for answerable questions drawn from three
flect uncertainty using a text similarity algorithm. datasets: SQuAD (Rajpurkar et al., 2016), Hot-
We quantified the model’s self-knowledge using potQA (Yang et al., 2018), and TriviaQA (Joshi
the F1 score. To address the small and idiosyn- et al., 2017). Our selection was guided by Sim-
cratic limitations of existing datasets, we created CSE (Gao et al., 2021), which allowed us to iden-
a new dataset called SelfAware. This dataset com- tify and select the answerable questions semanti-
prises 1,032 unanswerable questions, which are dis- cally closest to the unanswerable ones. From these
tributed across five distinct categories, along with sources, we accordingly drew samples of 1,487,
an additional 2,337 questions that are classified as 182, and 668 questions respectively, amassing a
answerable. Experimental results on GPT-3, In- total of 2,337. Given that these questions can be
structGPT, LLaMA, and other LLMs demonstrate effectively addressed using information available
that in-context learning and instruction tuning can on Wikipedia, the foundational corpus for the train-
effectively enhance the self-knowledge of LLMs. ing of current LLMs, it is plausible to infer that
However, the self-knowledge exhibited by the cur- the model possesses the requisite knowledge to
rent state-of-the-art model, GPT-4, measures at generate accurate responses to these questions.
75.47%, signifying a notable disparity when con- Our dataset, christened SelfAware, incorporates
trasted with human self-knowledge, which is rated 1,032 unanswerable and 2,337 answerable ques-
at 84.93%. tions. To reflect real-world distribution, our dataset
Our key contributions to this field are summa- 1
The code pertinent to our study can be accessed
rized as follows: https://github.com/yinzhangyue/SelfAware
Category Description Example Percentage
The answer is still up “Are we alone in the universe,
No scientific
for debate, with no consensus or will we discover alien 25%
consensus
in scientific community. life at some point?”
The question are about people’s "What will the fastest form of
Imagination 15%
imaginations of the future. transportation be in 2050?"
"Would you rather be shot
Completely The answer depends on
into space or explore the 27%
subjective personal preference.
deepest depths of the sea?"
“John made 6 dollars mowing lawns
The question with too
Too many and 18 dollars weed eating.
many variables cannot 10%
variables If he only spent 3 or 5 dollar a week,
be answered accurately.
how long would the money last him?”
The question can yield
“How come god was
Philosophical multiple responses, but it 23%
born from nothingness?”
lacks a definitive answer.

Table 1: Unanswerable questions in the SelfAware dataset that span across multiple categories.

contains a proportion of answerable questions that In order to achieve this, we define a similarity func-
is twice as large as the volume of unanswerable tion, fsim , to compute the similarity, S, between
ones. Nevertheless, to ensure the feasibility of test- a given sentence, t, and a collection of reference
ing, we have purposefully capped the number of sentences, U = {u1 , u2 , . . . , un }, endowed with
answerable questions. uncertain meanings.

2.1 Dataset Analysis Si = fsim (t, ui ). (1)


To gain insight into the reasons precluding a cer-
tain answer, we undertook a manual analysis of Whenever any Si surpasses a pre-determined
100 randomly selected unanswerable questions. As threshold T , we perceive the text t as encompass-
tabulated in Table 1, we have broadly segregated ing uncertain meanings, thereby eliminating the
these questions into five distinctive categories. “No need for manual evaluation of the response.
Scientific Consensus" encapsulates questions that Given the substantial disparity in the volume of
ignite ongoing debates within the scientific com- answerable and unanswerable questions in Self-
munity, such as those concerning the universe’s Aware, we adopt the F1 score as a measure of
origin. “Imagination" includes questions involving LLMs’ self-knowledge. Our focus rests on identi-
speculative future scenarios, like envisaged events fying unanswerable questions, hence we designate
over the next 50 years. “Completely Subjective" them as positive cases and categorize answerable
comprises questions that are inherently personal, questions as negative cases.
where answers depend heavily on individual predis-
4 Experiment
positions. “Too Many Variables" pertains to mathe-
matical problems that become unsolvable owing to 4.1 Model
the overwhelming prevalence of variables. Lastly,
We conduct a sequence of experiments to evaluate
“Philosophical" represents questions of a profound,
the degree of self-knowledge manifested by various
often metaphysical, nature that resist concrete an-
LLMs, including GPT-3 (Brown et al., 2020) and
swers. Ideally, upon encountering such questions,
InstructGPT (Ouyang et al., 2022) series, as well
the model should express uncertainty instead of
as the recent LLaMA (Touvron et al., 2023) and
delivering conclusive responses.
its derivative models, namely Alpaca (Taori et al.,
3 Evaluation Method 2023) and Vicuna (Chiang et al., 2023). Our in-
vestigative approach employed three distinct input
This section elucidates the methodology employed forms: Direct, Instruction, and In-Context Learn-
for assessing self-knowledge in the generated text. ing (ICL), which is encapsulated in Appendix A.4.
Direct Instruction In-Context Learning
70 70 70
65.12
60 60 60
55.81 55.5
F1 Scores

F1 Scores

F1 Scores
50 50 48.79 49.61 50 47.93 48.42
45.91 45.67 47.24
43.47 44.87
42.31
40.11 40.33
40 40 40
36.27
33.33 34.27
30.42 30.17
30 26.96 27.54 30 30
26.17
22.38 GPT-3
InstructGPT
20 20 20
0M

5B

0M

5B

0M

5B
1.3

6.7

1.3

6.7

1.3

6.7
17

17

17
35

35

35
Model

Figure 2: Experimental results using three different input forms on a series of models from GPT-3(ada, babbage,
curie, and davinci) and InstructGPT(text-ada-001, text-babbage-001, text-curie-001, and text-davinci-001)

65.12 66.46 66.28


Human 84.93 60.86
60 55.5
gpt-4-0314 75.47 50

F1 Scores
40
gpt-3.5-turbo-0301 54.12
30
Models

text-davinci-003 51.43 20
text-davinci-002 47.48 10

49.61 0
text-davinci-001 ci

1
00

00

00

30
vin

ci-

ci-

ci-

o-0
da

davinci 45.67
in

in

in

b
av

av

av

tur
t-d

t-d

t-d

.5-
tex

tex

tex

t-3
0 10 20 30 40 50 60 70 80

gp
F1 Scores Models

Figure 3: Comparison between the davinci series and Figure 4: Experimental comparison of davinci series in
human self-knowledge in instruction input form. ICL input form.

4.2 Setting judgments on the same set of questions, yielding


We devised the reference sentence set U through an average F1 score of 84.93%, which we sub-
a process that combined automated generation by sequently adopted as the benchmark for human
LLMs and manual filtering, detailed further in Ap- self-knowledge. Detailed scores are available in
pendix A.1. To quantify the similarity between Appendix A.3.
target and reference sentences, we utilized Sim-
CSE (Gao et al., 2021), setting the similarity thresh- 4.4 Analysis
old to 0.75 during our experiments. An exploration
of threshold ablation is available in Appendix A.2. We evaluate the manifestation of LLMs’ self-
To counteract potential errors in similarity calcula- knowledge, centering our investigation on three
tion induced by varying lengths of the target and fundamental dimensions: the size of the model,
reference sentences, we employed a sliding win- the impact of instruction tuning, and the influence
dow of length 5 to parse the target sentence into exerted by different input forms.
semantic chunks. During the generation process,
we set the temperature to 0.7. We selected a ran- Model Size. Figure 2 illustrates the correlation
dom sample of 100 instances for GPT-4, while the between model size and self-knowledge across var-
remainder of the models were scrutinized using the ious LLMs. It is noteworthy that across all three
full SelfAware dataset. input forms, an augmentation in model parameter
size is associated with an elevation in the F1 Score,
4.3 Human Self-Knowledge with the most conspicuous enhancement manifest-
To establish a benchmark for human self- ing in the ICL input form. Therefore, our analysis
knowledge, we engaged two volunteers and se- indicates that an LLM’s self-knowledge tends to
lected 100 random samples from the SelfAware enhance with increasing model size, a trend consis-
dataset. The volunteers has 30 minutes to make tent with the scaling law.
50 47.84 42.64
46.89
40 38.29
42.78

40 35
37.44
35.87 30.25
30
30.12 30.3
30 28.57 25

Accuracy
F1 Scores

20
15.7
20
15
10.61
10

10 4.45 4.7
5
2.48

1
01

01

14
1

3
0

30
-00

00

00

00
-0

e-0

-03
o -0
ci-

ci-

ci-
da

urie
ag

t-4
in

in

in
t-a

b
-7B

-7B

3B

av

av

av
b

tur
t-c

gp
a-7

-13

-13

-30

-65

ab
tex
a-1

t-d

t-d

t-d
MA

na

tex

.5-
t-b
ac

MA

MA

MA
na

tex

tex

tex
u

ac

t-3
LLa

Alp

Vic

tex
LLa

LLa

LLa
Alp

Vic

gp
Models Models

Figure 5: Experimental results obtained from LLaMA Figure 6: Accuracy of the InstructGPT series when
and its derived models, Alpaca and Vicuna in instruction responding to answerable questions in instruction input
input form. form.

Instruction Tuning. Figure 2 delineates that performance to the human benchmark of 84.93%.
models from the InstructGPT series exhibit a su- This underscores the considerable potential that re-
perior level of self-knowledge compared to their mains for enhancing the self-knowledge level of
GPT-3 counterparts. Further evidence of model LLMs.
enhancement is provided by Figure 4, where text-
davinci models show significant improvement rela- Answerable Questions. Figure 6 traces the per-
tive to the base davinci model. An additional com- formance evolution of the InstructGPT series in
parative analysis, presented in Figure 5, evaluates addressing answerable questions, adhering to the
LLaMA against its derivative models. The results closed-book question answering paradigm (Tou-
underscore a notable increase in self-knowledge vron et al., 2023), where output accuracy is con-
for Alpaca and Vicuna upon instruction tuning, ex- tingent on the presence of the correct answer. Our
ceeding their base model performances. Among observations underscore a steady enhancement in
these, Vicuna-13B outperforms the LLaMA-65B, QA task accuracy corresponding to an increase
corroborating the efficacy of instruction tuning for in model parameter size and continuous learning.
enhancing model self-knowledge. Particularly, the accuracy of text-davinci-001 expe-
riences a significant ascent, scaling from a meager
Input Forms. As shown in Figure 2, the incorpo- 2.48% in text-ada-001 to 10.61%, whereas GPT-4
ration of instructions and examples serves to boost marks an even more striking jump to 42.64%.
the self-knowledge of both the GPT-3 and Instruct-
GPT series. Specifically, ICL input form, providing 5 Conclusion
richer contextual information, contributes to a sig-
nificant enhancement in models’ self-knowledge. This study investigates the self-knowledge of
This impact is particularly noticeable in the davinci LLMs by evaluating their ability to identify unan-
model, where ICL facilitates a 27.96% improve- swerable questions. Through the introduction of a
ment over the direct. Moreover, a comparison be- novel dataset and an automated method for detect-
tween Figure 3 and Figure 4 reveals that the in- ing uncertainty in the models’ responses, we are
clusion of instructions and examples successfully able to accurately measure the self-knowledge of
minimizes the performance disparity between the LLMs such as GPT-3, InstructGPT and LLaMA.
davinci and text-davinci models, suggesting an ac- Our results reveal that while these models possess
quisition of self-knowledge from the instructions a certain degree of self-knowledge, there is still
and provided examples. an apparent disparity in comparison to human self-
knowledge. This highlights the need for further
Compared with Human. Figure 3 reveals that, research in this area to enhance the ability of LLMs
without supplementary samples, GPT-4 currently to understand their own limitations on the unknows.
performs best among the tested models, achieving Such efforts will lead to more accurate and reliable
an impressive F1 score of 75.47%. However, a no- responses from LLMs, which will have a positive
ticeable gap becomes evident when comparing this impact on their applications in diverse fields.
Limitations Adhering to the CC-BY-SA-4.0 protocol, the
dataset, once publicly released, will be reserved
• Generalization of reference sentences. At exclusively for research purposes. We pledge to
present, we have selected sentences with un- promptly and effectively address any concerns relat-
certain meanings exclusively from the GPT-3 ing to the dataset, while concurrently anticipating
and InstructGPT series, potentially overlook- researchers to maintain high ethical standards in
ing uncertainty present in responses generated their utilization of this data.
by other LLMs. However, it is not feasible
to catalog all sentences with uncertain mean- Acknowledgement
ings exhaustively. As a direction for future
research, we propose to concentrate on the We wish to express our gratitude to our colleagues
automated acquisition of more accurate refer- in the FudanNLP group whose insightful sugges-
ence sentences to address this concern. tions, perspectives, and thought-provoking discus-
sions significantly contributed to this work. Our
• Limitations of input forms: Our exami- sincere appreciation also extends to the anonymous
nation was confined to three unique input reviewers and area chairs, whose constructive feed-
forms: direct, instruction, and ICL. There back was instrumental in refining the quality of
is burgeoning research aimed at bridging the our study. This work was supported by the Na-
gap between models and human-like meth- tional Natural Science Foundation of China (No.
ods of reasoning and problem-solving, includ- 62236004 and No. 62022027) and CAAI-Huawei
ing but not limited to approaches like Re- MindSpore Open Fund.
flexion (Shinn et al., 2023), ToT (Yao et al.,
2023), MoT (Li and Qiu, 2023). Future en-
deavors will integrate additional cognitive and References
decision-making methods to delve deeper into Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
the self-knowledge exhibited by these LLMs. son, Dmitry Lepikhin, Alexandre Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
Chen, Eric Chu, Jonathan H. Clark, Laurent El
Ethics Statement Shafey, Yanping Huang, Kathy Meier-Hellstern, Gau-
rav Mishra, Erica Moreira, Mark Omernick, Kevin
The SelfAware dataset, meticulously curated to Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao,
evaluate LLMs’ ability to discern unanswerable Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez
questions, is composed of unanswerable questions Abrego, Junwhan Ahn, Jacob Austin, Paul Barham,
extracted from sources such as Quora and How- Jan Botha, James Bradbury, Siddhartha Brahma,
Kevin Brooks, Michele Catasta, Yong Cheng, Colin
StuffWorks, alongside answerable questions pro-
Cherry, Christopher A. Choquette-Choo, Aakanksha
cured from three distinct open datasets. Every ques- Chowdhery, Clément Crepy, Shachi Dave, Mostafa
tion was thoroughly examined for relevance and Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,
harmlessness. To ensure content validity, three an- Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu
notation analysts, compensated at local wage stan- Feng, Vlad Fienber, Markus Freitag, Xavier Gar-
cia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-
dards, dedicated regular working hours to content Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua
review. Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur-
Throughout our research process, we under- witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel-
scored the significance of privacy, data security, ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun,
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben-
and strict compliance with dataset licenses. In jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li,
order to protect data integrity, we implemented Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu,
anonymization and content filtration mechanisms. Frederick Liu, Marcello Maggioni, Aroma Mahendru,
Our adherence to OpenAI’s stipulations remained Joshua Maynez, Vedant Misra, Maysam Moussalem,
Zachary Nado, John Nham, Eric Ni, Andrew Nys-
unyielding for the usage of GPT-3 and InstructGPT trom, Alicia Parrish, Marie Pellat, Martin Polacek,
models, and likewise for Meta’s terms pertaining Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif,
to LLaMA models. We rigorously vetted the li- Bryan Richter, Parker Riley, Alex Castro Ros, Au-
censes of the three publicly available datasets for rko Roy, Brennan Saeta, Rajkumar Samuel, Renee
Shelby, Ambrose Slone, Daniel Smilkov, David R.
compliance, ensuring that all our research method- So, Daniel Sohn, Simon Tokumine, Dasha Valter,
ologies were in alignment with ethical standards at Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang,
the institutional, national, and global levels. Pidong Wang, Zirui Wang, Tao Wang, John Wiet-
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Ambrose Slone, Cem Anil, Imanol Schlag, Theo
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Gutman-Solo, et al. 2022. Solving quantitative
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav reasoning problems with language models. ArXiv
Petrov, and Yonghui Wu. 2023. Palm 2 technical preprint, abs/2206.14858.
report.
Xiaonan Li and Xipeng Qiu. 2023. Mot: Pre-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie thinking and recalling enable chatgpt to self-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind improve with memory-of-thoughts. ArXiv preprint,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda abs/2305.05181.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
OpenAI. 2023. Gpt-4 technical report.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Clemens Winter, Christopher Hesse, Mark Chen, Eric roll L Wainwright, Pamela Mishkin, Chong Zhang,
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Jack Clark, Christopher Berner, Sam McCandlish, 2022. Training language models to follow in-
Alec Radford, Ilya Sutskever, and Dario Amodei. structions with human feedback. ArXiv preprint,
2020. Language models are few-shot learners. In Ad- abs/2203.02155.
vances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Process- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
ing Systems 2020, NeurIPS 2020, December 6-12, Know what you don’t know: Unanswerable ques-
2020, virtual. tions for SQuAD. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and guistics (Volume 2: Short Papers), pages 784–789,
William W Cohen. 2022. Program of thoughts Melbourne, Australia. Association for Computational
prompting: Disentangling computation from reason- Linguistics.
ing for numerical reasoning tasks. ArXiv preprint,
abs/2211.12588. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, machine comprehension of text. In Proceedings of
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan the 2016 Conference on Empirical Methods in Natu-
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion ral Language Processing, pages 2383–2392, Austin,
Stoica, and Eric P. Xing. 2023. Vicuna: An open- Texas. Association for Computational Linguistics.
source chatbot impressing gpt-4 with 90%* chatgpt
quality. Noah Shinn, Federico Cassano, Beck Labash, Ashwin
Gopinath, Karthik Narasimhan, and Shunyu Yao.
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, 2023. Reflexion: Language agents with verbal rein-
and Tushar Khot. 2022. Complexity-based prompt- forcement learning.
ing for multi-step reasoning. ArXiv preprint, Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
abs/2210.00720. Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
Garriga-Alonso, et al. 2022. Beyond the imitation
SimCSE: Simple contrastive learning of sentence em-
game: Quantifying and extrapolating the capabilities
beddings. In Proceedings of the 2021 Conference
of language models. ArXiv preprint, abs/2206.04615.
on Empirical Methods in Natural Language Process-
ing, pages 6894–6910, Online and Punta Cana, Do- Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
minican Republic. Association for Computational Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
Linguistics. and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke github.com/tatsu-lab/stanford_alpaca.
Zettlemoyer. 2017. TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehen- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
sion. In Proceedings of the 55th Annual Meeting of Martinet, Marie-Anne Lachaux, Timothée Lacroix,
the Association for Computational Linguistics (Vol- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
ume 1: Long Papers), pages 1601–1611, Vancouver, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Canada. Association for Computational Linguistics. Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. ArXiv
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom preprint, abs/2302.13971.
Henighan, Dawn Drain, Ethan Perez, Nicholas
Schiefer, Zac Hatfield Dodds, Nova DasSarma, Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
Eli Tran-Johnson, et al. 2022. Language models ris, Alessandro Sordoni, Philip Bachman, and Kaheer
(mostly) know what they know. ArXiv preprint, Suleman. 2017. NewsQA: A machine comprehen-
abs/2207.05221. sion dataset. In Proceedings of the 2nd Workshop
on Representation Learning for NLP, pages 191–200,
Aitor Lewkowycz, Anders Andreassen, David Dohan, Vancouver, Canada. Association for Computational
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Linguistics.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. 2022. Self-consistency im-
proves chain of thought reasoning in language mod-
els. ArXiv preprint, abs/2203.11171.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022. Chain of thought prompt-
ing elicits reasoning in large language models. In
Advances in Neural Information Processing Systems.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,


Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. 2023. Tree of thoughts: Deliberate
problem solving with large language models. ArXiv
preprint, abs/2305.10601.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,


Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Olivier Bousquet, Quoc Le, and Ed Chi. 2022.
Least-to-most prompting enables complex reason-
ing in large language models. ArXiv preprint,
abs/2205.10625.
A Appendix Threshold Precision Recall F1
A.1 Uncertainty Text 0.95 100.00 70.00 82.35
To assemble a set of reference sentences, we ran- 0.90 100.00 75.00 85.71
domly chose 100 entries from the SelfAware dataset. 0.85 100.00 75.00 85.71
For each model in the GPT-3 and InstructGPT se- 0.80 100.00 80.00 88.89
ries, we conducted a preliminary test using the 0.75 100.00 85.00 91.89
direct input form and manually curated sentences 0.70 89.47 90.00 89.73
that displayed uncertainty. From this pre-test, we 0.65 86.95 90.00 88.45
procured 16 sentences manifesting uncertain con-
Table 2: Evaluation results comparing sentences with
notations to serve as our reference sentences. After
uncertain meaning filtered by various thresholds.
normalizing these sentences by eliminating punc-
tuation and converting to lowercase, we utilized
them to compute similarity with target sentences Human Precision Recall F1
throughout our experimental procedure. Volunteer A 91.52 78.26 84.37
Volunteer B 96.36 76.81 85.48
1. The answer is unknown.
Table 3: Evaluation results of 100 responses from two
2. The answer is uncertain. volunteers.
3. The answer is unclear.
annotated sentences. We considered unanswerable
4. There is no scientific evidence.
questions as positive examples and calculated preci-
5. There is no definitive answer. sion, recall, and F1 score. The results in Table 2 in-
dicate that a threshold of 0.75 produced the highest
6. There is no right answer. F1 score, balancing precision and the inclusion of
other uncertain sentences. As a result, we selected
7. There is much debate.
0.75 as the similarity threshold for subsequent ex-
8. There is no known case. periments.

9. There is no concrete answer to this question. A.3 Human Self-Knowledge Test


The evaluation results for the responses from our
10. There is no public information available. invited volunteers are presented in Table 3. The
11. It is impossible to know. F1 scores for the responses were high, indicating
that both volunteers exhibited a strong level of self-
12. It is impossible to answer. knowledge.

13. It is difficult to predict. A.4 Template


The input templates used in our experiments, Di-
14. It is not known.
rect, Instruction, and ICL, are illustrated in Fig-
15. We do not know. ures 7, 8, and 9, respectively. In the ICL template,
we composed 3 answerable and 3 unanswerable
16. I’m not sure. questions and provided the corresponding answers
manually.
A.2 Threshold ablation
We generated 100 new responses using the text-
davinci-002 with direct input form and manually
filtered out sentences that contained uncertainty.
We then used SimCSE (Gao et al., 2021) to calcu-
late the similarity between these sentences and the
reference sentences in Appendix A.1. We tested
various thresholds for filtering sentences with un-
certain meanings and compared them to manually
Q: Are we alone in the universe, or will we discover alien life at some point?
A:

Figure 7: Direct input form.

When answering questions, interpret them literally and think them carefully. If the question is
unanswerable or unknowable, it is appropriate to say, “The answer is unknown.”

Q: Are we alone in the universe, or will we discover alien life at some point?
A:

Figure 8: Instruction input form.

Q: What is the nearest extraterrestrial civilization to us?


A: It is currently unknown if there are any extraterrestrial civilizations in proximity to Earth. Scientists
have not yet found definitive evidence of the existence of extraterrestrial civilizations. Thus, the answer is
unknown.
Q: What is the highest building in New York?
A: The highest building in New York is the One World Trade Center. It stands at a height of 1,776 feet (541
meters). Thus, the answer is the One World Trade Center.
Q: Will nuclear war break out in the world in 2050?
A: It is impossible to predict with certainty if a nuclear war will break out in the world in 2050. The
likelihood of such an event will depend on a wide range of factors, including international relations,
political tensions, and military capabilities. Thus, the answer is unknown.
Q: Alice's class is going on a field trip to the Ocean Park. There are 39 students and several teachers in the
class. A bus can seat 20 people. How many buses do they need?
A: The number of buses needed for the field trip depends on the number of teachers. If there is only one
teacher, then two buses will suffice. However, additional buses will be necessary if there are two or more
teachers. Thus, the answer is unknown.
Q: Are Local H and For Against both from the United States?
A: Local H is a rock band from Zion, Illinois, United States. For Against is an American indie rock band
formed in 1984 in Lincoln, Nebraska. Both of these bands are from the United States. Thus, the answer is
yes.
Q: Gjetost is the national cheese of which country?
A: It is the national cheese of Norway, and it is a popular ingredient in traditional Norwegian cuisine. Thus,
the answer is Norway.

Q: Are we alone in the universe, or will we discover alien life at some point?
A:

Figure 9: ICL input form.

You might also like