ä»æåãã«ç±³çµ±è¨å¦ä¼ãpå¤ã®ä½¿ç¨ã«é¢ãã6ã¤ã®ååã公表したããã®è²¬ä»»è ã§ããåå¦ä¼Executive Directorã®Ronald L. Wassersteinã¯ãRetraction Watchã¨ããè«ææ¤åç£è¦ããã°*1ã®ã¤ã³ã¿ãã¥ã¼ã«å¿ããæè¿ã®再現性危機問題ãä»åã®å£°æã®èæ¯ã«ãããã¨ã説明しているï¼H/T Mostly Economicsï¼ãæ¥æ¬ã§ããã®6ååã¯åæã§åãä¸ãããã¦ãããNaverまとめããã®è¾ºãã«è©³ããã
米統è¨å¦ä¼ã®ãµã¤ãã§ã¯ããã®6ååãæ示ãã声æææ¸ã¨å ±ã«ãåææ¸ã®på¤ã®è°è«ã«é¢ãã21人ã®çµ±è¨å¦è ã®åå¿ãä½µãã¦å ¬éãã¦ããããã®ãã¡UCãã¼ã¯ã¬ã¼ææã®Philip B. Starkãã表é¡ã®å°è«ï¼åé¡ã¯ãThe Value of p-Valuesãï¼ã§ãä»åã®å£°æã®ç²¾ç¥ã¯è²·ãããå 容ã«ã¯è¥å¹²ã®éåæããããã¨ãã¦ä»¥ä¸ã®ç¹ãææãã¦ããã
- The informal definition of a p-value at the beginning of the document is vague and unhelpful.
- The statement draws a distinction between âthe null hypothesisâ and âthe underlying assumptionsâ under which the p-value is calculated. But the null hypothesis is the complete set of assumptions under which the p-value is calculated.
- The âother approachesâ section ignores the fact that the assumptions of some of those methods are identical to those of p-values. Indeed, some of the methods use p-values as input (e.g., the False Discovery Rate).
- The statement ignores the fact that hypothesis tests apply in many situations in which there is no parameter or notion of an âeffect,â and hence nothing to estimate or to calculate an uncertainty for.
- The statement ignores the crucial distinction between frequentist and Bayesian inference.
ï¼»è注ã§ã®è¿½å ææï¼½The document has other problems, among them: It characterizes a p-value of 0.05 as âweakâevidence against the null hypothesis, but strength of evidence depends crucially on context. It categorically recommends using multiple numerical and graphical summaries of data, but there are situations in which these would be gratuitous distractions—if not an invitation to p-hacking!
ï¼æ訳ï¼
- ææ¸ã®æåã§ã®på¤ã®ç¥å¼ã®å®ç¾©ã¯ææ§ã§å½¹ã«ç«ããªã*2ã
- 声æã¯ãã帰ç¡ä»®èª¬ãã¨ãpå¤ãè¨ç®ããåºã¨ãªãããã®æ ¹åºã«ããä»®å®ãã¨ãåºå¥ãã¦ãããããã帰ç¡ä»®èª¬ã¨ã¯ãpå¤ãè¨ç®ããåºã¨ãªãä»®å®ã®ä¸å¼ã§ãã*3ã
- ãä»ã®ã¢ããã¼ããã»ã¯ã·ã§ã³ã§ã¯ããããã®ææ³ã®ä¸ã«ã¯ä»®å®ãpå¤ã¨åä¸ã®ãã®ãããã¨ããäºå®ãç¡è¦ãã¦ãããå®éã®ã¨ãããææ³ã®ä¸ã«ã¯på¤ãå ¥åã¨ãªããã®ãããï¼ï¼å½çºè¦çï¼ã
- 声æã¯ããå¹æãã®ãã©ã¡ã¼ã¿ãæ¦å¿µãåå¨ããªãç¶æ³ã«ä»®èª¬æ¤å®ãé©ç¨ãããå ´åãæ°å¤ãããã¨ããäºå®ãç¡è¦ãã¦ããããã®å ´åãä¸ç¢ºå®æ§ãæ¨è¨ãªããè¨ç®ãã対象ãåå¨ããªãã
- 声æã¯ãé »åº¦ä¸»ç¾©ã¨ãã¤ãºä¸»ç¾©ã®æ¨è¨ã®æ±ºå®çãªéããç¡è¦ãã¦ããã
ï¼»è注ã§ã®è¿½å ææ]声æã«ã¯ä»ã«ã次ã®ãããªåé¡ãããï¼
- 0.05ã¨ããpå¤ã帰ç¡ä»®èª¬ã«åãããå¼±ãã証æ ã¨ä½ç½®ä»ãã¦ãããã証æ ã®å¼·ãã¯ç¶æ³ã«æ±ºå®çã«ä¾åããã
- æ°åããã³ã°ã©ãã«ãããã¼ã¿ã®è¤æ°ã®ã¾ã¨ããæ´»ç¨ãããã¨ã大ãã«æ¨å¥¨ãã¦ãããããããããã¨ãå´ã£ã¦æ³¨æãé¸ãããã¨ã«ãªããä¸æãããã¨pãããã³ã°ã¸ã®å ¥ãå£ã¨ãªããããªç¶æ³ãåå¨ããã
ãã®å¾Starkã¯ãå½¼èªèº«ãèããããç°¡æãªèª¬æããä»åã®å£°æã®ä»£æ¿ã¨ãã¦æ示ãã¦ããã
Science progresses in part by ruling out potential explanations of data. p-values help assess whether a given explanation is adequate. The explanation being assessed is often called âthe null hypothesis.â
If the p-value is small, either the explanation is wrong, or the explanation is right but something unlikely happened—something that had a probability equal to the p-value. Small p-values are stronger evidence that the explanation is wrong: the data cast doubt on that explanation.
If the p-value is large, the explanation accounts for the data adequately—although the explanation might still be wrong. Large p-values are not evidence that the explanation is right: lack of evidence that an explanation is wrong is not evidence that the explanation is right. If the data are few or low quality, they might not provide much evidence, period.
There is no bright line for whether an explanation is adequate: scientific context matters.
A p-value is computed by assuming that the explanation is right. The p-value is not the probability that the explanation is right.
p-values do not measure the size or importance of an effect, but they help distinguish real effects from artifacts. In this way, they complement estimates of effect size and confidence intervals.
Moreover, p-values can be used in some contexts in which the notion of âeffect sizeâ does not make sense. Hence, p-values may be useful in situations in which estimates of effect size and confidence intervals are not.
Like all tools, p-values can be misused. One common misuse is to hunt for explanations that have small p-values, and report only those, without taking into account or reporting the hunting. Such âp-hacking,â âsignificance hunting,â selective reporting, and failing to account for the fact that more than one explanation was examined (âmultiplicityâ) can make the reported p-values misleading.
Another misuse involves testing âstraw manâ explanations that have no hope of explaining the data: null hypotheses that have little connection to how the data were collected or generated. If the explanation is unrealistic, a small p-value is not surprising. Nor is it illuminating.
Many fields and many journals consider a result to be scientifically established if and only if a p-value is below some threshold, such as 0.05. This is poor science and poor statistics, and creates incentives for researchers to âgameâ their analyses by p-hacking, selective reporting, ignoring multiplicity, and using inappropriate or contrived null hypotheses.
Such misuses can result in scientific âdiscoveriesâ that turn out to be false or that cannot be replicated. This has contributed to the current âcrisis of reproducibilityâ in science.
ï¼æ訳ï¼
ç§å¦ã®é²æ©ã®ä¸é¨ã¯ããã¼ã¿ã«é¢ãã説æã®åè£ãé¤å¤ãããã¨ããæãç«ã£ã¦ãããpå¤ã¯ããã説æãé©åãã©ãããè©ä¾¡ããå©ãã¨ãªããè©ä¾¡ã®å¯¾è±¡ã¨ãªã説æã¯ã帰ç¡ä»®èª¬ãã¨å¼ã°ãããã¨ãå¤ãã
ããpå¤ãå°ãããã°ã説æã誤ã£ã¦ãããã説æã¯æ£ãããä½ãèµ·ããé£ããã¨ãèµ·ããããã®ããããã§ããããã®èµ·ããé£ããã¨ã®ç¢ºçã¯på¤ã«çãããå°ããªpå¤ã¯èª¬æã誤ã£ã¦ããå¼·ã証æ ã§ãããå³ã¡ããã¼ã¿ã¯ãã®èª¬æã«ç義ãæãæãã¦ãããã¨ãããã¨ã§ããã
ããpå¤ã大ãããã°ããã®èª¬æã¯ãã¼ã¿ãé©åã«è¡¨ç¾ãã¦ããããã ãããã§ããã®èª¬æã誤ã£ã¦ããå¯è½æ§ã¯ããã大ããªpå¤ã¯èª¬æãæ£ãã証æ ã«ã¯ãªããªãã説æã誤ã£ã¦ãã証æ ã®æ¬ å¦ã¯ã説æãæ£ãã証æ ã§ã¯ãªãã®ã ããããã¼ã¿ãå°éãããã¯ä½å質ãªãã°ããã¾ã証æ ãæä¾ãããã¨ã¯ã§ãããããã§è©±ã¯çµããã
説æãé©åãå¦ãã®æ確ãªç·å¼ãã¯åå¨ããªããç§å¦çãªæèãåé¡ã¨ãªãã®ã ã
på¤ã¯èª¬æãæ£ããã¨ä»®å®ãã¦è¨ç®ããããpå¤ã¯èª¬æãæ£ãã確çã§ã¯ãªãã
på¤ã¯å¹æã®å¤§ãããªããéè¦æ§ã測ãããã§ã¯ãªãããå®éã®å¹æã¨å½ã®å¹æãåºå¥ããå©ãã«ãªãããã®ç¹ã§ãå¹æéãä¿¡é ¼åºéã®æ¨è¨ãè£å®ããã
ã¾ããpå¤ã¯ããå¹æéãã¨ããæ¦å¿µãæå³ãæããªãç¶æ³ã§ã使ç¨ã§ãããã¨ãããããã®ãããå¹æéãä¿¡é ¼åºéã®æ¨è¨ãå½¹ã«ç«ããªãç¶æ³ã§ãå½¹ã«ç«ã¤å¯è½æ§ãããã
ãã¹ã¦ã®éå ·ã¨åãããpå¤ã誤ç¨ããå¾ããä¸è¬çãªèª¤ç¨ã®ä¸ã¤ã¯ãpå¤ã®å°ããªèª¬æã追ãæ±ãã追ãæ±ããéç¨ãèæ ®ãªããå ±åãããã¨ãªãã«ã説æã ããå ±åãããã¨ã§ããããããããpãããã³ã°ããªãããæææ§ãã³ãã£ã³ã°ãã®çµæãé¸æçã«å ±åããä¸ã¤ããå¤ã説æã調ã¹ãã¨ããäºå®ï¼ãå¤éæ§ãï¼ãå ±åããªããã¨ã¯ãå ±åãããpå¤ã人ã ã誤ã£ãæ¹åã«å°ããã®ã¨ãããã¨ã«ãªããããªãã
å¥ã®èª¤ç¨ã¯ããã¼ã¿ã説æããå¯è½æ§ãç¡ããè人形ã説æãæ¤å®ãããã¨ã§ãããããã§ãããè人形ã説æã¨ã¯ããã¼ã¿ã®åéãªããçæéç¨ã¨ã»ã¼ç¡é¢ä¿ã®å¸°ç¡ä»®èª¬ã§ããããã説æãéç¾å®çãªãã°ãpå¤ãå°ãããã¨ã¯é©ãã«å¤ããªãããä½ã®è§£æã«ããªããªãã
å¤ãã®åéãå¤ãã®å¦è¡èªã§ã¯ãpå¤ã0.05ã®ãããªããé¾å¤ããä½ãå ´åãããã¦ãã®å ´åã®ã¿ãçµæãç§å¦çã«ç«è¨¼ããããã¨è¦åãã¦ãããããã¯ç§å¦ã¨ãã¦ãçµ±è¨å¦ã¨ãã¦ãå£æªãªããæ¹ã§ãããpãããã³ã°ãé¸æçå ±åãå¤éæ§ã®ç¡è¦ãä¸é©åãªããä¸èªç¶ãªå¸°ç¡ä»®èª¬ã¨ãã£ãæ¹æ³ã§åæããå¼ã¶ãã¤ã³ã»ã³ãã£ããç 究è ã«ä¸ãã¦ãã¾ãã
ãããã誤ç¨ã¯ãå¾ã§ééãã§ãããã¨ãæããã«ãªã£ããåç¾ãã§ããªãã£ããããç§å¦çãçºè¦ãã«ã¤ãªãããããªãããããç¾å¨ã®ç§å¦ã®ãåç¾æ§å±æ©ãã®ä¸å ã¨ãªã£ãã®ã§ããã
*1:cf. Retraction Watch - Wikipediaã関連日本語記事1ã関連日本語記事2ã
*2:声æææ¸ã®è©²å½ããã¨æãããç®æï¼
What is a p-value?
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
*3:声æææ¸ã®ååï¼ãP-values can indicate how incompatible the data are with a specified statistical model.ãã®å¾ã«ã¯ç¶ãã¦ä»¥ä¸ã®è¨è¿°ãããï¼
A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called ânull hypothesis.â Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.