1904 00812 PDF
1904 00812 PDF
1904 00812 PDF
Abstract
The pioneering research of G. K. Zipf on the relationship between word fre-
quency and other word features led to the formulation of various linguistic laws.
The most popular is Zipf’s law for word frequencies. Here we focus on two
laws that have been studied less intensively: the meaning-frequency law, i.e.
the tendency of more frequent words to be more polysemous, and the law of
abbreviation, i.e. the tendency of more frequent words to be shorter. In a pre-
vious work, we tested the robustness of these Zipfian laws for English, roughly
measuring word length in number of characters and distinguishing adult from
child speech. In the present article, we extend our study to other languages
(Dutch and Spanish) and introduce two additional measures of length: syllabic
length and phonemic length. Our correlation analysis indicates that both the
meaning-frequency law and the law of abbreviation hold overall in all the ana-
lyzed languages.
Keywords: Zipf’s laws, polysemy, brevity, word frequency
1. Introduction
The linguist George Kingsley Zipf (1902-1950) is known for his investigations
on statistical laws of language [1, 2]. Perhaps the most popular one is Zipf ’s
law for word frequencies [1, 3], that states that the frequency of the i-th
most frequent word in a text follows approximately
f ∝ i−α (1)
where f is the frequency of that word, i its rank or order and α is a constant
(α ≈ 1). Zipf’s law is an example of power-law model for the relationship
between two variables [4]. Zipf’s law for word frequencies can be explained by
information theoretic models of communication [5] and is a robust pattern of
∗ Corresponding author
Email addresses: [email protected] (Bernardino Casas), [email protected]
(Antoni Hernández-Fernández), [email protected] (Neus Català),
[email protected] (Ramon Ferrer-i-Cancho), [email protected] (Jaume
Baixeries)
m ∝ i−γ (3)
These laws are examples of laws where the predictor is the word frequency
and the response is another word feature. These laws are regarded as universal
although the only evidence of their universality is that they hold in every lan-
guage where they have been tested so far [11]. Because of their generality, these
laws have triggered modeling efforts that attempt to explain their origin and
support their presumable universality with the help of abstract mechanisms or
communication principles [12, 13], or exploring directly from voice those statis-
tical patterns in levels under the phoneme scale [14]. Therefore, investigating
the experimental conditions under which these laws surface is crucial.
In a previous work [15], we have studied these linguistic laws in a large corpus
of child and adult language (CHILDES) [16]. We extracted semantic polysemy
values from WordNet [17] and SemCor corpus 1 , and defined word length simply
as the number of characters per word.
In this present article we extend our research in [15] by re-analyzing the
behaviour of those linguistic laws in children and adults separately, using the
transcripts in the CHILDES database, and exploring different definitions of
word frequency, word polysemy and word length. In order to test the statisti-
cal validity of these linguistic laws, we also expand the number of languages to
Dutch and Spanish, as well as English, which was the only language analyzed
in [15]. Concerning word frequency, we consider two major sources of estima-
tion: the CHILDES database [16] and Wikipedia [18]. Frequency estimates are
computed separately for children and adults (comprising mothers, fathers and
investigators). This division allows us to compare children and adults linguistic
1 http://multisemcor.fbk.eu/semcor.php
2
production: motherese also known as child-directed speech (CDS) or infant-
directed speech (IDS) has been studied for many years and it is still a hot topic
of research [19].
Concerning polysemy, we define the polysemy of a word as the number of dif-
ferent senses it has, based on the WordNet of its corresponding language (Prince-
ton WordNet, Open Dutch WordNet and Multilingual Central Repository for
Spanish). Hereafter, we will refer to this polysemy as WordNet polysemy. We
assume that the polysemy measure provided by WordNet does not distinguish
between different types of polysemy and we are aware of the inherent difficulties
of borrowing this conceptual framework (see [20, 21, 22, 23]). Concerning word
length, we consider three different units of measurement: a graphical unit (num-
ber of characters) and two phonetic units (number of phonemes and number of
syllables). From the sources for obtaining word frequency and polysemy values
and from the variety of measurement units for word length, we come up to eight
major ways of investigating meaning-frequency law and law of abbreviation, for
each language under study.
In this paper, we investigate these laws qualitatively using measures of cor-
relation between two variables. Thus, the law of abbreviation is defined as a
significant negative correlation between the frequency of a word and its length
for any unit of measurement. The meaning-frequency law is defined as a sig-
nificant positive correlation between the frequency of a word and its WordNet
polysemy, a proxy for the number of meanings of a word. While our approach
to these laws is non-parametric (we are not assuming any particular model for
the relationship between two variables), traditional research on statistical laws
of language is mostly parametric, assuming some sort of power law or general-
izations of power laws [4, 10, 24].
We adopt these correlational definitions to remain agnostic about the actual
functional dependency between the variables, which is currently under revision
for various statistical laws of language [25, 26, 27]. We will show that a sig-
nificant correlation of the right sign is found in the majority of combinations
of conditions mentioned above, providing support for the hypothesis that these
laws are originated from abstract mechanisms. We propose as well some hy-
potheses to explain why in some exceptional cases the analyzed variables do not
correlate significantly.
The remainder of the article is organized as follows. Section 2 revises the
power law model that Zipf proposed for the law of meaning distribution [1, 3]
to illustrate the challenges of the parametric approach presented here. Then we
justify the convenience of a non-parametric approach (and correlation analysis)
that we have adopted in this article for statistical laws of language involving
word frequency. Sections 3 and 4 present, respectively, the materials (databases)
and the methods employed to analyze them. Section 5 presents the results of
our analysis of the meaning-frequency law and the law of abbreviation. Section
6 discusses our findings and suggests future work.
To check if the equation that Zipf proposed for the law of meaning distribu-
tion (Eq. 3) holds on modern corpora, we have reproduced the computations
exposed in [1] on a data set that we explore in depth in the next sections.
3
When plotting the relationship between number of meanings per word (on
the ordinate) and frequency rank (abscissa), Zipf applied a linear binning tech-
nique to reduce noise. When using bins of length λ, the 1st bin is formed by
the λ most frequent words, the 2nd bin is formed by the next λ most frequent
words,... etc. Formally, the j-th bin is defined by words whose rank i satisfies
λ(j − 1) + 1 ≤ i ≤ λj. (4)
Zipf plotted the relationship between the average number of meanings of the
j-th bin and j [1, p. 30] and fitted a power law (Eq. 3). We follow the same
method for a sample from the CHILDES database (see Section 3.5 for further
details).
Figures 1 and 2 show two examples of these plots taking as input English
words produced by adults and by children, respectively. Frequencies have been
obtained from CHILDES for both data sets. For estimating the values of the
parameters, the slope and the Y-intercept of the best regression line, we have
used two different methods: non-linear least squares and maximum likelihood
[28] on the original curve (in normal scale). The values shown at the top of each
figure correspond to the parameters of the fitting in log-log scale, that define
the regression line.
Tables 1 and 2 summarize the analyses performed over the whole data sets,
that is, for the three languages (English, Dutch and Spanish) and for the two
roles (adults and children). Table 1 corresponds to groupings of 100 words (the
abscissa of a point represents 100 words and the value of its ordinate is the
average number of meanings per word in this hundred). Table 2 corresponds to
groupings of 500 words.
Least squares Maximum likelihood
Language / role N
slope intercept R-squared slope intercept
English/Adults 162 −0.3007 15.4126 0.8745 −0.3206 16.5226
English/Children 99 −0.2925 15.0790 0.8651 −0.3085 15.8425
Dutch/Adults 52 −0.1615 3.7226 0.7426 −0.1688 3.7993
Dutch/Children 26 −0.1701 3.5488 0.7767 −0.1758 3.5928
Spanish/Adults 32 −0.1771 6.2247 0.6305 −0.1859 6.3535
Spanish/Children 35 −0.1731 6.0254 0.5909 −0.1844 6.1916
Table 1: Estimated values for the parameters of Zipf’s law of meaning distribution by Least
squares and by Maximum Likelihood. N is the number of data points after grouping the
words into groups of 1 hundred.
Table 2: Estimated values for the parameters of Zipf’s law of meaning distribution (slope and
intercept) by Least squares and by Maximum Likelihood. N is the number of data points
after grouping the words into groups of 5 hundreds.
4
Figure 1: Average number of meanings per word (on the ordinate) for each successive set of λ
words on the abscissa: the 1st set has rank 1, the 2nd set has rank 2, ... etc. (λ is the number
of words per bin to reduce). The true data points (blue circles) are compared against the
power-law models fitted using non-linear least squares (LS; red line) and maximum likelihood
(ML; green line).
Top: normal scale. Bottom: log-log scale. Left: λ = 100. Right: λ = 500.
Data set: English words produced by adults, using the CHILDES frequency.
5
Figure 2: Same format as in Fig. 1 for English words produced by children.
6
These results suggest that using a similar methodology in [1], the analyzed
data sets confirm Zipf’s meaning-frequency law. However, in this paper we want
to present, an alternative way of confirming this law (as well as the Zipf’s law
of abbreviation) by means of a correlation analysis.
The dependence of the exponent on bin size is a challenge for research on
the law of meaning distribution. Paradoxically, when using the same bin size as
Zipf did (λ = 1000), we get the number of non-empty bins reduced to a few for
Spanish and Dutch (this is the reason why we exclude that bin size in the tables
above). We can reduce the bin size to maximize the number of languages used
but then the exponents deviate from the originally reported by Zipf in English.
One also needs to control for the kind of binning. Logarithmic binning is often
used when investigating power law relationships [29]. In addition, we are fitting
Eq. 3 and estimating γ assuming that a power law holds. The validity of such
assumption must be tested. The plots above suggest a deviation from the power
law for high ranks.
Finally, we need to consider the role of the data source in the emergence of
the power law. For instance, word frequency could be estimated from Wikipedia
entries (see Section 4 for such a possibility). The magnitude of the whole chal-
lenge can easily be reduced with a non-parametric approach based on a correla-
tion analysis that does not involve neither any kind of binning nor assuming an
exact model (an equation), that may not be generally valid. So, after revisiting
the classical Zipfian approach to the meaning-frequency law [1, 3], next sections
develop our proposal of a correlation analysis between frequency, meaning and
word length.
3. Materials
In this section we describe the different corpora and tools that have been
used in this paper. We first describe the WordNet database which has been
used to compute the polysemy measure. We also describe the tools used to
convert text to phonetic transcription and to perform syllabic segmentation:
CELEX database and SAGA. Finally, we describe the two different sources for
calculating the frequency of words that are analyzed in this paper: CHILDES
database and Wikipedia as reference corpora.
7
WordNet databases contain only four main syntactic categories: nouns,
verbs, adjectives and adverbs. Words of other syntactic categories are not
present in these databases (for instance, in English the article the or the prepo-
sition for ). However, some words which should be considered as functional
words, have been included in our analyses, because they can also be considered
as content words (i.e. in English, the determiner a can also be a noun as in
Letter A or Vitamin A).
Multilingual
Princeton Open Dutch
WordNet Central
WordNet [17] Wordnet [31]
Repository [32]
Language English Dutch Spanish
Synsets 117,659 30,177 38,512
Words 147,306 43,077 36,681
Core 100% 67% 71%
Max 75 15 34
Mean 1.546 1.404 1.585
Median 1 1 1
STD 1.913 0.972 1.797
Table 3: Statistics on used wordnets in Open Multilingual Wordnet to calculate the WordNet
polysemy. Synsets: Total number of synsets. Words: Total number of words. Core: the
percentage of synsets covered from "core" word senses in Princeton WordNet (approximately
the 5000 most frequently used word senses). Max : Maximum number of synsets per word.
Mean: Mean number of synsets per word. Median: Median of the number of synsets per
word. STD: Standard Deviation.
3.3. SAGA
SAGA is an automatic tool for phonetic transcription in Spanish, considering
its multiple dialectal variants. The phonetic description is made in terms of the
SAMPA alphabet. The tool is able to split the words into syllables and mark
the prosodic stress.
2 http://celex.mpi.nl/
3 http://www.phon.ucl.ac.uk/home/sampa/index.html
8
SAGA is able to perform different kinds of transcriptions depending on the
output settings (phonemes, semi-phonemes, syllables, semi-syllables). In addi-
tion, even Spanish has a mostly phonetic writing, there are some exceptions
to the general phonetic rules as for example foreign words, archaic language or
dialectal variants. To deal with these cases, SAGA contains dictionaries that
can be modified to customize the phonetic transcriptions as desired.
We have used SAGA for Spanish conversations to perform both phonetic
transcription and syllabic segmentation. This application is distributed under
the terms of the GNU General Public License4 .
3.4. Wikipedia
Wikipedia is a free online encyclopedia built collaboratively and hosted by
the non-profit Wikimedia Foundation. It exists in 295 languages, from which
currently there is a total of 284 active ones, with the number of pages ranging
from more than 5 million articles (English) to a few hundred articles (Zulu,
Romani, Greenlandic. . . )5 .
Wikipedia includes articles that span across many topics and it is updated
with constant contributions. Thus, it turns out to be a useful resource as a
reference corpus for getting word frequencies. Since we use two different sources
for estimating word frequencies, we can compare the results obtained by using
a general corpus (Wikipedia) with the use of a simpler one (CHILDES).
The contents of each Wikipedia can be downloaded and processed to cal-
culate the frequency of every word that appears in Wikipedia [18]. We have
downloaded from Gregory Grefenstette webpage the lexicons with the frequen-
cies extracted from Wikipedia for English, Dutch and Spanish 6 .
80/gregory.grefenstette/
9
Types Tokens
Lang. Role
analyzed total cover analyzed total cover
Children 9,930 29,017 34% 1,596,726 2,308,675 69%
English
Adults 16,235 25,135 64% 3,008,148 4,584,213 65%
Children 2,627 13,666 19% 313,556 628,622 50%
Dutch
Adults 5,273 19,037 28% 1,008,393 2,122,354 48%
Children 3,520 27,167 13% 253,145 864,603 29%
Spanish
Adults 3,217 19,975 16% 288,660 997,901 29%
Table 4: Number of analyzed and total words (by types and tokens) obtained from CHILDES
conversations for each language and role. cover : percentage of words (by types or tokens)
that appear in the corresponding WordNet and Wikipedia lexicon.
4. Methods
4.2. Frequency
We have extracted word frequency values from two different sources. Thus,
for each word that appears in the selected conversations of CHILDES, we obtain:
• Wikipedia frequency, the frequency that the given word has in the
Wikipedia dataset.
• CHILDES frequency, the frequency that the given word has in CHILDES
according to the speaker’s role: children or adults (comprising mothers,
fathers and investigators). For example, for the word book two different
frequencies are given: the number of times this word appears uttered by
children and uttered by adults, respectively.
4.3. Polysemy
From linguistics, polysemous words are words that have more than one mean-
ing. Linguists distinguish between words with multiple meanings, where the
meanings are unrelated (called homonyms), and words with multiple senses,
where the senses are related. An example of the former is the word bank, having
unrelated meanings such as a sloping land or a financial institution,
10
whereas an example of the latter is honey, having related senses such as sweet
yellow liquid produced by bees or a beloved person.
We have calculated the polysemy of a word as the number of different mean-
ings provided by the WordNet database of its corresponding language. In Word-
Net, the different senses of a polysemous word are assigned to different synsets.
Then, we have considered the number of synsets a word belongs to as the num-
ber of meanings it has. This count is what we call WordNet polysemy. We
assume that the polysemy measure provided by WordNet does not differentiate
between polysemy classes mentioned above.
We are aware that using the WordNet polysemy measure in the CHILDES
corpora induces a bias. First, because we are assuming that the same meanings
that are used in written text are also used in spoken language. Second, because
we are using all possible meanings of a word. An alternative would have been
to tag manually all corpora (which is currently an unavailable option) or to use
an automatic tagger. But also in this latter case, the possibility of biases or
errors would be present.
11
On top of the correlation analysis we build another analysis where we com-
pare pairs of correlations that have a variable in common. Our goal is to deter-
mine if the unit of measurement has some effect on the strength of a linguistic
law. When we investigate the law of abbreviation (the correlation between the
frequency of a word and its length), we keep the source used to estimate fre-
quency while we vary the way length is measured: number of characters, number
of phonemes or number of syllables. In particular, we determine if the difference
between two dependent correlations sharing one variable is significant.
Suppose that we have two different length measures L1 , L2 (which can be
the number of characters, phonemes or syllables) and one frequency measure
F (which can be the CHILDES or Wikipedia frequency). Suppose that the
correlation between F and L1 is r(F, L1 ) and the correlation between F and
L2 is r(F, L2 ), and that both correlations are negative. To determine if one of
those correlations is significantly stronger that the other, we apply a two-tailed
Steiger’s test [37] (we use the r.test standardized R function). If the p-value
is below the significance level and |r(F, L1 )| < |r(F, L2 )|, we can conclude that
L2 is more correlated to F than L1 . Else, if |r(F, L1 )| > |r(F, L2 )|, we can
conclude that L1 is more correlated to F than L2 . Otherwise, if the test is non
significant, we cannot conclude that one correlation is stronger than the other.
We note that the r.test standardized R function requires a single sample
size as a parameter. For this reason, before performing the test, in order to
compute r(F, L1 ) and r(F, L2 ), we have selected from the dataset, those records
that have a valid value on all three variables (F, L1 , L2 ). However, when we
compute a correlation test two single variables F and L1 (or L2 ), we select all
those records that have a valid value in both F and L1 (L2 ), but not necessarily
in all three of them. Therefore, the value computed for r(F, L1 ) and r(F, L2 )
in the Steiger’s test may yield a somehow different value from that in a single
correlation test because of this constraint (inherent to the Steiger’s test).
As the theory of Steiger’s test is defined on Pearson correlation, this higher
level analysis is performed only on Pearson correlations and Spearman corre-
lations (Spearman correlation is a Pearson correlation on rank transformation
of the random variables [34]). As far as we know, it is not warranted that the
Steiger’s tests can be applied on Kendall correlations.
We use three different measures of correlations for the following reasons.
Pearson correlation is included for its popularity and simplicity. Spearman
and Kendall correlation are included for their capacity to capture non-linear
dependencies. Spearman is needed for the Steiger’s tests (see Section 5) and
Kendall correlation allows one to interpret the strength of a correlation based
on the number of ties (this will be shown in Section 5.4).
We assume a significance level of 0.05 in all tests.
We remark that the analysis for the CHILDES corpora has been segmented
into two roles: children and adults.
5. Results
We describe the results that have been obtained in three different languages
(English, Dutch and Spanish) from the analysis of the relationship between:
1. Frequency and polysemy (meaning-frequency law).
2. Frequency and word length (Zipf’s law of abbreviation).
12
We use two different measures for frequency (CHILDES frequency and Wikipedia
frequency), one measure for polysemy (WordNet polysemy) and three measures
for word length (number of characters, phonemes and syllables) as previously
explained in Section 4.
Here we present the results in two formats:
1. A table that contains the results of a correlation test between a frequency
measure versus the following measures: WordNet polysemy, number of
characters, number of phonemes and number of syllables. Each table
shows the results of three (Pearson, Spearman and Kendall) correlation
tests. For each language we have produced two tables: one table where the
frequency measure is the CHILDES frequency, and another table where
the frequency measure is the Wikipedia frequency.
2. A plot for each pair of variables that have been analyzed in the previous
tables along with a nonparametric regression and a probability density
function (see Section 4 for details).
We also present the results of the Steiger’s test that shed light on which of the
three different length measures exhibits a stronger correlation with frequency.
Finally, we include a subsection in which we examine the impact of ties in our
analyses.
13
5.2. Frequency versus Length
The analysis of the two measures of frequency versus the three measures of
length are in Tables C.13 and C.14 in English, Tables C.15 and C.16 in Dutch,
and Tables C.17 and C.18 in Spanish in the Appendix C. In this case, the
results show a more compact behavior, since all correlations are significant and
negative both for children and for adults.
As for the nonparametric regression in Figures A.5, A.6, A.7, A.8, A.9 and
A.10, we have that the results are consistent with these previous patterns: in
all cases, the regression shows a negative slope.
14
To sum up, the variables number of characters and number of phonemes show
a stronger correlation with respect to frequency than the number of syllables
when the Steiger’s test was significant. The results also reveal (in English) a
slightly stronger correlation between the number of characters and frequency
than between the number of phonemes and frequency.
Table 5: For English, the results of the Steiger’s test between a frequency measure (CHILDES
or Wikipedia) and the three different length measures of a word (Char. for number of charac-
ters, Phon. for the number of phonemes and Syllables for the number of Syllables). Column
Role indicates the subject (children or adults), the Frequency column indicates the source of
the frequency measure (the shared variable), and the column Correlation indicates the type
of correlation. As for the remaining columns, we have the combination of two length mea-
sures. The content is >∗ when the first length variable is more correlated to the frequency
measure than the second variable and the Steiger’s test is significant. If the second measure
is more correlated than the first and the test is significant, then, the content is <∗ . If the test
is not significant, then the contents are > or < depending on what variable shows a higher
correlation to the frequency measure.
Table 6: Results of the Steiger’s test for Dutch. The format is the same as in Table 5.
15
Role Frequency Correlation Char. vs Phon. Phon. vs Syllables Char. vs Syllables
Pearson > > >
Children CHILDES
Spearman > >∗ >∗
Kendall > < <
Pearson > > >
Wikipedia
Spearman >∗ < >
Kendall > < <
Pearson < > >
Adults CHILDES
Spearman < > >
Kendall < < <
Pearson > > >
Wikipedia
Spearman >∗ < >
Kendall > < <
Table 7: Results of the Steiger’s test for Spanish. The format is the same as in Table 5.
where n is the sample size and Nc and Nd are, respectively, the number of
concordant and discordant pairs in the sample. We have that
n
Nc + Nd + Nt =
2
where Nt is the number of tied pairs (pairs that are neither concordant nor
discordant). Applying
n
Nd = − Nc − Nt
2
one can rewrite Eq. 5 equivalently as
2Nc
τ = −1 + υ + n
(6)
2
where
Nt
υ= n
2
is the proportion of tied pairs (0 ≤ υ ≤ 1). The fact that Nc ≥ 0 allows one to
see that
τ ≥υ−1 (7)
Put differently, the strongest negative Kendall τ that can be obtained is υ − 1.
The higher the number of ties, the weaker the maximum Kendall τ correlation
that can be obtained. Table 8 shows the percentage of ties, namely 100υ, of
frequency (CHILDES and Wikipedia) versus polysemy and the measures of
length for every language and role.
16
It is possible to derive lower bounds for the Spearman correlation ρ from
that of Kendall τ correlation. Knowing that [38]
1
ρ≥ (3τ − 1)
2
and recalling Eq. 7, one obtains
3
ρ≥ υ−2 (8)
2
Similarly, knowing that [39]
1
ρ≥ (1 + τ )2 − 1
2
one obtains
υ2
ρ≥ −1 (9)
2
Combining, Eqs. 8 and 9, we get finally
υ2
3
ρ ≥ max υ − 1, −1 (10)
2 2
The lower bounds of ρ above are likely to be looser than the original lower bound
of τ because the former are derived from the latter.
Table 8: Percentage of ties, 100υ, between frequency (CHILDES and Wikipedia) and Word-
Net polysemy, number of characters, number of phonemes and number of syllables for every
language and role.
17
6. Discussion and Future Work
In this paper, we have reviewed two linguistic laws that we owe to Zipf’s
[3, 1] and that have probably been shadowed by the best-known Zipf’s law for
word frequencies [1]. Our analysis of the correlation between brevity (measured
in number of characters, phonemes and syllables) and polysemy (number of
synsets) versus word frequency was conducted with three correlation tests with
varying assumptions and robustness. Pearson correlation is a measure of linear-
ity while the Spearman correlation and Kendall correlation are able to capture
monotonic non-linear dependencies as we have explained in Section 4. Our anal-
ysis confirms that a positive correlation between the frequency of the words and
the number of synsets (consistent with the meaning-frequency law [3]) and a
negative correlation between the length of the words and their frequency (con-
sistent with the law of abbreviation [1]) arises under different definitions of the
variables. In all cases, we find correlations whose sign matches the expected
sign. In addition, all correlations are significant except the Pearson correlations
in the meaning-frequency law for Dutch and Spanish. This behaviour could be
due to (a) the lower capacity of the Pearson correlation to detect non-linear
dependencies compared to Spearman and Kendall correlations or, (b) the fact
that English exhibits a larger sample size than those two languages (Table 4).
In optimization models of the law of abbreviation, length is regarded as a
proxy for the energetic cost of the word [12, 40]. Then one expects that a better
measure of energetic cost would give a stronger correlation with frequency. Our
meta-analysis of the correlation between frequency and length has shown that
this correlation is slightly stronger when length is measured with characters
than in phonemes in most cases in English, and that characters and phonemes
are stronger that syllables in both English and Dutch. In Spanish, no clear
pattern arises, which is consistent with the classical view of Spanish as a more
transparent language than English [41] or Dutch [42]. Thus, the grapheme
to phoneme conversion is easier in Spanish than in English or Dutch [42]. The
degree of transparency of a language is defined as the extent to which a language
maintains one-to-one relations between units from different dimensions, e.g.,
phonemes versus graphemes. Transparency is tied to the notion of “simplicity"
in accounting for acquisition data (see [42] for a review). Transparency facilitates
reading. Then, learning to read in a transparent orthography imposes fewer
constraints than learning to read in a more opaque writing system [43].
The fact that the correlation between frequency and number of syllables
tends to be weaker than correlations with other measures of length does not
imply that syllables are a worse measure of length or energetic cost. It could
be simply due to the fact that ties of length values are easier to obtain with
syllabic length, a fact that is expected to yield weaker correlations and higher
p-values as we have shown in Section 5.4.
Interestingly, we have not found any remarkable qualitative difference in the
analysis of correlations for adults versus children in the CHILDES database, sug-
gesting that both child speech and the infant-directed speech or child-directed
speech (the so-called motherese) [19] seem to show the same general statistical
biases in the use of more frequent words (that tend to be shorter and more pol-
ysemous), confirming the results of our previous test in [15] where adults were
split into three different roles, mother, father and investigator, instead of being
considered together in a single class as in this present paper.
18
Our analyses have shown the robustness of these Zipfian patterns from the
standpoint of a correlation analysis. Such robustness provides support to Zipf’s
hypothesis that these laws originate from abstract principles, e.g., functional
pressures (least effort as he would put it), that are consistent with modern for-
malizations as a compression principle for the law of abbreviation [12, 40] or a
biased random walk over the mapping words into meanings for the origins of
Zipf’s meaning frequency law [13]. This theoretical approaches strongly suggest
that it might be possible to provide a coherent and parsimonious explanation for
the laws we have examined in this article and other laws such as Zipf’s law for
word frequencies [44] or Menzerath’s law [45]. The need for an abstract stand-
point is not only suggested by our analyses but also by patterning consistent
with these laws in human language in different conditions, e.g., sign language
[46], Kanji or Chinese characters [47, 48], and also in animal communication
[12, 49, 50].
Our work offers many possibilities for future research.
First, expanding the set of languages to include languages from other families
(i.e. not Indo-European languages) and the set of lexical databases employed
(e.g., [51]). As for the latter, the challenge is to find sources that allow to deal
with different languages homogeneously.
Second, considering different definitions of the same variables. For instance,
a limitation of our study is the fact that we define word length using discrete
units: number of syllables, number of phonemes or number of characters. Future
research could benefit from viewing length as a continuous variable, e.g. the
(average) duration in time of the word, because that may yield a better estimate
of the actual energetic cost of a word and also because our research on language
laws is to some extend limited by the information that is transcribed and the
writing conventions, that add some degree of arbitrariness. These limitations
have been overcome to a large extent in novel investigations of language laws in
pure voice [14].
Third, our work can be extended including other linguistic variables such as
homophony, i.e. words with different origin (and a priori different meaning) that
have converged to the same phonological form. This extension would require
to trace the history of each word, under a dynamical and lexical perspective,
following the connection between brevity of words and homophony that Jes-
persen (1933) suggested in his seminal work [52] and that has been confirmed
more recently [53, 54] as a strong association between shortness of words, token
frequency and homophony [54]. If polysemy is taken to be a form of motivated
homophony, by which a word has two or more related meanings, but with prob-
ably different representation than homophones (for which different meanings
are "stored separately" [55]) in a semantic space, both phenomena will only be
distinguishable if we analyze and segment directly voice signals or, as said be-
fore, we do that under a diachronic approach. In any case, we must be aware of
the limitations of synchronic approaches when homophony or homography are
studied, the later being indistinguishable from polysemy in the present article.
Fourth, a parametric study of these laws with the help of power-law like
functions [4, 24]. In Section 2, we have shown some challenges of that kind of
investigation. We do think that such investigation is very needed and worth-
while. We have just argued it is not as simple as commonly believed.
Finally, future work should bridge the gap between our classic Zipfian per-
spective and psycholinguistics. We suggest a couple of ways. First, an ex-
19
ploration of the structural differences between common and rare words [56].
Second, an application of our methodology to other magnitudes such as con-
textual diversity, which may be more relevant than the mere word frequency in
some lexical tasks [57, 58].
Acknowledgments
The authors thank Pedro Delicado for his helpful comments. This research work
was supported by the grant SGR2014-890 (MACDA) and the recognition 2017SGR-856
(MACDA) from AGAUR (Generalitat de Catalunya), and also the grants TIN2014-
57226-P (APCOM), TIN2017-89244-R (MACDA) and TIN2016-77820-C3-3-R (GRAPH-
MED) from MINECO (Ministerio de Economia, Industria y Competitividad).
Appendix A. Figures
Figures A.3, A.4, A.5, A.6, A.7, A.8, A.9 and A.10 show the results obtained
in this article for English, Dutch and Spanish.
In all these plots, frequency is placed on the x-axis for at least three reasons.
First, frequency is given (frequency is assumed to be constant) while length
is variable In information theory (coding theory in particular) [59]. In the
problem of compression, one aims to minimize the average length of codes given
probabilities (estimated as relative frequencies). Information theory predicts
the length of a code as a function of its frequency [59, p. 111] or its frequency
rank [40]. Therefore, when plotting length versus frequency, it makes sense
to put it on the x-axis. Second, word frequency is a fundamental variable in
psycholinguistics to predict language processing costs [60, 61]. The third reason
comes from the popular Zipf’s law for word frequencies. Although Zipf’s law
is usually plotted as frequency as function of rank (following Eq. 1), a sister
(but not identical) plot, the so called frequency spectrum, consists of showing
the number of distinct words as a function of frequency [62]. The second plot
(frequency on the x-axis) is preferred by various authors for investigating the
distribution of word frequencies (e.g., [63, p. 298], [64, p. 3]).
Tables B.9, B.10, B.11 and B.12 show the CHILDES corpora used in this
article for English, Dutch and Spanish.
20
Children Adults
English
Dutch
Spanish
Figure A.3: WordNet polysemy versus CHILDES frequency in double logarithmic scale. The
color indicates the density of points: dark green is the highest possible density. A smoothed
curve (blue line) and a curve proportional to the probability density of values of the x-axis
(red dashed line) is also shown.
21
Children Adults
English
Dutch
Spanish
Figure A.4: WordNet polysemy versus Wikipedia frequency. The format is the same as in
Fig. A.3.
22
Children Adults
English
Dutch
Spanish
Figure A.5: Number of characters versus CHILDES frequency. The format is the same as in
Fig. A.3.
23
Children Adults
English
Dutch
Spanish
Figure A.6: Number of characters versus Wikipedia frequency. The format is the same as in
Fig. A.3.
24
Children Adults
English
Dutch
Spanish
Figure A.7: Number of phonemes versus CHILDES frequency. The format is the same as in
Fig. A.3.
25
Children Adults
English
Dutch
Spanish
Figure A.8: Number of phonemes versus Wikipedia frequency. The format is the same as in
Fig. A.3.
26
Children Adults
English
Dutch
Spanish
Figure A.9: Number of syllables versus CHILDES frequency. The format is the same as in
Fig. A.3.
27
Children Adults
English
Dutch
Spanish
Figure A.10: Number of syllables versus Wikipedia frequency. The format is the same as in
Fig. A.3.
28
Corpus Age Range # children Comments
Lara [65] 1;9 – 3;3 1 Longitudinal case study
Manchester [66] 1;8 – 3;0 12 12 English children
recorded weekly for the
period of a year
Wells [67] 1;6 – 5;0 32 Large study of the lan-
guage of British preschool
children collected at ran-
dom intervals
29
Corpus Age Range # children Comments
BolKuiken [77] 1;7 – 3;7 47 Dutch normal controls
CLPF [78, 79] 1;0 – 2;11 12 PHONBANK, longitudi-
nal study with 20,000 ut-
terances
Groningen [80] 1;5 – 3;7 6 ’Iris’ was removed be-
cause she subsequently
displayed delay in lan-
guage development due to
hearing problems. ’Iri’
(ending with no ’s’) was
also excluded (this per-
son was very likely a mis-
spelling of ’Iris’ because
he/she was in the same
subdirectory of ’Iris’ and
was the only target child
in the only file where it ap-
peared).
Schaerlaekens 1;8 – 2;10 , 1;10 – 3;1 6
[81]
Laura 1;9 – 5;10
van Kampen [82] 2
Sarah 1;6.16 – 6;0
30
Corpus Age Range # children Comments
Aguirre [83] 1;7-2;10 1
BecaCESNo 3;6-11;6 40
ColMex 6;0-7;0 30 Mexican Spanish, picture
and procedural descrip-
tion
DiezItza [84] 3;0-3;11 20
FernAguado 3;0-4;0 50
Hess [85] 6;0-12;0 24
JacksonThal 0;10-3;0 202 Cross-sectional data from
[86, 87] Queretaro, San Diego, and
Santa Barbara
Linaza [88] 2;0-4;0 1
LlinasOjea 0;11-3;02 1 Longitudinal study of two
children in Asturias, but
only Yasmin is considered.
Marrero 1;8-8;0 3 Longitudinal study of
Spanish children from the
Canaries
Nieva 1;8-2;3 1
Ornat [89] 1;7-4;0 1
Remedi 1;11-2;10 1
Romero [90] 2;0 1 Mexican Spanish
SerraSole 1;4-3;10 1
Shiro [91] 6;0-9;0 113 Narratives from Venezue-
lan children
Appendix C. Correlations
Tables C.13, C.14, C.15, C.16, C.17 and C.18 show the correlations be-
tween Frequency and Polysemy, and Frequency and Length measures for En-
glish, Dutch and Spanish.
Tables D.19, D.20 and D.21 show the results of Steiger’s test between number
of characters and number of phonemes, number of characters and number of
syllables, and number of phonemes and number of syllables for English, Dutch
and Spanish.
31
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children 0.059 < 10−8 0.249 < 10−140 0.182 < 10−323 9,930
Adults 0.09 < 10−323 0.264 < 10−257 0.196 < 10−323 16,235
CHILDES frequency vs Number of characters
Children -0.122 < 10−33 -0.27 < 10−164 -0.202 < 10−164 9,930
Adults -0.115 < 10−48 -0.324 < 10−323 -0.243 < 10−323 16,235
CHILDES frequency vs Number of phonemes
Children -0.123 < 10−29 -0.31 < 10−188 -0.234 < 10−185 8,547
−38 −323
Adults -0.11 < 10 -0.361 < 10 -0.273 < 10−323 14,146
CHILDES frequency vs Number of syllables
Children -0.077 < 10−11 -0.239 < 10−110 -0.193 < 10−108 8,547
−18 −298
Adults -0.075 < 10 -0.303 < 10 -0.243 < 10−290 14,146
Table C.13: Analysis of the correlations in English with CHILDES frequency. For every
correlation test the value of the statistic (r, ρ and τ ) an the corresponding p-value is shown.
Significant correlations are indicated in bold.
Table C.14: Analysis of the correlations in English with Wikipedia frequency. The format is
the same as in Table C.13.
32
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children 0.017 0.376 0.19 < 10−22 0.147 < 10−323 2,627
−43
Adults 0.013 0.342 0.19 < 10 0.149 < 10−323 5,273
CHILDES frequency vs Number of characters
Children -0.13 < 10−10 -0.187 < 10−21 -0.138 < 10−21 2,627
−15 −148
Adults -0.113 < 10 -0.347 < 10 -0.259 < 10−143 5,273
CHILDES frequency vs Number of phonemes
Children -0.136 < 10−11 -0.2 < 10−23 -0.149 < 10−23 2,575
−15 −155
Adults -0.114 < 10 -0.358 < 10 -0.271 < 10−151 5,179
CHILDES frequency vs Number of syllables
Children -0.1 < 10−6 -0.169 < 10−17 -0.134 < 10−17 2,575
−10 −125
Adults -0.093 < 10 -0.323 < 10 -0.258 < 10−120 5,179
Table C.15: Analysis of the correlations in Dutch with CHILDES frequency. The format is
the same as in Table C.13.
Table C.16: Analysis of the correlations in Dutch with Wikipedia frequency. The format is
the same as in Table C.13.
33
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children -0.01 0.564 0.162 < 10−21 0.12 < 10−323 3,520
−17
Adults 0.011 0.523 0.152 < 10 0.113 < 10−323 3,217
CHILDES frequency vs Number of characters
Children -0.125 < 10−13 -0.367 < 10−111 -0.276 < 10−108 3,520
−13 −99
Adults -0.136 < 10 -0.362 < 10 -0.272 < 10−96 3,217
CHILDES frequency vs Number of phonemes
Children -0.114 < 10−10 -0.361 < 10−107 -0.271 < 10−104 3,512
−14 −99
Adults -0.136 < 10 -0.362 < 10 -0.272 < 10−96 3,207
CHILDES frequency vs Number of syllables
Children -0.109 < 10−10 -0.344 < 10−97 -0.276 < 10−94 3,520
−12 −94
Adults -0.126 < 10 -0.354 < 10 -0.284 < 10−91 3,217
Table C.17: Analysis of the correlations in Spanish with CHILDES frequency. The format is
the same as in Table C.13.
Table C.18: Analysis of the correlations in Spanish with Wikipedia frequency. The format is
the same as in Table C.13.
34
CHILDES Frequency Wikipedia Frequency
Role Pearson Spearman Pearson Spearman size
t p-value t p-value t p-value t p-value
English
Children -2.484 0.013 0.022 0.982 -3.693 < 10−3 -4.572 < 10−5 8,548
Adults -3.562 < 10−3 2.501 0.012 -4.553 < 10−5 -8.837 < 10−17 14,149
Dutch
Children 0.395 0.693 1.417 0.157 -0.472 0.637 -1.049 0.294 2,579
Adults -0.251 0.802 1.624 0.104 -0.793 0.428 -6.491 < 10−10 5,190
Spanish
Children -0.861 0.389 -0.391 0.696 -1.203 0.229 -3.596 < 10−3 3,516
Adults 0.098 0.922 0.571 0.568 -1.194 0.233 -3.43 < 10−3 3,210
Table D.19: Steiger’s test between variables Number of characters and Number of phonemes.
The test indicates if the difference between the r1 and r2 is significant. r1 is the correlation
between Frequency and Number of characters and r2 is the correlation between Frequency
and Number of phonemes. Frequency is the shared variable. The table has two parts: one for
the results when the CHILDES frequency is considered the shared variable, and another when
Wikipedia frequency is the shared variable. Within each part, the t statistic of Steiger test
and p-value, the corresponding p-value, are shown for Pearson and Spearman correlations.
Significant results are in bold.
Table D.20: Steiger’s test between variables Number of characters and Number of syllables.
The format is the same as in Table D.19.
Table D.21: Steiger’s test between variables Number of phonemes and Number of syllables.
The format is the same as in Table D.19.
35
References
References
[1] G. K. Zipf, Human behaviour and the principle of least effort, Addison-
Wesley, Cambridge (MA), USA, 1949.
[2] G. K. Zipf, The Psycho-Biology of Language: an Introduction to Dynamic
Psychology, MIT Press, Cambridge, MA, USA, 1968, originally published
in 1935 by Houghton Mifflin - Boston - MA - USA.
[3] G. K. Zipf, The Meaning-Frequency Relationship of Words, Journal of Gen-
eral Psychology 1945 (33) (1945) 251–256.
[4] S. Naranan, V. K. Balasubrahmanyan, Models for power law relations in lin-
guistics and information science, J. Quantitative Linguistics 5 (1-2) (1998)
35–61.
[5] R. Ferrer-i-Cancho, Optimization models of natural communication, in:
Journal of Quantitative Linguistics, Vol. 25, 2014.
URL http://arxiv.org/abs/1412.2486
[6] F. Font-Clos, G. Boleda, A. Corral, A scaling law beyond Zipf’s law and
its relation to Heaps’ law, New Journal of Physics 15 (9) (2013) 093033.
URL http://stacks.iop.org/1367-2630/15/i=9/a=093033
[7] A. Corral, G. Boleda, R. Ferrer-i-Cancho, Zipf’s law for word frequencies:
Word forms versus lemmas in long texts, PLoS ONE 10 (7) (2015) 1–23.
doi:10.1371/journal.pone.0129031.
[8] R. Ferrer-i-Cancho, The meaning-frequency law in Zipfian optimization
models of communication, Glottometrics 35 (2016) 28–37.
[9] P. Grzybek, Contributions to the science of text and language: word length
studies and related issues, Vol. 31, Springer Science & Business Media,
2006.
[10] U. Strauss, P. Grzybek, G. Altmann, Word length and word frequency,
Springer, Dordrecht, 2007, pp. 277–294.
[11] B. Ilgen, B. Karaoglan, Investigation of Zipf’s “law-of-meaning” on Turkish
corpora, in: 22nd International Symposium on Computer and Information
Sciences (ISCIS 2007), 2007, pp. 1–6.
[12] R. Ferrer-i-Cancho, A. Hernández-Fernández, D. Lusseau, G. Agoramoor-
thy, M. J. Hsu, S. Semple, Compression as a universal principle of animal
behavior, Cognitive Science 37 (8) (2013) 1565–1578.
[13] R. Ferrer-i-Cancho, M. Vitevitch, The origins of Zipf’s meaning-frequency
law, Journal of the American Association for Information Science and Tech-
nology 69 (2018) 1369–1379.
[14] I. Gonzalez Torre, B. Luque, L. Lacasa, J. Luque, A. Hernandez-Fernandez,
Emergence of linguistic laws in human voice, Scientific reports 7 (43862)
(2017) 1–10. doi:10.1038/srep43862.
URL http://www.nature.com/articles/srep43862
36
[15] A. Hernández-Fernández, B. Casas, R. Ferrer-i-Cancho, J. Baixeries, Test-
ing the Robustness of Laws of Polysemy and Brevity Versus Frequency,
Springer International Publishing, Cham, 2016, pp. 19–29. doi:10.1007/
978-3-319-45925-7_2.
URL http://dx.doi.org/10.1007/978-3-319-45925-7_2
[16] B. MacWhinney, The CHILDES project: tools for analyzing talk, 3rd Edi-
tion, Vol. 2: the database, Lawrence Erlbaum Associates, Mahwah, NJ,
2000.
37
[26] F. Font-Clos, A. Corral, Log-log convexity of type-token growth in Zipf’s
systems, Phys. Rev. Lett. 114 (2015) 238701. doi:10.1103/PhysRevLett.
114.238701.
[27] R. Ferrer-i-Cancho, A. Hernández-Fernández, J. Baixeries, Ł. Dębowski,
J. Mačutek, When is Menzerath-Altmann law mathematically trivial? A
new approach, Statistical Applications in Genetics and Molecular Biology
13 (2014) 633–644.
[28] H. L. Seal, The maximum likelihood fitting of the discrete Pareto law,
Journal of the Institute of Actuaries (1886-1994) 78 (1) (1952) 115–121.
[29] A. Corral, Dependence of earthquake recurrence times and independence
of magnitudes on seismicity history, Tectonophysics 424 (2006) 177–193.
[30] F. Bond, R. Foster, Linking and Extending an Open Multilingual Wordnet,
Sofia, 2013.
[31] M. Postma, E. van Miltenburg, R. Segers, A. Schoen, P. Vossen, Open
Dutch WordNet, in: Proceedings of the Eight Global Wordnet Conference,
Bucharest, Romania, 2016.
[32] A. Gonzalez-Agirre, E. Laparra, G. Rigau, Multilingual central repository
version 3.0: upgrading a very large lexical knowledge base, in: Proceedings
of the 6th Global WordNet Conference (GWC 2012), Matsue, 2012.
[33] R. H. Baayen, R. Piepenbrock, L. Gulikers, CELEX (1996).
URL http://celex.mpi.nl
[34] W. J. Conover, Practical nonparametric statistics, Wiley, New York, 1999,
3rd edition.
[35] J. D. Gibbons, S. Chakraborti, Nonparametric statistical inference, Chap-
man and Hall/CRC, Boca Raton, FL, 2010, 5th edition.
[36] P. Embrechts, A. McNeil, D. Straumann, Correlation and dependence in
risk management: properties and pitfalls, in: M. A. H. Dempster (Ed.),
Risk management: value at risk and beyond, Cambridge University Press,
Cambridge, 2002, pp. 176–223.
[37] J. H. Steiger, Tests for comparing elements of a correlation matrix, Psy-
chological Bulletin 87 (1980) 245–251.
[38] H. E. Daniels, Rank correlation and population models, Journal of the
Royal Statistical Society, Series B 12 (1950) 171–81.
[39] J. Durbin, A. Stuart, Inversions and rank correlations, Journal of the Royal
Statistical Society, Series B 13 (1951) 303–309.
[40] R. Ferrer-i-Cancho, C. Bentz, C. Seguin, Compression and the origins of
Zipf’s law of abbreviation.
URL http://arxiv.org/abs/1504.04884
[41] R. Nash, Comparing English and Spanish: Patterns in Phonology and Or-
thography, Prentice Hall, 1977.
URL https://books.google.es/books?id=ke86OwAACAAJ
38
[42] S. Leufkens, Transparency in language: a typological study, LOT, 2015.
URL http://hdl.handle.net/11245/1.439561
[43] E. Ijalba, L. K. Obler, First language grapheme-phoneme transparency
effects in adult second-language learning, Vol. 27, 2015, pp. 47–70.
[44] R. Ferrer-i-Cancho, Compression and the origins of Zipf’s law for word
frequencies, Complexity 21 (2016) 409–411. doi:10.1002/cplx.21820.
URL http://dx.doi.org/10.1002/cplx.21820
[45] M. L. Gustison, S. Semple, R. Ferrer-i-Cancho, T. Bergman, Gelada vocal
sequences follow Menzerath’s linguistic law, Proceedings of the National
Academy of Sciences USA 13 (2016) E2750–E2758. doi:10.1073/pnas.
1522072113.
[46] C. Börstell, T. Hörberg, R. Östling, Distribution and duration of signs and
parts of speech in Swedish Sign Language, Sign Language & Linguistics 19
(2016) 143–196.
[47] H. Sanada, Investigations in Japanese historical lexicology, Peust &
Gutschmidt Verlag, Göttingen, 2008.
[48] Y. Wang, X. Chen, Structural complexity of simplified Chinese characters,
in: A. Tuzzi, J. M. M. Benesová (Eds.), Recent Contributions to Quanti-
tative Linguistics, De Gruyter, 2015, pp. 229–239.
[49] R. Ferrer-i-Cancho, B. McCowan, A law of word meaning in dolphin whistle
types, Entropy 11 (4) (2009) 688–701. doi:10.3390/e11040688.
[50] C. Hobaiter, R. W. Byrne, The meanings of chimpanzee gestures, Current
Biology 24 (2014) 1596–1600.
[51] A. Duchon, M. Perea, N. Sebastián-Gallés, A. Martí, M. Carreiras, Es-
Pal: One-stop shopping for Spanish word properties, Behavior Research
Methods 45 (4) (2013) 1246–1258.
[52] O. Jespersen, Monosyllabism in English, in: Linguistica: Selected Writings
of Otto Jespersen, George Allen and Unwin LTD, London, UK, 2007, pp.
574–598.
[53] J. Ke, A cross-linguistic quantitative study of homophony, Journal of Quan-
titative Linguistics (2006) 129–159.
[54] G. Fenk-Oczlon, A. Fenk, Frequency effects on the emergence of poly-
semy and homophony, International Journal Information Technologies and
Knowledge 4 (2) (2010) 103–109.
[55] I. Dautriche, E. Chemla, What homophones say about words, PLOS ONE
11 (9) (2016) 1–19. doi:10.1371/journal.pone.0162176.
URL http://dx.doi.org/10.1371%2Fjournal.pone.0162176
[56] T. Landauer, L. Streeter, Structural differences between common and rare
words: Failure of equivalence assumptions for theories of word recognition,
Journal of Verbal Learning and Verbal Behavior 12 (2) (1973) 119 – 131.
doi:https://doi.org/10.1016/S0022-5371(73)80001-5.
39
URL http://www.sciencedirect.com/science/article/pii/
S0022537173800015
40
[71] R. Brown, A first language: the early stages, Harvard University Press,
Cambridge, MA, 1973.
[72] S. Kuczaj, The acquisition of regular and irregular past tense forms, Journal
of Verbal Learning and Verbal Behavior 16 (1977) 589–600.
[73] CHILDES, American English Corpora. CHILDES. The Database Manu-
als. Available at http://childes.psy.cmu.edu/manuals/02englishusa.doc. Ac-
cessed 17 December 2012., TalkBank (2012).
41
[85] K. Hess Zimmermann, El desarrollo lingüístico en los años escolares: análi-
sis de narraciones infantiles, Ph.D. thesis, El Colegio de México, México
(2003).
[91] M. Shiro, Getting the story across: A discourse analysis approach to eval-
uative stance in Venezuelan children’s narratives, unpublished Doctoral
Dissertation. Harvard University (1997).
42