1904 00812 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Polysemy and Brevity versus Frequency in Language

Bernardino Casasa , Antoni Hernández-Fernándezb , Neus Catalàa , Ramon


Ferrer-i-Canchoa , Jaume Baixeriesa,∗
a Complexity & Quantitative Linguistics Lab, Laboratory for Relational Algorithmics,
Complexity and Learning (LARCA), Departament de Ciències de la Computació,
Universitat Politècnica de Catalunya, Barcelona, Catalonia.
b Complexity & Quantitative Linguistics Lab, Laboratory for Relational Algorithmics,

Complexity and Learning (LARCA), Institut de Ciències de l0 Educació, Universitat


Politècnica de Catalunya, Barcelona, Catalonia.
arXiv:1904.00812v1 [cs.CL] 27 Mar 2019

Abstract
The pioneering research of G. K. Zipf on the relationship between word fre-
quency and other word features led to the formulation of various linguistic laws.
The most popular is Zipf’s law for word frequencies. Here we focus on two
laws that have been studied less intensively: the meaning-frequency law, i.e.
the tendency of more frequent words to be more polysemous, and the law of
abbreviation, i.e. the tendency of more frequent words to be shorter. In a pre-
vious work, we tested the robustness of these Zipfian laws for English, roughly
measuring word length in number of characters and distinguishing adult from
child speech. In the present article, we extend our study to other languages
(Dutch and Spanish) and introduce two additional measures of length: syllabic
length and phonemic length. Our correlation analysis indicates that both the
meaning-frequency law and the law of abbreviation hold overall in all the ana-
lyzed languages.
Keywords: Zipf’s laws, polysemy, brevity, word frequency

1. Introduction

The linguist George Kingsley Zipf (1902-1950) is known for his investigations
on statistical laws of language [1, 2]. Perhaps the most popular one is Zipf ’s
law for word frequencies [1, 3], that states that the frequency of the i-th
most frequent word in a text follows approximately

f ∝ i−α (1)

where f is the frequency of that word, i its rank or order and α is a constant
(α ≈ 1). Zipf’s law is an example of power-law model for the relationship
between two variables [4]. Zipf’s law for word frequencies can be explained by
information theoretic models of communication [5] and is a robust pattern of

∗ Corresponding author
Email addresses: [email protected] (Bernardino Casas), [email protected]
(Antoni Hernández-Fernández), [email protected] (Neus Català),
[email protected] (Ramon Ferrer-i-Cancho), [email protected] (Jaume
Baixeries)

Preprint submitted to Elsevier April 2, 2019


language that presents invariance with text length in a sufficiently long text
[6], and little sensitivity with respect to the linguistic units considered [7]. The
focus of this paper is to test the robustness of two statistical laws in linguistics
that have been studied less intensively:
• Meaning-frequency law [3], the tendency of more frequent words to be
more polysemous. Zipf predicted that the m number of meanings of a
word should follow
m ≈ fδ (2)
where f is the frequency of a word and δ ≈ 0.5. Zipf never tested the
validity of this equation. He only derived it from Zipf’s law for word
frequencies (Eq. 1) and the law of meaning distribution [3] (see [8] for
a general derivation). The latter links m with i, namely, the frequency
rank. Zipf proposed and tested [1, 3]

m ∝ i−γ (3)

where γ is a constant (γ ≈ 0.5).


• Zipf ’s law of abbreviation [1, 9], the tendency of more frequent words
to be shorter or smaller. In his pioneering research, Zipf made this ob-
servation but did not propose any mathematical formulae to model that
dependency [1]. Power-law like functions were suggested later on by other
researchers ([10]).

These laws are examples of laws where the predictor is the word frequency
and the response is another word feature. These laws are regarded as universal
although the only evidence of their universality is that they hold in every lan-
guage where they have been tested so far [11]. Because of their generality, these
laws have triggered modeling efforts that attempt to explain their origin and
support their presumable universality with the help of abstract mechanisms or
communication principles [12, 13], or exploring directly from voice those statis-
tical patterns in levels under the phoneme scale [14]. Therefore, investigating
the experimental conditions under which these laws surface is crucial.
In a previous work [15], we have studied these linguistic laws in a large corpus
of child and adult language (CHILDES) [16]. We extracted semantic polysemy
values from WordNet [17] and SemCor corpus 1 , and defined word length simply
as the number of characters per word.
In this present article we extend our research in [15] by re-analyzing the
behaviour of those linguistic laws in children and adults separately, using the
transcripts in the CHILDES database, and exploring different definitions of
word frequency, word polysemy and word length. In order to test the statisti-
cal validity of these linguistic laws, we also expand the number of languages to
Dutch and Spanish, as well as English, which was the only language analyzed
in [15]. Concerning word frequency, we consider two major sources of estima-
tion: the CHILDES database [16] and Wikipedia [18]. Frequency estimates are
computed separately for children and adults (comprising mothers, fathers and
investigators). This division allows us to compare children and adults linguistic

1 http://multisemcor.fbk.eu/semcor.php

2
production: motherese also known as child-directed speech (CDS) or infant-
directed speech (IDS) has been studied for many years and it is still a hot topic
of research [19].
Concerning polysemy, we define the polysemy of a word as the number of dif-
ferent senses it has, based on the WordNet of its corresponding language (Prince-
ton WordNet, Open Dutch WordNet and Multilingual Central Repository for
Spanish). Hereafter, we will refer to this polysemy as WordNet polysemy. We
assume that the polysemy measure provided by WordNet does not distinguish
between different types of polysemy and we are aware of the inherent difficulties
of borrowing this conceptual framework (see [20, 21, 22, 23]). Concerning word
length, we consider three different units of measurement: a graphical unit (num-
ber of characters) and two phonetic units (number of phonemes and number of
syllables). From the sources for obtaining word frequency and polysemy values
and from the variety of measurement units for word length, we come up to eight
major ways of investigating meaning-frequency law and law of abbreviation, for
each language under study.
In this paper, we investigate these laws qualitatively using measures of cor-
relation between two variables. Thus, the law of abbreviation is defined as a
significant negative correlation between the frequency of a word and its length
for any unit of measurement. The meaning-frequency law is defined as a sig-
nificant positive correlation between the frequency of a word and its WordNet
polysemy, a proxy for the number of meanings of a word. While our approach
to these laws is non-parametric (we are not assuming any particular model for
the relationship between two variables), traditional research on statistical laws
of language is mostly parametric, assuming some sort of power law or general-
izations of power laws [4, 10, 24].
We adopt these correlational definitions to remain agnostic about the actual
functional dependency between the variables, which is currently under revision
for various statistical laws of language [25, 26, 27]. We will show that a sig-
nificant correlation of the right sign is found in the majority of combinations
of conditions mentioned above, providing support for the hypothesis that these
laws are originated from abstract mechanisms. We propose as well some hy-
potheses to explain why in some exceptional cases the analyzed variables do not
correlate significantly.
The remainder of the article is organized as follows. Section 2 revises the
power law model that Zipf proposed for the law of meaning distribution [1, 3]
to illustrate the challenges of the parametric approach presented here. Then we
justify the convenience of a non-parametric approach (and correlation analysis)
that we have adopted in this article for statistical laws of language involving
word frequency. Sections 3 and 4 present, respectively, the materials (databases)
and the methods employed to analyze them. Section 5 presents the results of
our analysis of the meaning-frequency law and the law of abbreviation. Section
6 discusses our findings and suggests future work.

2. Revisiting Zipf ’s law of meaning distribution

To check if the equation that Zipf proposed for the law of meaning distribu-
tion (Eq. 3) holds on modern corpora, we have reproduced the computations
exposed in [1] on a data set that we explore in depth in the next sections.

3
When plotting the relationship between number of meanings per word (on
the ordinate) and frequency rank (abscissa), Zipf applied a linear binning tech-
nique to reduce noise. When using bins of length λ, the 1st bin is formed by
the λ most frequent words, the 2nd bin is formed by the next λ most frequent
words,... etc. Formally, the j-th bin is defined by words whose rank i satisfies
λ(j − 1) + 1 ≤ i ≤ λj. (4)
Zipf plotted the relationship between the average number of meanings of the
j-th bin and j [1, p. 30] and fitted a power law (Eq. 3). We follow the same
method for a sample from the CHILDES database (see Section 3.5 for further
details).
Figures 1 and 2 show two examples of these plots taking as input English
words produced by adults and by children, respectively. Frequencies have been
obtained from CHILDES for both data sets. For estimating the values of the
parameters, the slope and the Y-intercept of the best regression line, we have
used two different methods: non-linear least squares and maximum likelihood
[28] on the original curve (in normal scale). The values shown at the top of each
figure correspond to the parameters of the fitting in log-log scale, that define
the regression line.
Tables 1 and 2 summarize the analyses performed over the whole data sets,
that is, for the three languages (English, Dutch and Spanish) and for the two
roles (adults and children). Table 1 corresponds to groupings of 100 words (the
abscissa of a point represents 100 words and the value of its ordinate is the
average number of meanings per word in this hundred). Table 2 corresponds to
groupings of 500 words.
Least squares Maximum likelihood
Language / role N
slope intercept R-squared slope intercept
English/Adults 162 −0.3007 15.4126 0.8745 −0.3206 16.5226
English/Children 99 −0.2925 15.0790 0.8651 −0.3085 15.8425
Dutch/Adults 52 −0.1615 3.7226 0.7426 −0.1688 3.7993
Dutch/Children 26 −0.1701 3.5488 0.7767 −0.1758 3.5928
Spanish/Adults 32 −0.1771 6.2247 0.6305 −0.1859 6.3535
Spanish/Children 35 −0.1731 6.0254 0.5909 −0.1844 6.1916

Table 1: Estimated values for the parameters of Zipf’s law of meaning distribution by Least
squares and by Maximum Likelihood. N is the number of data points after grouping the
words into groups of 1 hundred.

Least squares Maximum likelihood


Language / role N
slope intercept R-squared slope intercept
English/Adults 32 −0.3751 11.5483 0.9812 −0.3762 11.5731
English/Children 19 −0.3756 11.3958 0.9790 −0.3772 11.4271
Dutch/Adults 10 −0.2241 3.2234 0.9741 −0.2261 3.2321
Dutch/Children 5 −0.2528 3.0483 0.9934 −0.2538 3.0510
Spanish/Adults 6 −0.2645 5.3619 0.9238 −0.2627 5.3526
Spanish/Children 7 −0.2710 5.2888 0.9478 −0.2720 5.2942

Table 2: Estimated values for the parameters of Zipf’s law of meaning distribution (slope and
intercept) by Least squares and by Maximum Likelihood. N is the number of data points
after grouping the words into groups of 5 hundreds.

As shown in Tables 1 and 2, as groupings become larger, the slopes of the


regression lines become closer to the expected value of −0.5 [1, 11]. We want to
note here that, especially in the case of English, groupings of 1000 words (as in
Zipf’s work [1]) yield slopes of exactly −0.5.

4
Figure 1: Average number of meanings per word (on the ordinate) for each successive set of λ
words on the abscissa: the 1st set has rank 1, the 2nd set has rank 2, ... etc. (λ is the number
of words per bin to reduce). The true data points (blue circles) are compared against the
power-law models fitted using non-linear least squares (LS; red line) and maximum likelihood
(ML; green line).
Top: normal scale. Bottom: log-log scale. Left: λ = 100. Right: λ = 500.
Data set: English words produced by adults, using the CHILDES frequency.

5
Figure 2: Same format as in Fig. 1 for English words produced by children.

6
These results suggest that using a similar methodology in [1], the analyzed
data sets confirm Zipf’s meaning-frequency law. However, in this paper we want
to present, an alternative way of confirming this law (as well as the Zipf’s law
of abbreviation) by means of a correlation analysis.
The dependence of the exponent on bin size is a challenge for research on
the law of meaning distribution. Paradoxically, when using the same bin size as
Zipf did (λ = 1000), we get the number of non-empty bins reduced to a few for
Spanish and Dutch (this is the reason why we exclude that bin size in the tables
above). We can reduce the bin size to maximize the number of languages used
but then the exponents deviate from the originally reported by Zipf in English.
One also needs to control for the kind of binning. Logarithmic binning is often
used when investigating power law relationships [29]. In addition, we are fitting
Eq. 3 and estimating γ assuming that a power law holds. The validity of such
assumption must be tested. The plots above suggest a deviation from the power
law for high ranks.
Finally, we need to consider the role of the data source in the emergence of
the power law. For instance, word frequency could be estimated from Wikipedia
entries (see Section 4 for such a possibility). The magnitude of the whole chal-
lenge can easily be reduced with a non-parametric approach based on a correla-
tion analysis that does not involve neither any kind of binning nor assuming an
exact model (an equation), that may not be generally valid. So, after revisiting
the classical Zipfian approach to the meaning-frequency law [1, 3], next sections
develop our proposal of a correlation analysis between frequency, meaning and
word length.

3. Materials
In this section we describe the different corpora and tools that have been
used in this paper. We first describe the WordNet database which has been
used to compute the polysemy measure. We also describe the tools used to
convert text to phonetic transcription and to perform syllabic segmentation:
CELEX database and SAGA. Finally, we describe the two different sources for
calculating the frequency of words that are analyzed in this paper: CHILDES
database and Wikipedia as reference corpora.

3.1. Open Multilingual Wordnet


The WordNet database can be seen as a set of senses (also called synsets) and
relationships among them, where a synset is the representation of an abstract
meaning, and it is defined as a set of words having (at least) the meaning that the
synset stands for. Each pair word-synset is also related to a syntactic category.
For instance, the pair book and the synset a written work or composition that
has been published are related to the category noun, whereas the pair book and
synset to arrange for and reserve (something for someone else) in advance are
related to the category verb.
Open Multilingual WordNet [30] gives access to open wordnets in several
languages. In this paper we use the Princeton WordNet for English, the Open
Dutch WordNet for Dutch and the Multilingual Central Repository for Spanish.
Since each WordNet has been made by many different projects, they all vary
notably in size and coverage. Table 3 shows some statistics for every WordNet
used in this paper.

7
WordNet databases contain only four main syntactic categories: nouns,
verbs, adjectives and adverbs. Words of other syntactic categories are not
present in these databases (for instance, in English the article the or the prepo-
sition for ). However, some words which should be considered as functional
words, have been included in our analyses, because they can also be considered
as content words (i.e. in English, the determiner a can also be a noun as in
Letter A or Vitamin A).

Multilingual
Princeton Open Dutch
WordNet Central
WordNet [17] Wordnet [31]
Repository [32]
Language English Dutch Spanish
Synsets 117,659 30,177 38,512
Words 147,306 43,077 36,681
Core 100% 67% 71%
Max 75 15 34
Mean 1.546 1.404 1.585
Median 1 1 1
STD 1.913 0.972 1.797

Table 3: Statistics on used wordnets in Open Multilingual Wordnet to calculate the WordNet
polysemy. Synsets: Total number of synsets. Words: Total number of words. Core: the
percentage of synsets covered from "core" word senses in Princeton WordNet (approximately
the 5000 most frequently used word senses). Max : Maximum number of synsets per word.
Mean: Mean number of synsets per word. Median: Median of the number of synsets per
word. STD: Standard Deviation.

3.2. CELEX database


CELEX [33] is the Dutch Centre for Lexical Information at the Max Planck
Institute for Psycholinguistics. CELEX database comprises three different search-
able lexical databases, Dutch, English and German. The lexical data contained
in each database is divided into five categories: orthography, phonology, mor-
phology, syntax (word class) and word frequency.
We use CELEX database to obtain the phonetic transcription and syllabic
segmentation of Dutch and English speech transcripts. Using WebCelex2 we
have created two lexicons, one for English and another for Dutch, by selecting
from the English Wordforms and Dutch Wordforms databases the following
items: Word (from the orthography category), and PhonSAM and PhonSylSAM
(from the phonology category). The format of the phonetic transcription is
SAMPA charset (Speech Assessment Methods Phonetic Alphabet3 ).

3.3. SAGA
SAGA is an automatic tool for phonetic transcription in Spanish, considering
its multiple dialectal variants. The phonetic description is made in terms of the
SAMPA alphabet. The tool is able to split the words into syllables and mark
the prosodic stress.

2 http://celex.mpi.nl/
3 http://www.phon.ucl.ac.uk/home/sampa/index.html

8
SAGA is able to perform different kinds of transcriptions depending on the
output settings (phonemes, semi-phonemes, syllables, semi-syllables). In addi-
tion, even Spanish has a mostly phonetic writing, there are some exceptions
to the general phonetic rules as for example foreign words, archaic language or
dialectal variants. To deal with these cases, SAGA contains dictionaries that
can be modified to customize the phonetic transcriptions as desired.
We have used SAGA for Spanish conversations to perform both phonetic
transcription and syllabic segmentation. This application is distributed under
the terms of the GNU General Public License4 .

3.4. Wikipedia
Wikipedia is a free online encyclopedia built collaboratively and hosted by
the non-profit Wikimedia Foundation. It exists in 295 languages, from which
currently there is a total of 284 active ones, with the number of pages ranging
from more than 5 million articles (English) to a few hundred articles (Zulu,
Romani, Greenlandic. . . )5 .
Wikipedia includes articles that span across many topics and it is updated
with constant contributions. Thus, it turns out to be a useful resource as a
reference corpus for getting word frequencies. Since we use two different sources
for estimating word frequencies, we can compare the results obtained by using
a general corpus (Wikipedia) with the use of a simpler one (CHILDES).
The contents of each Wikipedia can be downloaded and processed to cal-
culate the frequency of every word that appears in Wikipedia [18]. We have
downloaded from Gregory Grefenstette webpage the lexicons with the frequen-
cies extracted from Wikipedia for English, Dutch and Spanish 6 .

3.5. CHILDES database


The CHILDES database [16] is a set of corpora of transcripts of conversations
between children and adults. The corpora included in this database are in
different languages.
In this paper we have studied the conversations of 60 children in English,
73 children in Dutch and 490 children in Spanish. Detailed information on
these conversations can be found in Table B.9 for British English, Table B.10
for American English, Table B.11 for Dutch and B.12 for Spanish in Appendix
B. For each spoken word of these conversations the following values are given:
CHILDES frequency (number of times this word appears in CHILDES, counted
separately by children and adults), Wikipedia frequency (number of times this
word appears in Wikipedia), number of synsets (according to the corresponding
WordNet), number of characters, number of phonemes and number of syllables.
Table 4 shows the number of different types and tokens obtained from the se-
lected corpora. The number of analyzed tokens and types is smaller than the
number of tokens and types initially extracted from the conversations, because
only those words that are present in the correspondent WordNet have been
retained.

4 Freely available at http://www.talp.upc.edu/index.php/technology/tools/


signal-processing-tools/81-saga
5 https://en.wikipedia.org/wiki/List_of_Wikipedias
6 http://web.archive.org/web/20170205022929/http://pages.saclay.inria.fr:

80/gregory.grefenstette/

9
Types Tokens
Lang. Role
analyzed total cover analyzed total cover
Children 9,930 29,017 34% 1,596,726 2,308,675 69%
English
Adults 16,235 25,135 64% 3,008,148 4,584,213 65%
Children 2,627 13,666 19% 313,556 628,622 50%
Dutch
Adults 5,273 19,037 28% 1,008,393 2,122,354 48%
Children 3,520 27,167 13% 253,145 864,603 29%
Spanish
Adults 3,217 19,975 16% 288,660 997,901 29%

Table 4: Number of analyzed and total words (by types and tokens) obtained from CHILDES
conversations for each language and role. cover : percentage of words (by types or tokens)
that appear in the corresponding WordNet and Wikipedia lexicon.

4. Methods

We now describe the different numerical and computational methods that


have been used in this paper.

4.1. Word Length Computation


There are several types of units to measure word length among which the
most used are graphic and phonetic. Graphical units are usually characters or
letters. Phonetic units are phonemes, syllables or sounds, and although they
are highly correlated with graphical units there can be differences depending on
the language.
Here, we have considered three different units of measurement: a graphical
unit (number of characters) and two phonetic units (number of phonemes and
number of syllables). When dealing with counting characters, numbers, blanks,
separation characters and the like have not been taken into consideration. The
resources used to obtain orthographic and phonetic information are described
in Section 3.

4.2. Frequency
We have extracted word frequency values from two different sources. Thus,
for each word that appears in the selected conversations of CHILDES, we obtain:

• Wikipedia frequency, the frequency that the given word has in the
Wikipedia dataset.
• CHILDES frequency, the frequency that the given word has in CHILDES
according to the speaker’s role: children or adults (comprising mothers,
fathers and investigators). For example, for the word book two different
frequencies are given: the number of times this word appears uttered by
children and uttered by adults, respectively.

4.3. Polysemy
From linguistics, polysemous words are words that have more than one mean-
ing. Linguists distinguish between words with multiple meanings, where the
meanings are unrelated (called homonyms), and words with multiple senses,
where the senses are related. An example of the former is the word bank, having
unrelated meanings such as a sloping land or a financial institution,

10
whereas an example of the latter is honey, having related senses such as sweet
yellow liquid produced by bees or a beloved person.
We have calculated the polysemy of a word as the number of different mean-
ings provided by the WordNet database of its corresponding language. In Word-
Net, the different senses of a polysemous word are assigned to different synsets.
Then, we have considered the number of synsets a word belongs to as the num-
ber of meanings it has. This count is what we call WordNet polysemy. We
assume that the polysemy measure provided by WordNet does not differentiate
between polysemy classes mentioned above.
We are aware that using the WordNet polysemy measure in the CHILDES
corpora induces a bias. First, because we are assuming that the same meanings
that are used in written text are also used in spoken language. Second, because
we are using all possible meanings of a word. An alternative would have been
to tag manually all corpora (which is currently an unavailable option) or to use
an automatic tagger. But also in this latter case, the possibility of biases or
errors would be present.

4.4. Statistical Methods


In the present work we have studied the relationship between (1) frequency
and polysemy and (2) frequency and length. For the three variables, frequency,
polysemy and word length, we have used different sources yielding us many
combinations for evaluation.
For each language selected in the CHILDES corpora, we have calculated
correlations between:
1. CHILDES frequency and WordNet polysemy
2. CHILDES frequency and number of characters
3. CHILDES frequency and number of phonemes
4. CHILDES frequency and number of syllables
5. Wikipedia frequency and WordNet polysemy
6. Wikipedia frequency and number of characters
7. Wikipedia frequency and number of phonemes
8. Wikipedia frequency and number of syllables
For each combination of two variables, we compute:
1. Correlation test. Pearson, Spearman and Kendall two-sided correlation
tests [34], using the cor.test standardized R function. The traditional
Pearson correlation is a measure of linear dependency while Spearman and
Kendall correlations are to capture non-linear dependencies [35, 36].
2. Plot, in logarithmic scale, that also shows the density of points.
3. Nonparametric regression, to obtain a smoothed curve for the cloud
of points defined by the two variables. The smoothed curve is calculated
using the locpoly standardized R function and added to the previous
plot.
4. Probability density function using local polynomials. Proportional
density function is calculated using the locpoly standardized R function
and added to the previous plot.

11
On top of the correlation analysis we build another analysis where we com-
pare pairs of correlations that have a variable in common. Our goal is to deter-
mine if the unit of measurement has some effect on the strength of a linguistic
law. When we investigate the law of abbreviation (the correlation between the
frequency of a word and its length), we keep the source used to estimate fre-
quency while we vary the way length is measured: number of characters, number
of phonemes or number of syllables. In particular, we determine if the difference
between two dependent correlations sharing one variable is significant.
Suppose that we have two different length measures L1 , L2 (which can be
the number of characters, phonemes or syllables) and one frequency measure
F (which can be the CHILDES or Wikipedia frequency). Suppose that the
correlation between F and L1 is r(F, L1 ) and the correlation between F and
L2 is r(F, L2 ), and that both correlations are negative. To determine if one of
those correlations is significantly stronger that the other, we apply a two-tailed
Steiger’s test [37] (we use the r.test standardized R function). If the p-value
is below the significance level and |r(F, L1 )| < |r(F, L2 )|, we can conclude that
L2 is more correlated to F than L1 . Else, if |r(F, L1 )| > |r(F, L2 )|, we can
conclude that L1 is more correlated to F than L2 . Otherwise, if the test is non
significant, we cannot conclude that one correlation is stronger than the other.
We note that the r.test standardized R function requires a single sample
size as a parameter. For this reason, before performing the test, in order to
compute r(F, L1 ) and r(F, L2 ), we have selected from the dataset, those records
that have a valid value on all three variables (F, L1 , L2 ). However, when we
compute a correlation test two single variables F and L1 (or L2 ), we select all
those records that have a valid value in both F and L1 (L2 ), but not necessarily
in all three of them. Therefore, the value computed for r(F, L1 ) and r(F, L2 )
in the Steiger’s test may yield a somehow different value from that in a single
correlation test because of this constraint (inherent to the Steiger’s test).
As the theory of Steiger’s test is defined on Pearson correlation, this higher
level analysis is performed only on Pearson correlations and Spearman corre-
lations (Spearman correlation is a Pearson correlation on rank transformation
of the random variables [34]). As far as we know, it is not warranted that the
Steiger’s tests can be applied on Kendall correlations.
We use three different measures of correlations for the following reasons.
Pearson correlation is included for its popularity and simplicity. Spearman
and Kendall correlation are included for their capacity to capture non-linear
dependencies. Spearman is needed for the Steiger’s tests (see Section 5) and
Kendall correlation allows one to interpret the strength of a correlation based
on the number of ties (this will be shown in Section 5.4).
We assume a significance level of 0.05 in all tests.
We remark that the analysis for the CHILDES corpora has been segmented
into two roles: children and adults.

5. Results

We describe the results that have been obtained in three different languages
(English, Dutch and Spanish) from the analysis of the relationship between:
1. Frequency and polysemy (meaning-frequency law).
2. Frequency and word length (Zipf’s law of abbreviation).

12
We use two different measures for frequency (CHILDES frequency and Wikipedia
frequency), one measure for polysemy (WordNet polysemy) and three measures
for word length (number of characters, phonemes and syllables) as previously
explained in Section 4.
Here we present the results in two formats:
1. A table that contains the results of a correlation test between a frequency
measure versus the following measures: WordNet polysemy, number of
characters, number of phonemes and number of syllables. Each table
shows the results of three (Pearson, Spearman and Kendall) correlation
tests. For each language we have produced two tables: one table where the
frequency measure is the CHILDES frequency, and another table where
the frequency measure is the Wikipedia frequency.
2. A plot for each pair of variables that have been analyzed in the previous
tables along with a nonparametric regression and a probability density
function (see Section 4 for details).

We also present the results of the Steiger’s test that shed light on which of the
three different length measures exhibits a stronger correlation with frequency.
Finally, we include a subsection in which we examine the impact of ties in our
analyses.

5.1. Frequency versus Polysemy


We now describe the results that analyze the relationship between frequency
and polysemy. We remind the reader that we compare the two sources of fre-
quency (CHILDES and Wikipedia) with WordNet polysemy.
In English, all correlations are significant and positive (Tables C.13 and C.14
in the Appendix C). In Dutch (Tables C.15 and C.16 in the Appendix C) and
Spanish (Tables C.17 and C.18 in the Appendix C), Spearman and Kendall
correlations are significant and positive and the Pearson correlations are non-
significant with a correlation value close to zero.
A visual inspection of the graphics in Figure A.3 in the Appendix A confirms
the patterns that we have examined previously. When CHILDES frequency is
analyzed, we see in English that, in all cases, the nonparametric regressions show
a positive slope in the area where most of the points are concentrated. Where
this concentration decreases, so does the nonparametric regression. In both
Dutch and Spanish, the regression shows a similar pattern, but the increase is
not as strong as in English. But, as in English, the regression decays significantly
when it reaches the area with a smaller density of points.
When Wikipedia frequency is analyzed (Figure A.4 in the Appendix A), we
can observe two facts: points are distributed in a more compact way and the
nonparametric regression has a steeper slope in English, and, to a lesser degree,
in Dutch and Spanish.
To sum up, we can say that, in a vast majority of cases, the values of
the significant correlations are always positive. Correlations are significant in
English in all cases, and for all non-linear correlation measures in Dutch and
Spanish.

13
5.2. Frequency versus Length
The analysis of the two measures of frequency versus the three measures of
length are in Tables C.13 and C.14 in English, Tables C.15 and C.16 in Dutch,
and Tables C.17 and C.18 in Spanish in the Appendix C. In this case, the
results show a more compact behavior, since all correlations are significant and
negative both for children and for adults.
As for the nonparametric regression in Figures A.5, A.6, A.7, A.8, A.9 and
A.10, we have that the results are consistent with these previous patterns: in
all cases, the regression shows a negative slope.

5.3. Steiger’s test


We have seen in the previous section that all three measures of length exhibit
a negative correlation with respect to frequency. We now turn into the ques-
tion of deciding which of these measures holds the strongest correlation with
frequency by means of a Steiger’s test (see Section 4 for details about how this
test has been computed).
In Appendix D we present, for each pair of length measures (which can
be the number of characters, phonemes or syllables) a table that displays the
analytical results of the Steiger’s test (the t and p-value) with respect to the two
different sources of frequency (CHILDES or Wikipedia). Table D.19 shows the
results for the Steiger’s test between variables number of characters and number
of phonemes, Table D.20 shows the results between number of characters and
number of syllables and Table D.21 shows the results between variables number
of phonemes and number of syllables.
We also provide a more compact way of seeing the results contained in
those tables in a set of three different tables, one for each language: English
in Table 5, Dutch in Table 6 and Spanish in Table 7. In these tables, for each
analyzed language, we list all possible combinations of role, frequency measure
and correlation type and, for each combination, we display the order relationship
on the strength of the correlation for each pair of length variables. In the column
Char. vs Phon. we show the relation between number of characters and number
of phonemes, in the column Phon. vs Syllables we show the relation between
number of phonemes and number of syllables, and in the column Phon. vs
Syllables we show the relation between number of phonemes and number of
syllables. For each pair, the relation may be >∗ or <∗ if the Steiger’s test has
determined that the difference of the correlations is significant. If the test is
not significant, then, the relation may be > or <. Since the Kendall correlation
test is not analyzed with the Steiger’s test (see Section 4), we have adopted the
convention of assuming that this test is always non-significant.
Tables 5, 6, and 7, provide the following results: in English most of the tests
are significant, and the pattern that emerges more frequently is that of Char. >∗
P hon. >∗ Syllables, this is, that the number of characters is the length measure
most correlated to a frequency measure, followed by the number of phonemes
and, then, by the number of syllables. In fact, the patterns Char. >∗ Syllables
and P hon. >∗ Syllables appear in the vast majority of cases. In Dutch, we
observe that little can be said about the prominence of the number of phonemes
or characters as the most correlated length variable for lack of significance.
However, the two patterns Char. >∗ Syllables and P hon. >∗ Syllables (as
in English) appear in most of the cases. As for Spanish, nothing can be said
because most of the relations are non-significant.

14
To sum up, the variables number of characters and number of phonemes show
a stronger correlation with respect to frequency than the number of syllables
when the Steiger’s test was significant. The results also reveal (in English) a
slightly stronger correlation between the number of characters and frequency
than between the number of phonemes and frequency.

Role Frequency Correlation Char. vs Phon. Phon. vs Syllables Char. vs Syllables


Pearson >∗ >∗ >∗
Children CHILDES
Spearman < >∗ >∗
Kendall < > >
Pearson >∗ >∗ >∗
Wikipedia
Spearman >∗ >∗ >∗
Kendall > > >
Pearson >∗ >∗ >∗
Adults CHILDES
Spearman <∗ >∗ >∗
Kendall < > >
Pearson >∗ >∗ >∗
Wikipedia
Spearman >∗ >∗ >∗
Kendall > > >

Table 5: For English, the results of the Steiger’s test between a frequency measure (CHILDES
or Wikipedia) and the three different length measures of a word (Char. for number of charac-
ters, Phon. for the number of phonemes and Syllables for the number of Syllables). Column
Role indicates the subject (children or adults), the Frequency column indicates the source of
the frequency measure (the shared variable), and the column Correlation indicates the type
of correlation. As for the remaining columns, we have the combination of two length mea-
sures. The content is >∗ when the first length variable is more correlated to the frequency
measure than the second variable and the Steiger’s test is significant. If the second measure
is more correlated than the first and the test is significant, then, the content is <∗ . If the test
is not significant, then the contents are > or < depending on what variable shows a higher
correlation to the frequency measure.

Role Frequency Correlation Char. vs Phon. Phon. vs Syllables Char. vs Syllables


Pearson < >∗ >∗
Children CHILDES
Spearman < >∗ >∗
Kendall < > >
Pearson > > >∗
Wikipedia
Spearman > >∗ >∗
Kendall > > >
Pearson > >∗ >∗
Adults CHILDES
Spearman < >∗ >∗
Kendall < > >
Pearson > > >∗
Wikipedia
Spearman >∗ >∗ >∗
Kendall > > >

Table 6: Results of the Steiger’s test for Dutch. The format is the same as in Table 5.

15
Role Frequency Correlation Char. vs Phon. Phon. vs Syllables Char. vs Syllables
Pearson > > >
Children CHILDES
Spearman > >∗ >∗
Kendall > < <
Pearson > > >
Wikipedia
Spearman >∗ < >
Kendall > < <
Pearson < > >
Adults CHILDES
Spearman < > >
Kendall < < <
Pearson > > >
Wikipedia
Spearman >∗ < >
Kendall > < <

Table 7: Results of the Steiger’s test for Spanish. The format is the same as in Table 5.

5.4. Proportion of ties


Here we aim at shedding some light on the weakness of the correlation be-
tween frequency and other variables. We focus on the Kendall τ correlation
because it allows for a simple analysis of the influence of tied values.
The Kendall τ correlation is defined as [34]
Nc − Nd
τ= n
(5)
2

where n is the sample size and Nc and Nd are, respectively, the number of
concordant and discordant pairs in the sample. We have that

n
Nc + Nd + Nt =
2

where Nt is the number of tied pairs (pairs that are neither concordant nor
discordant). Applying
n
Nd = − Nc − Nt
2
one can rewrite Eq. 5 equivalently as
2Nc
τ = −1 + υ + n
(6)
2

where
Nt
υ= n

2

is the proportion of tied pairs (0 ≤ υ ≤ 1). The fact that Nc ≥ 0 allows one to
see that
τ ≥υ−1 (7)
Put differently, the strongest negative Kendall τ that can be obtained is υ − 1.
The higher the number of ties, the weaker the maximum Kendall τ correlation
that can be obtained. Table 8 shows the percentage of ties, namely 100υ, of
frequency (CHILDES and Wikipedia) versus polysemy and the measures of
length for every language and role.

16
It is possible to derive lower bounds for the Spearman correlation ρ from
that of Kendall τ correlation. Knowing that [38]
1
ρ≥ (3τ − 1)
2
and recalling Eq. 7, one obtains
3
ρ≥ υ−2 (8)
2
Similarly, knowing that [39]
1
ρ≥ (1 + τ )2 − 1
2
one obtains
υ2
ρ≥ −1 (9)
2
Combining, Eqs. 8 and 9, we get finally

υ2

3
ρ ≥ max υ − 1, −1 (10)
2 2

The lower bounds of ρ above are likely to be looser than the original lower bound
of τ because the former are derived from the latter.

Number of Number of Number of


Language Role Frequency Polysemy
characters phonemes syllables
CHILDES 18.2% 20.9% 21.4% 35.4%
Children
Wikipedia 11.9% 14.4% 15.0% 29.9%
English
CHILDES 21.3% 21.3% 21.3% 33.9%
Adults
Wikipedia 13.7% 13.2% 13.2% 26.9%
CHILDES 31.8% 19.5% 21.9% 38.0%
Children
Wikipedia 26.9% 13.1% 15.6% 33.0%
Dutch
CHILDES 36.1% 21.8% 23.3% 37.0%
Adults
Wikipedia 28.2% 11.1% 12.7% 28.4%
CHILDES 24.0% 21.4% 21.0% 38.7%
Children
Wikipedia 17.5% 14.3% 13.8% 33.0%
Spanish
CHILDES 23.2% 21.5% 21.1% 38.3%
Adults
Wikipedia 16.4% 14.1% 13.7% 32.3%

Table 8: Percentage of ties, 100υ, between frequency (CHILDES and Wikipedia) and Word-
Net polysemy, number of characters, number of phonemes and number of syllables for every
language and role.

In Sections 5.2 and 5.3, we have shown a tendency of syllabic length to be


the unit of length that is the most weakly correlated with frequency. This could
be due to the higher proportion of ties of syllabic length ties in general (Table
8), that reduces the potential strength of the correlation according to Eqs. 7
and 10.

17
6. Discussion and Future Work

In this paper, we have reviewed two linguistic laws that we owe to Zipf’s
[3, 1] and that have probably been shadowed by the best-known Zipf’s law for
word frequencies [1]. Our analysis of the correlation between brevity (measured
in number of characters, phonemes and syllables) and polysemy (number of
synsets) versus word frequency was conducted with three correlation tests with
varying assumptions and robustness. Pearson correlation is a measure of linear-
ity while the Spearman correlation and Kendall correlation are able to capture
monotonic non-linear dependencies as we have explained in Section 4. Our anal-
ysis confirms that a positive correlation between the frequency of the words and
the number of synsets (consistent with the meaning-frequency law [3]) and a
negative correlation between the length of the words and their frequency (con-
sistent with the law of abbreviation [1]) arises under different definitions of the
variables. In all cases, we find correlations whose sign matches the expected
sign. In addition, all correlations are significant except the Pearson correlations
in the meaning-frequency law for Dutch and Spanish. This behaviour could be
due to (a) the lower capacity of the Pearson correlation to detect non-linear
dependencies compared to Spearman and Kendall correlations or, (b) the fact
that English exhibits a larger sample size than those two languages (Table 4).
In optimization models of the law of abbreviation, length is regarded as a
proxy for the energetic cost of the word [12, 40]. Then one expects that a better
measure of energetic cost would give a stronger correlation with frequency. Our
meta-analysis of the correlation between frequency and length has shown that
this correlation is slightly stronger when length is measured with characters
than in phonemes in most cases in English, and that characters and phonemes
are stronger that syllables in both English and Dutch. In Spanish, no clear
pattern arises, which is consistent with the classical view of Spanish as a more
transparent language than English [41] or Dutch [42]. Thus, the grapheme
to phoneme conversion is easier in Spanish than in English or Dutch [42]. The
degree of transparency of a language is defined as the extent to which a language
maintains one-to-one relations between units from different dimensions, e.g.,
phonemes versus graphemes. Transparency is tied to the notion of “simplicity"
in accounting for acquisition data (see [42] for a review). Transparency facilitates
reading. Then, learning to read in a transparent orthography imposes fewer
constraints than learning to read in a more opaque writing system [43].
The fact that the correlation between frequency and number of syllables
tends to be weaker than correlations with other measures of length does not
imply that syllables are a worse measure of length or energetic cost. It could
be simply due to the fact that ties of length values are easier to obtain with
syllabic length, a fact that is expected to yield weaker correlations and higher
p-values as we have shown in Section 5.4.
Interestingly, we have not found any remarkable qualitative difference in the
analysis of correlations for adults versus children in the CHILDES database, sug-
gesting that both child speech and the infant-directed speech or child-directed
speech (the so-called motherese) [19] seem to show the same general statistical
biases in the use of more frequent words (that tend to be shorter and more pol-
ysemous), confirming the results of our previous test in [15] where adults were
split into three different roles, mother, father and investigator, instead of being
considered together in a single class as in this present paper.

18
Our analyses have shown the robustness of these Zipfian patterns from the
standpoint of a correlation analysis. Such robustness provides support to Zipf’s
hypothesis that these laws originate from abstract principles, e.g., functional
pressures (least effort as he would put it), that are consistent with modern for-
malizations as a compression principle for the law of abbreviation [12, 40] or a
biased random walk over the mapping words into meanings for the origins of
Zipf’s meaning frequency law [13]. This theoretical approaches strongly suggest
that it might be possible to provide a coherent and parsimonious explanation for
the laws we have examined in this article and other laws such as Zipf’s law for
word frequencies [44] or Menzerath’s law [45]. The need for an abstract stand-
point is not only suggested by our analyses but also by patterning consistent
with these laws in human language in different conditions, e.g., sign language
[46], Kanji or Chinese characters [47, 48], and also in animal communication
[12, 49, 50].
Our work offers many possibilities for future research.
First, expanding the set of languages to include languages from other families
(i.e. not Indo-European languages) and the set of lexical databases employed
(e.g., [51]). As for the latter, the challenge is to find sources that allow to deal
with different languages homogeneously.
Second, considering different definitions of the same variables. For instance,
a limitation of our study is the fact that we define word length using discrete
units: number of syllables, number of phonemes or number of characters. Future
research could benefit from viewing length as a continuous variable, e.g. the
(average) duration in time of the word, because that may yield a better estimate
of the actual energetic cost of a word and also because our research on language
laws is to some extend limited by the information that is transcribed and the
writing conventions, that add some degree of arbitrariness. These limitations
have been overcome to a large extent in novel investigations of language laws in
pure voice [14].
Third, our work can be extended including other linguistic variables such as
homophony, i.e. words with different origin (and a priori different meaning) that
have converged to the same phonological form. This extension would require
to trace the history of each word, under a dynamical and lexical perspective,
following the connection between brevity of words and homophony that Jes-
persen (1933) suggested in his seminal work [52] and that has been confirmed
more recently [53, 54] as a strong association between shortness of words, token
frequency and homophony [54]. If polysemy is taken to be a form of motivated
homophony, by which a word has two or more related meanings, but with prob-
ably different representation than homophones (for which different meanings
are "stored separately" [55]) in a semantic space, both phenomena will only be
distinguishable if we analyze and segment directly voice signals or, as said be-
fore, we do that under a diachronic approach. In any case, we must be aware of
the limitations of synchronic approaches when homophony or homography are
studied, the later being indistinguishable from polysemy in the present article.
Fourth, a parametric study of these laws with the help of power-law like
functions [4, 24]. In Section 2, we have shown some challenges of that kind of
investigation. We do think that such investigation is very needed and worth-
while. We have just argued it is not as simple as commonly believed.
Finally, future work should bridge the gap between our classic Zipfian per-
spective and psycholinguistics. We suggest a couple of ways. First, an ex-

19
ploration of the structural differences between common and rare words [56].
Second, an application of our methodology to other magnitudes such as con-
textual diversity, which may be more relevant than the mere word frequency in
some lexical tasks [57, 58].

Acknowledgments

The authors thank Pedro Delicado for his helpful comments. This research work
was supported by the grant SGR2014-890 (MACDA) and the recognition 2017SGR-856
(MACDA) from AGAUR (Generalitat de Catalunya), and also the grants TIN2014-
57226-P (APCOM), TIN2017-89244-R (MACDA) and TIN2016-77820-C3-3-R (GRAPH-
MED) from MINECO (Ministerio de Economia, Industria y Competitividad).

Appendix A. Figures

Figures A.3, A.4, A.5, A.6, A.7, A.8, A.9 and A.10 show the results obtained
in this article for English, Dutch and Spanish.
In all these plots, frequency is placed on the x-axis for at least three reasons.
First, frequency is given (frequency is assumed to be constant) while length
is variable In information theory (coding theory in particular) [59]. In the
problem of compression, one aims to minimize the average length of codes given
probabilities (estimated as relative frequencies). Information theory predicts
the length of a code as a function of its frequency [59, p. 111] or its frequency
rank [40]. Therefore, when plotting length versus frequency, it makes sense
to put it on the x-axis. Second, word frequency is a fundamental variable in
psycholinguistics to predict language processing costs [60, 61]. The third reason
comes from the popular Zipf’s law for word frequencies. Although Zipf’s law
is usually plotted as frequency as function of rank (following Eq. 1), a sister
(but not identical) plot, the so called frequency spectrum, consists of showing
the number of distinct words as a function of frequency [62]. The second plot
(frequency on the x-axis) is preferred by various authors for investigating the
distribution of word frequencies (e.g., [63, p. 298], [64, p. 3]).

Appendix B. Information about CHILDES

Tables B.9, B.10, B.11 and B.12 show the CHILDES corpora used in this
article for English, Dutch and Spanish.

20
Children Adults

English

Dutch

Spanish

Figure A.3: WordNet polysemy versus CHILDES frequency in double logarithmic scale. The
color indicates the density of points: dark green is the highest possible density. A smoothed
curve (blue line) and a curve proportional to the probability density of values of the x-axis
(red dashed line) is also shown.

21
Children Adults

English

Dutch

Spanish

Figure A.4: WordNet polysemy versus Wikipedia frequency. The format is the same as in
Fig. A.3.

22
Children Adults

English

Dutch

Spanish

Figure A.5: Number of characters versus CHILDES frequency. The format is the same as in
Fig. A.3.

23
Children Adults

English

Dutch

Spanish

Figure A.6: Number of characters versus Wikipedia frequency. The format is the same as in
Fig. A.3.

24
Children Adults

English

Dutch

Spanish

Figure A.7: Number of phonemes versus CHILDES frequency. The format is the same as in
Fig. A.3.

25
Children Adults

English

Dutch

Spanish

Figure A.8: Number of phonemes versus Wikipedia frequency. The format is the same as in
Fig. A.3.

26
Children Adults

English

Dutch

Spanish

Figure A.9: Number of syllables versus CHILDES frequency. The format is the same as in
Fig. A.3.

27
Children Adults

English

Dutch

Spanish

Figure A.10: Number of syllables versus Wikipedia frequency. The format is the same as in
Fig. A.3.

28
Corpus Age Range # children Comments
Lara [65] 1;9 – 3;3 1 Longitudinal case study
Manchester [66] 1;8 – 3;0 12 12 English children
recorded weekly for the
period of a year
Wells [67] 1;6 – 5;0 32 Large study of the lan-
guage of British preschool
children collected at ran-
dom intervals

Table B.9: CHILDES database for British English http://childes.psy.cmu.edu/access/


Eng-UK/

Corpus Age Range # children Comments


Bloom 1970 [68, 1;9 – 3;2 2 A large longitudinal study
69, 70] of one child with a few
samples for another. Gia
was excluded because age
information is not re-
ported for her.
Adam 2;3 – 4;10 Large longitudinal study
Brown [71] Eve 1;6 – 2;3 3 of three children: Adam
Sarah 2;3 – 5;1 55 files, Eve 20 and Sarah
139
Kuczaj [72] 2;4 – 5;0 1 Diary study in the home
environment
MacWhinney Ross 2;6 – 8;0 Diary study of the devel-
2
[73] opment of two brothers
recorded in spontaneous
situations
Mark 0;7 – 5;6
Providence [74] 1–3 5 Ethan was excluded be-
cause he was diagnosed
with Asperger’s Syndrome
at the age of 5.
Sachs [75] 1;1 – 5;1 1 Longitudinal naturalistic
study
Suppes [76] 1;11 – 3;11 1 Longitudinal study of a
single child

Table B.10: CHILDES database for American English http://childes.psy.cmu.edu/access/


Eng-NA/

29
Corpus Age Range # children Comments
BolKuiken [77] 1;7 – 3;7 47 Dutch normal controls
CLPF [78, 79] 1;0 – 2;11 12 PHONBANK, longitudi-
nal study with 20,000 ut-
terances
Groningen [80] 1;5 – 3;7 6 ’Iris’ was removed be-
cause she subsequently
displayed delay in lan-
guage development due to
hearing problems. ’Iri’
(ending with no ’s’) was
also excluded (this per-
son was very likely a mis-
spelling of ’Iris’ because
he/she was in the same
subdirectory of ’Iris’ and
was the only target child
in the only file where it ap-
peared).
Schaerlaekens 1;8 – 2;10 , 1;10 – 3;1 6
[81]
Laura 1;9 – 5;10
van Kampen [82] 2
Sarah 1;6.16 – 6;0

Table B.11: CHILDES database for Dutch http://childes.psy.cmu.edu/access/Dutch/

30
Corpus Age Range # children Comments
Aguirre [83] 1;7-2;10 1
BecaCESNo 3;6-11;6 40
ColMex 6;0-7;0 30 Mexican Spanish, picture
and procedural descrip-
tion
DiezItza [84] 3;0-3;11 20
FernAguado 3;0-4;0 50
Hess [85] 6;0-12;0 24
JacksonThal 0;10-3;0 202 Cross-sectional data from
[86, 87] Queretaro, San Diego, and
Santa Barbara
Linaza [88] 2;0-4;0 1
LlinasOjea 0;11-3;02 1 Longitudinal study of two
children in Asturias, but
only Yasmin is considered.
Marrero 1;8-8;0 3 Longitudinal study of
Spanish children from the
Canaries
Nieva 1;8-2;3 1
Ornat [89] 1;7-4;0 1
Remedi 1;11-2;10 1
Romero [90] 2;0 1 Mexican Spanish
SerraSole 1;4-3;10 1
Shiro [91] 6;0-9;0 113 Narratives from Venezue-
lan children

Table B.12: CHILDES database for Spanish http://childes.psy.cmu.edu/access/Spanish/

Appendix C. Correlations

Tables C.13, C.14, C.15, C.16, C.17 and C.18 show the correlations be-
tween Frequency and Polysemy, and Frequency and Length measures for En-
glish, Dutch and Spanish.

Appendix D. Tables of Steiger’s tests

Tables D.19, D.20 and D.21 show the results of Steiger’s test between number
of characters and number of phonemes, number of characters and number of
syllables, and number of phonemes and number of syllables for English, Dutch
and Spanish.

31
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children 0.059 < 10−8 0.249 < 10−140 0.182 < 10−323 9,930
Adults 0.09 < 10−323 0.264 < 10−257 0.196 < 10−323 16,235
CHILDES frequency vs Number of characters
Children -0.122 < 10−33 -0.27 < 10−164 -0.202 < 10−164 9,930
Adults -0.115 < 10−48 -0.324 < 10−323 -0.243 < 10−323 16,235
CHILDES frequency vs Number of phonemes
Children -0.123 < 10−29 -0.31 < 10−188 -0.234 < 10−185 8,547
−38 −323
Adults -0.11 < 10 -0.361 < 10 -0.273 < 10−323 14,146
CHILDES frequency vs Number of syllables
Children -0.077 < 10−11 -0.239 < 10−110 -0.193 < 10−108 8,547
−18 −298
Adults -0.075 < 10 -0.303 < 10 -0.243 < 10−290 14,146

Table C.13: Analysis of the correlations in English with CHILDES frequency. For every
correlation test the value of the statistic (r, ρ and τ ) an the corresponding p-value is shown.
Significant correlations are indicated in bold.

Pearson Spearman Kendall


Role size
r p-value ρ p-value τ p-value
Wikipedia frequency vs Wordnet polysemy
Children 0.059 < 10−8 0.415 < 10−323 0.301 < 10−323 9,975
Adults 0.068 < 10−323 0.422 < 10−323 0.307 < 10−323 16,286
Wikipedia frequency vs Number of characters
Children -0.106 < 10−25 -0.241 < 10−131 -0.172 < 10−127 9,975
−32 −146
Adults -0.094 < 10 -0.2 < 10 -0.142 < 10−143 16,286
Wikipedia frequency vs Number of phonemes
Children -0.1 < 10−19 -0.242 < 10−113 -0.176 < 10−112 8,548
−23 −92
Adults -0.084 < 10 -0.171 < 10 -0.122 < 10−91 14,149
Wikipedia frequency vs Number of syllables
Children -0.06 < 10−7 -0.17 < 10−55 -0.131 < 10−54 8,548
Adults -0.054 < 10−10 -0.102 < 10−33 -0.078 < 10−32 14,149

Table C.14: Analysis of the correlations in English with Wikipedia frequency. The format is
the same as in Table C.13.

32
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children 0.017 0.376 0.19 < 10−22 0.147 < 10−323 2,627
−43
Adults 0.013 0.342 0.19 < 10 0.149 < 10−323 5,273
CHILDES frequency vs Number of characters
Children -0.13 < 10−10 -0.187 < 10−21 -0.138 < 10−21 2,627
−15 −148
Adults -0.113 < 10 -0.347 < 10 -0.259 < 10−143 5,273
CHILDES frequency vs Number of phonemes
Children -0.136 < 10−11 -0.2 < 10−23 -0.149 < 10−23 2,575
−15 −155
Adults -0.114 < 10 -0.358 < 10 -0.271 < 10−151 5,179
CHILDES frequency vs Number of syllables
Children -0.1 < 10−6 -0.169 < 10−17 -0.134 < 10−17 2,575
−10 −125
Adults -0.093 < 10 -0.323 < 10 -0.258 < 10−120 5,179

Table C.15: Analysis of the correlations in Dutch with CHILDES frequency. The format is
the same as in Table C.13.

Pearson Spearman Kendall


Role size
r p-value ρ p-value τ p-value
Wikipedia frequency vs Wordnet polysemy
Children 0.002 0.931 0.377 < 10−89 0.286 < 10−323 2,634
Adults 0.008 0.574 0.395 < 10−196 0.302 < 10−323 5,289
Wikipedia frequency vs Number of characters
Children -0.079 < 10−4 -0.4 < 10−100 -0.285 < 10−94 2,634
Adults -0.07 < 10−6 -0.347 < 10−148 -0.244 < 10−140 5,289
Wikipedia frequency vs Number of phonemes
Children -0.077 < 10−4 -0.389 < 10−93 -0.282 < 10−87 2,579
Adults -0.067 < 10−5 -0.315 < 10−119 -0.224 < 10−114 5,190
Wikipedia frequency vs Number of syllables
Children -0.06 0.002 -0.363 < 10−80 -0.28 < 10−75 2,579
Adults -0.055 < 10−4 -0.281 < 10−94 -0.212 < 10−90 5,190

Table C.16: Analysis of the correlations in Dutch with Wikipedia frequency. The format is
the same as in Table C.13.

33
Pearson Spearman Kendall
Role size
r p-value ρ p-value τ p-value
CHILDES frequency vs Wordnet polysemy
Children -0.01 0.564 0.162 < 10−21 0.12 < 10−323 3,520
−17
Adults 0.011 0.523 0.152 < 10 0.113 < 10−323 3,217
CHILDES frequency vs Number of characters
Children -0.125 < 10−13 -0.367 < 10−111 -0.276 < 10−108 3,520
−13 −99
Adults -0.136 < 10 -0.362 < 10 -0.272 < 10−96 3,217
CHILDES frequency vs Number of phonemes
Children -0.114 < 10−10 -0.361 < 10−107 -0.271 < 10−104 3,512
−14 −99
Adults -0.136 < 10 -0.362 < 10 -0.272 < 10−96 3,207
CHILDES frequency vs Number of syllables
Children -0.109 < 10−10 -0.344 < 10−97 -0.276 < 10−94 3,520
−12 −94
Adults -0.126 < 10 -0.354 < 10 -0.284 < 10−91 3,217

Table C.17: Analysis of the correlations in Spanish with CHILDES frequency. The format is
the same as in Table C.13.

Pearson Spearman Kendall


Role size
r p-value ρ p-value τ p-value
Wikipedia frequency vs Wordnet polysemy
Children -0.003 0.849 0.385 < 10−124 0.282 < 10−120 3,524
Adults -0.007 0.698 0.359 < 10−97 0.262 < 10−95 3,220
Wikipedia frequency vs Number of characters
Children -0.085 < 10−6 -0.144 < 10−16 -0.103 < 10−16 3,524
Adults -0.088 < 10−6 -0.167 < 10−20 -0.119 < 10−20 3,220
Wikipedia frequency vs Number of phonemes
Children -0.078 < 10−5 -0.122 < 10−12 -0.087 < 10−12 3,516
Adults -0.081 < 10−5 -0.145 < 10−15 -0.103 < 10−15 3,210
Wikipedia frequency vs Number of syllables
Children -0.079 < 10−5 -0.139 < 10−15 -0.107 < 10−16 3,524
Adults -0.082 < 10−5 -0.165 < 10−20 -0.126 < 10−20 3,220

Table C.18: Analysis of the correlations in Spanish with Wikipedia frequency. The format is
the same as in Table C.13.

34
CHILDES Frequency Wikipedia Frequency
Role Pearson Spearman Pearson Spearman size
t p-value t p-value t p-value t p-value
English
Children -2.484 0.013 0.022 0.982 -3.693 < 10−3 -4.572 < 10−5 8,548
Adults -3.562 < 10−3 2.501 0.012 -4.553 < 10−5 -8.837 < 10−17 14,149
Dutch
Children 0.395 0.693 1.417 0.157 -0.472 0.637 -1.049 0.294 2,579
Adults -0.251 0.802 1.624 0.104 -0.793 0.428 -6.491 < 10−10 5,190
Spanish
Children -0.861 0.389 -0.391 0.696 -1.203 0.229 -3.596 < 10−3 3,516
Adults 0.098 0.922 0.571 0.568 -1.194 0.233 -3.43 < 10−3 3,210

Table D.19: Steiger’s test between variables Number of characters and Number of phonemes.
The test indicates if the difference between the r1 and r2 is significant. r1 is the correlation
between Frequency and Number of characters and r2 is the correlation between Frequency
and Number of phonemes. Frequency is the shared variable. The table has two parts: one for
the results when the CHILDES frequency is considered the shared variable, and another when
Wikipedia frequency is the shared variable. Within each part, the t statistic of Steiger test
and p-value, the corresponding p-value, are shown for Pearson and Spearman correlations.
Significant results are in bold.

CHILDES Frequency Wikipedia Frequency


Role Pearson Spearman Pearson Spearman size
t p-value t p-value t p-value t p-value
English
Children -7.922 < 10−14 -9.812 < 10−21 -7.895 < 10−14 -13.23 < 10−38 8,548
Adults -9.045 < 10−18 -9.3 < 10−19 -8.817 < 10−17 -19.092 < 10−79 14,149
Dutch
Children -3.277 0.001 -2.051 0.040 -2.01 0.045 -3.519 < 10−3 2,579
Adults -3.079 0.002 -4.199 < 10−4 -2.164 0.031 -9.394 < 10−20 5,190
Spanish
Children -1.775 0.076 -2.704 0.007 -0.701 0.483 -0.491 0.623 3,524
Adults -1.139 0.255 -1.006 0.314 -0.691 0.490 -0.28 0.780 3,220

Table D.20: Steiger’s test between variables Number of characters and Number of syllables.
The format is the same as in Table D.19.

CHILDES Frequency Wikipedia Frequency


Role Pearson Spearman Pearson Spearman size
t p-value t p-value t p-value t p-value
English
−12 −26 −9
Children -7.145 < 10 -10.844 < 10 -6.135 < 10 -10.937 < 10−26 8,548
Adults -7.67 < 10−13 -12.544 < 10−35 -6.581 < 10−10 -14.275 < 10−45 14,149
Dutch
Children -3.488 < 10−3 -2.901 0.004 -1.662 0.097 -2.628 0.009 2,579
Adults -3.007 0.003 -5.232 < 10−6 -1.71 0.087 -4.951 < 10−6 5,190
Spanish
Children -1.093 0.274 -2.413 0.016 -0.09 0.929 1.19 0.234 3,516
Adults -1.162 0.245 -1.227 0.220 -0.102 0.918 1.281 0.200 3,210

Table D.21: Steiger’s test between variables Number of phonemes and Number of syllables.
The format is the same as in Table D.19.

35
References
References
[1] G. K. Zipf, Human behaviour and the principle of least effort, Addison-
Wesley, Cambridge (MA), USA, 1949.
[2] G. K. Zipf, The Psycho-Biology of Language: an Introduction to Dynamic
Psychology, MIT Press, Cambridge, MA, USA, 1968, originally published
in 1935 by Houghton Mifflin - Boston - MA - USA.
[3] G. K. Zipf, The Meaning-Frequency Relationship of Words, Journal of Gen-
eral Psychology 1945 (33) (1945) 251–256.
[4] S. Naranan, V. K. Balasubrahmanyan, Models for power law relations in lin-
guistics and information science, J. Quantitative Linguistics 5 (1-2) (1998)
35–61.
[5] R. Ferrer-i-Cancho, Optimization models of natural communication, in:
Journal of Quantitative Linguistics, Vol. 25, 2014.
URL http://arxiv.org/abs/1412.2486
[6] F. Font-Clos, G. Boleda, A. Corral, A scaling law beyond Zipf’s law and
its relation to Heaps’ law, New Journal of Physics 15 (9) (2013) 093033.
URL http://stacks.iop.org/1367-2630/15/i=9/a=093033
[7] A. Corral, G. Boleda, R. Ferrer-i-Cancho, Zipf’s law for word frequencies:
Word forms versus lemmas in long texts, PLoS ONE 10 (7) (2015) 1–23.
doi:10.1371/journal.pone.0129031.
[8] R. Ferrer-i-Cancho, The meaning-frequency law in Zipfian optimization
models of communication, Glottometrics 35 (2016) 28–37.
[9] P. Grzybek, Contributions to the science of text and language: word length
studies and related issues, Vol. 31, Springer Science & Business Media,
2006.
[10] U. Strauss, P. Grzybek, G. Altmann, Word length and word frequency,
Springer, Dordrecht, 2007, pp. 277–294.
[11] B. Ilgen, B. Karaoglan, Investigation of Zipf’s “law-of-meaning” on Turkish
corpora, in: 22nd International Symposium on Computer and Information
Sciences (ISCIS 2007), 2007, pp. 1–6.
[12] R. Ferrer-i-Cancho, A. Hernández-Fernández, D. Lusseau, G. Agoramoor-
thy, M. J. Hsu, S. Semple, Compression as a universal principle of animal
behavior, Cognitive Science 37 (8) (2013) 1565–1578.
[13] R. Ferrer-i-Cancho, M. Vitevitch, The origins of Zipf’s meaning-frequency
law, Journal of the American Association for Information Science and Tech-
nology 69 (2018) 1369–1379.
[14] I. Gonzalez Torre, B. Luque, L. Lacasa, J. Luque, A. Hernandez-Fernandez,
Emergence of linguistic laws in human voice, Scientific reports 7 (43862)
(2017) 1–10. doi:10.1038/srep43862.
URL http://www.nature.com/articles/srep43862

36
[15] A. Hernández-Fernández, B. Casas, R. Ferrer-i-Cancho, J. Baixeries, Test-
ing the Robustness of Laws of Polysemy and Brevity Versus Frequency,
Springer International Publishing, Cham, 2016, pp. 19–29. doi:10.1007/
978-3-319-45925-7_2.
URL http://dx.doi.org/10.1007/978-3-319-45925-7_2
[16] B. MacWhinney, The CHILDES project: tools for analyzing talk, 3rd Edi-
tion, Vol. 2: the database, Lawrence Erlbaum Associates, Mahwah, NJ,
2000.

[17] C. Fellbaum, WordNet: An Electronic Lexical Database, MIT Press, Cam-


bridge, MA, 1998.
[18] G. Grefenstette, Extracting Weighted Language Lexicons from Wikipedia,
in: N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Mae-
gaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.),
Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC 2016), European Language Resources Association
(ELRA), Paris, France, 2016.
[19] C. Saint-Georges, M. Chetouani, R. Cassel, F. Apicella, A. Mahdhaoui,
F. Muratori, M.-C. Laznik, D. Cohen, Motherese in interaction: At the
cross-road of emotion and cognition? (A systematic review), PLOS ONE
8 (10). doi:10.1371/journal.pone.0078103.
URL http://dx.doi.org/10.1371%2Fjournal.pone.0078103
[20] N. Ide, Y. Wilks, Making Sense About Sense, Springer Netherlands, Dor-
drecht, 2006, pp. 47–73. doi:10.1007/978-1-4020-4809-8_3.
URL http://dx.doi.org/10.1007/978-1-4020-4809-8_3
[21] A. Kilgarriff, Dictionary word sense distinctions: An enquiry into their
nature, Computers and the Humanities 26 (5) (1992) 365–387. doi:10.
1007/BF00136981.
URL http://dx.doi.org/10.1007/BF00136981

[22] B. Armstrong, C. Zugarramurdi, A. Cabana, J. Valle Lisboa, D. Plaut,


Relative meaning frequencies for 578 homonyms in two Spanish dialects: A
cross-linguistic extension of the English eDom norms, Behavior Research
Methods 48. doi:10.3758/s13428-015-0639-3.

[23] I. Fraga, I. Padrón, M. Perea, M. Comesaña, I saw this somewhere else:


The Spanish Ambiguous Words (SAW) database, Lingua 185 (2017) 1 –
10. doi:https://doi.org/10.1016/j.lingua.2016.07.002.
URL http://www.sciencedirect.com/science/article/pii/
S0024384116300596
[24] G. Altmann, Prolegomena to Menzerath’s law, Glottometrika 2 (1980) 1–
10.
[25] E. G. Altmann, M. Gerlach, Statistical Laws in Linguistics, Springer
International Publishing, Cham, 2016, pp. 7–26. doi:10.1007/
978-3-319-24403-7_2.
URL http://dx.doi.org/10.1007/978-3-319-24403-7_2

37
[26] F. Font-Clos, A. Corral, Log-log convexity of type-token growth in Zipf’s
systems, Phys. Rev. Lett. 114 (2015) 238701. doi:10.1103/PhysRevLett.
114.238701.
[27] R. Ferrer-i-Cancho, A. Hernández-Fernández, J. Baixeries, Ł. Dębowski,
J. Mačutek, When is Menzerath-Altmann law mathematically trivial? A
new approach, Statistical Applications in Genetics and Molecular Biology
13 (2014) 633–644.
[28] H. L. Seal, The maximum likelihood fitting of the discrete Pareto law,
Journal of the Institute of Actuaries (1886-1994) 78 (1) (1952) 115–121.
[29] A. Corral, Dependence of earthquake recurrence times and independence
of magnitudes on seismicity history, Tectonophysics 424 (2006) 177–193.
[30] F. Bond, R. Foster, Linking and Extending an Open Multilingual Wordnet,
Sofia, 2013.
[31] M. Postma, E. van Miltenburg, R. Segers, A. Schoen, P. Vossen, Open
Dutch WordNet, in: Proceedings of the Eight Global Wordnet Conference,
Bucharest, Romania, 2016.
[32] A. Gonzalez-Agirre, E. Laparra, G. Rigau, Multilingual central repository
version 3.0: upgrading a very large lexical knowledge base, in: Proceedings
of the 6th Global WordNet Conference (GWC 2012), Matsue, 2012.
[33] R. H. Baayen, R. Piepenbrock, L. Gulikers, CELEX (1996).
URL http://celex.mpi.nl
[34] W. J. Conover, Practical nonparametric statistics, Wiley, New York, 1999,
3rd edition.
[35] J. D. Gibbons, S. Chakraborti, Nonparametric statistical inference, Chap-
man and Hall/CRC, Boca Raton, FL, 2010, 5th edition.
[36] P. Embrechts, A. McNeil, D. Straumann, Correlation and dependence in
risk management: properties and pitfalls, in: M. A. H. Dempster (Ed.),
Risk management: value at risk and beyond, Cambridge University Press,
Cambridge, 2002, pp. 176–223.
[37] J. H. Steiger, Tests for comparing elements of a correlation matrix, Psy-
chological Bulletin 87 (1980) 245–251.
[38] H. E. Daniels, Rank correlation and population models, Journal of the
Royal Statistical Society, Series B 12 (1950) 171–81.
[39] J. Durbin, A. Stuart, Inversions and rank correlations, Journal of the Royal
Statistical Society, Series B 13 (1951) 303–309.
[40] R. Ferrer-i-Cancho, C. Bentz, C. Seguin, Compression and the origins of
Zipf’s law of abbreviation.
URL http://arxiv.org/abs/1504.04884
[41] R. Nash, Comparing English and Spanish: Patterns in Phonology and Or-
thography, Prentice Hall, 1977.
URL https://books.google.es/books?id=ke86OwAACAAJ

38
[42] S. Leufkens, Transparency in language: a typological study, LOT, 2015.
URL http://hdl.handle.net/11245/1.439561
[43] E. Ijalba, L. K. Obler, First language grapheme-phoneme transparency
effects in adult second-language learning, Vol. 27, 2015, pp. 47–70.
[44] R. Ferrer-i-Cancho, Compression and the origins of Zipf’s law for word
frequencies, Complexity 21 (2016) 409–411. doi:10.1002/cplx.21820.
URL http://dx.doi.org/10.1002/cplx.21820
[45] M. L. Gustison, S. Semple, R. Ferrer-i-Cancho, T. Bergman, Gelada vocal
sequences follow Menzerath’s linguistic law, Proceedings of the National
Academy of Sciences USA 13 (2016) E2750–E2758. doi:10.1073/pnas.
1522072113.
[46] C. Börstell, T. Hörberg, R. Östling, Distribution and duration of signs and
parts of speech in Swedish Sign Language, Sign Language & Linguistics 19
(2016) 143–196.
[47] H. Sanada, Investigations in Japanese historical lexicology, Peust &
Gutschmidt Verlag, Göttingen, 2008.
[48] Y. Wang, X. Chen, Structural complexity of simplified Chinese characters,
in: A. Tuzzi, J. M. M. Benesová (Eds.), Recent Contributions to Quanti-
tative Linguistics, De Gruyter, 2015, pp. 229–239.
[49] R. Ferrer-i-Cancho, B. McCowan, A law of word meaning in dolphin whistle
types, Entropy 11 (4) (2009) 688–701. doi:10.3390/e11040688.
[50] C. Hobaiter, R. W. Byrne, The meanings of chimpanzee gestures, Current
Biology 24 (2014) 1596–1600.
[51] A. Duchon, M. Perea, N. Sebastián-Gallés, A. Martí, M. Carreiras, Es-
Pal: One-stop shopping for Spanish word properties, Behavior Research
Methods 45 (4) (2013) 1246–1258.
[52] O. Jespersen, Monosyllabism in English, in: Linguistica: Selected Writings
of Otto Jespersen, George Allen and Unwin LTD, London, UK, 2007, pp.
574–598.
[53] J. Ke, A cross-linguistic quantitative study of homophony, Journal of Quan-
titative Linguistics (2006) 129–159.
[54] G. Fenk-Oczlon, A. Fenk, Frequency effects on the emergence of poly-
semy and homophony, International Journal Information Technologies and
Knowledge 4 (2) (2010) 103–109.
[55] I. Dautriche, E. Chemla, What homophones say about words, PLOS ONE
11 (9) (2016) 1–19. doi:10.1371/journal.pone.0162176.
URL http://dx.doi.org/10.1371%2Fjournal.pone.0162176
[56] T. Landauer, L. Streeter, Structural differences between common and rare
words: Failure of equivalence assumptions for theories of word recognition,
Journal of Verbal Learning and Verbal Behavior 12 (2) (1973) 119 – 131.
doi:https://doi.org/10.1016/S0022-5371(73)80001-5.

39
URL http://www.sciencedirect.com/science/article/pii/
S0022537173800015

[57] M. Vergara-Martínez, M. Comesaña, M. Perea, The ERP signature of


the contextual diversity effect in visual word recognition, Cognitive, Af-
fective, & Behavioral Neuroscience 17 (3) (2017) 461–474. doi:10.3758/
s13415-016-0491-7.
URL https://doi.org/10.3758/s13415-016-0491-7

[58] J. S. Adelman, G. D. Brown, J. F. Quesada, Contextual diversity, not


word frequency, determines word-naming and lexical decision times, Psy-
chological Science 17 (9) (2006) 814–823. doi:10.1111/j.1467-9280.
2006.01787.x.
[59] T. M. Cover, J. A. Thomas, Elements of information theory, Wiley, New
York, 2006, 2nd edition.
[60] R. Brown, D. McNeill, The “tip of the tongue” phenomenon, Journal of
Verbal Learning and Verbal Behavior 5 (4) (1966) 325 – 337.
[61] C. M. Connine, J. Mullennix, E. Shernoff, J. Yelen, Word familiarity and
frequency in visual and auditory word recognition, Journal of Experimental
Psychology: Learning, Memory and Cognition 16 (1990) 1084–1096.
[62] J. Tuldava, The frequency spectrum of text and vocabulary, J. Quantitative
Linguistics 3 (1) (1996) 38–50.
[63] S. Naranan, V. K. Balasubrahmanyan, Information theoretic models in
statistical linguistics - Part II: Word frequencies and hierarchical structure
in language., Current Science 63 (1992) 297–306.
[64] I. Moreno-Sánchez, F. Font-Clos, A. Corral, Large-scale analysis of Zipf’s
law in English texts, PLOS ONE 11 (1) (2016) 1–19.
[65] C. F. Rowland, S. L. Fletcher, The effect of sampling on estimates of lexical
specificity and error rates, Journal of Child Language 33 (2006) 859–877.
[66] A. L. Theakston, E. V. M. Lieven, J. M. Pine, C. F. Rowland, The role of
performance limitations in the acquisition of verb-argument structure: an
alternative account, Journal of Child Language 28 (2011) 127–152.

[67] C. G. Wells, Learning through interaction: the study of language develop-


ment, Cambridge University Press, Cambridge, UK, 1981.
[68] L. Bloom, L. Hood, P. Lightbown, Imitation in language development: If,
when and why, Cognitive Psychology 6 (1974) 380–420.
[69] L. Bloom, P. Lightbown, L. Hood, M. Bowerman, M. Maratsos, M. P.
Maratsos, Structure and variation in child language, Monographs of the
Society for Research in Child Development (Serial no. 160) 40 (2) (1975)
1–97.
[70] L. Bloom, Language development: Form and function in emerging gram-
mars, MIT Press, Cambridge, MA, 1970.

40
[71] R. Brown, A first language: the early stages, Harvard University Press,
Cambridge, MA, 1973.

[72] S. Kuczaj, The acquisition of regular and irregular past tense forms, Journal
of Verbal Learning and Verbal Behavior 16 (1977) 589–600.
[73] CHILDES, American English Corpora. CHILDES. The Database Manu-
als. Available at http://childes.psy.cmu.edu/manuals/02englishusa.doc. Ac-
cessed 17 December 2012., TalkBank (2012).

[74] K. Demuth, J. Culbertson, J. Alter, Word-minimality, epenthesis, and coda


licensing in the acquisition of English, Language and Speech 49 (2006) 137–
174.
[75] J. Sachs, Talking about the there and then: the emergence of displaced ref-
erence in parent-child discourse, in: Children’s language, Vol. 4, Lawrence
Erlbaum Associates, Hillsdale, NJ, 1983, pp. 1–28.
[76] P. Suppes, The semantics of children’s language, American Psychologist 29
(1974) 103–114.
[77] G. Bol, F. Kuiken, Grammatical analysis of developmental language dis-
orders: A study of the morphosyntax of children with specific language
disorders, with hearing impairment and with Down’s syndrome, Clinical
Linguistics & Phonetics 4 (1) (1990) 77–86. arXiv:http://dx.doi.org/
10.3109/02699209008985472, doi:10.3109/02699209008985472.
URL http://dx.doi.org/10.3109/02699209008985472

[78] P. Fikkert, On the Acquisition of Prosodic Structure, no. 6, The Hague:


Holland Academic Graphics, 1994.
URL http://hdl.handle.net/2066/32125
[79] C. Levelt, On the Acquisition of Place, no. 8, The Hague: Holland Academic
Graphics, 1994.

[80] G. W. Bol, Implicational scaling in child language acquisition: The order of


production of Dutch verb constructions, in: M. Verrips, F. Wijnen (Eds.),
Amsterdam series in child language development: Vol. 3. Papers from the
Dutch-German Colloquium on Language Acquisition, Institute for General
Linguistics, Amsterdam, 1995, pp. 1–13.

[81] A. M. Schaerlaekens, The two-word sentence in child language, Mouton,


The Hague, 1973.
[82] J. Van Kampen, The learnability of the left branch condition, in: R. Bok-
Bennema, C. Cremers (Eds.), Linguistics in the Netherlands 1994, John
Benjamins, Amsterdam/Philadelphia, 1994, pp. 83–94.

[83] C. Aguirre, La adquisición de las categorías gramaticales en español, Edi-


ciones de la Universidad Autónoma de Madrid, 2000.
[84] E. Diez-Itza, Procesos fonológicos en la adquisición del español como lengua
materna, Actas del XI Congreso Nacional de Lingüística Aplicada (1995)
225–264.

41
[85] K. Hess Zimmermann, El desarrollo lingüístico en los años escolares: análi-
sis de narraciones infantiles, Ph.D. thesis, El Colegio de México, México
(2003).

[86] D. Jackson-Maldonado, D. Thal, Lenguaje y cognición en los primeros años


de vida, Project funded by the John D. and Catherine T. MacArthur Foun-
dation and CONACYT (1993).
[87] D. Thal, D. Jackson-Maldonado, Language and cognition in Spanish-
speaking infants and toddlers, Project funded by the John D. and Catherine
T. MacArthur Foundation (1993).
[88] J. Linaza, M. E. Sebastián, C. del Barrio, Lenguaje, comunicación y com-
prensión. La adquisición del lenguaje, Monografía de Infancia y Aprendizaje
(1981) 195–198.

[89] S. L. Ornat, A. Fernández, P. Gallo, S. Mariscal, La adquisición de la lengua


Española, Siglo XXI, Madrid, 1994.
[90] S. Romero, A. Santos, D. Pellicer, The construction of communicative com-
petence in Mexican Spanish speaking children (6 months to 7 years), Mexico
City: University of the Americas (1992).

[91] M. Shiro, Getting the story across: A discourse analysis approach to eval-
uative stance in Venezuelan children’s narratives, unpublished Doctoral
Dissertation. Harvard University (1997).

42

You might also like