-
Notifications
You must be signed in to change notification settings - Fork 2
/
05-simulations.Rmd
1121 lines (979 loc) · 60.5 KB
/
05-simulations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Factors Influencing Inference of Clonality in Diploid Populations
```{r, results = "asis", echo = FALSE}
out <- "
\\singlespacing
\\begin{center}
"
cat(beaverdown::iflatex(out))
```
Zhian N. Kamvar and Niklaus J. Grünwald
```{r, results = "asis", echo = FALSE}
out <- "
\\end{center}
\\vspace*{\\fill}"
cat(beaverdown::iflatex(out))
```
Target Journal: **Molecular Ecology**
```{r, results = "asis", echo = FALSE}
out <- "
\\doublespacing
\\newpage
"
cat(beaverdown::iflatex(out))
```
## Abstract
The index of association is a measure of multilocus linkage disequilibrium,
which reflects the deviation of observed genetic variation from expected. In
sexual populations, loci are randomly assorting due to recombination, resulting
in a near-zero value of the index of association. In clonal populations,
recombination is non-existent, meaning that loci are passed from parent to
offspring in a non-independent fashion, resulting in a significantly non-zero
value of the index of association. We build on previous work by investigating
the effect of sample size, mutation rate, and clone correction on the power of
the index of association to detect clonal reproduction in simulated data sets
generated with microsatellite and genomic markers. Our findings suggest that
power decreases with small sample sizes and low allelic diversity, and physical
linkage in genomic markers does not affect power if permutation tests are
performed with linkage groups. We hope that these novel insights provide useful
to the study of molecular pathogens.
## Introduction
Population genetic theory is largely based on model populations such as the
neutral Wright-Fisher model in which populations are assumed to be infinitely
large, with discrete generations, randomly assorting alleles, with no migration
and no mutation [@hartl1997principles; @nielsen2013introduction]. By using such
reductionist models, population geneticists are able to reduce the complexity
and enable analyses to ask fundamental questions and test hypotheses about
evolutionary processes that could explain population structure and history.
These neutral models, however, cannot be applied to populations whose life
history violate the fundamental assumption of random assortment of alleles, such
as populations that undergo clonal reproduction [@orive1993effective;
@milgroom1996recombination]. For many clonal populations, the contribution of
genetic variation from mutation is greater than that of recombination. While
this increases the risk of the accumulation of deleterious mutations, the
two-fold cost of sex is drastically reduced, meaning that selectively
advantageous combinations of alleles are maintained [@heitman2012evolution].
Pathogenic microorganisms can reproduce sexually or clonally
[@tibayrenc1996towards; @milgroom1996recombination]. Diseases caused by these
organisms are in part managed by the use of antimicrobial compounds that kill
these organisms. Detecting recombination in populations of pathogenic
microorganisms is therefore important for the implementation of rational
management strategies as a prevalence of sexual reproduction could lead to the
repeated emergence of novel, resistant genotypes [@smith1993how;
@milgroom1996recombination; @goss2014irish; @de2006molecular;
@nieuwenhuis2016frequency].
Several studies have attempted to infer the degree of sex in populations that
undergo clonal reproduction [@smith1993how; @balloux2003population;
@de2004clonal; @ali2016cloncase; @nieuwenhuis2016frequency]. For populations
with well-defined sexual and clonal phases occurring at separate times, such as
rust fungi, methods like *CloNcaSe* are effective for estimating the rate of
sexual reproduction and effective population size [@ali2016cloncase]. However,
this method cannot be applied to populations where the reproductive cycle is not
partitioned into discrete generations.
Simply detecting the presence of clonal reproduction, however can be useful in
and of itself [@milgroom1996recombination]. A method commonly used to assess
this is the index of association ($I_A$), and its standardized version,
$\bar{r}_d$, which measure multilocus linkage disequilibrium
[@brown1980multilocus; @smith1993how; @haubold1998detecting; @Agapow_2001;
@de2004clonal; @kamvar2014poppr]. The value of $I_A$, as shown in equation
\@ref(eq:ia), is measured as the ratio of observed variance ($V_O$) and expected
variance ($V_E$) in genetic distance between samples [@smith1993how;
@Agapow_2001]:
\begin{equation}
I_A = \frac{V_O}{V_E} - 1 (\#eq:ia)
\end{equation}
The expected variance is practically modeled as the sum of the variances over
*m* loci: $V_E = \sum^m{var_j}$ [@Agapow_2001; @haubold1998detecting]. If the
differences between samples are randomly distributed (linkage equilibrium), we
can expect the value of $I_A$ to be zero [@smith1993how; @Agapow_2001]. Under
scenarios of non-random mating (e.g. population structure or clonal
reproduction), the observed variance would be greater than the expected variance
due to a multi-modal distribution of distances, and $I_A$ would be greater than
zero [@smith1993how; @Agapow_2001; @milgroom2015population]. @Agapow_2001 noted
that this metric does not have an upper limit and increases with the number of
loci. To correct this, they developed $\bar{r}_d$ (equation \@ref(eq:rd)), which
has a similar structure to a correlation coefficient and ranges from 0 (no
linkage) to 1 (complete linkage):
\begin{align} % This mode aligns the equations to the '&=' signs
\begin{split} % This mode groups the equations into one.
\bar{r}_d &= \frac{\sum\sum{cov_{j,k}}}{
\sum\sum{\sqrt{var_{j} \cdot var_{k}}}} \\
&= \frac{V_O - V_E}{2\sum\sum{\sqrt{var_{j} \cdot var_{k}}}}
\end{split}
(\#eq:rd)
\end{align}
De Meeûx & Balloux [-@de2004clonal] investigated the effect of increasing levels
of sexual reproduction on $\bar{r}_d$ (noted in their publication as
$\bar{r}_D$). They found that very little ($\sim$ 1%) sexual reproduction is
required to produce a value of $\bar{r}_d$ close to zero. This indicates that
$\bar{r}_d$ alone might not be well suited as a measure of clonal reproduction.
@prugnolle2010apparent tested the effect of sampling design on $\bar{r}_d$,
finding that its value is drastically reduced when clones from multiple
populations are sampled, leading to an over-estimation of the level of
recombination.
These studies laid the groundwork for understanding the behavior of $\bar{r}_d$
under different scenarios of non-random mating in diploid organisms, but there
were some limitations in available technology. For both studies, the only
software available for calculation of $\bar{r}_d$ for diploid organisms was
<span style="font-variant:small-caps;">Multilocus</span>, which could only take
one data set at a time [@Agapow_2001; @de2004clonal; @prugnolle2010apparent;
@kamvar2014poppr]. This constrained researchers to only analyze a minimal set of
20 populations per scenario.
A non-zero value of $I_A$ and $\bar{r}_d$ does not always indicate a significant
departure from the null assumption of unlinked loci. Since the distribution of
$I_A$ and $\bar{r}_d$ are not known, the safest way to test for significance are
random permutation tests. These tests effectively create unlinked populations by
shuffling individuals at each locus, independently and re-calculating $I_A$ and
$\bar{r}_d$ [@smith1993how; @Agapow_2001; @haubold1998detecting]. An upper
one-sided test of significance is then used to see if the observed statistic is
greater than the observed distribution.
Standard practice for analyzing microbial populations is to perform this test on
both the whole data set (wd) and clone-corrected data (cc), where each
multilocus genotype is represented only once per population to avoid signatures
of linkage that arise from re-sampling the same individual [@goss2014irish;
@milgroom1996recombination; @mcdonald1997population]. If the p-value for
$\bar{r}_d$ is significant after clone-correction, then the population is
expected to be clonal. While significance testing is available in
<span style="font-variant:small-caps;">Multilocus</span> in the form of random
permutations, it is computationally expensive, and can only take one data set at
a time [@Agapow_2001; @kamvar2014poppr]. As a result, power analysis of
$\bar{r}_d$ to detect clonal reproduction with and without clone-correction has
not yet been performed [@de2004clonal].
In the years since the studies conducted by @de2004clonal and
@prugnolle2010apparent, reduced-representation, high-throughput sequencing
methods such as Genotyping-By-Sequencing (GBS) and Restriction site associated
DNA sequencing (RAD-seq) have become popular tools for population genetic
analysis [@davey2010rad; @davey2011genome; @elshire2011robust]. These methods
have the capability to generate thousands of unlinked markers at a fraction of
the cost and time necessary to develop high quality microsatellite markers.
These marker systems are also prone to missing data and high error rates
[@mastretta2015restriction]. The index of association was developed for multiple
loci at a time when obtaining even 100 unlinked markers posed a significant
challenge. With the advent of these technologies, it is unclear how marker
choice and genotyping error affect the index of association.
We developed the R package *poppr* for analysis of clonal populations, removing
the limitations of data input and computational expense of analyzing the index
of association [@kamvar2014poppr] and further expanded this to analysis of
genome-wide SNP data [@R2016; @kamvar2015novel]. With these tools we expand on
previous studies by asking how sample size, marker choice, clone-correction, and
the assumption of homogeneous mutation rates affect our ability to detect clonal
reproduction in diploid populations. Our objectives to answer these questions
are to (1) re-analyze $\bar{r}_d$ against increasing rates of sexual
reproduction in both microsatellite and SNP data sets, (2) perform a power
analysis of $\bar{r}_d$ to assess sensitivity and specificity, and (3) assess
how genotypic and allelic evenness and diversity affects $\bar{r}_d$. Because
studies have observed significantly negative values of $I_A$ and $\bar{r}_d$ (p
$\geq$ 0.95), we additionally seek to determine what factors result in negative
$\bar{r}_d$ values. This work provides novel insights into the sensitivity and
scope of the index of association for inferring clonality.
## Methods
We used simulations to evaluate the behavior of $\bar{r}_d$ under different
population scenarios. Initial sets of simulations were created for different
levels of sexual reproduction for each marker type. All simulations were
performed with the python package simuPOP version 1.1.7 in python version 3.4
[@peng2008forward]. For each scenario, 100 simulations with 10 replicates were
created with a census size of 10,000 diploid individuals with equal mating type
proportions evolved over 10,000 generations. We chose a census size of 10,000
individuals to minimize the effect of drift. We chose to run the simulations for
10,000 generations because this was previously shown in @de2004clonal to be
sufficient in reaching equilibrium for summary statistics.
The simulated populations were first stored in the native simuPOP format and
then transferred to feather format using the python and R packages *feather*
version 0.3.0 for downstream analyses. Microsatellite simulations were performed
on Ubuntu Linux version 14.04; SNP simulations were performed on CentOS Linux
version 6.8. During downstream analysis, 10, 25, 50, and 100 individuals were
sampled without replacement for each replicate in R version 3.2 with the package
*poppr* version 2.3.0 [@R2016;@kamvar2014poppr; @kamvar2015novel]. Analyses
(described below) were performed on both full and clone-corrected data sets. All
downstream analyses were run on the OSU CGRB Core Computing Facility.
### Simulating Microsatellite Loci
Each population was simulated with 21 co-dominant, unlinked loci containing 6 to
10 alleles per locus with frequencies drawn from a uniform distribution and
subsequently normalized so that allele frequencies at each locus sum to one.
We used 6 to 10 alleles per locus as this is the number of alleles commonly used
for population genetic studies to avoid statistical noise. Before mating,
mutations occurred at each locus at a rate of 1e-5 mutations/generation with the
exception of the first locus, at which the mutation rate was set to 1e-3.
The mutation rate of 1e-5 was selected as this was previously used in
@de2004clonal and is close to the estimated recombination rate of *Saccharomyces
cerevisiae* microsatellite loci [@lynch2008genome]. All mutations were applied
in a stepwise manner using the `StepwiseMutator()` operator in simuPOP.
### Simulating SNP Loci
Simulations of 10,000 diploid, biallelic loci spread evenly over 10 chromosomal
fragments were simulated with a mutation rate of 1e-5 mutations per generation
for forward and backward mutations using the `SNPMutator()` operator and and a
recombination rate of 0.01 between adjacent loci using the `Recombinator()`
operator in simuPOP.
### Mating
Simulations of sexual reproduction were conducted at 10 rates of sexual
reproduction over a log scale (0.0, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 0.1,
0.5, 1.0) reflecting the fraction of individuals in generation *t+1* produced
via sexual reproduction. One to three offspring could be produced at each mating
event. Rates were chosen to investigate the dynamics of $\bar{r}_d$ at low
levels of clonal reproduction. For sexual events, two parents were chosen
randomly from the population with the `RandomSelection()` operator and offspring
genotypes were created via the `MendelianGenoTransmitter()` operator. The clonal
fraction was created by randomly sampling individuals from the population and
duplicating their genotypes with the `CloneGenoTransmitter()` operator. If one
mating type was lost before 10,000 generations, the simulation would continue to
completion with only clonal reproduction.
### Analysis of Microsatellite Data
The standardized index of association ($\bar{r}_d$, @Agapow_2001) was calculated
for full and clone-corrected data using the *poppr* function `ia()`
[@kamvar2014poppr]. Tests for significance were performed by randomly permuting
the alleles at each locus independently and then assessing $\bar{r}_d$. A total
of 999 permutations were conducted for each replicate population to evaluate
statistical support. Analyses were done for both full and clone-corrected data.
The p-values reflect the proportion of observations greater than the observed
statistic. Estimates of genotypic diversity were assessed with the *poppr*
function `diversity_boot()` with 999 bootstrap replicates, recording the
estimate and variance [@kamvar2015novel]. The genotypic diversity statistics we
calculated were Shannon's Index, $H = -\sum p_i ln p_i$
[@shannon2001mathematical], Stoddart and Taylor's Index, $G = 1/\sum p_i^2$
[@stoddart1988genotypic], and Evenness, $E_5 = (G - 1)/(e^H - 1)$
[@pielou1975ecological] where $p_i$ is the frequency of the *i*th genotype. We
additionally calculated Nei's expected heterozygosity, also known as gene
diversity [@Nei:1978], and mean allelic evenness $E_{5A} = (1/m) \sum E_{5l}$,
where *m* is the number of loci and $E_{5l}$ is the evenness of the alleles at
locus *l* with the *poppr* function `locus_table()` [@grunwald2003analysis;
@kamvar2014poppr].
### Analysis of SNP Data
The overall value of $\bar{r}_d$ was calculated for each simulation with the
*poppr* function `bitwise.ia()` [@kamvar2015novel]. Significance was first
assessed by randomly shuffling genotypes at each locus independently and then,
to preserve existing background linkage structure, at each chromosome
independently. This was done 999 times for each replicate population. P-values
represent the proportion of random samples greater than or equal to the observed
statistic.
Because GBS data are associated with high error rates, we additionally wanted to
assess the effect of missing data on analysis. To do this, we used scripts
written for @kamvar2015novel to randomly insert missing data via the `pop_NA()`
function [@kamvar2015poppr2supp] at rates of 1%, 5%, and 10% each.
### Power Analysis
Permutation analysis of the index of association is commonly used to calculate
statistical support for detecting non-random mating. If the observed value of
$\bar{r}_d$ is greater than 95% of the results from the permutations, then the
null hypothesis of linkage equilibrium is rejected. The power or sensitivity of
this test can be seen as the fraction of significant results within a set of
simulation parameters when the rate of sexual reproduction is < 1.
The Receiver Operating Characteristic (ROC) curve is a statistical method of
assessing the balance between sensitivity and specificity of a diagnostic method
[@metz1978basic]. This is done by simultaneously assessing the true positive
fraction of tests to a false positive fraction along a gradient of thresholds of
increasing leniency. A simple way of thinking about ROC analysis is to imagine
two factories that both make marbles. On average, factory A makes marbles half
the size of factory B. Both factories manufacture for the same distributor,
which mixes the marbles from both factories. You are tasked with finding the
best sieve that can accurately and precisely separate the marbles. To measure
this quantity, you order a set of orange marbles from factory A and a set of
blue marbles from factory B. To calculate the ROC curve, you would take a range
of sieves and stack them such that the one with the largest mesh size is on top
and the one with the smallest mesh size is on the bottom and pour both orange
and blue marbles through. The fractions of orange and blue marbles that passed
through each sieve are the true positive and false positive fractions,
respectively. Plot the true positive fraction against the false positive
fraction for each sieve respectively, and you can assess how well the sieve
method works by examining the shape of the ROC curve.
A useful summary of the ROC curve is to calculate the area under the ROC curve
(AURC). Briefly, if a method has perfect explanatory power, the area under the
ROC curve will be equal to 1. If a method has no explanatory power, the AURC
will be equal to 0.5. We used this method to assess the efficacy of $\bar{r}_d$
to detect non-random mating.
To calculate the ROC curve, we first define what a positive value is. If we
consider the p-value of $\bar{r}_d$ as a classifier to determine clonality, we
can define the false positive fraction as the fraction of observations where $p
\leq \alpha$ for simulations where the rate of sexual reproduction is set to 1.0
(Table \@ref(tab:simtab1)). In contrast, the true positive fraction is the
fraction of observations where $p \leq \alpha$ in simulations where clonality is
introduced (e.g. all simulations where the rate of sexual reproduction is less
than 1.0).
Reproductive Mode $p \leq \alpha$ $p > \alpha$
------------------- ----------------- ---------------
Non-Random Mating True Positive False Negative
Random Mating False Positive True Negative
Table: (\#tab:simtab1) Definitions of false positive and true positive values
for ROC analysis of simulations. *p* is the p-value for $\bar{r}_d$, $\alpha$ is
a threshold value in [0, 1]. Randomly mating populations are simulated with a
sex rate of 1. Non-Random mating populations are simulated with a sex rate less
than one.
We constructed each ROC curve by plotting the true positive fraction on the y
axis and the false positive fraction on the x axis with values of $\alpha$ in
increments of 0.01 from 0 to 1. Curves were calculated hierarchically by rate of
sexual reproduction (< 1) and a unique seed was used to generate the
populations. For each hierarchical level, separate curves were calculated for
sample size, mutation rate, and clone-correction. The areas under the ROC curves
(AURC) were calculated using the `auc()` function in the R package *flux*
version 0.3-0 [@jurasinski2014flux].
Because simulations across rates of sexual reproduction were generated from
unique seeds, we additionally constructed ROC curves for each seed with respect
to clone-correction, sample size, and mutation rate. To test for significant
effects of clone-correction, sample size, and mutation rate on the inference of
non-random mating on microsatellite data, we then separated the data by rate of
sexual reproduction and performed a three way (type III) ANOVA on each rate
separately with formula \@ref(eq:anova). For SNP data, we did not have clone
correction or mutation rate differences, so we set these variables to 1.
\begin{equation}
AURC \sim Clone Correction \times Sample Size \times Mutation Rate (\#eq:anova)
\end{equation}
Because of a pattern that we saw in the results of $E_{5A}$ for microsatellite
data, we asked whether or not using a value of $E_{5A} \geq 0.85$ could help
detect clonal reproduction even if the p-value was non-significant. To assess
this, we conducted an additional ROC analysis to condition p-values on $E_{5A}$
such that non-significant values that also had $E_{5A} \geq 0.85$ would be
considered significant.
## Results
### Clone Correction Negatively Impacts Clonal Inference
For microsatellite data, the values of $\bar{r}_d$ for completely clonal and
nearly clonal populations ranged from -0.577 to 1 in completely clonal data sets
(Fig. \@ref(fig:sim1)). Genotypic diversity statistics (Fig. \@ref(fig:simdiv),
top six panels) showed a predictable pattern of genotypic diversity increasing
with sexual reproduction with diversity being higher on average for clonal
populations with uneven mutation rates as compared to even mutation rates.
Allelic diversity statistics (Fig. \@ref(fig:simdiv), bottom four panels) showed
few differences in means due to mutation rate, but $E_{5A}$ showed consistent
differences in rate of sexual reproduction and sample size. Both $E_{5A}$ and
$H_{exp}$ showed higher values for populations where the rate of sexual
reproduction is < 0.1%. Not shown in the graph is a bimodal pattern in the
$H_{exp}$ distributions with a rate of sexual reproduction < 0.1%.
The power (p <= 0.01) to detect non-random mating decreases significantly at low
levels of sexual reproduction (Fig. \@ref(fig:sim4)). This is further affected
by both mutation rate, sample size, and clone correction. Clone correction,
however, appears to affect even mutation rates more strongly than uneven
mutation rates. For all scenarios the power to detect non-random mating drops to
below 0.25 with rates of sexual reproduction greater than 1%. This trend is only
moderately improved when the threshold is raised to p <= 0.05 (data not shown).
There is less power to detect non-random mating at 0% and 0.01% sex as compared
to 0.05% and 0.1% sex. This appears to be due to the observed negative values of
$\bar{r}_d$ that result in a p-value of 1. At p <= 0.01, we observe a false
positive rate of 1.5% in the worst case scenario.
```{r sim1, echo = FALSE, fig.cap = figcap1, fig.scap = figscap1}
if (LATEX){
knitr::include_graphics("figure/simulations/rd_sexrate.pdf")
} else {
knitr::include_graphics("figure/simulations/rd_sexrate.png")
}
cap <- "Effect of rate of sexual reproduction, sample size (n), mutation rate, and
clone-correction on $\\bar{r}_d$ for microsatellite data. Violin plots represent
data sets simulated with even and uneven mutation rates over 20 and 21 loci,
respectively (columns) and sub-sampled in four different population sizes
(rows). Color indicates whether or not the calculations were performed on whole
or clone-corrected data sets. Each violin plot contains 1000 unique data sets.
Black lines in violins mark the 25, 50, and 75th percentile."
scap <- "Effect of rate of sexual reproduction, sample size, mutation rate, and
clone-correction on $\\bar{r}_d$ for microsatellite data."
figcap1 <- beaverdown::render_caption(caption = cap, figname = "fig1", to = TO)
if (!LATEX) figcap1 <- gsub("@ref", "\\\\@ref", figcap1)
figscap1 <- beaverdown::render_caption(caption = scap, figname = "fig1s", to = TO)
```
```{r simdiv, echo = FALSE, fig.cap = figcapdiv, fig.scap = figscapdiv}
if (LATEX){
knitr::include_graphics("figure/simulations/diversity_stats.pdf")
} else {
knitr::include_graphics("figure/simulations/diversity_stats.png")
}
cap <- "Effect rate of sexual reproduction and sample size (n) on genotypic and
allelic diversity for microsatellite data. Genotypic diversity:
H = Shannon-Wiener index, $E_5$ = evenness, G = Stoddart and Taylor's index.
Allelic Diversity: $E_{5A}$ = allelic evenness, $H_{exp}$ = Nei's expected
heterozygosity (gene diversity). Allelic diversity estimates calculated for
clone-corrected data. Points indicate means and lines extend to to standard
deviations on either side of the mean. Color/shading indicates sample size.
Vertical axis on different scales."
figscapdiv <- "The effect rate of sexual reproduction and sample size on
genotypic and allelic diversity for microsatellite data."
figcapdiv <- beaverdown::render_caption(caption = cap, figname = "figdiv", to = TO)
if (!LATEX) figcapdiv <- gsub("@ref", "\\\\@ref", figcapdiv)
```
```{r, results = "asis", echo = FALSE}
out <- "
\\newpage
"
cat(beaverdown::iflatex(out))
```
```{r sim4, echo = FALSE, fig.cap = figcap3, fig.scap = figscap3, out.width = "100%"}
if (LATEX){
knitr::include_graphics("figure/simulations/ssr_power.pdf")
} else {
knitr::include_graphics("figure/simulations/ssr_power.png")
}
figcap3 <- "Effect of rate of sexual reproduction, sample size (n), mutation rate,
and clone-correction on the power of permutation analysis for microsatellite
data. Power (true positive rate) is on the vertical axis. Values are shown for p
= 0.01. Plots are arranged in a grid with clone-correction in rows and mutation
rate in columns. Color indicates sample size."
figscap3 <- "Effect of rate of sexual reproduction, sample size, mutation rate,
and clone-correction on the power of permutation analysis for microsatellite
data."
```
```{r simroc, echo = FALSE, fig.cap = figcaproc, fig.scap = figscaproc}
if (LATEX){
knitr::include_graphics("figure/simulations/ROC_Curve.pdf")
} else {
knitr::include_graphics("figure/simulations/ROC_Curve.png")
}
figcaproc <- "Effect of rate of sexual reproduction, sample size (n), mutation
rate, clone-correction, and $E_{5A}$ augmentation on ROC analysis of
$\\bar{r}_d$. Rate of sexual reproduction arranged in rows. Clone corrected (cc)
and whole (wd) data sets with even and uneven mutation rates arranged in
columns. Line type indicates the use of $E_{5A}$ > 0.85 to augment the analysis.
Line shade indicates sample size. A dotted line of unity is shown in each plot.
Each ROC curve is calculated over 100 values of $\\alpha$ from 0 to 1."
figcaproc <- beaverdown::render_caption(caption = figcaproc, figname = "fig1", to = TO)
figscaproc <- "Effect of rate of sexual reproduction, sample size, mutation
rate, clone-correction, and $E_{5A}$ augmentation on ROC analysis of
$\\bar{r}_d$."
figscaproc <- beaverdown::render_caption(caption = figscaproc, figname = "fig1", to = TO)
```
```{r, results = "asis", echo = FALSE}
out <- "
\\newpage
"
cat(beaverdown::iflatex(out))
```
The ROC and AURC analyses showed a steady decrease in the ability to detect non-
random mating with increasing levels of sexual reproduction, mainly due to a
loss in sensitivity with increasing levels of $\alpha$ (Fig. \@ref(fig:simroc),
\@ref(fig:sim2)). As seen in the power analysis (Fig. \@ref(fig:sim4)), clone
correction consistently appears to lower the power to detect linkage based on
the AURC. For all scenarios, the diagnostic power of $\bar{r}_d$ to detect
non-random mating is no better than random at rates of sexual reproduction
greater than 10%.
A three-way ANOVA adds support for the hypothesis that clone correction
significantly affects the diagnostic power of $\bar{r}_d$ (Fig. \@ref(fig:sim3),
Table \@ref(tab:sim3)). While there are no significant effects at 50% sex, clone
correction, mutation rate, and sample size all have a significant effect on
$\bar{r}_d$'s diagnostic properties. At low levels of sex (< 1%), the
interaction of clone correction and mutation rate shows a significant effect,
which is not seen in the interaction of clone correction and sample size. Only
at sex rates of 0.1% and 0.05% do we see significant effects across all
interactions.
```{r sim2, echo = FALSE, fig.cap = figcap2, fig.scap = figscap2}
if (LATEX){
knitr::include_graphics("figure/simulations/AURC_box_plot.pdf")
} else {
knitr::include_graphics("figure/simulations/AURC_box_plot.png")
}
figcap2 <- "Effect of rate of sexual reproduction, sample size (n), mutation rate,
and clone-correction on area under the ROC curve. Box plots showing the
distribution of the AURC for each independent seed. Boxes span the interquartile
range (IQR) and the whiskers extend to 1.5 * IQR. An AURC of 1 indicates a
perfect classifier, while an AURC of 0.5 indicates the classifier is no better
than a random guess. Any values below 0.5 (shaded in grey) are considered worse
than random."
figscap2 <- "Effect of rate of sexual reproduction, sample size, mutation
rate, and clone-correction on area under the ROC curve."
```
```{r, results = "asis", echo = FALSE}
out <- "
\\newpage
"
cat(beaverdown::iflatex(out))
```
```{r sim3, echo = FALSE, fig.cap = figcap3, fig.scap = figscap3, out.width = "100%"}
if (LATEX){
knitr::include_graphics("figure/simulations/AURC_ANOVA.pdf")
} else {
knitr::include_graphics("figure/simulations/AURC_ANOVA.png")
}
figcap3 <- "P-values from ANOVA analysis assessing AURC distributions with the
model $AURC \\sim Clone Correction \\times Sample Size \\times Mutation Rate$
for each rate of sexual reproduction, separately. P-values are represented both
as colors and numbers."
figcap3 <- beaverdown::render_caption(caption = figcap3, figname = "fig1", to = TO)
figscap3 <- "P-values from ANOVA analysis assessing AURC distributions with the
model $AURC \\sim Clone Correction \\times Sample Size \\times Mutation Rate$
for each rate of sexual reproduction, separately."
figcap3 <- beaverdown::render_caption(caption = figscap3, figname = "fig1", to = TO)
```
sexrate term sumsq df statistic p.value
-------- ------------- ------- ----- ----------- --------
0.0000 (Intercept) 63.936 1 10569.674 0.000
CC 0.855 1 141.309 0.000
SS 0.060 3 3.289 0.020
MR 0.556 1 91.913 0.000
CC X SS 0.001 3 0.049 0.985
CC X MR 0.174 1 28.747 0.000
SS X MR 0.002 3 0.100 0.960
CC X SS X MR 0.000 3 0.006 0.999
Residuals 9.582 1584 NA NA
0.0001 (Intercept) 70.972 1 14652.094 0.000
CC 0.496 1 102.400 0.000
SS 0.027 3 1.886 0.130
MR 0.340 1 70.257 0.000
CC X SS 0.009 3 0.596 0.617
CC X MR 0.076 1 15.698 0.000
SS X MR 0.001 3 0.077 0.972
CC X SS X MR 0.003 3 0.241 0.868
Residuals 7.673 1584 NA NA
0.0005 (Intercept) 91.939 1 245382.751 0.000
CC 0.072 1 191.181 0.000
SS 0.089 3 78.753 0.000
MR 0.050 1 132.414 0.000
CC X SS 0.047 3 42.205 0.000
CC X MR 0.027 1 72.663 0.000
SS X MR 0.035 3 31.105 0.000
CC X SS X MR 0.018 3 15.818 0.000
Residuals 0.593 1584 NA NA
0.0010 (Intercept) 86.202 1 157278.033 0.000
CC 0.224 1 408.905 0.000
SS 0.269 3 163.853 0.000
MR 0.113 1 206.697 0.000
CC X SS 0.135 3 82.053 0.000
CC X MR 0.060 1 109.071 0.000
SS X MR 0.067 3 40.994 0.000
CC X SS X MR 0.034 3 20.426 0.000
Residuals 0.868 1584 NA NA
0.0050 (Intercept) 52.266 1 14127.572 0.000
CC 0.936 1 253.111 0.000
SS 2.592 3 233.531 0.000
MR 0.138 1 37.393 0.000
CC X SS 0.234 3 21.126 0.000
CC X MR 0.028 1 7.675 0.006
SS X MR 0.026 3 2.385 0.068
CC X SS X MR 0.022 3 1.974 0.116
Residuals 5.860 1584 NA NA
0.0100 (Intercept) 39.376 1 5724.212 0.000
CC 0.686 1 99.672 0.000
SS 2.455 3 118.955 0.000
MR 0.068 1 9.817 0.002
CC X SS 0.173 3 8.380 0.000
CC X MR 0.003 1 0.378 0.539
SS X MR 0.051 3 2.488 0.059
CC X SS X MR 0.056 3 2.720 0.043
Residuals 10.896 1584 NA NA
0.0500 (Intercept) 28.671 1 1903.303 0.000
CC 0.072 1 4.806 0.029
SS 0.239 3 5.293 0.001
MR 0.001 1 0.080 0.778
CC X SS 0.391 3 8.646 0.000
CC X MR 0.000 1 0.032 0.858
SS X MR 0.037 3 0.809 0.489
CC X SS X MR 0.001 3 0.028 0.994
Residuals 23.861 1584 NA NA
0.1000 (Intercept) 26.010 1 1441.495 0.000
CC 0.009 1 0.524 0.469
SS 0.131 3 2.419 0.065
MR 0.000 1 0.001 0.979
CC X SS 0.162 3 2.984 0.030
CC X MR 0.000 1 0.013 0.908
SS X MR 0.014 3 0.256 0.857
CC X SS X MR 0.001 3 0.015 0.998
Residuals 28.581 1584 NA NA
0.5000 (Intercept) 25.669 1 1489.342 0.000
CC 0.001 1 0.033 0.857
SS 0.042 3 0.806 0.491
MR 0.002 1 0.099 0.753
CC X SS 0.001 3 0.025 0.995
CC X MR 0.000 1 0.006 0.939
SS X MR 0.004 3 0.076 0.973
CC X SS X MR 0.001 3 0.011 0.998
Residuals 27.301 1584 NA NA
Table: (\#tab:sim3) ANOVA table comparing the model $AURC \sim Clone Correction
\times Sample Size \times Mutation Rate$ for each rate of sexual reproduction
separately. Columns: sexrate = rate of sexual reproduction, term = ANOVA term,
sumsq = sum of squares, df = degrees of freedom, statistic = F statistic,
p.value = p-value, Terms are as follows: CC = Clone Correction, SS = Sample
Size, MR = Mutation Rate.
### Negative Values of $\bar{r}_d$ in Clonal Populations Correlated With Low Allelic Diversity
When using the criterion of $E_{5A} \geq 0.85$ to augment the criteria for the
ROC analysis, we found that the power to detect non-random mating increased
significantly whereas the false positive rate appeared to only increase with
low sample sizes (Fig. \@ref(fig:sim4a), \@ref(fig:sim2a)). Analyzing the
distributions of the AURC with ANOVA, we found that the most consistently
significant effect found was that of sample size (Fig. \@ref(fig:sim3a), Tab.
\@ref(tab:sim4)) (albeit much like Fig. \@ref(fig:sim3), all effects and
interactions were significant for 0.1% and 0.05% sex).
```{r sim4a, echo = FALSE, fig.cap = figcap3, fig.scap = figscap3, out.width = "100%"}
if (LATEX){
knitr::include_graphics("figure/simulations/ssr_power_ea.pdf")
} else {
knitr::include_graphics("figure/simulations/ssr_power_ea.png")
}
figcap3 <- "Effect of rate of sexual reproduction, sample size (n), mutation rate,
and clone-correction on the power of permutation analysis for microsatellite
data taking into account allelic evenness. Plots are arranged in a grid with
clone-correction in rows and mutation rate in columns. Values are shown for
$\\alpha$ = 0.01 and $E_{5A} \\geq 0.85$"
figcap3 <- beaverdown::render_caption(caption = figcap3, figname = "fig1", to = TO)
figscap3 <- "Effect of rate of sexual reproduction, sample size, mutation rate,
and clone-correction on the power of permutation analysis for microsatellite
data taking into account allelic evenness."
```
```{r sim2a, echo = FALSE, fig.cap = figcap2a, fig.scap = figscap2a}
if (LATEX){
knitr::include_graphics("figure/simulations/AURC_box_plot_ea.pdf")
} else {
knitr::include_graphics("figure/simulations/AURC_box_plot_ea.png")
}
figcap2a <- "Effect of rate of sexual reproduction, sample size (n), mutation rate,
and clone-correction on area under the ROC curve taking into account allelic
evenness. Box plots showing the distribution of the area under the ROC Curve for
each independent seed. Boxes span the interquartile range (IQR) and the whiskers
extend to 1.5 * IQR. An AURC of 1 indicates a perfect classifier, while an AURC
of 0.5 indicates the classifier is no better than a random guess. Any values
below 0.5 (shaded in grey) is considered worse than random."
figscap2a <- "Effect of rate of sexual reproduction, sample size, mutation rate,
and clone-correction on area under the ROC curve taking into account allelic
evenness."
```
```{r sim3a, echo = FALSE, fig.cap = figcap3a, fig.scap = figscap3a, out.width = "100%"}
if (LATEX){
knitr::include_graphics("figure/simulations/AURC_ANOVA_ea.pdf")
} else {
knitr::include_graphics("figure/simulations/AURC_ANOVA_ea.png")
}
figcap3a <- "P-values from ANOVA analysis on AURC distributions with the model
$AURC \\sim Clone Correction \\times Sample Size \\times Mutation Rate$ for each
rate of sexual reproduction separately taking into account allelic evenness.
P-values are represented both as colors and numbers."
figcap3a <- beaverdown::render_caption(caption = figcap3a, figname = "fig1", to = TO)
figscap3a <- "P-values from ANOVA analysis on AURC distributions with the model
$AURC \\sim Clone Correction \\times Sample Size \\times Mutation Rate$ for each
rate of sexual reproduction separately taking into account allelic evenness."
figscap3a <- beaverdown::render_caption(caption = figscap3a, figname = "fig1", to = TO)
```
sexrate term sumsq df statistic p.value
-------- ------------- ------- ----- ----------- --------
0.0000 (Intercept) 96.914 1 169384.022 0.000
CC 0.001 1 0.952 0.329
SS 0.007 3 4.365 0.005
MR 0.000 1 0.268 0.605
CC X SS 0.001 3 0.638 0.591
CC X MR 0.000 1 0.013 0.908
SS X MR 0.001 3 0.704 0.550
CC X SS X MR 0.000 3 0.104 0.958
Residuals 0.906 1584 NA NA
0.0001 (Intercept) 96.531 1 202595.988 0.000
CC 0.001 1 2.838 0.092
SS 0.009 3 6.345 0.000
MR 0.000 1 0.823 0.365
CC X SS 0.001 3 0.404 0.750
CC X MR 0.000 1 0.819 0.366
SS X MR 0.000 3 0.178 0.911
CC X SS X MR 0.000 3 0.112 0.953
Residuals 0.755 1584 NA NA
0.0005 (Intercept) 91.136 1 158601.676 0.000
CC 0.051 1 89.102 0.000
SS 0.106 3 61.325 0.000
MR 0.036 1 61.799 0.000
CC X SS 0.033 3 19.237 0.000
CC X MR 0.019 1 33.262 0.000
SS X MR 0.024 3 14.139 0.000
CC X SS X MR 0.012 3 7.046 0.000
Residuals 0.910 1584 NA NA
0.0010 (Intercept) 84.686 1 113261.634 0.000
CC 0.211 1 282.532 0.000
SS 0.339 3 151.115 0.000
MR 0.103 1 137.833 0.000
CC X SS 0.126 3 56.324 0.000
CC X MR 0.053 1 71.367 0.000
SS X MR 0.061 3 26.976 0.000
CC X SS X MR 0.030 3 13.161 0.000
Residuals 1.184 1584 NA NA
0.0050 (Intercept) 51.087 1 13448.247 0.000
CC 0.888 1 233.702 0.000
SS 2.812 3 246.742 0.000
MR 0.127 1 33.301 0.000
CC X SS 0.226 3 19.851 0.000
CC X MR 0.024 1 6.284 0.012
SS X MR 0.031 3 2.682 0.045
CC X SS X MR 0.025 3 2.186 0.088
Residuals 6.017 1584 NA NA
0.0100 (Intercept) 38.906 1 5499.579 0.000
CC 0.634 1 89.610 0.000
SS 2.558 3 120.521 0.000
MR 0.070 1 9.912 0.002
CC X SS 0.182 3 8.581 0.000
CC X MR 0.003 1 0.443 0.506
SS X MR 0.048 3 2.270 0.079
CC X SS X MR 0.054 3 2.531 0.056
Residuals 11.206 1584 NA NA
0.0500 (Intercept) 28.404 1 1861.840 0.000
CC 0.064 1 4.165 0.041
SS 0.274 3 5.996 0.000
MR 0.001 1 0.072 0.788
CC X SS 0.401 3 8.760 0.000
CC X MR 0.001 1 0.049 0.825
SS X MR 0.035 3 0.759 0.517
CC X SS X MR 0.001 3 0.020 0.996
Residuals 24.165 1584 NA NA
0.1000 (Intercept) 25.985 1 1413.238 0.000
CC 0.008 1 0.422 0.516
SS 0.145 3 2.632 0.049
MR 0.000 1 0.004 0.952
CC X SS 0.164 3 2.974 0.031
CC X MR 0.000 1 0.019 0.891
SS X MR 0.013 3 0.229 0.876
CC X SS X MR 0.001 3 0.012 0.998
Residuals 29.124 1584 NA NA
0.5000 (Intercept) 25.366 1 1459.705 0.000
CC 0.001 1 0.030 0.862
SS 0.051 3 0.978 0.402
MR 0.003 1 0.158 0.691
CC X SS 0.001 3 0.026 0.994
CC X MR 0.000 1 0.005 0.943
SS X MR 0.004 3 0.078 0.972
CC X SS X MR 0.001 3 0.010 0.999
Residuals 27.526 1584 NA NA
Table: (\#tab:sim4) ANOVA table comparing the model $AURC \sim Clone Correction
\times Sample Size \times Mutation Rate$ for each rate of sexual reproduction
separately. This table generated with the allelic evenness correction. Columns:
sexrate = rate of sexual reproduction, term = ANOVA term, sumsq = sum of
squares, df = degrees of freedom, statistic = F statistic, p.value = p-value,
Terms are as follows: CC = Clone Correction, SS = Sample Size, MR = Mutation
Rate.
### Power To Detect Clonal Reproduction in SNP Data
Because of the computational intensity of the simulations, we were only able to
run 24 unique seeds over all rates of sexual reproduction with 10 replicates per
seed. Due to an unexpected corner condition, some replicates failed to save, so
we randomly sampled five replicates per seed per rate of sexual reproduction,
giving us a total of 1,200 total populations for analysis.
Analysis of $\bar{r}_d$ for SNP data showed a similar pattern to the
microsatellite data where clonal and nearly clonal (0.1% sexual reproduction)
distributions of $\bar{r}_d$ were bimodal, but they did not have as many
significantly negative values (Fig. \@ref(fig:sim5)). We observed that that none
of the values of $\bar{r}_d$ for the sexual population reached zero. Increasing
levels of missing data appeared to consistently decrease the value of
$\bar{r}_d$ (\@ref(fig:simmisssnp)). This effect was magnified when missing data
was treated as a new allele.
We initially randomized all loci independently to test for significant
departures from random mating. Our results from that analysis showed *p = 0.001*
for all but two clonal data sets (data not shown). After shuffling each
chromosome independently, we obtained significance values that better reflected
our observations from the microsatellite data. Because of the dearth of
significantly negative $\bar{r}_d$ values in clonal populations, the power of
the index to detect clonal reproduction was greater, but at the cost of an
increase in false positive rates (Fig. \@ref(fig:sim6)). An ANOVA analysis of
the distributions of ROC over the 24 seeds showed significant differences for
0.5% to 10% sexual reproduction (Fig. \@ref(fig:sim7), Tab. \@ref(tab:sim5)).
```{r sim5, echo = FALSE, fig.cap = figcap5, fig.scap = figscap5}
if (LATEX){
knitr::include_graphics("figure/simulations/genomic_rd.pdf")
} else {
knitr::include_graphics("figure/simulations/genomic_rd.png")
}
cap <- "Effect of rate of sexual reproduction and sample size (n) on $\\bar{r}_d$
for SNP data. Each panel represents a different sample size. Each violin plot
contains 120 unique data sets. Points represent observed values. Black lines in
violins mark the 25, 50, and 75th percentile."
scap <- "Effect of rate of sexual reproduction and sample size on $\\bar{r}_d$
for SNP data."
figcap5 <- beaverdown::render_caption(caption = cap, figname = "fig2", to = TO)
if (!LATEX) figcap1 <- gsub("@ref", "\\\\@ref", figcap1)
figscap5 <- beaverdown::render_caption(caption = scap, figname = "fig2s", to = TO)
```
```{r simmisssnp, echo = FALSE, fig.cap = figcapmisssnp, fig.scap = figscapmisssnp}
if (LATEX){
knitr::include_graphics("figure/simulations/genomic_missing.pdf")
} else {
knitr::include_graphics("figure/simulations/genomic_missing.png")
}
cap <- "Effect of rate of sexual reproduction, sample size (n), and missing data on
$\\bar{r}_d$ for SNP data. Panels are arranged horizontally with increasing
rates of missing data from left to right and vertically with increasing sample
size from top to bottom. Shading indicates whether or not missing data was
ignored or treated as a new allele. Black lines mark the 25, 50, and 75th
percentile. Each violin represents 120 values."
scap <- "Effect of rate of sexual reproduction, sample size, and missing data on
$\\bar{r}_d$ for SNP data."
figcapmisssnp <- beaverdown::render_caption(caption = cap, figname = "fig2", to = TO)
if (!LATEX) figcap1 <- gsub("@ref", "\\\\@ref", figcap1)
figscapmisssnp <- beaverdown::render_caption(caption = scap, figname = "fig2s", to = TO)
```
```{r sim6, echo = FALSE, fig.cap = figcap6, fig.scap = figscap6}
if (LATEX){
knitr::include_graphics("figure/simulations/genomic_power.pdf")
} else {
knitr::include_graphics("figure/simulations/genomic_power.png")
}
figcap6 <- "Effect of rate of sexual reproduction and sample size (n) the power to
detect non-random mating for SNP data. Color indicates sample size. False
positive rate is shown as increasing transparency. Data shown for both p = 0.01
and p = 0.05"
figscap6 <- "Effect of rate of sexual reproduction and sample size the power to
detect non-random mating for SNP data. Color indicates sample size."
```
```{r sim7, echo = FALSE, fig.cap = figcap7, fig.scap = figscap7}
if (LATEX){
knitr::include_graphics("figure/simulations/AURC_genomic.pdf")
} else {
knitr::include_graphics("figure/simulations/AURC_genomic.png")
}
figcap7 <- "Effect of rate of sexual reproduction and sample size (n) on area under
the ROC curve for SNP data. Each point represents AURC calculated with 10
populations total. 5 populations with a sex rate of 1.0 and 5 populations with a
sex rate specified on the horizontal axis."
figscap7 <- "Effect of rate of sexual reproduction and sample size on area under
the ROC curve for SNP data."
```
sexrate term sumsq df statistic p.value
-------- ------------ ------- --- ---------- --------
0.0000 (Intercept) 22.932 1 17347.604 0.000
SS 0.004 3 1.121 0.345
Residuals 0.122 92 NA NA
0.0001 (Intercept) 23.602 1 29949.701 0.000
SS 0.001 3 0.352 0.787
Residuals 0.072 92 NA NA
0.0005 (Intercept) 23.562 1 28317.512 0.000
SS 0.001 3 0.339 0.797
Residuals 0.077 92 NA NA
0.0010 (Intercept) 23.562 1 28317.512 0.000
SS 0.001 3 0.339 0.797
Residuals 0.077 92 NA NA
0.0050 (Intercept) 19.117 1 6103.399 0.000
SS 0.174 3 18.568 0.000
Residuals 0.288 92 NA NA
0.0100 (Intercept) 13.984 1 1590.890 0.000
SS 0.806 3 30.552 0.000
Residuals 0.809 92 NA NA
0.0500 (Intercept) 6.510 1 232.786 0.000
SS 1.340 3 15.972 0.000
Residuals 2.573 92 NA NA
0.1000 (Intercept) 5.920 1 155.018 0.000
SS 0.575 3 5.023 0.003
Residuals 3.514 92 NA NA
0.5000 (Intercept) 6.161 1 169.859 0.000
SS 0.059 3 0.542 0.655
Residuals 3.337 92 NA NA
Table: (\#tab:sim5) ANOVA table comparing the model $AURC \sim Sample Size$ for
each rate of sexual reproduction separately with SNP data. Columns: sexrate =
rate of sexual reproduction, term = ANOVA term, sumsq = sum of squares, df =
degrees of freedom, statistic = F statistic, p.value = p-value, Terms are as
follows: SS = Sample Size.
### Common Features of Microsatellite and SNP Markers
For both microsatellite and SNP data, as found in @de2004clonal, large variances
were observed with extremely low levels of sexual reproduction (Figs.
\@ref(fig:sim1), \@ref(fig:sim5)). The variances and means of $\bar{r}_d$
decrease with increasing levels of sexual reproduction. For low levels of sexual
reproduction (< 0.01%), the distributions appeared bimodal. This feature appears
in both the microsatellite and SNP data, but with microsatellite data, the lower
part of the distribution extends into the negative values of $\bar{r}_d$,
resulting in non-significant p-values, which occur when low allelic diversity is
observed. Additionally, with larger sample sizes, the variances decrease (Fig.
\@ref(fig:sim1), \@ref(fig:sim5)) and the power to detect clonal reproduction
increases (\@ref(fig:sim4a), \@ref(fig:sim6)), suggesting that the behavior of
the index is affected by sample size. Based on the power analysis, we observe
that the critical point to detect clonal reproduction in a population is at a
maximum of 1% sexual reproduction (Fig.\@ref(fig:sim4), \@ref(fig:sim4a),
\@ref(fig:sim6)).
## Discussion
Scientists have used the index of association to provide evidence for clonal
reproduction in populations. The index of association measure the ratio of
variance between observed and expected genetic distance and is a measure of
linkage among markers (equation \@ref(eq:ia) ) [@brown1980multilocus;
@smith1993how]. @Agapow_2001 created a standardized version, $\bar{r}_d$, that
resembles a correlation coefficient (equation \@ref(eq:rd)). De Meeûx & Balloux
[-@de2004clonal] previously showed that very little sexual reproduction is
required for $\bar{r}_d$ to reach values close to zero (e.g. the absence of
linkage among markers). We confirmed that the power to detect clonal
reproduction is greatly diminished after $\sim$ 1% sex in a population. We also
provide novel insights into how $\bar{r}_d$ responds to marker type, sample
size, mutation rate homogeneity, and clone-correction. We found that the power
to detect significant departures of $\bar{r}_d$ from expected under random
mating decreases with small sample sizes, and thus also decreases with
clone-correction as sample size is by definition reduced when clone-correcting.