|
This study was funded in part by NSF grant
#0211851
and by the Initiative for Future Agriculture and Food Systems Grant no. 2001-52100-11292 from the USDA Cooperative State Research, Education, and Extension Service.
Contents:
 
Methods
 
Simulation Results
 
Discussion
Methods
Xu et al. (2004) Genomics 84: 941-951 compares the following methods using simulations based on human chr22:
Method
|
Enzymes
|
Band Size Range (bp)
|
Gel length1
|
Tolerance (Xu,Ours)2
|
1e
|
HindIII
|
600 to 16k
|
3300
|
Variable,7
|
2e
|
HindIII, HaeIII
|
58 to 773
|
(773-58) * 1 = 715
|
2,2
|
3e3
|
HindIII, BamHI, HaeIII
|
35 to 500
|
(500-35) * 10 * 1= 4650
|
2,4
|
4e
|
HindIII/HaeIII; HindIII/RsaI; HindIII/DpnI
|
75 to 500
|
(500-75) * 10 * 3 = 12750
|
5,4
|
5e
|
HindIII, BamHI,XbaI,XhoI,HaeIII
|
35 to 500
|
(500-35) * 10 * 4 = 18600
|
2,4
|
1The paper did not specify the gel lengths used in the FPC configurations, so we have computed
them from the band values (gel length = total # of possible band values).
2We used bands instead of sizes for 1e, hence, the tolerance is different. For the 3 methods
that use sequencing machines, we used 0.4 bp
estimated by ourselves and Luo et al. for the ABI 3700/3730.
3In the 3e method, as specified in the paper, only HindIII receives a label.
The paper concludes, based on the human chr22 data only, that the 2e and 3e methods are to be
preferred, which is counterintuitive because 4e, 5e contain more information.
We have extended the simulations to a large number of
other sequenced chromosomes and find that, as expected, the 5e method is superior, with 4e
a fairly close second.
Our simulations followed the exact methods enumerated above except for two differences.
First, we simulated agarose (1e) using band data rather than size data,
and a fixed tolerance of 7, as this is more customary and matches the error obtained in
realistic scoring. Second, we used tolerance 4 for all of the methods specifying detection by
automated sequencer (3e,4e,5e), corresponding to our practice with data from 3730xl machines.
We note that we were not able to fully reproduce the results of the paper using the information provided.
The gel lengths were not stated and these have a significant impact on cutoff values and other
simulation outcomes; see Discussion for additional discrepancies.
Simulation Results
Human Chr22 comparison:
We compare the human22 simulations of the paper with our own, using both the parameters
stated in the paper and the parameters which we derived using the methodology stated in the paper.
Full Simulations, tabulated by coverage or
species: fully-automated simulations on
12 different pseudomolecules, using 4 different coverages,
3 different build criteria, and 2 different random libraries for each case.
The simulation software used may be downloaded here.
FPC v8.2 or later is required.
Discussion
The "by coverage" tables show that 5e is clearly
superior when measured by the number of contigs formed (or F-, which is virtually the same).
The other measures, e.g. F+ or Q, arise in unpredictable quantities as soon as there
are badly-formed contigs. A given false contig may contain many Q, or in some cases none at all.
Every false contig contains at least one F+ clone overlap, but sometimes there
will be many more. These measures fluctuate between the different methods and show no clear
distinction between them. The combined "Map Score" which is defined in the paper (and displayed in
the "by species" tables) is therefore a
combination of an informative score (contig #)
with several essentially random scores that obscure the signal.
Our simulated digestions differed significantly from those reported in the paper in two cases.
In the 3e case, we obtained 47.5 bands/clone as compared to 71.7 cited in the paper,
and our figure seems more reasonable to us since it
is approximately the same as the 2e method.
For the 4e method, we obtained 110 bands/clone, compared to 73.8 in the paper;
in this case also, our value seems reasonable to us since it is
approximately proportional to the number of labeled 6-cutters.
We were also unsure how Xu et al. defined the parameter F-, which is shown as
zero for all entries of their Table 1. There must be at least one
F- (i.e., missing) overlap for every contig break, and even if these are not
counted in F-, there are generally additional F- because of bridging clones.
Several of the test cases, including human22, have false overlaps (F+) already at
very stringent cutoffs (the others are human19, human20, and arab1). In human22
this is caused by a
128kb repeat (interrupted by 56 gaps) which generates false overlaps in all 5 methods.
HICF could be disadvantageous in such cases, as the false
overlap will be detected at lower cutoff, and furthermore the gaps in the
duplicon are less likely to intersect the small HICF fragments and differentiate them.
This phenomenon deserves further study, but our simulations do not indicate any
disadvantage for HICF even in these cases.
Note that none of these tests attempt to simulate error. The error rate of HICF fingerprinting
in maps we have constructed to date (whole-genome cereal assemblies) is higher
than that of a well-scored agarose project, but the higher throughput and
greater contig formation in HICF considerably outweighs this
disadvantage.
|