The University of Arizona

MTP Simulation  

Home | Search | FPC | Simulation

This study was funded in part by NSF grant #0213764.

We generated simulation results for automatically selecting an MTP based on different criteria. This document presents the results so that users will have an idea what to expect from their MTPs.

Contents:
      Fingerprint method
      BSS method using BES-draft
      Using longest clones and sizes
      Conclusions

Fingerprint method

We created simulated datasets from the rice chromosome 3 pseudomolecule for a total of 24 datasets:

  • 10x and 20x coverages: For each dataset, only one library was made (i.e. clones cut with HindIII), and the average size clone was 150,000 with a standard deviation of 17,000.
  • Agarose and snapshot HICF: The 1e and 5e methods discussed in Simulation of FP methods are agarose and HICF respectively.
  • 6% and 12.5% error: The error rate of a data set is the fraction of false positive and false negative bands in the clones, while band noise is a Gaussian error term added to all band values. With really good band calling, agarose can have a 6% error rate, but it is generally 12%. HICF tends to have a 12% error rate.
  • 3 datasets of each: These were averaged together for the results in the tables below.

For each dataset, the clones were assembled into FPC contigs. The tolerance for agarose was 7 and for HICF was 4. For each dataset, the simulation software used a binary search to finds the highest possible cutoff before getting the first false positive overlap. The same set of datasets were used for all the following tables.

The quality of minimal tiling paths was evaluated by reference to the known positions of the clones in the pseudomolecule.

  • The average clone overlap is based on the real coordinates.
  • The percent false positive overlaps is the ratio of the number of false MTP overlaps to the total number of MTP overlaps. The false MTP overlaps are the neighboring clones in the MTP that appear to share bands but do not overlap based on the real coordinates.
  • The number of gaps is the number of gaps in the MTP that are not due to false positive overlaps; generally, this occurs wherever a region has no good, overlapping clone pairs to use, which usually indicates a region of the contig with low coverage or poor quality.

In all the following table, the default MTP parameters were used unless stated otherwise.
Weight parameter:The user may influence the tradeoff between minimizing overlaps and the risk of false positive overlaps in the MTP through the variation of the "weight" parameter. An indication of the effect of this parameter is given in the following two graphs, which show average overlap and percentage of false overlaps in the MTP as a function of the weight parameter for several data sets. These graphs may be used as guidelines for the effect of changing the weight parameter value, but should not be interpreted as a prediction of the results in any single case. The performance of MTP and the effect of the weight parameter will vary depending on the quality of the data set, the digestion method, the FPC assembly, and the values of other MTP parameters.

Agarose method:


HICF method:

BSS method using BES-draft

Draft sequence and BESs may also be used to find overlapping clones as a basis for MTP. We created sequence assemblies from simulated fragments at 1x, 2x, and 3x coverages of the rice chromosome 3 pseudomolecule. Using the 10x with 6% error datasets, BESs were created from the ends of clones. There is no error introduced into the sequence. Additionally, the SeqCtgs are all correct; that is, we simulated extracting sequences of 800 bps, then computed the SeqCtg based on the known coordinates.

The BSS (Blast Some Sequence) function in FPC was used to align the BESs to the draft sequence, and then the MTP was used to find overlapping clone pairs based on the BSS results. The MTP software allows positive only overlaps, or negative overlaps. The idea behind negative overlaps is that the draft sequence bridges the two clones, and hence, the full sequence is known. We have the following types of results:

  • Negative allowed:
    1. Clones overlap and the BESs match a SeqCtg at the same location.
    2. Clones overlap and the BESs match a SeqCtg from a different location; this is not a false overlap, though it is a false SeqCtg.
    3. Clones do not overlap but are bridged by a SeqCtg at the same location.
    4. Clones do not overlap but are bridged by a SeqCtg from a different location. In this case, using the SeqCtg to bridge the two clones would be incorrect.
  • Positive only: the same 4 situations as the above, except we must count false overlaps as such, even if there is a correct bridging SeqCtg since the clones do not overlap.

We provide two tables, one with positive only overlaps allowed and one allowing negative overlaps. For the first case, there is a column for F+ overlaps and in the second case there is a column for Negative overlaps. In both cases, there is a column for F+ SeqCtg, indicating the wrong SeqCtg bridges a negative overlap, as these are the true false positives for BES->Draft.

When only using draft sequence to determine overlaps, there are generally not enough defined overlaps to determine an MTP acrossed the entire contigs, hence, we have the concept of expressways and junctions. An expressway is an MTP that does not go across the contig, i.e. there may be multiple expressways in a contig. A junction is where two end clones of expressways overlap, but they were not defined to have an acceptable overlap (e.g. there was no draft sequence covering the respective BESs). These can cause large overlaps, so we suggest that the BSS method never be used by itself, but always in conjunction with fingerprint overlaps. When using both, the MTP algorithm will always pick the BES-draft when possible, but otherwise use fingerprints. The two following tables using only BSS overlaps is only to elucidate the small overlaps that can be detected. The total overlap is the MTP and junction overlap.

The following two tables use the 10x with 6% error datasets.

Using the same data sets, we also ran MTP using both fingerprint-based pairs and BSS-based pairs with positive overlap only option.

Longest clones and using sizes

If band sizes are available in the Sizes directory of an FPC project, you may select to use these for calculation of clone overlaps instead of the number of bands times the average size of restriction fragment. The following table illustrates the differences in the MTP results as a consequence of using the accumulated sizes versus an approximation based on the number of bands. The table is based on the average of three simulated data sets of a 15x library of rice chromosome 3 pseudomolecule with agarose digestion at a 0% error rate.

Another way in which size information can be used by MTP is to modify the shortest path algorithm so that longer clones are used preferentially. An example of the effect of selecting the "Give preference to large clones" option is given in the following table, which shows an average over the same data sets as used in the previous table.

Conclusions

From our discussions with biologists who have selected MTPs by hand, the automatic scripts do as well as a person can do by hand. As these results show, using fingerprints is not the optimal way to select an MTP. Using BESs and draft is obviously optimal. Another approach is to sequence 'seed' clones and then find the best neighbor to sequence based on finding the clone BES nearest the end. For this approach, we are currently working on developing an algorithm to select the next clone for sequencing based on a draft sequenced clone and BESs; note, this will also provide some ordering of the sequenced contigs.

Email Comments To: fpc@agcol.arizona.edu

 

 

 

Last Modified Thursday February 14, 2008 10:40 AM and 50 seconds