Adding Electronic Markers to FPC Maps

Supplemental information to provide detailed information on the data sets and experiments used to compare different strategies of electronic marker addition, using the search tools MegaBLAST (Zhang et. al., 2002), BLAST (Altschul et. al., 1997) and BLAT (Kent, 2001).

To add markers to a map, one uses an alignment program to return hits of the marker sequences against sequence associated to the clones in the map (this can be draft, finished, or bac-end sequence (BES)). Two primary decisions must be made: which program to use, and how to rank the resulting hits. These are discussed in sections I and II below, for a target of draft or finished sequence. BES targets are covered in section III.

In order to compare different programs and hit-ranking approaches, we have used rice markers provided by the Japanese Rice Genome Research Project(JGRP). The set consisted of 3444 ESTs from 2191 cDNAs, and 338 STSs with 263 unique ids. The average EST sequence length was 421, while the STS average was 325.

Our target database of clone sequences was the full set of draft sequence for rice, as of Feb. 2003, consisting of 11126 sequenced contigs from 3497 clones. These were downloaded from Genbank with the query

    Oryza [ORGN] AND (30000[SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC [ALL] OR 
    chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR EST [TITL] OR mRNA [TITL]
    OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL] OR protein[TITL]))

(this query was created by L. Teytelman at Gramene: http://www.gramene.org/documentation/Alignment_docs/rice_rflp.html)

Our BES database consisted of 127457 rice BESs of clones in the rice FPC map (Chen et al. 2001; Mao et al. 2000).

We scored hits as true-positive or false-positive, depending on whether the hit was to the correct chromosome, where the chromosome of the marker was known from the JGRP map, while that for the clone was known from the Genbank record (for draft sequence) or the FPC map (for BES). This is an admittedly simplistic measure, but if bad hits are randomly distributed, it should only mislabel a bad hit as good 1/12 of the time (since there are 12 chromosomes). To check whether bad hits might be concentrated on the "correct" chromosome, 10 hits labeled as true-positive were examined manually, using the detailed map location of the given marker and neighboring markers, and in all cases the hit was confirmed.

Another consideration is that the markers are not entirely specific; i.e., they may hybridize in the lab to chromosomes other than the one mapped; if these extra hits are also found electronically, they will be scored as false-positives according to our "wrong chromosome" test. Because of this, the error rate does not go to zero at high stringencies, but goes to a "base rate" which is determined by the specificity of the markers.Only error in excess of this base rate can be attributed to the search strategy. For the EST markers in our test set this rate is about 9%, while for the STS markers it is much lower.

I. Scoring Strategy: %Match vs. E-value

In a traditional BLAST search, hits are ranked according to the BLAST E-value, which measures the probability of that hit occurring by chance. This is a sensible measure to use for similarity searches, where one wants to find any target sequences that probably have the same function or evolutionary origin as the target.

When adding electronic markers, however, one isn't looking for similarity, but for identity. The marker sequence is supposed to be exactly present in at least one location in the genome. A perfect match of, say, half of the marker sequence, is not the kind of hit one is looking for, even though it could have a very low E-value.

For this reason, it seems sensible to score electronic marker hits according to the percentage of the total marker length which is matched; this score will be denoted "%Match". (Note that this is not the same as the "percent identity" from BLAST reports, because that score only gives the fraction of bases that match within the alignment region, which may not cover the whole marker.)

The %Match strategy is clearly seen to be superior to E-value in tests on the STS subset of JRGP markers (table I below). On this dataset, using MegaBLAST, a screen on %Match ≥ 95 finds 462 hits, with a false-positive rate of 2%;by contrast, taking E-value ≥ 170 gives only 179 hits, with 3% false-positives.[ %Match was implemented by parsing the output file and taking %Match = (match length/query length) *(percent identity) .]

It is understandable that E-value should work poorly in this case, because the sequences have a variety of lengths, and a given E-value is only well-suited for a small range of lengths. E-170 corresponds on this dataset to a match of about 300bp, and since the average STS marker has length only 325, there are many which are shorter than 300. Hits of these markers will never be found at E-170; however, if the E-value cutoff is lowered to accept the shorter markers, it becomes too low for the longer ones, and false hits are introduced. For example, marker L704 has length 268, and scores a 100% hit on clone AC018929, but this hit has E-value only E-149. To accept this hit, the E-value cutoff would have to be lowered at least to E-149, at which point the cumulative false-positive rate is 6%, twice as high as at E-170.

Another problem with E-value screens, which may not be sufficiently appreciated, is that the value returned for a given alignment varies with the size of the databases being compared. The larger the databases from which the two sequences are drawn, the larger will be the returned E-value. This means that hits which are accepted at a given E-value cutoff, could be rejected if the database sizes are increased, even if the two sequences are unchanged. So for example if one runs an E-value screen of some markers against batches of sequence that come in periodically over time, and then runs the same screen later against the full set of sequences, the set of hits will be different. By contrast, the %Match strategy is independent of database size, so this potential confusion is absent.

From the data in table I, these STS marker sequences are seen to be very accurate, so that one can expect to find typically at least 95% of the marker sequence in the target genome; this is the ideal situation for the %Match strategy. If the marker sequences are less accurate, or contain bad reads at either end, then the advantage of %Match over E-value will be reduced or eliminated. This effect can be seen within the EST marker set, which also has the additional complication of spliced introns, as discussed in the next section.

It is worth mentioning that not all of the markers get a hit at %Match ≥ 95, or indeed at any cutoff. At 95%, hits are found for 220 unique markers, or 83% of the total 263 markers in the set. Lowering the cutoff to 90% brings the marker total to 235, while going all the way down to 70% brings in only 5 more for a total of 240. So it seems that about 10% of the markers simply do not hit, which is not unreasonable given the gaps in clone coverage and in draft sequence, and possible errors in marker sequencing.

Finally we note that although MegaBLAST data was used in this comparison, BLAT gives the same result; furthermore, in the context of BSS, the %Match scoring is supported only through BLAT (the reason for this, as discussed below, is that %Match requires BLAT in order to work well on EST data). MegaBLAST data was used here merely to emphasize that it is the scoring strategy, not the program, which is being tested.

Supplemental Table I. %Match Strategy vs. E-value strategy for STS-to-sequence data.
F.P = False-Positive rate, as a percent of total hits

	%Match Strategy			E-value Strategy
%Match	F.P.(Cumulative)	Hits	E-value	F.P.(Cumulative)	Hits
100%	1%	131	200	3%	134
99	2	288	190	3	134
98	2	366	180	3	140
97	2	420	170	3	179
96	1	453	160	6	234
95	2	462	150	6	258
94	2	473	140	8	325
93	2	479	130	10	387
92	3	495	120	20	482
91	5	511	110	28	592
90	5	512	100	31	651

II. Unsplicing Introns: BLAST vs. BLAT

Marker sequences which have undergone splicing present a problem for the %Match strategy, because although the full marker sequence may be present in the target, it is in disconnected exons. BLAST (and MegaBLAST) returns each exon hit as a separate high-scoring pair, but from the report it is not straightforward to put those pieces together to produce the total %Match.

However, there is a search tool which does this automatically - BLAT. Given this capability, and the superiority of the %Match strategy seen above on STS data, it is reasonable to believe that BLAT + %Match will be superior on the EST data. In fact we will see that it is, but the margin of difference is much smaller than on the STS data.

Tables II and III display the EST-to-sequence data for BLAT and MegaBLAST, respectively. (In table II, the maximum allowed intron was taken to be 1000, i.e., all hits with any intron larger than 1000bp were excluded; see below. Note that, before these tests were run, all Poly(A) tails, including any embedded n's, were stripped from the markers. Also, MegaBLAST was run without any filtering, i.e., using command-line parameter -F F.)

Supplemental Table II. BLAT data, EST-to-sequence, maximum intron = 1000
F.P. = False-positive rate(given as percentage of hits found)
Intron Hits = Hits containing an intron, defined as a target gap ≥ 20bp
F.P(Introns)= Running false-positive rate for hits containing an intron

%Match	F.P.(Cumulative)	F.P.(Running)	Hits	Intron Hits	F.P.(Introns)
100%	9%	9%	402	192	7%
99	9	9	1561	765	9
98	8	5	2473	1173	5
97	8	10	3029	1436	11
96	8	12	3452	1639	12
95	8	7	3734	1766	7
94	8	5	3943	1866	3
93	8	10	4065	1917	3
92	8	11	4183	1985	7
91	9	17	4307	2048	19
90	9	8	4397	2087	5
89	9	10	4463	2123	8
88	9	13	4523	2149	15
87	9	16	4577	2179	26
86	9	19	4628	2192	30
85	9	16	4664	2213	23
84	9	24	4705	2230	41
83	9	12	4755	2253	8
82	9	20	4803	2274	47
81	9	20	4833	2290	25
80	9	31	4878	2317	33
79	9	13	4923	2339	13
78	10	32	4948	2349	30
77	10	10	4988	2378	13
76	10	33	5027	2392	28
75	10	32	5058	2410	38
74	10	41	5087	2427	23
73	10	57	5125	2445	33
72	11	25	5156	2458	23
71	11	50	5182	2473	53
70	11	52	5201	2483	60

Supplemental Table III. BLAST data, EST-to-sequence

E-value	F.P.(Cumulative)	F.P.(Running)	Hits
200	7	7	2116
195	7	0	2116
190	7	0	2116
185	7	0	2116
180	7	9	2157
175	8	12	2319
170	8	14	2481
165	8	11	2624
160	8	6	2771
155	8	7	2910
150	8	11	3046
145	8	5	3181
140	8	15	3296
135	8	12	3428
130	9	13	3552
125	9	11	3723
120	9	17	3897
115	10	21	4069
110	10	18	4249
105	10	21	4401
100	11	27	4561
95	11	22	4704
90	13	53	4934
85	14	34	5069
80	14	28	5221

For purposes of comparison, we can take the lowest cutoffs in each table which still give some fixed (cumulative) false-positive rate. Using 8% false-positives, we see that BLAT with %Match ≥ 92 finds 4183 hits, while MegaBLAST with E ≥ 135 finds 3428. Using 9%, BLAT with %Match ≥ 79 finds 4923, while MegaBLAST with E ≥ 120 finds 3897.

Hence, on this data, BLAT + %Match finds 20% more hits than MegaBLAST, with equal error rates. From table II it appears also close to 50% of the hits found by BLAT contain (putative) introns, here defined as hits containing at least one target gap larger than 20bp; also, the running false-positive rate for these intron hits separately is not higher than the total rate, if %Match ≥ 90. At %Match < 90, the running false-positive rates for intron hits do rise, suggesting that 90 is a good cutoff for %Match on this data.

Looking in more detail at the results for 8% cumulative false-positives, it turns out that BLAT finds 1145 true hits that MegaBLAST misses, of which 894 contain introns. However, MegaBLAST still finds 492 true hits that BLAT misses. The average length of these excess MegaBLAST sequences is 497 bases, larger than the 421 bp average length of all the EST sequences; this suggests that possibly the problem is low-quality sequence remaining at one end. Since the 3' ends were trimmed in most cases to a Poly(A) tail, the 5' ends are suspect.

In order to test this hypothesis, 5% of the marker sequence was removed from the 5' end of each marker, and BLAT was run again. On this data, total error 8% required only %Match ≥ 90, and the hits found by BLAT but not MegaBLAST increased to 1276, while those found by MegaBLAST and not BLAT decreased to 353. This bears out the hypothesis, but blindly cutting off larger pieces of the sequence doesn't work; the trimming must be done as accurately as possible at the time of sequencing, if the markers are to realize their full value.

It was mentioned above that the BLAT hits were screened for a maximum intron size of 1000. This number obviously must be chosen to suit the particular genome in question; for rice, with typical intron size 400-600bp (Sasaki et. al. 2002; Y. Yu, pers. comm.), a limit of 1000 seems reasonable. Our EST data supports this, because a higher intron size limit brings in few new hits, while quality declines considerably. For example, taking %Match ≥ 92, but considering only hits with maximum intron size 1000 or more, one finds only 465 additional hits, and the cumulative false-positive rate is 14%, as compared to 8% for the original set with maximum intron size 1000.

Before moving on we again discuss the false-negatives, i.e., the markers which do not hit. Using BLAT, with only the Poly(A) tails trimmed, and %Match ≥ 92, one finds hits to 1767 unique markers, or 81% of the total set of 2161. Carrying the search to %Match ≥ 70 brings the total to 86%, and %Match ≥ 50 gives 88%. So again roughly 10% of the markers are missing, a figure not out of line with the known gaps in coverage.

III. Short Target Sequences

Aligning marker sequences to short target sequences such as BESs presents a problem for the %Match strategy, since usually only a piece of the marker sequence can be expected to hit, and this is especially true for EST markers containing introns. The usefulness of BLAT's intron-unsplicing capability is likewise also reduced. Our dataset did not produce enough hits for conclusive results, but it appeared that both of the two strategies considered above perform the same.

References

Altschul, S. , Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. 1997. Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs.Nucleic Acids Res. 25:3389-3402.

Chen, M., Presting, G., Barbazuk, W., Goicoechea, J., Blackmon, B., Fang, G., Kim, H., Frisch, D., Yu, Y., Higingbottom, S., Phimphilai, J., Phimphilai, D., Thurmond, S., Gaudette, B., Li, P., Liu, J., Hatfield, J., Sun, S., Farrar, K., Henderson, C., Barnett, L., Costa, R., Williams, B.,Walser, S., Atkins, M., Hall, C., Bancroft, I., Salse, J., Regad, F., Mohapatra, T., Singh, N., Tyagi, A., Soderlund, C., Dean, R., and Wing, R.2001.An integrated physical and genetic map of the rice genome.Plant Cell 14:537-545.

Kent, W. James. 2001.BLAT - The BLAST-Like Alignment Tool.Genome Research 12:656-664.

Mao, L., Wood, T.,Yu, Y., Budiman, M., Woo, S., Sasinowski, M., Goff, S., Dean, R. and Wing, R. 2000. Rice Transposable Elements: A Survey of 73,000 Sequence-Tagged-Connectors (BESs). Genome Research 10:982-990.

Sasaki, T., Matsumoto, T., Yamamoto, K., Sakata, K., Baba, T., Katayose, Y., Wu, J., Niimura, Y., Cheng, Z., Nagamura, Y., et al.2002. The genome sequence and structure of rice chromosome 1.Nature 420:312-316.

Zhang, Z., Schwartz, S., Wagner, L., and Miller, W.2002. A Greedy Algorithm for Aligning DNA Sequences. Journal of Computational Biology 7:203-214.