SyMAP 4.2 System Guide  
AGCoL Home | Software | Release Notes | System Guide | User Guide | Tour | Troubleshooting

Contents

01. Overview
02. Publications
03. Getting Started
04. System Requirements
05. Installation
06. Running the demo
07. Creating a New Project
08. Preparing the Sequences
09.Annotation Files
10. Working with FPC Files
11. SyMAP and MySQL
12. Runtime and Memory
13.Web Display
14. Database Parameters
15. Self Alignments
16.How SyMAP Works
17. References

Overview

SyMAP is a system for computing, displaying, and analyzing syntenic alignments between medium-to-high divergent eukaryotic genomes. It does not work well for very similar genomes or bacterial genomes.

Its features include the following (for a pictorial introduction, see the Tour):

GUI manager to run synteny computations and view results.
Multiple display modes (dot plot, circular, side-by-side, closeup, 3D).
Draft sequence ordering by synteny, i.e. align a draft genome to a fully sequence (not draft-to-draft).
Construction of cross-species gene families.
Complete annotation-based queries.
All displays accessible through the web.
Can align FPC maps to sequenced genomes.

Publications

The back-end processing of SyMAP, including the synteny block and anchor filtering algorithms, is described in the following two publications. A sketch of the algorithms is also provided here.
        C. Soderlund,  W. Nelson, A. Shoemaker and A. Paterson (2006)
        SyMAP: A System for Discovering and Viewing Syntenic Regions of FPC maps 
        Genome Research 16:1159-1168.

        C. Soderlund, M. Bomhoff, and W. Nelson (2011) 
        SyMAP: A turnkey synteny system with application to plant genomes.
        Nucleic Acids Research 39(10):e68.
SyMAP is freely distributed software, however if you use SyMAP results in published research, you must cite one or both of these articles.

Getting Started

To run SyMAP, the only things you must have are:
A. Mac or Linux computer with 1G RAM.
B. The SyMAP downloadable package.
C. Sequence files, in FASTA format. Pay close attention to Preparing the Sequences

Follow the steps below to get started with SyMAP. If you are working with FPC files, see also here; for problems, see troubleshooting. Note that each SyMAP application window has a separate Help button providing further details on the use of its functions.

1. Use a Linux or Mac machine. It needs to have Java v1.6 or later, and sufficient processing power. See system requirements.
2. Prepare sequences and annotation. Sequences can be in one or many files and can be masked or unmasked. See preparing the sequences. Annotation format is gff3; see annotation files.
3. Download SyMAP. Installation is a simple unzip. See installation.
4. Run the demo. Highly recommended. See running the demo.
5. Set up MySQL. SyMAP can run on its own but we recommend that you use a separate MySQL installation, which can be on any machine. See SyMAP and MySQL.
6. Import your files. The Manager interface makes this easy; see creating a new project.
7. Compute alignments and synteny. This also easy through the Manager interface. See runtime and memory.
8. View results. These functions are located on the Manager interface. Detailed description of the user interface is in the User Guide.
9. Share through the web. A web install script is provided; see web display.

System Requirements

SyMAP runs primarily on Intel systems with Linux or Mac OSX (tested on OSX 10.6.8 and 10.8.3). Windows is supported for viewing and querying only.

For performing large alignments (e.g. 1Gb genomes or more) it is essential to have multiple CPUs, a 64-bit computer, and at least 5Gb of RAM for each CPU that you intend to use. (Note that you can set the number of CPUs for SyMAP to use).

For viewing alignments, CPU and memory needs are typically negligible, unless you are performing queries on more than 4-5 genomes at once.

Installation

Installation simply consists of unzipping the download package, using the command
     > tar -xzvf symap_42.tar.gz
This can be done anywhere and creates a directory called symap_42. You can move this directory later if desired.

To run SyMAP, change into the symap_42 directory and run the command

     > ./symap 

Running the Demo

If you have not used SyMAP before, it is essential to run the demos. You can run them immediately after unzipping the package; it is not necessary to install MySQL (on Linux only) because a Java-controlled version of the database is included in the package.

Change into the symap_42 directory.
  1. If you have mySQL on your machine, then edit symap.config and enter database and host information (see SyMAP and MySQL).

  2. If you have Java 3D on your machine,
    then enter ./symap
    else enter ./symap -no3d

    Many lines of text will print to the console, as SyMAP launches the supplied MySQL database. When this is done, the Project Manager window opens, shown at right.

(Click Images for larger version)

 
There are four demo projects listed on the left. Check "Demo-Seq" and "Demo-Seq2". A link "Load All Projects" will be displayed in the top of the right panel; select it to load the projects, which will take several minutes; when done, the Manager will look as shown in the image.

In the "Available Syntenies" table, click the cell for the "Demo_Seq2" row and the "Demo_Seq" column. If you have two CPUs available, set the "CPUs" setting to "2". Then click the "Selected Pair" button to start the alignment.

 

The alignment will take about 30 minutes (15 with two CPUs).

When done, the table will have a checkbox, signifying that the synteny is available for viewing. Click the cell again which will enable the viewing buttons, e.g "Dot Plot".

 
Click "Dot Plot" and you will see the dot plot shown here. By clicking and/or selecting regions you can zoom into certain regions and bring up detailed views of the alignments. The Help button (question mark) provides full information on the functions.
 
Return to the Manager, and click the "Chromosome Explorer" button. This brings up the Explorer, shown at right.
(If this does not work, restart with symap -no3d; the 3D view on the right will not be available).

Here you can pick different sets of chromosomes, using the small icons at left, and see them in different views. At first you only see the reference chromosome, which is initially chr3 from Demo-Seq (the reference has a red box around its number).

Click on the icons for Demo-Seq2 chr1 and chr3, to see the 3D view at right. The ribbons represent synteny blocks (green for inverted).

 
Click the "Circle" button to see a Circos-style3 display of the same chromosomes:
 
Click the 2D button to see a side-by-side view of the same chromosomes. Note that the reference is in the middle. Brown lines show the individual anchors (see how symap works).
 
Selecting a region on one of the sequence tracks using the mouse zooms to that region.

Now the annotation icons (blue) for individual genes can be seen in the center of each sequence track.

If you zoom in even closer, then you can click the "Sequence Filter" button and the top of a sequence track (or right click in the sequence), and check "Show Descriptions for Annotations", and you will see the annotation text for each gene.

 
Returning to the Manager, click the "SyMAP Queries" button. This brings up the SyMAP Query window in Overview mode. Click the "Query Setup" option on the left-hand side and you will see the Query Setup window:

The query does two basic things:
A. Locate syntenic regions based on annotation
B. Create putative gene families across the species by grouping the genes (or regions) which are connected by anchors.

 
Enter "glycosyl" for the "Annotation String Search" and press "Do Search".

55 results are returned. Click the "PgFSize" column header to sort by this column, giving the table at right:

Each row is an anchor connecting two of the chromosomes. At the top of the table are 7 anchors grouped into a putative family (PgeneF=25 in the image, but it may be numbered differently when you run it).

The rows with a non-empty "BlockNum" are anchors involved in synteny blocks. You could restrict the query to only these anchors, if desired. Synteny anchors are more likely to represent a true ancestral relationship; however, synteny blocks can not always be detected in sparser regions.

If you query with more than two species, you can ask interesting questions such as "show me the glycosyl-related gene families which are present in species A and B but not in species C".

 
The remainder of the demo relates to draft sequence, i.e. unanchored shotgun sequence. If you aren't working with draft sequence, you can skip to Creating a New Project.
 
Return to the Manager and load the Demo-Draft project, following the same steps used to load the previous two projects.

On the Summary List, under the Demo-Draft listing, you will see the parameter "Order Against: demo_seq". With this setting, the Demo-Draft contigs will be ordered using synteny to Demo-Seq, as soon as that alignment is run.

 
Use the "Selected Pair" button to align Demo-Draft and Demo-Seq, as before. The alignment will take about 20 minutes.

When done, open the dot plot for this pair and you will see that the draft contigs have been ordered and oriented to agree with the Demo-Seq. (For real projects the agreement will not be so good!)

 
The ordering on Demo-Draft is in the database only; it does not change the sequence files on disk.

However, SyMAP does write out ordered "pseudomolecule" files created from the draft contigs. These are put in the form of a new SyMAP project which can be loaded and aligned. You will see this new project in the Projects panel, as shown at right.

A text file showing the ordering information is also written to the Demo_Draft project directory, which you can find under the data/pseudo subdirectory of the symap directory.

This is the end of the demo.
At this point you will probably want to proceed by reading the next section to learn how to create your own project in SyMAP.

Creating a New Project

If mySQL is installed, edit symap.config as described in SyMAP and MySQL.

To create a new project, start symap (i.e. ./symap and press the "Add Project" button at the lower left. Enter the name and type of the project. The Help button on the dialog provides further information.

After saving the new project, it appears in the Projects list on the left, but it is still an empty shell. Check its box and it will appear in the Summary section (right hand side) where you will then click the "Parameters" link to open the Parameters window.

On the Parameters window you will add the filenames for the sequences and annotations for the project, as well as setting other parameters if desired. The Help button on this window provides the necessary details.

After setting parameters, the project is ready to be loaded and aligned using the same steps as in the demos.

Preparing the Sequences

The first decision with whole-genome sequence is whether or not to apply repeat masking. Masking reduces alignment time and false-positive hits, but also runs a risk of concealing true hits due to inaccurate masking. Masking also requires considerable time.

Masking is not really necessary unless the genome is highly repetitive and those repeats are shared with other genomes being aligned. (Repeats cause particular trouble for self-alignments, see self-alignments in SyMAP).

Another masking option which is available if you have gene annotation is to mask out everything but the annotated genes. You can enable the "mask_genes" option on the Parameters window for your project (turn it on before doing the alignments).

Note that sequence files should be in FASTA format and the name of a sequence is the string immediately following the ">", e.g.

>chr3  oryza sativa
GAATTCGAATTTGGGTAATGCTAATCAATACAGGTCAAAATCTATGTATTGAGTGGAATATACTGCAAAGTAATTACCTT
CTTCCAAAGGAAAGCATTCCTTCTCTCTTGTGGGACTAGCAGATGATCTCGCAGCCAAGACGTGACCACCCAAGGCTCAC
...
Here the sequence name is "chr3", and the sequence follows. The additional information "oryza sativa" is ignored.

Three things are important in naming sequences for SyMAP:
A. Sequence names can contain only letters, numbers, and underscores.
B. The sequence names must exactly match those used in the annotation files (first column), or the annotations will not be loaded.
C. If possible, use a consistent prefix such as "chr" for all sequences, followed by a short number. If this is not possible, see note below.
Note: If you are unable to meet condition C, then you should set the parameter "grp_prefix" to the prefix of the sequences. SyMAP will then remove the prefix and use the shortened names in the displays, saving space. (If you set "grp_prefix", and some sequences don't have the prefix, they will keep their full names.)

Annotation Files

Annotation files should be in gff3 format. The first column (seqid) must exactly match the sequence names in the fasta files. The third column (type) determines how SyMAP uses the entry. Types "gene","exon","CDS","centromere", and "gap" are recognized (other entries are ignored). "CDS" and "exon" are treated equivalently in SyMAP.

The last column (attributes) contains "tag=value" pairs describing the annotation. You can set which attributes to use, or use all those occurring more than a certain number of times (open the Parameters window for the project, look for parameter "annot_keywords").

Important: Only the annotation on "gene" entries is shown in the displays or used for searching. For entries "exon", "CDS", "gap", and "centromere", only the coordinates from the gff file are used; the annotation text is not read.

Working with FPC Files

SyMAP can also align genomes represented by an FPC physical map, by first aligning the BACs using BAC-end sequences or marker sequences. If you plan to work with FPC alignments, the first step is to run the provided demo "Demo-FPC". Align it to "Demo-Seq2" using the same steps as described above, and explore the various displays.

Creating an FPC project is the same as for a sequence project except that you choose the type "fpc", and then the Project Parameters window has some different paramaters. The Parameters window is where you will enter the FPC file, and your fasta files of marker and BAC-end sequences.

Note that the BAC-end sequence names must be exactly the clone names used in FPC, with extension "r" or "f" labeling the strand. In other words if the FPC map has a clone "a0435B26" then the BES for that clone can be named "a0435B26f" or "a0435B26r".

The BES and marker alignments in an FPC project are performed using BLAT1, in contrast to MUMmer2 for sequence projects. The running time is typically several times longer than that of MUMmer (described here), but the memory usage is much lower.

SyMAP and MySQL

SyMAP stores all data in a MySQL server. The package comes with a version of MySQL that is suitable for the demo or a small project, but not for large projects or for web display. For any substantial work we recommend using a standalone MySQL installation. Note, if you are using MacOSX 10.7+, the pre-packaged version of MySQL does not work, and cannot be upgraded since the Java/MySQL module is no longer supported. You will need to install MySQL.

The MySQL installation does not need to be on the machine where you will do the computations or view the results, as long as it is on an accessible network. Once the server is ready, fill out the database parameters in the "symap.config" file in the main SyMAP directory, as described here.

Note that it is a good idea to have separate admin and client usernames, where the client has read-only access. If you set up the web displays, they will use the client username. The "admin" user needs to have sufficient permissions to create a new database.

Note that the default settings of MySQL are poorly suited for large-scale data storage. You will want to adjust the parameters innodb_buffer_pool_size,innodb_flush_log_at_trx_commit as described here.

MySQL on the Mac:
Download and install the "MySQL Community Server" from http://dev.mysql.com/downloads/mysql/. Also install the Preferences Panel, and use that to start the server. In symap.config, use

db_name             = symap
db_server           = localhost
db_adminuser        = root
db_adminpasswd      = 
db_clientuser       = root
db_clientpasswd     = 
For large projects you will want to adjust the parameters as described here.

Runtime and Memory

The largest component of SyMAP execution time is in running MUMmer2 (or BLAT1). The typical runtime is one CPU-hour per target sequence (or group of chromosomes, since SyMAP will group shorter sequences together for efficiency).

For example, to align maize (10 chromosomes, 2Gb) to rice (12 chromosomes, 370Mb) required 1 hour, 3 minutes using 8 CPUs with 2.3Ghz speed. SyMAP grouped the shorter rice chromosomes into 8 groups and each processor handled one.

The memory usage of MUMmer is typically 5G per CPU, however it can be as high as 10G for very long or repetitive chromosomes.

Web Display

SyMAP's Java-based architecture makes it easy to share results over the web. All you have to do is create a single html file with the following project-browser tag (edited for your parameters):
<applet  archive='symapApplet.jar'
    code='symap.SyMAPBrowserApplet'
    name='SyMAP'
    codebase='http://URL_TO_JAR_FILE_DIRECTORY'
    width='100%'
    height='100%'
> 
    <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
    <param name='username' value=''> <!-- should be a READ-ONLY user!! -->
    <param name='password' value=''>
    <param name='title' value='SyMAP Project Browser'>
    <!-- param name='projects' value='maize, soy' --><!-- uncomment this line to show only certain projects -->
</applet>
You can also create custom pages showing dotplots, circle plots, and other displays using individual applet tags, following these examples:
Display TypeExampleCode
Block View

(two species)

 <applet code='blockview.BlockApplet'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>
Circle View

(multi-species)

 <applet code='circview.CircApplet'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <param name='project3' value='soy'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>
Chromosome Explorer

(multi-species)

 <applet code='symapCE.SyMAPAppletExp'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <param name='project3' value='soy'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>
Dot Plot

(multi-species)

 <applet code='dotplot.DPApplet'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <param name='project3' value='soy'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>
Query Interface

(multi-species)

 <applet code='symapQuery.SyMAPQueryApplet'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <param name='project3' value='soy'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>
Alignment Summary

(two species)

 <applet code='symap.projectmanager.common.SummaryApplet'
        name='SyMAP'
        codebase='http://URL_TO_JAR_FILE_DIRECTORY'
        archive='symapApplet.jar' >
        <param name='database' value='jdbc:mysql://DATABASE_HOST_ADDRESS/DATABASE_NAME'>
        <param name='username' value=''>
        <param name='password' value=''>
        <param name='project1' value='grape'>
        <param name='project2' value='poplar'>
        <alt='The Java Applet is loading, please be patient'>
        </applet>

If one of the projects is an FPC project, indicate that with a parameter tag with type number corresponding to the project name tag:

<param name='type1' value='fpc'>

Database Parameters

Parameters for accessing the MySQL database should be set in the "symap.config" file in the main symap directory, as follows:

Database Parameters
db_name Name of the MySQL database. SyMAP will create it, if it does not exist yet. If it does exist, it should either already be a SyMAP database, or it should be empty.
db_server The machine hosting the MySQL database, e.g. "myserver.myschool.edu".
db_adminuser MySQL username of a user with sufficient privileges to create a database. Needed for creating and updating alignments.
db_adminpasswd Password of the admin user.
db_clientuser MySQL username of a user with read-only access. Used in the web displays, if installed, or if symap is launched with the "-r" parameter.
db_clientpasswd Password of the client user.

Self Alignments and SyMAP

Because of its reliance on MUMmer, SyMAP has to follow a slightly different procedure in carrying out a self-alignment. MUMmer ordinarily seeds its alignments with unique matches, which eliminates the possibility of off-diagonal seeds in the alignment of a chromosome to itself. To overcome this problem, SyMAP v4.2 runs the individual chromosome self-alignments using the MUMmer parameter -maxmatch, which removes the uniqueness requirement at the cost of greatly increased noise. The extra noise is then filtered to a large extent by the default SyMAP filters, but the diagonal squares of the dot plot will still have more noise visible than the off-diagonal. An example of such chromosome self-synteny blocks can be seen on chromosome 12 of soybean in the fabaceae section of SymapDB. Arabidopsis chromosome 1 also has good examples.

How SyMAP Works

This section provides a brief overview of the SyMAP processing steps; for more, see the SyMAP published papers4,5. The processing has four phases:
Alignment:
The sequences are written to disk*, with gene-masking if desired. In the alignment, one species is "query" and the other is "target". If one project is FPC, that is the query; if both are sequence, the query is the one with alphabetically the first name. The query sequences are written into one large file, while smaller target sequences are grouped into larger fasta files of size up to 70Mb, for more efficient processing in MUMmer.

Anchor Clustering:
The raw anchor set consists of the hits found by MUMmer or BLAT. These are first clustered into gene, or putative-gene hits. This is done by clustering the hit regions on each sequence, and then defining new "gene" hits which connect these regions. For example if three separate exons hit between two genes, they will be clustered into one "gene" hit having a combined score equal to the sum of the raw hit scores. Clustering is by gene if the hits overlap annotation, otherwise, it uses a max separation 1kb, creating "putative gene" regions.

Anchor Filtering:
The clustered "gene anchors" are now filtered using a version of reciprocal-best filtering which is adapted for retaining duplications and gene families. For each pair of genes (or putative genes) which is connected by a clustered anchor, the retained anchors must be among the top two anchors by score on both sides (top-2 allows for one ancestral whole-genome duplication). An anchor will also be retained if its score is at least 80% of that of the 2nd-best anchor on each side (this allows for retention of gene family anchors). These filter parameters may be adjusted through the Alignment & Synteny Parameters window.

Synteny Block Detection:
After the clustered anchors are loaded into the database, the synteny synteny block algorithm runs. This algorithm looks for approximately-collinear sequences of anchors, subject to several parameters including A) Number of anchors; B) Collinearity of the anchors; C) Amount of "noise" in the surrounding region (to help reject false-positive chains). Criterion A can be adjusted in the Alignment & Synteny Parameters window.

* Note that the sequences are re-written from the database to the disk for three reasons:A)To allow re-grouping for efficiency; B) To ensure elimination of invalid characters; C) To mask non-gene regions, if desired. This also ensures that sequences names will match those in the database, and prevents problems caused by moving the source sequences on disk.

References

1 Kent, J. (2002) BLAT--the BLAST-like alignment tool, Genome Research 12:656-64.

2 Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.L. (2004) Versatile and open software for comparing large genomes, Genome Biology, 5:R12

3 Krzywinski, M., J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. Jones, M. Marra (2009) Circos: An information aesthetic for comparative genomics. Genome Research doi:10.1101/gr.092759.109.

4 Soderlund, C., Nelson, W., Shoemaker, A., and Paterson, A.(2006) SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res. 16:1159-1168.

5 Soderlund, C., Bomhoff, M., and Nelson, W. (2011) SyMAP: A turnkey synteny system with application to multiple large duplicated plant sequenced genomes. Nucleic Acids Res V39, issue 10, e68.

Email Comments To: symap@agcol.arizona.edu