The University of Arizona
Ensembl convert to SyMAP  
AGCoL | SyMAP Home | Index | System Guide | User Guide | Tour | Troubleshooting

Overview

Ensembl supplies FASTA formated files for genome sequence and GFF formated files for the annotation The following provides a simple scheme to produce the correctly formated files for SyMAP.

Contents
Download
Convert files
Load files into SyMAP
    What the ConvertEnsembl script does
    Editing the script

Download

  1. Go to Ensembl, which shows all species for which Ensembl has a genome. For plants and fungi, see EnsemblPlant and EnsemblFungi.
  2. Select your species.
  3. Select "Download DNA sequence (FASTA)". This takes you to a FTP site. It is recommended that you download the "*.dna_sm.toplevel.fa.gz", as it is the soft masked chromosome sequences.
  4. Select the "GFF3" from the "Download genes, cDNAs, ncRNA, proteins - FASTA - GFF3" line. This takes you to an FTP site. Download the *.chr.gff3.gz file.

Convert files

  1. Go to the symap_5/data/seq directory.
  2. Make a subdirectory for your species and move the FASTA and GFF files into the directory. Leave the "fa.gz" and "gff3.gz" suffixes on the files.
  3. Type the following at the command line to copy the ConvertEnsembl script to the pseudo directory:
    cp ../../scripts/ConvertEnsembl.class .
  4. Execute
    java ConvertEnsembl <species>

Example

In symap_5/data/seq directory, I made a subdirectory called "rice" and moved the rice .fna.gz and gff3.gz files into it, then executed:
cp ../../scripts/ConvertEnsembl.class .
java ConvertEnsembl rice
This results in the following contents:
data/seq/rice/
      Oryza_sativa.IRGSP-1.0.45.chr.gff3.gz
      Oryza_sativa.IRGSP-1.0.dna_rm.toplevel.fa.gz
      annotation/
         gene.gff
         exon.gff
      sequence/
         genomic.fna

ConvertEnsembl optional flags:
FlagDescriptionDetailsDefault
-vVerbosePrint out header lines of skipped sequencesNo print

Load files into SyMAP

The above scenerio puts the files in the default SyMAP directories. When you start up SyMAP, you will see your projects listed on the left of the panel. Check the projects you want to load, which will cause them to be shown on the right of the symap window. The default parameters for the converted Ensembl projects will probably be sufficient. Select "Load All Projects". Once loaded, you can run the synteny algorithm be selecting "All Pairs". If you want to compute self-synteny, you have to do that individually with the "Selected Pair" button.

What the ConvertEnsembl script does

The following occurs in the data/seq/<project directory name> where "project directory name" is the argument supplied to ConvertEnsembl:
  1. Reads the file ending in '.fa.gz' and writes a new file called sequence/genomic.fna with the following changes:
    1. Only sequences with word "chromosome" in their ">" header line and a number after the ">" will be copied to the genomics.fa file.
    2. The header line is replaced with ">ChrN" where N is 1,2...
  2. Reads the file ending in 'gff3.gz' and writes two new files called annotation/gene.gff and annotation/exon.gff, as follows:
    1. Only lines with the 'type' (3rd column) equal 'gene' and 'exon' are read.
    2. The gene line is written to the gene.gff file with the following changes:
      1. The first column N is replace with the 'ChrN'.
      2. The last column 'attributes' only keeps the "ID=" and "description=" fields. The "description=" keyword is shortened to "desc=". If the description field contains "Source:...", it is removed.
  3. The exon line is written to the exon.gff file with the first column N is replace with the 'ChrN'.

Editing the script

The ConvertEnsembl.java code is supplied in the scripts directory. It is very simply written, it does not use external libraries and only common programming techniques found in all programming languages.

Once you make your changes, execute:

javac ConvertEnsembl.java
You will need to have JDK installed to use the 'javac' command.

Some reasons for editing:

  1. This script has been tested on files from 8 different plant genomes. However, there can be variations that may not be accounted for.
  2. You may prefer to keep more gene attributes then the script provides.
  3. You may want to replace the "Chr" with some other prefix.
Email Comments To: symap@agcol.arizona.edu