PBIL

HOMOLENS : Homologous Sequences in Ensembl Animal Genomes

HOMOLENS release 03 (January 2007)

Release informations: Protein Nucleotide

Previous release

HOMOLENS is a database of homologous genes from Ensembl organisms, structured under ACNUC sequence database management system. It allows to select sets of homologous genes among species, and to visualize multiple alignments and phylogenetic trees.It is as well possible to search for orthologous genes in a wide rane of taxons. Thus HOMOLENS is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generaly, HOMOLENS gives an overall view of what is known about a peculiar gene family. Note that HOMOLENS is splitted into two databases on this server: HOMOLENS contains the protein sequences while HOMOLENSDNA contains the nucleotide sequences. Protein sequences of HOMOLENS have been generated by translating the CDS of HOMOLENSDNA and using associated cross-references to generate the annotations.

New (02/2008)! New tree viewer application: see here.

Query HOMOLENS using web applications

Acces to several ACNUC databases: EMBL,GenBank,SwissProt, HOVERGEN, HOGENOM, HOMOLENS, etc.

Query HOMOLENS proteins

You may enter any word ( sequence name, keyword, species, ...)
Search for protein sequences Search for protein families

exact
match

enter a word

exact
match

enter a word
Check the box if you want to report exact matches only.

Query HOMOLENS nucleotide sequences

You may enter any word ( sequence name, keyword, species, ...)
Search for CDS sequences Search for CDS families

exact
match

enter a word

exact
match

enter a word
Check the box if you want to report exact matches only.

Query HOMOLENS using BLAST

You may blast your sequence against several databases at PBIL.

Query HOMOLENS using HoSeqI

You may search the HOMOLENS family which is the closest of your sequence. Associated alignment and phylogenetic tree are automatically generated.
  • HoseqI Allows to retrieve protein family in HOMOLENS

Orthologs search

You can retrieve orthologous and paralogous genes with the FamFetch application. This is a powerful tool allowing you to request the phylogenic trees database with a complex tree user-build motif including duplication and speciation events. You can use as well a command-line version of FamFetch. Altenatively, precalculated othologous animal genes are available from the HOMOLENS database.

Acces to HOMOLENS

You can query the database on thispage or via several access :

Contents

Organisms

HOMOLENS is build from the Ensembl database:
  • Ensembl (Release 41) : Animals from the EBI :
    • Aedes aegypti
    • Anopheles gambiae
    • Apis mellifera
    • Bos taurus
    • Caenorhabditis briggsae
    • Caenorhabditis elegans
    • Canis familiaris
    • Ciona intestinalis
    • Ciona savignyi
    • Danio rerio
    • Dasypus novemcinctus
    • Drospohila melanogaster
    • Echinops telfairi
    • Gallus gallus
    • Gasterosteus aculeatus
    • Homo sapiens
    • Loxodonta africana
    • Macaca mulatta
    • Monodelphis domestica
    • Mus musculus
    • Oryctolagus cuniculus
    • Oryzias latipes
    • Pan troglodytes
    • Rattus norvegicus
    • Saccharomyces cerevisiae
    • Takifugu rubripes
    • Tetraodon nigroviridis
    • Xenopus tropicalis
Data are modified and re-annotated: sequences names are modified according the organism, taxonomy fields are modified when they are unconsistant or not accurate, then gene family , GC contents, internal introns, 3'UTR and 5'UTR informations are added to annotations.

Sequences, Families, Alignments, Phylogenetic trees

Number of proteins 474,339
Number of CDS 659,580
Number of genomic sequences 81,903
Number of families (at least 2 sequences) 28,068
Number of orphans 72,545 (15%)
Number of protein sequences associated to a family 401,794 (84%)

Alignments and Phylogenetic trees for 28, 021 families containing 2 to 2000 sequences have been calculated. Phylogenetic trees are calculated with the program PHYML (substitution model = JTT, estimated proportion of invariable sites, 4 categories, estimated gamma, initial tree with BIONJ) on conserved blocks of the MUSCLE alignments selected with GBLOCKS.

Paralogy/Orthology Events Assignment Phylogenetic trees of each gene family are analysed using RapMasse to assign duplication or speciation event to each node by comparison with the species tree.

HOMOLENS is now available. You can make requests on the protein and the genome data via our web server or via the socket server.

Server mirroring

You don't need to install the server itself to have HOMOLENS running on your computer as the client is enough for that purpose. On the other hand you may want to set-up your own server in a way to speed up your database access and to propose that service to potential users in your geographic area. To install an HOMOLENS server, you need first to register. Starting from the registering page results, you will have access to the server installation procedure.

The whole database will be soon available from our FTP server at URL: ftp://pbil.univ-lyon1.fr/pub/homolens/ Note that it is much more efficient to use a dedicated FTP client to download the database rather than an Internet Web browser.

Important note:
Some entries such as those found in HOMOLENS are copyrighted since they are initailly produced by EBI,SIB,NCBI,JGI,TIGER. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement.

Family Building

To build the families we perform a similarity search of all the proteins against each other with BLASTP2. For this purpose, we use the BLOSUM62 similarity matrix and a threshold of 10-4 for E-values. Low complexity sequences are filtered with SEG. Then, the results are processed this way:
  • For each pair of sequences, Homologous Segment Pairs (HSPs) that are not compatible with a global alignment are removed
  • Two sequences in a pair are included in the same family if:
    • The remaining HSPs cover at least 80% of the proteins length.
    • Their similarity is greater or equal to 50% (two amino-acids are considered similar if the BLOSUM62 similarity score is positive)
  • We use simple transitive links to build our families. If a pair of sequences named A + B and a pair of sequences B + C fulfill the conditions listed above, then A, B and C are integrated in the same family, this even if the pair A + C does not fulfill these conditions.

.

Sequence Annotations

Family annotation

Protein sequences: we add for each entry a line in the CC field that gives the number of the family the sequence belongs to:
CC   -!- GENE_FAMILY: HBG017522.
Genome sequence: we add for each coding sequence a qualifier that gives the number of the family the gene belongs to:
FT                   /gene_family="HBG017522"

This number is incorporated in the keywords associated to the corresponding entry in the ACNUC database structure. Due to that fact it is possible to retrieve all the sequences associated to a family with this number when using the retrieval system Query or the on-line version WWW-Query.

GC content and intron information annotations

We include in the the genomic sequneces the GC content of each coding sequence:
FT                   /%(C+G)="CG<35%"
FT                   /note="C+G content in third codon positions = 31.4 % "
It is thus possible to select sequences according to its GC content.

We also include in genmoic sequences descriptions of non-coding regions:

  • INT_INT: internals introns (i.e. within CDS)
  • 5'INT: introns in 5'UTR
  • 3'INT: introns in 3'UTR
  • 5'NCR: 5' non-coding region
  • 3'NCR: 3' non-coding region
For example:
FT   3'ncr           2278..2368
These subsequences can be selected and extracted from the database in the same way as CDS, using WWW-Query (see Help).

Contact and reference

If you encounter some problems when installing or using HOMOLENS, please contact Laurent Duret or Simon Penel Also we welcome any comments or suggestions on the database and/or its interface.

Acknowledgements

Calculations have been done at the IN2P3 Computing Center.

References


If you use families from HOVERGEN, HOMOLENS or HOGENOM, Please cite :
  • Duret, L., Perrière, G. and Gouy, M. (1999) "HOVERGEN: database and software for comparative analysis of homologous vertebrate genes". In Bioinformatics Databases and Systems, Letovsky, S. (ed.), Kluwer Academic Publishers, Boston, pp. 13-29.
  • Dufayard J.F., Duret L., Penel S., Gouy M., Rechenmann F. and Perrière G. (2005) "Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases" Bioinformatics, vol. 21 pp.2596-2603