eggNOG mapper v2.0.2 v2.0.8 - eggnogdb/eggnog-mapper GitHub Wiki (2024)

Overview
What's new in eggNOG-mapper v2
- development branches
- v2.0.8
- v2.0.7
- v2.0.6
- v2.0.5
- v2.0.4-rf1
- v2.0.3-rf1
- v2.0.2-rf1
- v2.0.1-rf1
- v2.0.1b
- v2.0.0
Requirements
- Software Requirements
- Storage Requirements
- Other Requirements
Installation
- Pypi version
- GitHub release
- Cloning a GitHub repository
- Setup
- Optional tools
Basic usage
A few recipes
Parameters
- General Options
- Input Data Options
- Gene Prediction Options
- Search Options
  - Search filtering common options
  - Diamond search options
  - MMseqs2 search options
  - HMMer search options
- Annotation Options
- Output options
Output format
- Output files
- Output fields
  - Seed orthologs file
  - Annotations file
  - Orthologs file
  - HMMer hits file
  - Sequences of predicted CDS
  - GFF of predicted CDS
  - Sequences without annotation
  - PFAM hits
Setting up large annotation jobs
- Phase 1. hom*ology searches
- Phase 2. Orthology and functional annotation
- Even larger jobs in large memory computers
Citation

Overview

EggNOG-mapper (a.k.a. emapper.py or just emapper) is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups (OGs) and phylogenies from the eggNOG database (http://eggnogdb.embl.de/) to transfer functional information from fine-grained orthologs only.

Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.

The use of orthology predictions for functional annotation permits a higher precision than traditional hom*ology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).

Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan are available at https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb.

EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de

What's new in eggNOG-mapper v2

development branches

(no news)

v2.0.8

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.8

Added GFF decoration (--decorate_gff option), to create/modify a GFF including emapper hits and/or annotations.
Output file comments start with "##", making easier filtering them without removing the header (which starts with "#")
Fixed seed_orthologs header.
Fixed pident not shown in seed_orthologs output.

v2.0.7

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.7

Added --trans_table (Diamond's --query-gencode, MMseqs2's --translation-table, Prodigal's -g/--trans-table) option, to specify a translation table for gene prediction and blastx searches.
Added --training_genome and --training_file options, to run Prodigal training mode.
Default search thresholds (pident, score, query and subject coverage) are set to None.
Both Diamond and MMseqs2 seed_orthologs file includes now percentage identity (pident), position of hits (qstart, qend, sstart, send) and query coverage (qcov) and subject coverage (scov).
Added --outfmt_short option for Diamond, to run it producing only query, subject, evalue and score as output. This option could be useful to obtain better performance when no thresholds for pident, and query and subject coverages are used (see Diamond docs about traceback). Of course, seed_orthologs file will contain only those 4 fields.
Added subject coverage (target_coverage) to gff output from blastx based gene predictions.
Added --block_size (Diamond's -b/--block-size) and --index_chunks (Diamond's -c/--index-chunks) options.
Bug fixes.

v2.0.6

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.6

Added create_dbs.py script, to create diamond/mmseqs eggnog5 databases from a user-specified list of taxa.

v2.0.5

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.5

if --translate option is used (along with --itype CDS), input sequences will be translated to proteins before searching with either diamond "blastp", mmseqs "blastp" or hmmer. If --itype CDS is used without --translate, it will raise error for hmmer, but it will run diamond or mmseqs in "blastx" modes.
Bug fix when running hmmer with only-numerical identifiers in input sequences
Other minor changes

v2.0.4-rf1

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.4-rf1

Gene prediction step using Prodigal.
Search and annotation of ORFs using diamond blastx or MMseqs2. This can be used to annotate ORFs of contigs without using Prodigal.
MMseqs2 support for the search step of eggNOG-mapper.
Parameters to allow users to control sensitivity of diamond/MMseqs2 searches.
Improved report of orthologs.
NCBITaxa support is now included in eggNOG-mapper without relying on ete3.
--md5 option, which can be used to add the md5 hash of the query as a new column in the annotations output file.
''-m cache mode and -c FILE options, to annotate using an annotations file with md5 hashes as cached results. A fasta file with unannotated sequences is output also, which can be used in a subsequent conventional emapper annotation run.
"Bottom-top" orthology search if no proper orthologs are retrieved from a priori chosen best OG.
--go_evidence all option to report all GO terms.
--dbmem option to pre-load the eggnog.db sqlite3 DB into memory.

v2.0.3-rf1

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.3-rf1

New eggnog DB version 5.0.1 including PFAM annotations and PFAM HMMs.
Added expected eggNOG DB version, and warning if found version is different than expected one.
Added PFAM annotations, which are directly transferred from orthologs.
Added --pfam transfer option to emapper.py.

v2.0.2-rf1

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.2-rf1

New --tax_scope'' modes.
Added eggNOG DB version to -v/--version option.

v2.0.1-rf1

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.1-rf1

All code migrated to Python 3. Therefore, python3 is now required to run eggnog-mapper scripts.
HMMER search capabilities moved to new scripts: hmm_search.py, hmm_server.py, hmm_worker.py. HMMER search options still available through emapper.py script only for searching against custom databases (no annotation), which is just equivalent to be using hmm_search.py.
Changes in output format.
Changes in available parameters and behaviour of existing ones.
Added some integration and unit tests

v2.0.1b

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.1b

Bug fixes, minor changes.

v2.0.0

https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.0

Expanded database of precomputed orthology assignments, now based on eggNOG v5.0. This includes 5,090 representative genomes (4445 bacteria, 168 archaea and 477 eukaryota), as well as 2502 viral proteomes.
HMMer search mode is deprecated. Read FAQ---Frequently-Asked-Questions#why-i-cannot-choose-hmmer-search-mode-in-version-20
Updated functional sources (e.g. KEGG, GeneOntology)
New output columns compared with eggNOG-mapper version 1 (see https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2).

Requirements

Software Requirements

Python 3.7 (or greater)
BioPython 1.76 (python package)
psutil 5.7.0 (python package, required only if using the HMMER server mode)
`wget` (linux command, required for downloading the eggNOG-mapper databases with download_eggnog_data.py)

Storage Requirements

~40 GB for the eggNOG annotation database (+ ~0.3 GB for taxa database)
~10 GB for Diamond database of eggNOG sequences (required if using -m diamond, which is the default search mode).
~90 GB for MMseqs2 database of eggNOG sequences (required if using -m mmseqs).
~3 GB for PFAM database (required if using --pfam_realign options for realignment of queries to PFAM domains).
The size of eggNOG diamond/mmseqs databases create with create_dbs.py is highly variable, depending on the size of the chosen taxonomic groups.
The size of eggNOG HMM databases is highly variable (check list of HMMER databases at http://eggnog5.embl.de/#/app/downloads).

Other Requirements

Using -m mmseqs requires no less than 224-256GB of RAM for the default DB, and therefore it is only recommended for large datasets to be processed in large memory systems. Databases created with create_dbs.py will require less RAM, which will depend on the size of the chosen taxonomic groups.
Using --dbmem loads the whole eggnog.db sqlite3 annotation database during the annotation step, and therefore requires no less than 44-48GB of memory.
Using --pfam_realign denovo uses HMMER server mode when the number of queries is equal or greater than 100. Therefore the whole PFAM database is loaded into memory.
Also, using the --num_servers option when running HMMER in server mode (a.k.a. hmmgpmd, which is used for -m hmmer --usemem, --pfam_realign denovo or hmm_server.py) loads the HMM database as many times as specified in the argument (e.g. --pfam_realign denovo --num_servers 2 loads the PFAM database into memory twice).

Installation

Pypi version

pip install eggnog-mapper

GitHub release

Download the latest version of eggnog-mapper from the next link: https://github.com/eggnogdb/eggnog-mapper/releases/latest
Decompress the .tar.gz or .zip file
Enter the decompressed directory and install the dependencies, either with:
- setuptools: python setup.py install
- pip: pip install -r requirements.txt
- conda: conda install --file requirements.txt

Cloning a GitHub repository

Download (clone) the repository: git clone https://github.com/jhcepas/eggnog-mapper.git
Enter the repository directory and install the dependencies, either with:
- setuptools: python setup.py install
- pip: pip install -r requirements.txt
- conda: conda install --file requirements.txt

Setup

If you want to be sure that eggNOG-mapper is using the bundled binaries for external tools (hmmer, diamond, mmseqs), it may help adding the emapper scripts and binaries to the PATH. If for example your eggnog-mapper path was /home/user/eggnog-mapper:

export PATH=/home/user/eggnog-mapper:/home/user/eggnog-mapper/eggnogmapper/bin:"$PATH"

Also, if you want to store eggNOG-mapper databases in a specific directory, you may wish to create an environment variable to avoid using --data_dir in all your commands. For example:

export EGGNOG_DATA_DIR=/home/user/eggnog-mapper-data

Next step would be downloading the eggNOG-mapper databases, running the next script:

download_eggnog_data.py

This will download the eggNOG annotation database (along with the taxa databases), and the Diamond database of eggNOG proteins.

If no EGGNOG_DATA_DIR variable was defined and no --data_dir option was given to download_eggnog_data.py, the latter will try to download the files to a `data` directory within your eggnog-mapper directory.

Also, check download_eggnog_data.py --help for a detailed list of options. For example:

The -P flag is required to download the PFAM database.
The -M flag is required to download the MMseqs2 database. Note that no MMseqs2 index is included, and because of this we recommend creating the index if using huge input datasets. To do it you could use the mmseqs createindex "$EGGNOG_DATA_DIR"/mmseqs tmp (see https://mmseqs.com/latest/userguide.pdf for more details).
The -H -d taxID flag is required to download a HMMER database (check list of databases at http://eggnog5.embl.de/#/app/downloads).

Note that, since eggnog-mapper version 2.0.6, you could also skip downloading the default diamond or mmseqs databases, and create them for specific taxa using create_dbs.py. For example, to create a diamond database for Bacteria only:

create_dbs.py -m diamond --dbname bacteria --taxa Bacteria

This will create a bacteria.dmnd diamond database to the default data directory or the one specified in EGGNOG_DATA_DIR environment variable. Such database can be used with emapper.py --dmnd_db bacteria.dmnd. The first time create_dbs.py is used it will take time to download the eggnog5 proteins and create the diamond or mmseqs database. Next calls to create_dbs.py to the same data directory will not need to download the eggnog5 proteins again. For further info, check create_dbs.py --help.

Optional tools

Depending on the workflow being used with eggNOG-mapper you will need different external tools. Nonetheless, all of them are actually included, bundled, along with eggNOG-mapper code. If you are running eggNOG-mapper fine, you may not need to install anything else.

However, the bundled tools are compiled binaries and could cause trouble in some systems, or could not be the most optimized compiled binaries for your system. In such cases, you may wish to install some or all of these tools independently. The tools are:

Prodigal: required if using --itype genome or --itype metagenome along with the option --genepred prodigal. Current bundled version is V2.6.3: February, 2016.
Diamond: required to run the search steps with -m diamond. Current bundled version is 2.0.4.
MMseqs2: required to run the search steps with -m mmseqs. Current bundled version is 113e3212c137d026e297c7540e1fcd039f6812b1.
HMMER: required to run the search steps with -m hmmer, to run the HMMER based scripts (hmm_mapper.py, hmm_server.py, hmm_worker.py), and to perform realignments to PFAM with --pfam_realign realign or --pfam_realign denovo. Current bundled version is HMMER 3.1b2 (February 2015).

Basically, whether eggNOG-mapper uses the one you installed or the bundled one will depend on what tool is found in your path first. If none are found in the path, eggNOG-mapper will try to use the bundled ones.

Basic usage

To start an annotation job, provide a FASTA file containing your query sequences (-i option), specify a project name which will be used as a prefix for all the output files (-o option), and run emapper.py

emapper.py -i FASTA_FILE_PROTEINS -o test

A few recipes

- Run search and annotation, using diamond in blastx mode

emapper.py -m diamond -i FASTA_FILE_NTS --itype CDS -o test

- Run search and annotation, using MMseqs after translating input CDS to proteins

emapper.py -m mmseqs -i FASTA_FILE_CDS --itype CDS --translate -o test

- Run search and annotation for assembled contigs, using diamond "blastx" hits for gene prediction

emapper.py -m diamond -i FASTA_FILE_NTS --itype metagenome -o test

- Run search and annotation for a genome, using MMseqs search on proteins predicted by Prodigal

emapper.py -m mmseqs -i FASTA_FILE_NTS --itype genome --genepred prodigal -o test

- Run gene prediction using a genome to train Prodigal (since version 2.0.7)

emapper.py -m mmseqs -i FASTA_FILE_NTS --itype genome --genepred prodigal --training_genome FASTA_FILE --training_file OUT_TRAIN_FILE -o test

- 2-step run-- search step using diamond in "sensitive" mode-- annotation step loading the eggnog.db sqlite3 into memory (--dbmem; requires around 40GB free mem)

emapper.py -i FASTA_FILE_PROTS -m diamond --sensmode sensitive --no_annot -o testemapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs --dbmem -o test_annot_1

- Repeat the annotation step, using specific taxa as target and reporting the one-to-one orthologs found, reading the eggnog.db from disk (no --dbmem option)

emapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs --report_orthologs --target_orthologs one2one --target_taxa 72274,1123487 -o test_annot_2

- Use HMMER to search a database of bacterial proteins, using a scratch dir to write output on a different drive than the one used to read. Once emapper.py finishes, output files in the scratch dir will be moved to the actual output dir, and the scratch dir will be removed.

emapper.py -m hmmer -i FASTA_FILE_PROTS -d bact -o test --scratch_dir /scratch/test

- Realign queries to the PFAM domains found on seed orthologs

emapper.py -i FASTA_FILE_PROTS -o test --pfam_transfer seed_ortholog --pfam_realign realign

- Realign queries to the whole PFAM database

emapper.py -i FASTA_FILE_PROTS -o test --pfam_realign denovo

Parameters

General Options

--version

show version and exit.

--list_taxa

List available taxonomic names and IDs and exit.

--cpu NUM_CPU

number of CPUs to be used whenever possible (diamond, annotation tasks, etc). --cpu 0 to run with all available CPUs.

Input Data Options

-i FILE

input FASTA file containing query sequences (proteins by default; see --translate). Required unless -m no_search

--itype INPUT_TYPE

The type of sequences included in the input file. The options are:

- --itype proteins, which is the default.
- --itype CDS
- --itype genome
- --itype metagenome

For --itype proteins the input file is used directly as input for the search step. With --itype CDS, the input file will be used directly as input for diamond and MMseqs2, unless the --translate is used (see below); for hmmer the input CDS are first translated to proteins. If --itype genome is used, the input sequences are considered contigs, and gene prediction will be performed (see --genepred option). --itype metagenome is the same as --itype genome, except that Prodigal will be run in a different mode when --genepred prodigal is used.

--translate

if --itype CDS and the --translate option is used, input sequences will be translated to proteins before search. If -m hmmer and --itype CDS, input sequences will be translated to proteins, as if --translate was automatically activated. If -m diamond or -m mmseqs and --itype CDS but no --translate option is given, searches will be performed in "blastx" mode. Note that this is different than using --itype genome or --itype metagenome, in which case the hits are used to identify one or more ORFs within the input sequences, whereas using --itype CDS without --translate will just annotate the best hit found for each input sequence.

--annotate_hits_table FILE

annotate TSV formatted table with 4 fields: query, hit, evalue, score. Required if -m no_search.

-c FILE, --cache FILE

Annotations file with md5 checksums of sequences. Required if -m cache.

--data_dir DIR

path to eggnog-mapper databases (data/ folder or the one specified by the EGGNOG_DATA_DIR environment variable, by default).

Gene Prediction Options

--genepred GENE_PRED_MODE

When --itype genome or --itype metagenome is used, gene prediction is carried out. There are 2 gene prediction modes:

- The default is --genepred search, which means that either Diamond or MMseqs2 (depending on -m argument) is run in blastx mode. As of now, we cannot recommend using Diamond for complete genomes, unless the assembly is rather fragmented and/or contigs are not very large. MMseqs2 is faster than Diamond for assembled genomes, and it is the recommended one if the memory requirements can be met.
- If --genepred prodigal is specified, Prodigal is run for gene prediction, and the proteins predicted by Prodigal are used in the subsequent search and annotation steps. Prodigal will be run in a different mode depending whether --itype genome or --itype metagenome is used.
--trans_table TRANS_TABLE_CODE (since version 2.0.7)

Option to change the translation table used for gene prediction. It corresponds with Diamond's --query-gencode, MMseqs2's --translation-table and Prodigal's -g/--trans-table). Usually the value is an integer corresponding to a specific translation table (e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). Check each program's documentation for more info.

--training_genome FASTA_FILE (since version 2.0.7)

FASTA file of the genome to be used for Prodigal's training mode. Requires --itype genome --genepred prodigal and also requires --training_file FILE. Note training will be run only if the training file does NOT exist. If the training file already exists, the latter will be used directly for gene prediction, and training will be skipped.

--training_file FILE (since version 2.0.7)

Training file to be created and/or used by Prodigal. If the training file does not exist, the training genome (--training_genome option) will be used to create a training file, and then immediately perform gene prediction from such training file. If the training file already exists, the training is skipped, the --training_genome option is ignored, and gene prediction is performed using the existing training file.

Search Options

-m MODE

how input queries will be searched against eggNOG sequences. Default is -m diamond. All MODE options are shown in the next table:

MODE		Notes
diamond	search queries against eggNOG sequences using diamond	requires -i FILE
hmmer	search sequences/hmm against sequences/hmm using HMMER	requires -i FILE and -d DB_NAME.
mmseqs	search queries against eggNOG sequences using MMseqs2	requires -i FILE
cache	search queries a file of previous annotation file which includes md5 hashes of the annotated sequences.	requires -i FILE and -c FILE
no_search	skip search stage. Annotate an existing .seed_orthologs file.	requires --annotate_hits_table FILE, unless `--no_annot` is used.

Search filtering common options

--pident FLOAT

report only alignments equal or above this percentage of identity threshold. Default None (since version 2.0.7). No effect if -m hmmer.

--evalue FLOAT

report only alignments equal or above this e-value threshold. Default 0.001

--score FLOAT

report only alignments equal or above this bit score threshold. Default None (since version 2.0.7).

--query-cover FLOAT

report only alignments equal or above this query coverage fraction threshold. Default None (since version 2.0.7).

--subject-cover FLOAT

report only alignments equal or above this target (eggNOG sequence) coverage fraction threshold. Default None (since version 2.0.7). No effect if -m hmmer.

Diamond search options

--dmnd_db FILE

path to diamond-compatible database. Useful to specify a location different than data/ or --data_dir.

--sensmode DIAMOND_SENS_MODE

either fast, mid-sensitive, sensitive, more-sensitive, very-sensitive or ultra-sensitive. Default is sensitive (to be sure, check the default for your version in emapper.py --help).

--matrix MATRIX_NAME

which substitution matrix to be used by diamond, among BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30.

--gapopen INT

gap open penalty used by diamond. Default is diamond default.

--gapextend INT

gap extend penalty used by diamond. Default is diamond default.

--block_size FLOAT (since version 2.0.7)

Diamond's -b/--block-size option. Default is Diamond's default.

--index_chunks INT (since version 2.0.7)

Diamond's -c/--index-chunks option. Default is Diamond's default.

--outfmt_short (since version 2.0.7)

Diamond will produce only the query, subject, evalue and score fields in its output, and seed_orthologs file will have only those fields also. This option could be useful to obtain better performance when no thresholds for pident, and query and subject coverages are used (see Diamond docs about traceback).

MMseqs2 search options

--mmseqs_db FILE

path to MMseqs2-compatible database. Useful to specify a location different than data/ or --data_dir.

--start_sens FLOAT

Starting sensitivity for MMseqs2 iterative searches. Default 3.

--sens_steps INT

Number of iterative searches with different sensitivities for MMseqs2. Default 3.

--final_sens FLOAT

Final sensitivity for MMseqs2 iterative searches. Default 7.

--mmseqs_sub_mat MMSEQS_SUB_MAT

Matrix to be used for --sub-mat option of MMseqs2. Default: the default one used by MMseqs2.

HMMer search options

'-d DB_NAME', '--database DB_NAME'

specify the target database for sequence searches. DB_NAME should be the name of a database downloaded using `download_eggnog_data.py -H -d taxID", or such a database loaded in a server (e.g. db.hmm:host:port; see hmm_server.py documentation)

'--servers_list FILE'

A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option will be ignored.

'--qtype QUERY_TYPE'

hmm or seq. Type of input data (-i).

'--dbtype DB_TYPE'

hmmdb or seqdb. Type of data in DB (-db).

'--usemem'

Use this option to allocate the whole database (-d) in memory. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must be a HMMER-format database created with esl-reformat. Database will be unloaded after execution.

'-p INT', '--port INT'

Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes.

'--end_port PORT'

Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes.

'--num_servers INT'

When using --usemem, specify the number of servers to fire up. By default, only 1 server is used. Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. It is important to consider that for each server the HMM database will be loaded into memory, and therefore memory consumption will grow as --num_servers is increased.

'--num_workers INT'

When using --usemem, specify the number of workers per server to fire up. By default, cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. In our tests --num_workers has not the expected impact on performance, and increasing --num_servers is required to an actual speed boost, although the memory requirements must be met. However, this could be different in other systems or if hmmpgmd use of workers is fixed somehow.

'--hmm_maxhits INT'

Max number of hits to report (0 to report all). Default=1.

'--report_no_hits'

Whether queries without hits should be included in the output table.

'--hmm_maxseqlen INT'

Ignore query sequences larger than `maxseqlen`. Default=5000"

'--Z FLOAT

Fixed database size used in phmmer/hmmscan allows comparing e-values among databases. Default=40,000,000

'--cut_ga'

Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation.

'--clean_overlaps CLEAN_OVERLAPS_MODE'

Removes those hits which overlap, keeping only the one with best evalue. Default "none". Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e. domains are the queries from -i file). The "clans" and "hmmsearch_clans" and options will only have effect for hits to/from Pfam.'

Annotation Options

--no_annot

perform only the search stage and skip functional annotation, reporting only seed orthologs (.seed_orthologs file).

--dbmem

Store the whole eggnog sqlite DB into memory before retrieving the annotations. This requires ~40GB of RAM memory available, but can increase annotation speed considerably. Database will be unloaded after execution.

--seed_ortholog_evalue FLOAT

min e-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=0.001

--seed_ortholog_score FLOAT

min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=60

--tax_scope auto|narrowest|LIST_OF_TAX_IDS

fix the taxonomic scope used for annotation, for each query sequence, so only speciation events from a particular clade are used for functional transfer. More in detail, each seed ortholog belongs to a list of Orthologous Groups (OGs). eggnog-mapper uses one of these OGs to analyze speciation events and retrieve orthologs from which to transfer functional annotation. This can be done from a broader or a narrower OG. The --tax_scope option helps controlling how this choice is carried out. Default is auto.

auto

eggnog-mapper uses a predefined list of tax IDs, so that the OG chosen will be the narrowest one which belongs to that list. Therefore, --tax_scope auto is equivalent to --tax_scope 10239,5794,33090,6231,6656,40674,78,8782,33208,4751,33154,2759,2157,2,1 (viruses,apicomplexa,plants,nematods,arthropods,mammals,fishes,avian,metazoa,fungi,opisthokonta,euk,arch,bact,root). For example, if the OGs for a given query of our eggnog-mapper run are COG0012@1,COG0012@2,1MVM4@1224, the OG chosen to retrieve orthologs in auto mode will be COG0012@2, since 1MVM4@1224 does not belong to the list of tax IDs, and COG0012@2 is narrower than COG0012@1.

narrowest

Instructs eggnog-mapper to use the narrowest (most specific) taxon among the OGs identified for each hit. This could lead to scarce annotation, specially for those less well-known clades. In the same example as before, COG0012@1,COG0012@2,1MVM4@1224, the OG chosen to retrieve orthologs will be 1MVM4@1224, which is the narrowest.

LIST_OF_TAX_IDS

Use a user-defined comma-separated list of tax IDs and/or tax names (you can use a mix of tax IDs and names; use the --list_taxa option to retrieve a list of the ones which are available). The order matters: the left-most tax IDs will have preference over the right-most ones. Furthermore, the list of tax IDs can be suffixed with none, narrowest or auto, to specify the behaviour when none of tax IDs are found among the OGs of a target seed ortholog. If only the list of tax IDs is specified, the default behaviour is none.

none: no OG will be used for annotation, so no annotation will be obtained for this query.
auto: an OG will be chosen using the predefined list of tax IDs, and therefore at least the root level will be applied if no other taxa fits the target OGs (see auto above).
narrowest: the narrowest OG will be used for annotation, as if --tax_scope narrowest was chosen for this query.

An example of list of tax IDs would be --tax_scope 2759,2157,2,1 for euk, arch, bact and root, in that order of preference.

If, for example, the narrowest OG is preferred over root, the list could, instead of the previous, be --tax_scope 2759,2157,2,narrowest.

Another example: if a user wants to annotate all bacteria using the Bacteria level, and auto for all other taxa, he should use --tax_scope 2,auto

--target_orthologs one2one|many2one|one2many|many2many|all

defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer. Default: all

--target_taxa all|TAX_ID

broadest taxa which will used to search for orthologs. By default ('all'), orthologs from all taxa, within a given taxonomic scope, are used. Note that this option interacts with the OG chosen due to the --tax_scope option. First, speciation events are identified among the Orthologous Groups based on --tax_scope. Then, annotation will be transferred from the orthologs found within those speciation events: from all the orthologs if --target_taxa all, or only from orthologs of a specific TAX_ID if --target_taxa TAX_ID.

--excluded_taxa TAXID

the opposite behaviour than --target_taxa. (for debugging and benchmark purposes). Default is none.

--report_orthologs

as a first step in functional annotation, eggnog-mapper identifies the orthologs of each query, using seed orthologs from the search stage as an anchoring or starting point. A list of these orthologs is not reported by default. Use this option get the list of orthologs found for each query ('.orthologs' file).

--go_evidence experimental|non-electronic|all

defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic (default) = Use only non-electronically curated terms. all = all GO terms will be retrieved.

--pfam_transfer best_og|narrowest_og|seed_ortholog

PFAM domains will be retrieved from either best OG, the narrowest OG or directly from the seed ortholog. It has no effect if --pfam_realign denovo is used.

--pfam_realign none|realign|denovo

Defines how PFAM annotation will be performed.

none

A list of PFAMs, directly transferred from orthologs, will be reported.

realign

PFAMs from orthologs will be realigned to the query, and a list of PFAMs and their positions on the query will be reported.

denovo

Each query will be realigned to PFAM, and a list of PFAMs and their positions on the query will be reported.

--md5

Adds a column with the md5 hash of the query sequences in the annotations output file. An annotations output file created this way can be used as cache file (-c CACHE_FILE) for the -m cache mode.

Output options

--output,-o FILE_PREFIX

base name for output files

--output_dir DIR

where output files should be written. default is current working directory.

--scratch_dir DIR

write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems.

--resume

resumes a previous execution skipping reported hits in the output file. Note that diamond runs (-m diamond) cannot be resumed, but search stage can be skipped with -m no_search --annotate_hits_table FILE.

--override

overwrites output files if they exist. By default, execution is aborted if conflicting files are detected.

--temp_dir DIR

where temporary files are created. Better if this is a local disk.

--no_file_comments

no header lines nor stats are included in the output files

--decorate_gff no|yes|FILE[:FIELD]

Option to create/decorate a GFF file with emapper hits and/or annotations.

no: no GFF decoration will be performed. If running gene prediction with Prodigal, its GFF will be among the output files anyway. If running blastx-based gene prediction, the GFF with CDS of hits will be among output files anyway.
yes: a new GFF will be created including hits and/or annotations.
FILE[:FIELD]: a new GFF will be created, adding hits and/or annotations to the attributes already existing in the specified FILE. A FIELD (a GFF attribute) can be specified, to help identify to which GFF feature should the hits and/or annotations be added. For example, --decorate_gff genome_cds.gff:geneID will add hits and/or annotations to the features in which geneID matches the query name of the hit/annotation. By default, --decorate_gff no and FIELD is ID.

Output format

Output files

Seed orthologs (prefix.emapper.seed_orthologs)

A file with the results from the search phase. Therefore, each row represents a query hit against a target eggNOG sequence.

Annotations (prefix.emapper.annotations)

A file with the results from the annotation phase. Therefore, each row represents the annotation reported for a given query.

Orthologs (prefix.emapper.orthologs)

A file with the list of orthologs found for each query. This file is created only if using the --report_orthologs option.

HMM hits (prefix.emapper.hmm_hits)

A file with the results from the search phase, using hmm_mapper or emapper -m hmmer, which reports query-HMM target pairs, including the e-value and score of the hit, the starting and ending positions of the hit, as well as the query covered by the alignment to the HMM hit.

Sequences of predicted CDS (prefix.emapper.genepred.fasta)

A FASTA file with the sequences of the predicted CDS. It is generated when gene prediction is carried out, with --itype genome or --itype metagenome.

GFF of predicted CDS (prefix.emapper.genepred.gff)

A GFF (version 3) file with the position of the predicted CDS on the original input sequences. It is generated when gene prediction is carried out, with --itype genome or --itype metagenome.

Sequences without annotation (prefix.emapper.no_annotations.fasta)

A FASTA file with the sequences of queries for which an existing annotation was not found using the -m cache mode. This file can be used as input of another eggNOG-mapper run without using the cache, trying to annotate the sequences.

PFAM hits (prefix.emapper.pfam)

A file with the positions of the PFAM domains identified. Only created if --pfam_realign realign or --pfam_realign denovo.

Output fields

All files contain rows with tab-separated columns or fields.

Seed orthologs file

query
target

The target is what is also known, in eggnog-mapper, as 'seed ortholog'. It is the eggNOG sequence representing the best hit found for a given query during the search phase, and it will be used, during the annotation phase, to retrieve orthologs from which to transfer annotations.

e-value
bit-score

The e-value and bit-score fields are the values returned by the search tool being used (diamond by default, see -m option).

pident

Percentage of identity between the query and the subject (since version 2.0.7).

qstart

First position of query in the alignment (since version 2.0.7).

qend

End position of query in the alignment (since version 2.0.7).

sstart

Start position of subject (a.k.a. target) in the alignment (since version 2.0.7).

send

End position of subject (a.k.a. target) in the alignment (since version 2.0.7).

qcov

Percentage of the query length which is part of the alignment (since version 2.0.7).

scov

Percentage of the subject (a.k.a. target) length which is part of the alignment (since version 2.0.7).

Since version 2.0.7, the --outfmt_short option can be used to output only the first 4 fields of the seed orthologs file, when running searches with Diamond (see --outfmt_short option above).

Annotations file

Search hit fields

query_name
seed_eggNOG_ortholog
seed_ortholog_evalue
seed_ortholog_score

Orthologous Groups fields

eggNOG OGs

a comma-separated, clade depth-sorted (broadest to narrowest), list of Orthologous Groups (OGs) identified for this query. Note that each OG is represented in the following format: OG@tax_id|tax_name

narr_og_name

OG@tax_id|tax_name for the narrowest OG found for this query.

narr_og_cat

COG category corresponding to narr_og_name

narr_og_desc

Description corresponding to narr_og_name

best_og_name

OG@tax_id|tax_name for the OG chosen based on --tax_scope.

best_og_cat

COG category corresponding to best_og_name

best_og_desc

Description corresponding to best_og_name

Transferred annotations fields

Preferred_name
GOs
EC
KEGG_ko
KEGG_Pathway
KEGG_Module
KEGG_Reaction
KEGG_rclass
BRITE
KEGG_TC
CAZy
BiGG_Reaction

Orthologs file

query
comma-separated list of orthologs

HMMer hits file

query
hit
evalue
sum_score
query length
HMM position "from"
HMM position "to"
Sequence position "from"
Sequence position "to"
query coverage

Sequences of predicted CDS

If gene prediction is performed using search hits (diamond or mmseqs "blastx" hits), sequence identifiers include the identifier of the original sequence from which the CDS has been found, followed by an underscore and a number to differentiate among CDS from the same original sequence (e.g. A CDS found in >query_seq will be named >query_seq_1. A second one will be >query_seq_2, ...). If gene prediction is performed using prodigal, this output file is the one generated by Prodigal (check Prodigal documentation for output formats).

GFF of predicted CDS

If gene prediction is performed using search hits (diamond or mmseqs "blastx" hits), the source field (2nd column) show "eggNOG-mapper" and the attributes field (9th column) show results of the "blastx" search (e.g. ID=0_0;score=1597.8;evalue=0.0;eggnog5_target=316407.85674276;sstart=1;send=820;searcher=diamond). Also target_coverage is included since version 2.0.7. If gene prediction is performed using prodigal, this output file is the one generated by Prodigal (check Prodigal documentation for output formats).

Sequences without annotation

Just a FASTA file with the same identifiers as the original sequences.

PFAM hits

TODO

Setting up large annotation jobs

The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).

eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.

Phase 1. hom*ology searches

1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.

split -l 2000000 -a 3 -d input_file.faa input_file.chunk_

2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:

# generate all the commands that should be distributed in the clusterfor f in *.chunk_*; doecho ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; done

Phase 2. Orthology and functional annotation

The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).

3) Concatenate all chunk_*.emapper.seed_orthologs file.

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.

emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10

We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having `eggnog.db` under the `/dev/shm` disk, but you can of course run many of those instances in parallel. If you are running `emapper.py` from a conda environment, check [these](https://github.com/jhcepas/eggnog-mapper/issues/80).

and _voilà_, you got your annotations.

Even larger jobs in large memory computers

Use MMseqs for the search step (~200 GB mem required if using the whole eggnog5 DB).

# generate all the commands that should be distributed in the clusterfor f in *.chunk_*; doecho ./emapper.py -m mmseqs --no_annot --no_file_comments --cpu 16 -i $f -o $f; done

Load the annotation database into memory for the annotation step (~44 GB mem required)

emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10 --dbmem

Also when running Diamond for the search step (-m diamond) it can benefit from using large memory computers, by tuning the --block_size and the --index_chunks options. Also --index_chunks could be required by diamond when running in computers with over 64GB RAM. (These options are available since version 2.0.7).

Citation

 Please cite the following two papers if you use eggNOG-mapper v2

[1] Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork. Submitted (2016).[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085