Protein-coding potential exploration by kmers
return to home
[goto online analysis]

We assume you have your database of kmers frequencies cdskmerf_9.db (see below), and that you have a fasta file (input.fna) with genomic sequences for which you want to find high protein-coding potential. Then simply execute:

$ kodpot -f 1 cdskmerf_9.db input.fna > output.faa

which will create a fasta file with the protein sequences (output.faa); with names encoding the coordinates and strand in which the corresponding genomic regions are (relative to your input sequence).

Files:

Constructing a custom database of in-frame kmers frequencies

We need a set of trusted Coding Data Sequences (CDS) in fasta format. A source for these sequences can be the CDS from complete genomes in the NCBI. For example, information for genome sequences of bacteria can be obtained here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
considering that the files to use have names ending with "_cds_from_genomic.fna.gz".

Assuming you already have all the CDS files you want to use in a directory called "cds_dir", and that you want to use Kmers of size 9, then you can use:

$ makekodpotdb 9 cds_dir

which will create a file called: "cdskmerf_9.db", which you can use as described above.