Protein-coding potential by kmers

Protein-coding potential exploration by kmers

[goto online analysis]

We assume you have your database of kmers frequencies cdskmerf_9.db (see below), and that you have a fasta file (input.fna) with genomic sequences for which you want to find high protein-coding potential. Then simply execute:

$ kodpot -f 1 cdskmerf_9.db input.fna > output.faa

which will create a fasta file with the protein sequences (output.faa); with names encoding the coordinates and strand in which the corresponding genomic regions are (relative to your input sequence).

Files:

kodpot (linux binary): program to search for high protein-coding potential regions in genomic sequences. (source code)
makekodpotdb (linux binary): program to construct in-frame kmers frequencies database out of trusted CDS sequences. (source code)
cdskmerf_9.db: database of in-frame 9-mers frequencies from CDS of complete genomes in NCBI
cdskmerf_10.db: database of in-frame 10-mers frequencies from CDS of complete genomes in NCBI
cdskmerf_11.db: database of in-frame 11-mers frequencies from CDS of complete genomes in NCBI

Constructing a custom database of in-frame kmers frequencies

We need a set of trusted Coding Data Sequences (CDS) in fasta format. A source for these sequences can be the CDS from complete genomes in the NCBI. For example, information for genome sequences of bacteria can be obtained here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
considering that the files to use have names ending with "_cds_from_genomic.fna.gz".

Assuming you already have all the CDS files you want to use in a directory called "cds_dir", and that you want to use Kmers of size 9, then you can use:

$ makekodpotdb 9 cds_dir

which will create a file called: "cdskmerf_9.db", which you can use as described above.