Protein-coding potential exploration by kmers
We assume you have your database of kmers frequencies cdskmerf_9.db (see below), and that you have a fasta file (input.fna) with genomic sequences for which you want to find high protein-coding potential. Then simply execute:
$ kodpot -f 1 cdskmerf_9.db input.fna > output.faa
which will create a fasta file with the protein sequences (output.faa); with names encoding the coordinates and strand in which the corresponding genomic regions are (relative to your input sequence).
- kodpot (linux binary): program to search for high protein-coding potential regions in genomic sequences. (source code)
- makekodpotdb (linux binary): program to construct in-frame kmers frequencies database out of trusted CDS sequences. (source code)
- cdskmerf_9.db: database of in-frame 9-mers frequencies from CDS of complete genomes in NCBI
- cdskmerf_10.db: database of in-frame 10-mers frequencies from CDS of complete genomes in NCBI
- cdskmerf_11.db: database of in-frame 11-mers frequencies from CDS of complete genomes in NCBI
Constructing a custom database of in-frame kmers frequencies
We need a set of trusted Coding Data Sequences (CDS) in fasta format. A source for these sequences can be the CDS from complete genomes in the NCBI. For example, information for genome sequences of bacteria can be obtained here:
considering that the files to use have names ending with "_cds_from_genomic.fna.gz".
Assuming you already have all the CDS files you want to use in a directory called "cds_dir", and that you want to use Kmers of size 9, then you can use:
$ makekodpotdb 9 cds_dir
which will create a file called: "cdskmerf_9.db", which you can use as described above.