Discrover
Discrover is a motif discovery method to find binding sites of nucleic acid binding proteins.
The corresponding publication is:
Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models
Jonas Maaskola and Nikolaus Rajewsky
Nucleic Acid Research, 42(21):12995-13011, Dec 2014. doi:10.1093/nar/gku1083
Software
The software of Discrover is distributed under the GNU General Public License v3 and available on GitHub.There you can retrieve the source code, and we provide binary packages for select Linux distributions, including Debian, Fedora, and Ubuntu.
Synthetic sequence data for motif discovery performance evaluation
Please refer to the Materials section of the publication for details on how the data were generated.
Sequence archives
There are three sets of experiments. We provide separate separate downloads for the experiments.
- Basic experiments (md5sum: 2a1245773c9ad105cd92a91521eb8359)
- 3'UTR experiments (md5sum: 19126f1b6ea41d78a83d51c3972b416e)
- Decoy experiments (md5sum: 34d550c1bccf48ceb486c68c69f1a030)
Archive structure
The downloads for the basic and 3'UTR experiments contain paired FASTA files for the signal and control sequences in the following directory structure:
EXPNAME/nAAA/lenBBB/pCCC_klDDD.signal.fa
EXPNAME/nAAA/lenBBB/pCCC_klDDD.control.fa
For the decoy experiments the structure is as follows:
EXPNAME/nAAA/lenBBB/pCCC_klDDD_rpEEE_rklFFF.signal.fa
EXPNAME/nAAA/lenBBB/pCCC_klDDD_rpEEE_rklFFF.decoy_control.fa
The meaning of the uppercase strings is as follows:
Code | Meaning |
EXPNAME
|
Experiment name; PWM for the basic experiments and 3utr for the 3'UTR experiments, and decoy for the decoy experiments
|
AAA
|
Number of sequences |
BBB
|
Length of sequences |
CCC
|
Signal motif implantation frequency |
DDD
|
Signal motif information content |
EEE
|
Control motif implantation frequency |
FFF
|
Control motif information content |
Sequences
The sequences were generated as described in the publication. Background sequence is in lower case characters, implanted motifs are in upper case. This illustrated below:
>signal_0
cgttgtgcGCCACGCAaaag
>signal_1
taactttacGCCACCCActt
>signal_2
cacGCCACGGAggaggactc
>signal_3
TCCACGCAaatcaattcctt
>signal_4
atgtatgcgGTCACCGAgag