HS3D (Homo Sapiens Splice Sites Dataset) is a data set of Homo Sapiens Exon, Intron and Splice regions extracted from GenBank Rel.123. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. From the complete GenBank (Primate Sequences Division) Rel.123 (162,557 entries), entries of Human Nuclear DNA including Complete CDS and more than one Exon have been selected, and 4523 exons and 3802 introns have been extracted from these entries. Details about extracted exons and introns are reported (Locus, number, Start and End position in the entry, sequence, length, G+C content, presence of not AGCT data (nucleotide scan check)). Statistics are also reported (overall nucleotides, average G+C content, nucleotide scan check results, number of not GT starting / AG ending introns, minimum / maximum / average length, length standard deviation) . 3799+3799 donor and acceptor sites, as windows of 140 nucleotides around each splice site have been extracted. After discarding sequences not including canonical GT–AG junctions (65+74), including insufficient data (not enough material for a 140 nucleotide window) (686+589), including not AGCT bases (29+30), and redundant (218+226) there are 2796+ 2880 windows.
1. P.Pollastro, S.Rampone (2002). HS3D, a Dataset of Homo Sapiens Splice Regions, and its Extraction Procedure from a Major Public Database , International Journal of Modern Physics C, 13(8), 1105-1117. (please cite this paper)
2. P.Pollastro, S.Rampone (2003). HS3D: Homo Sapiens Splice Site Data Set , Nucleic Acids Research, 2003 Annual Database Issue.