Schema for UCSC Genes - UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
  Database: hg19    Primary Table: knownGene    Row Count: 82,960
Format description: Genes based on RefSeq, GenBank, and UniProt.
fieldexampleSQL type info description
name uc001aaa.3varchar(255) values Name of gene
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand +char(1) values + or - for strand
txStart 11873int(10) unsigned range Transcription start position
txEnd 14409int(10) unsigned range Transcription end position
cdsStart 11873int(10) unsigned range Coding region start
cdsEnd 11873int(10) unsigned range Coding region end
exonCount 3int(10) unsigned range Number of exons
exonStarts 11873,12612,13220,longblob   Exon start positions
exonEnds 12227,12721,14409,longblob   Exon end positions
proteinID  varchar(40) values UniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC Genes
alignID uc001aaa.3varchar(255) values Unique identifier for each (known gene, alignment position) pair

Connected Tables and Joining Fields
        hg19.bioCycPathway.kgID (via knownGene.name)
      hg19.ccdsKgMap.geneId (via knownGene.name)
      hg19.ceBlastTab.query (via knownGene.name)
      hg19.dmBlastTab.query (via knownGene.name)
      hg19.drBlastTab.query (via knownGene.name)
      hg19.foldUtr3.name (via knownGene.name)
      hg19.foldUtr5.name (via knownGene.name)
      hg19.gnfAtlas2Distance.query (via knownGene.name)
      hg19.gnfAtlas2Distance.target (via knownGene.name)
      hg19.gnfU95Distance.query (via knownGene.name)
      hg19.gnfU95Distance.target (via knownGene.name)
      hg19.humanHprdP2P.query (via knownGene.name)
      hg19.humanHprdP2P.target (via knownGene.name)
      hg19.humanVidalP2P.query (via knownGene.name)
      hg19.humanVidalP2P.target (via knownGene.name)
      hg19.humanWankerP2P.query (via knownGene.name)
      hg19.humanWankerP2P.target (via knownGene.name)
      hg19.keggPathway.kgID (via knownGene.name)
      hg19.kg5ToKg6.newId (via knownGene.name)
      hg19.kg6ToKg7.newId (via knownGene.name)
      hg19.kgAlias.kgID (via knownGene.name)
      hg19.kgColor.kgID (via knownGene.name)
      hg19.kgProtAlias.kgID (via knownGene.name)
      hg19.kgProtMap2.qName (via knownGene.name)
      hg19.kgSpAlias.kgID (via knownGene.name)
      hg19.kgTargetAli.qName (via knownGene.name)
      hg19.kgTxInfo.name (via knownGene.name)
      hg19.kgXref.kgID (via knownGene.name)
      hg19.knownBlastTab.query (via knownGene.name)
      hg19.knownBlastTab.target (via knownGene.name)
      hg19.knownCanonical.protein (via knownGene.name)
      hg19.knownCanonical.transcript (via knownGene.name)
      hg19.knownGeneMrna.name (via knownGene.name)
      hg19.knownGenePep.name (via knownGene.name)
      hg19.knownGeneTxMrna.name (via knownGene.name)
      hg19.knownGeneTxPep.name (via knownGene.name)
      hg19.knownIsoforms.transcript (via knownGene.name)
      hg19.knownToAllenBrain.name (via knownGene.name)
      hg19.knownToEnsembl.name (via knownGene.name)
      hg19.knownToGnfAtlas2.name (via knownGene.name)
      hg19.knownToHInv.name (via knownGene.name)
      hg19.knownToHprd.name (via knownGene.name)
      hg19.knownToKeggEntrez.name (via knownGene.name)
      hg19.knownToLocusLink.name (via knownGene.name)
      hg19.knownToPfam.name (via knownGene.name)
      hg19.knownToRefSeq.name (via knownGene.name)
      hg19.knownToSuper.gene (via knownGene.name)
      hg19.knownToTreefam.name (via knownGene.name)
      hg19.knownToU133.name (via knownGene.name)
      hg19.knownToU133Plus2.name (via knownGene.name)
      hg19.knownToU95.name (via knownGene.name)
      hg19.knownToVisiGene.name (via knownGene.name)
      hg19.knownToWikipedia.name (via knownGene.name)
      hg19.mmBlastTab.query (via knownGene.name)
      hg19.rnBlastTab.query (via knownGene.name)
      hg19.scBlastTab.query (via knownGene.name)
      hg19.ucscRetroInfo5.kgName (via knownGene.name)

Sample Rows
 
namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsproteinIDalignID
uc001aaa.3chr1+11873144091187311873311873,12612,13220,12227,12721,14409,uc001aaa.3
uc010nxr.1chr1+11873144091187311873311873,12645,13220,12227,12697,14409,uc010nxr.1
uc010nxq.1chr1+11873144091218913639311873,12594,13402,12227,12721,14409,B7ZGX9uc010nxq.1
uc009vis.3chr1-14361167651436114361414361,14969,15795,16606,14829,15038,15942,16765,uc009vis.3
uc009vit.3chr1-14361197591436114361914361,14969,15795,16606,16857,17232,17914,18267,18912,14829,15038,15947,16765,17055,17742,18061,18366,19759,uc009vit.3
uc009viu.3chr1-143611975914361143611014361,14969,15795,16606,16857,17232,17914,18267,18500,18912,14829,15038,15947,16765,17055,17742,18061,18369,18554,19759,uc009viu.3
uc001aae.4chr1-143611975914361143611014361,14969,15795,16606,16857,17232,17605,17914,18267,18912,14829,15038,15947,16765,17055,17368,17742,18061,18366,19759,uc001aae.4
uc001aah.4chr1-143612937014361143611114361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370,uc001aah.4
uc009vir.3chr1-143612937014361143611014361,14969,15795,16606,16857,17232,17914,18267,24737,29320,14829,15038,15947,16765,17055,17742,18061,18366,24891,29370,uc009vir.3
uc009viq.3chr1-14361293701436114361714361,15795,16606,16857,17605,24737,29320,14829,15947,16765,17055,18061,24891,29370,uc009viq.3

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

UCSC Genes (knownGene) Track Description
 

Description

The UCSC Genes track is a set of gene predictions based on data from RefSeq, GenBank, CCDS, Rfam, and the tRNA Genes track. The track includes both protein-coding genes and non-coding RNA genes. Both types of genes can produce non-coding transcripts, but non-coding RNA genes do not produce protein-coding transcripts. This is a moderately conservative set of predictions. Transcripts of protein-coding genes require the support of one RefSeq RNA, or one GenBank RNA sequence plus at least one additional line of evidence. Transcripts of non-coding RNA genes require the support of one Rfam or tRNA prediction. Compared to RefSeq, this gene set has generally about 10% more protein-coding genes, approximately four times as many putative non-coding genes, and about twice as many splice variants.

Display Conventions and Configuration

This track in general follows the display conventions for gene prediction tracks. The exons for putative non-coding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. The following color key is used:

  • Black -- feature has a corresponding entry in the Protein Data Bank (PDB)
  • Dark blue -- transcript has been reviewed or validated by either the RefSeq, SwissProt or CCDS staff
  • Medium blue -- other RefSeq transcripts
  • Light blue -- non-RefSeq transcripts

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions.

Methods

The UCSC Genes are built using a multi-step pipeline:

  1. RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping only the best alignments for each RNA. Alignments are discarded if they do not meet certain sequence identity and coverage filters. All sequences must align with high (98%) identity. The sequence coverage must be at least 90% for shorter sequences (those with 2500 or fewer bases), with the coverage threshold progressively relaxed for longer sequences.
  2. Alignments are broken up at non-intronic gaps, with small isolated fragments thrown out.
  3. A splicing graph is created for each set of overlapping alignments. This graph has an edge for each exon or intron, and a vertex for each splice site, start, and end. Each RNA that contributes to an edge is kept as evidence for that edge. Gene models from the Consensus CDS project (CCDS) are also added to the graph.
  4. A similar splicing graph is created in the mouse, based on mouse RNA and ESTs. If the mouse graph has an edge that is orthologous to an edge in the human graph, that is added to the evidence for the human edge.
  5. If an edge in the splicing graph is supported by two or more human ESTs, it is added as evidence for the edge.
  6. If there is an Exoniphy prediction for an exon, that is added as evidence.
  7. The graph is traversed to generate all unique transcripts. The traversal is guided by the initial RNAs to avoid a combinatorial explosion in alternative splicing. All RefSeq transcripts are output. For other multi-exon transcripts to be output, an edge supported by at least one additional line of evidence beyond the RNA is required. Single-exon genes require either two RNAs or two additional lines of evidence beyond the single RNA.
  8. Alignments are merged in from the hg19 tRNA Genes track and from Rfam in regions that are syntenic with the mm9 mouse genome.
  9. Protein predictions are generated. For non-RefSeq transcripts we use the txCdsPredict program to determine if the transcript is protein-coding, and if so, the locations of the start and stop codons. The program weighs as positive evidence the length of the protein, the presence of a Kozak consensus sequence at the start codon, and the length of the orthologous predicted protein in other species. As negative evidence it considers nonsense-mediated decay and start codons in any frame upstream of the predicted start codon. For RefSeq transcripts the RefSeq protein prediction is used directly instead of this procedure. For CCDS proteins the CCDS protein is used directly.
  10. The corresponding UniProt protein is found, if any.
  11. The transcript is assigned a permanent "uc" accession. If the transcript was not in the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating that this is the first version of this transcript. If the transcript is identical to some transcript in the previous release of UCSC Genes, the accession is re-used with the same version number. If the transcript is not identical to any transcript in the previous release but it overlaps a similar transcript with a compatible structure, the previous accession is re-used with the version number incremented.

Related Data

The UCSC Genes transcripts are annotated in numerous tables, each of which is also available as a downloadable file. These include tables that link UCSC Genes transcripts to external datasets (such as knownToLocusLink, which maps UCSC Genes transcripts to Entrez identifiers, previously known as Locus Link identifiers), and tables that detail some property of UCSC Genes transcript sequences (such as knownToPfam, which identifies any Pfam domains found in the UCSC Genes protein-coding transcripts). One can see a full list of the associated tables in the Table Browser by selecting UCSC Genes at the track menu; this list is then available at the table menu. Note that some of these tables refer to UCSC Genes by its former name of Known Genes, sometimes abbreviated as known or kg. While the complete set of annotation tables is too long to describe, some of the more important tables are described below.

  • kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA sequences (if any) on which each transcript was based.
  • knownToRefSeq identifies the RefSeq transcript that each UCSC Genes transcript is most closely associated with. That RefSeq transcript is either the RefSeq on which the UCSC Genes transcript was based, if there is one, or it's the RefSeq transcript that the UCSC Genes transcript overlaps at the most bases.
  • knownGeneMrna contains the mRNA sequence that represents each UCSC Genes transcript. If the transcript is based on a RefSeq transcript, then this table contains the RefSeq transcript, including any portions that do not align to the genome.
  • knownGeneTxMrna contains mRNA sequences for each UCSC Genes transcript. In contrast to the sequencess in knownGeneMrna, these sequences are derived by obtaining the sequences for each exon from the reference genome and concatenating these exonic sequences.
  • knownGenePep contains the protein sequences derived from the knownGeneMrna transcript sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based on these sequences.
  • knownGeneTxPep contains the protein translation (if any) of each mRNA sequence in knownGeneTxMrna.
  • knownIsoforms maps each transcript to a cluster ID, a cluster of isoforms of the same gene.
  • knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

Credits

The UCSC Genes track was produced at UCSC using a computational pipeline developed by Jim Kent, Chuck Sugnet, Melissa Cline and Mark Diekhans. It is based on data from NCBI RefSeq, UniProt (including TrEMBL and TrEMBL-NEW), CCDS, and GenBank as well as data from Rfam and the Todd Lowe lab. Our thanks to the people running these databases and to the scientists worldwide who have made contributions to them.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32:D23-6.

Chan PP, Lowe TM. GtRNAdb: A database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 2009 Jan;37:D93-7.

Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Res. 2011 Jan;39:D141-5.

Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46.

Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64.

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64.

The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan 1;40(D1):D71-D75. Epub 2011 Nov 18.