GIS RNA PET Track Settings
RNA Sub-cellular Localization by Paired-end diTag Sequencing from ENCODE/GIS   (All Expression tracks)

Maximum display mode:       Reset to defaults   
Select views (help):
Clusters       Plus Raw Signal       Minus Raw Signal       Alignments      
Select subtracks by localization and cell line:
 All Localization Whole Cell  Cytosol  Nucleus  Polysome  Nucleoplasm  Chromatin  Nucleolus 
Cell Line
GM12878 (Tier 1) 
H1-hESC (Tier 1) 
K562 (Tier 1) 
A549 (Tier 2) 
HeLa-S3 (Tier 2) 
HepG2 (Tier 2) 
HUVEC (Tier 2) 
IMR90 (Tier 2) 
MCF-7 (Tier 2) 
SK-N-SH (Tier 2) 
Select subtracks further by: (select multiple categories and items - help)
Replicate rank:
Library Prep Method:
RNA Extract:

List subtracks: only selected/visible    all    ()
  Cell Line↓1 Localization↓2 Library Prep Method↓3 Views↓4 Replicate rank↓5   Track Name↓6    Restricted Until↓7
 H1-hESC  Whole Cell  Clone-free  Clusters  1st  H1-hESC whole cell polyA+ clone-free RNA PET Clusters Rep 1 from ENCODE/GIS    schema   2010-08-13 
 H1-hESC  Whole Cell  Clone-free  Minus Raw Signal  1st  H1-hESC whole cell polyA+ clone-free RNA PET Minus Signal Rep 1 from ENCODE/GIS    schema   2010-08-13 
 H1-hESC  Whole Cell  Clone-free  Plus Raw Signal  1st  H1-hESC whole cell polyA+ clone-free RNA PET Plus Signal Rep 1 from ENCODE/GIS    schema   2010-08-13 
     Restriction Policy


This track was produced as part of the ENCODE Transcriptome Project and shows the starts and ends of full-length mRNA transcripts determined by Gene Identification Signature (GIS) paired-end ditag (PET) sequencing using RNA extracts from different sub-cellular localizations in different cell lines. Short tags used in GIS-PET sequencing provide signatures of the 5' start and the 3' end of individual mRNA transcripts, thus demarcating the first and last exon, and contain enough coding information to map the tags uniquely to the genome, in turn making it possible to identify unconventional fusion transcripts. These 5' and 3' paired-end tags extracted by restriction enzyme are ligated together to form a ditag for sequencing, where the 3' end includes two adenine bases from the polyA tail thereby reducing the relative amount of unique sequence. The RNA-PET information provided in this track is composed of two different PET length versions based on how the PETs were extracted using different restriction enzymes. The cloning-based PET method (18 bp and 16 bp for each of the 5' and 3' ends) is an earlier version (Ng et al., 2006). While the cloning-free PET approach (27 bp and 25 bp for each of the 5' and 3' ends) is a recently modified version which uses Type III restriction enzyme EcoP15I to generate a longer length of PET (Ruan and Ruan, 2012), which results in a significant enhancement in both library construction and mapping efficiency. Both versions of PET templates were sequenced by Illumina platform at 2 x 36 bp paired-end sequencing. See the Methods and References sections below for more details.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. Color differences among the views are arbitrary. They provide a visual cue for distinguishing between the different cell types and compartments.

The Clusters view shows clusters built from the alignments. In the graphical display, the ends are represented by blocks connected by a horizontal line. In full and packed display modes, the arrowheads on the horizontal line represent the direction of transcription. Although some of the subtracks have score information most of them do not and score filtering has been disabled.
Plus Raw Signal
The Plus Raw Signal view graphs the base-by-base density of tags on the forward strand.
Minus Raw Signal
The Minus Raw Signal view graphs the base-by-base density of tags on the reverse strand.
The Alignments view shows alignment of individual PET sequences. The alignment file follows the standard SAM/BAM format indicated in the SAM Format Specification. Some files also use the tag XA, generated by Bowtie, to represent the total number of mismatches in the tag.

Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.


Cells were grown according to the approved ENCODE cell culture protocols. Two different GIS RNA-PET protocols were used to generate the full-length transcriptome PETs: one is based on a cloning-free RNA-PET library construction and sequencing strategy (Ruan and Ruan, 2012), and the other is a cloning-based library construction (Ng et al., 2005) and recent Illumina paired-end sequencing.

Cloning-free RNA-PET (52 bp reads, 27 bp and 25 bp tag for each of the 5' and 3' ends)

Method: The cloning-free RNA-PET libraries were generated from polyA mRNA samples and constructed using a recently modified GIS protocol (Ruan and Ruan, 2012). High quality total RNA was used as starting material and purified with a MACs polyT column to obtain full-length polyA mRNAs. Approximately 5 µgrams of enriched polyA mRNA was used for reverse transcription to convert polyA mRNA to full-length cDNA. Specific linker sequences were ligated to the full-length cDNA. The modified cDNA was circularized by ligation generating circular cDNA molecules. The 27 bp tag from each end of the full-length cDNA was extracted by type III enzyme EcoP15I digestion. The resulting PETs were ligated with sequencing adaptors at both ends, amplified by PCR, and further purified as complex templates for paired-end sequencing using Illumina platforms.

Data: The sequenced RNA-PETs resulted in reads of 27 bp and 25 bp corresponding to the 5' and 3' end of each cDNA, respectively. Redundant and noisy reads were excluded from downstream analysis. Strand-specific orientation of each PET was determined using the barcode built into the sequencing template. The oriented RNA-PET was mapped onto the reference genome allowing up to two mismatches. The majority of the PETs mapped to known transcripts. A small portion of misaligned PETs, defined as discordant PETs, mapped too far from each tag, with wrong orientations, or to different chromosomes. These discordant PETs indicated the existence of some transcription variants that could be caused by genomic structural variants such as fusions, deletions, insertions, inversions, tandem repeats, translocations or RNA trans-splicing etc.

Cloning-based RNA-PET (34 bp reads, 18 bp and 16 bp tag for each of the 5' and 3' ends)

Method: The cloning-based RNA-PET (GIS-PET) libraries were generated from polyA RNA samples and constructed using the protocol described by Ng et al., 2005. Total RNA in good quality was used as starting material and further purified with a MACs polyT column to enrich polyA mRNA. Approximately 10 µgrams of polyA enriched mRNA was reverse transcribed resulting in full-length cDNA. The obtained full-length cDNA was modified with specific linker sequences and ligated to a GIS-developed (pGIS4) vector. The resulting plasmids form a complex full-length cDNA library, which was cloned into E. coli. The plasmid DNA was then isolated from the library, followed by MmeI (a type II enzyme) digestion to generate a final length of 18 bp/16 bp ditags from each end of the full-length cDNA. The single ditag (or PET) was then ligated to form a diPET structure (a concatemer with two unrelated PET linked by a linker sequence) to facilitate Illumina paired-end sequencing.

Data:Sequencing of clone-based RNA-PETs resulted in paired reads of 18 bp and 16 bp corresponding to the 5' and 3' end of each cDNA, respectively. The redundant reads were filtered out and unique reads were included for analysis. PET sequences were then mapped to (GRCh37, hg19, excluding mitochondrion, haplotypes, randoms and chromosome Y) reference genome using the following specific criteria (Ruan et al., 2007):

  • A minimal continuous 16 bp match must exist for the 5' signature; the 3' signature must have a minimal continuous 14 bp match
  • Both 5' and 3' signatures must be present on the same chromosome
  • Their 5' to 3' orientation must be correct (5' signature followed by 3' signature)
  • The maximal genomic span of a PET genomic alignment must be less than one million bp

PETs mapping to 2-10 locations are also included and may represent duplicated genes or pseudogenes in the genome. A majority of the PETs mapped to known transcripts or splice variants. A small portion of misaligned PETs, defined as discordant PETs, mapped either too far from each other, in the wrong orientation, or to different chromosomes. The presence of discordant PETs indicates that some transcriptional variants exist. These variants could be caused by genomic structural variants such as fusions, deletions, insertions, inversions, tandem repeats, translocation or RNA trans-splicing etc.


PETs were clustered using the following procedure. The mapping location of the 5' and 3' tag of a given PET was extended by 100 bp in both directions creating 5' and 3' search windows. If the 5' and 3' tags of a second PET mapped within the 5' and 3' search window of the first PET then the two PETs were clustered and the search windows were adjusted so that they contained the tag extensions of the second PET. PETs which subsequently mapped with their 5' and 3' tags within the adjusted 5' and 3' search window, respectively, were also assigned to this cluster and the search window was readjusted. This iterative process continued until no new PETs fell within the search window. This process is repeated until all PETs were assigned to a cluster.

The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. PETs that mapped to multiple locations may represent low complexity or repetitive sequences.


To assess overall PET quality and mapping specificity, the top ten most abundant PET clusters that mapped to well-characterized known genes were examined. Over 99% of the PETs represented full-length transcripts, and the majority fell within 10 bp of the known 5' and 3' boundaries of these transcripts. The PET mapping was further verified by confirming the existence of physical cDNA clones represented by the ditags. PCR primers were designed based on the PET sequences and amplified the corresponding cDNA inserts either from full-length cDNA library (cloning-based PET) or from isolated total RNA (cloning-free PET) for sequencing confirmation.

Release Notes

This is Release 2 (Aug 2012) of this track. It adds data for tier 2 cell lines (A549, SK-N-SH, IMR90, and MCF-7). This newer data has no scores in the Clusters files.

Note: As mentioned above, this track mixes two different methodologies. The clone-based data has functioning score fields in the Cluster files which could be used for filtering or shading. However, the clone-free data either has scores that are not scaled well or scores that are set to zero for all items. Therefore, the scores are useful for some tables and not for others.


The GIS RNA-PET libraries and sequence data for transcriptome analysis were generated and analyzed by scientists Xiaoan Ruan, Atif Shahab, Chialin Wei, and Yijun Ruan at the Genome Institute of Singapore.

Contact: Yijun Ruan


Ng P, Tan JJ, Ooi HS, Lee YL, Chiu KP, Fullwood MJ, Srinivasan KG, Perbost C, Du L, Sung WK, et al., Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes. Nucleic Acids Res. 2006;34:e84.

Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Shahab A, Ridwan A, Wong CH, et al., Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods. 2005;2:105-111.

Ng P, Wei CL, Ruan Y, Paired-end diTagging for transcriptome and genome analysis. Curr Protoc Mol Biol. 2007 Jul 21.12.

Ruan Y, Ooi HS, Choo SW, Chiu KP, Zhao XD, Srinivasan KG, Yao F, Choo CY, Liu J, Ariyaratne P, et al., Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res. 2007;17:828-838.

Ruan X and Ruan Y, Genome wide full-length transcript analysis using 5' and 3' paired-end-tag next generation sequencing (RNA-PET). Methods Mol Biol. 2012;809:535-62.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.