The data for the working draft is organized hierarchically by
chromosome and by the sequenced-clone contigs within each chromosome.
At the top level there are 25 folders; 22 of these are for the
numbered chromosomes (autosomes), folders X and Y are for the sex
chromosomes, and Un is for clone contigs that cannot be placed
confidently on a chromosome. Each of the 25 chromosomal folders
contains a separate clone contig folder for each of the clone contigs
for that chromosome.
There are two primary files in each clone contig folder; these have
suffixes .fa and .agp respectively. The .fa files gives the working
draft sequence for the clone contig. The format is Fasta format,
e.g.
>NT_077768
GAATTCTCTGTAACACTAAGCTCTCTTCCTCAAAACCAGAGGTAGATAGA
ATGTGTAATAATTTACAGAATTTCTAGACTTCAACGATCTGATTTTTTAA
ATTTATTTTTATTTTTTCAGGTTGAGACTGAGCTAAAGTTAATCTGTGGC
...
The .agp file is a kind of index that tells how the .fa file is
built. It looks like
17/NT_077768 1 6538 1 D AC021317.18 122280 128817 -
17/NT_077768 6539 56206 2 D AC021317.18 128918 178585 -
17/NT_077768 56207 56306 3 N 100 fragment yes
17/NT_077768 56307 117971 4 D AC021317.18 47188 108852 -
17/NT_077768 117972 170563 5 F AC115992.13 23659 76250 +
17/NT_077768 170564 274979 6 D AC124789.11 1 104416 -
...
Each line represents either an actual sequence record or a gap
(unless it begins with "#", in which case it is a comment.)
If the line represents an actual sequence record then it has the form
<chromosome/ctg>
<start-in-ctg>
<end-in-ctg>
<number>
<type>
<accession>.<version>
<start>
<end>
<orientation>
and if it represents a gap it has the form
<chromosome/ctg>
<start-in-ctg>
<end-in-ctg>
<number>
N
<number-of-Ns>
<kind>
<bridged?>
The positions <start-in-ctg> and <end-in-ctg> are the
start and end positions for where the sequence is to be put in the .fa
file. For a sequence record, the positions <start> and
<end> are the start and end positions of where the sequence came
from in the GenBank record <accession>.<version>. The
field <orientation> tells whether or not the sequence must be
reverse complemented before it is inserted into its place in the .fa
file. For example, the records above mean that to build the .fa file
for clone contig NT_077768 from chromosome 17 you take
AC021317 version 18, residues 122280 to 128817, reverse complemented, followed by
AC021317 version 18, residues 128918 to 178585, reverse complemented, followed by
a gap of 100 Ns, followed by
AC021317 version 18, residues 47188 to 108852, reverse complemented, followed by
AC115992 version 13, residues 23659 to 76250, followed by
AC124789 version 11, residues 1 to 104416, reverse complemented, followed by
...
The joins perfectly abut. In a sequence record, <type> can be
F - Finished,
A - in Active finishing,
D - Draft,
P - PreDraft,
O - Other sequence
and in a gap record it is always N.
The <number> field just sequentially numbers the records.
In a gap record, <number-of-Ns> is the size of the gap and
<kind> is
- fragment - a gap between two sequence contigs (also called a
"sequence gap")
- split_finished - a special sized gap between two finished sequence
contigs
- clone - a gap between two clones that do not overlap
- contig - a gap between clone contigs in the genome layout (also called
a "layout gap")
- centromere - a gap inserted for the centromere
- short_arm - a gap inserted at the start of an acrocentric chromosome
- heterochromatin - a gap inserted for an especially large region of
heterochromatin (may include the centromere as well.)
- telomere - a gap inserted for a telomere
<bridged?> is "yes" if there is a cDNA or BACend pair or
plasmid end pair that spans the gap, else it is "no".
We provide three ways you can
download
these .fa and .agp files:
- full data set: the entire hierarchy in a zipped format.
- by chromosome: one zipped file for each chromosome containing all
the sequence ordered along that chromosome.
- by individual clone contig: separate files, not zipped, for each
clone contig.
|