Blat Suite Program Specifications and User's Guide
|
|
|
Blat produces two major classes of alignments:
-
at the DNA level between two sequences that are of 95% or greater identity, but which
may include large inserts.
-
at the protein or translated DNA level between sequences that are of 80% or greater
identity and may also include large inserts.
The output of blat is flexible. By default, it is a simple tab-delimited file that describes
the alignment, but which does not include the sequence of the alignment itself.
Alternatively, blat can produce output compatible with
BLAST
or WU-BLAST,
as well as several other formats.
The main programs in the blat suite are:
-
gfServer –
a server that maintains an index of the genome in memory and uses the index to quickly
find regions with high levels of sequence similarity to a query sequence.
-
gfClient –
a program that queries gfServer over the network and does a
detailed alignment of the query sequence with regions found by gfServer.
-
blat –
client and server combined into a single program, first building the
index, then using the index, and then exiting.
-
webBlat –
a web-based version of gfClient that presents the alignments in an
interactive fashion.
Building an index of the genome typically takes 10 or 15 minutes. For interactive
applications, it is typical to use gfServer to build a whole-genome index. gfClient or
webBlat can then align a single query within a few seconds. When aligning a large
number of sequences in a batch mode, it may be more efficient to use blat, particularly
if it is run on a cluster of computers. Each blat run is typically done against a single
chromosome, but with a large number of query sequences.
Other programs in the blat suite are:
-
pslSort –
combines and sorts the output of multiple blat runs. The blat default
output format is psl.
-
pslReps –
selects the best alignments for a particular query sequence, using a
"near best in genome" approach.
-
pslPretty –
converts alignments from psl format, which is a tab-delimited
format that does not include the bases themselves, to a more readable alignment format.
-
faToTwoBit –
converts fasta format sequence files to the dense, randomly accessible
.2bit format that gfClient can use.
-
twoBitToFa –
converts from .2bit format back to fasta format.
-
faToNib –
converts from fasta to the somewhat less dense, randomly accessible
.nib format that predated .2bit format.
Note that each .nib file can contain only a single sequence.
-
nibFrag –
converts portions of a .nib file back to fasta format.
Usage details for each of the above programs are described in the
sections below. For all programs except webBlat, the usage options may also be viewed by
typing the name of the program (without arguments) on the UNIX command line.
The gfServer program requires approximately 1 byte of memory for every 3 bases in the genome
it is indexing in DNA mode, and 1.5 bytes for each unmasked base in translated mode. The
blat program requires approximately 2 bytes of memory for each base in the genome in DNA
mode, and 3 bytes of memory for each base in translated mode. The other programs use
relatively little memory.
You may also be interested in the following programs that are not part of the blat
suite:
-
in-silico PCR – quickly finds the
sequence between two primers. This includes webPCR, an interface similar to webBlat.
-
Genome Browser – displays annotations as a
series of tracks on top of the genome.
All of the above software and source code are freely available from the University of
California Santa Cruz for academic and non-commercial use. A license is required for
commercial use. For blat licensing, contact Kent Informatics. All other licensing is handled by the
UCSC Genome Browser project.
| |
|
|
|
|
blat – Standalone blat sequence search command line tool.
Usage:
blat database query [-ooc=11.ooc] output.psl
Where:
database and query are each either a .fa, .nib, or .2bit file, or a list of these files, one file
name per line.
-ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will
increase the speed by a factor of 40 in many cases, but is not required.
output.psl is the output file.
Subranges of .nib and .2bit files may be specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end is used.
Options:
-t=type Database type, one of:
dna - (default) DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
-q=type Query type, one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
-prot Synonymous with -t=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize.
-tileSize=N Sets the size of match that triggers an alignment. Usually between 8 and
12. Default is 11 for DNA and 5 for protein.
-stepSize=N Spacing between tiles. Default is tileSize.
-oneOff=N If set to 1, this allows one mismatch in tile and still triggers an
alignment. Default is 0.
-minMatch=N Sets the number of tile matches. Usually set from 2 to 4. Default is 2
for nucleotide, 1 for protein.
-minScore=N Sets minimum score. This is the matches minus the mismatches minus some
sort of gap penalty. Default is 30.
-minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide
searches, 25 for protein or translated protein searches.
-maxGap=N Sets the size of maximum gap between tiles in a clump. Usually set from 0
to 3. Default is 2. Relevant only for minMatch > 1.
-noHead Suppresses .psl header (so it's just a tab-separated file).
-makeOoc=N.ooc Makes overused tile file. Target must be complete genome.
-repMatch=N Sets the number of repetitions of a tile allowed before it is marked as
overused. Typically this is 256 for tileSize 12, 1024 for tile size 11,
4096 for tile size 10. Default is 1024. Typically needed only with
makeOoc. Also affected by stepSize: when stepSize is halved, repMatch is
doubled to compensate.
-mask=type Masks out repeats. Alignments won't be started in masked region but may
extend through it in nucleotide searches. Masked areas are ignored
entirely in protein or translated searches. Types are:
lower - masks out lower-cased sequence
upper - masks out upper-cased sequence
out - masks according to database.out RepeatMasker .out file
file.out - masks database according to RepeatMasker file.out
-qMask=type Masks out repeats in query sequence. Similar to -mask above but for query
rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not be masked in any
way, but matches in repeat areas will be reported separately from matches
in other areas in the psl output.
-minRepDivergence=NN Minimum percent divergence of repeats to allow them to be unmasked.
Default is 15. Relevant only for masking using RepeatMasker .out files.
-dots=N Outputs a dot every N sequences to show program's progress.
-trimT Trims leading poly-T.
-noTrimA Don't trim trailing poly-A.
-trimHardA Removes poly-A tail from qSize as well as alignments in psl output.
-fastMap Run for fast DNA/DNA remapping, not allowing introns and requiring high
%ID.
-out=type Controls output file format, one of:
psl - (Default) tab-separated format, no sequence
pslx - tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8 - NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high-quality mRNAs, looks harder for small initial and terminal exons.
Not recommended for ESTs.
-maxIntron=N Sets maximum intron size. Default is 750000.
-extendThroughN Allows extension of alignment through large blocks of Ns.
Blat settings for common usage scenarios:
-
Mapping ESTs to the genome within the same species:
-ooc=11.ooc
-
Mapping full length mRNAs to the genome in the same species:
-ooc=11.ooc -fine -q=rna
-
Mapping ESTs to the genome across species:
-q=dnax -t=dnax
-
Mapping mRNA to the genome across species:
-q=rnax -t=dnax
-
Mapping proteins to the genome:
-q=prot -t=dnax
-
Mapping DNA to DNA in the same species:
-ooc=11.ooc -fastMap
-
Mapping DNA from one species to another species (the query side of the alignment
should be split into chunks of 25 KB or less for best performance):
-q=dnax -t=dnax
| |
|
|
|
|
gfServer – Make a server to quickly find where DNA occurs in genome.
Usage:
To set up a server:
gfServer start host port file(s) where files are in .nib or .2bit format
To remove a server:
gfServer stop host port
To query a server with DNA sequence:
gfServer query host port probe.fa
To query a server with protein sequence:
gfServer protQuery host port probe.fa
To query a server with translated DNA sequence:
gfServer transQuery host port probe.fa
To query server with PCR primers
gfServer pcr host port fPrimer rPrimer maxDistance
To process 1 probe fa file against .nib format genome (not starting server):
gfServer direct probe.fa file(s).nib
To test pcr without starting server:
gfServer pcrDirect fPrimer rPrimer file(s).nib
To figure out usage level:
gfServer status host port
To get input file list:
gfServer files host port
Options:
-tileSize=N Size of n-mers to index. Default is 11 for nucleotides, 4 for proteins
(or translated nucleotides).
-stepSize=N Spacing between tiles. Default is tileSize.
-minMatch=N Number of n-mer matches that triggers detailed alignment. Default is 2 for
nucleotides, 3 for proteins.
-maxGap=N Number of insertions or deletions allowed between n-mers. Default is 2 for
nucleotides, 0 for proteins.
-trans Translate database to protein in 6 frames. Note: when using this option,
it is best to run on RepeatMasked data.
-log=logFile Keep a log file that records server requests.
-seqLog Include sequences in log file (not logged with -syslog).
-syslog Log to syslog.
-logFacility=facility Log to the specified syslog facility. Default is local0.
-mask Use masking from .nib file.
-repMatch=N Number of occurrences of a tile (n-mer) that triggers repeat-masking of
the tile. Default is 1024.
-maxDnaHits=N Maximum number of hits sent from the server for a DNA query. Default is
100.
-maxTransHits=N Maximum number of hits sent from the server for a translated query.
Default is 200.
-maxNtSize=N Maximum size of untranslated DNA query sequence. Default is 40000.
-maxAaSize=N Maximum size of protein or translated DNA queries. Default is 8000.
-canStop If set, a quit message will actually take down the server.
| |
|
|
|
|
gfClient – A client for the genomic finding program that produces a .psl file.
Usage:
gfClient host port seqDir in.fa out.psl
Where:
host is the name of the machine running the gfServer.
port is the same port used for the gfServer.
seqDir is the path of the .nib or .2bit files relative to the current dir. Note: these are needed
by the client as well as the server.
in.fa is a fasta format file. May contain multiple records.
out.psl is the output file.
Options:
-t=type Database type, one of:
dna - (Default) DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
-q=type Query type, one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
-prot Synonymous with -d=prot -q=prot
-dots=N Outputs a dot every N query sequences.
-nohead Suppresses psl header.
-minScore=N Sets minimum score. This is twice the matches minus the mismatches and minus a
gap penalty. Default is 30.
-minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide
searches, 25 for protein or translated protein searches.
-out=type Controls output file format, one of:
psl - (Default) tab-separated format without actual sequence
pslx - tab-separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-maxIntron=N Sets maximum intron size. Default is 750000.
| |
|
|
|
|
faToTwoBit – Convert DNA from fasta to .2bit format.
Usage:
faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
Options:
-noMask Ignore lower-case masking in fa file.
-stripVersion Strip off version number after "." for GenBank accessions.
-ignoreDups Convert only first sequence if there are duplicate sequence names. Use
"twoBitDup" to find duplicate sequences.
| |
|
|
|
|
twoBitToFa – Convert all or part of .2bit file to fasta format.
Usage:
twoBitToFa input.2bit output.fa
Options:
-seq=name Restrict this to just one sequence.
-start=X Start at given position in sequence (zero-based).
-end=X End at given position in sequence (non-inclusive).
-seqList=file File containing list of the desired sequence names in the format
seqSpec[:start-end], e.g. chr1 or chr1:0-189, where coordinates are
half-open zero-based, i.e. [start,end).
-noMask Convert sequence to all upper-case.
-bpt=index.bpt Use bpt index instead of the built-in one.
-bed=input.bed Grab sequences specified by input.bed. Will exclude introns.
-bedPos With -bed, to use chrom:start-end as the fasta ID in output.fa.
-udcDir=/dir/to/cache Location of cache for remote bigBeds and bigwigs.
Sequence and range may also be specified as part of the input file name using the syntax:
/path/input.2bit:name
or
/path/input.2bit:name
or
/path/input.2bit:name:start-end
| |
|
|
|
|
faToNib – Convert from .fa to .nib format.
Usage:
faToNib [options] in.fa out.nib
Options:
-softMask Create nib that soft-masks lower-case sequence.
-hardMask Create nib that hard-masks lower-case sequence.
| |
|
|
|
|
nibFrag – Extract part of a .nib file as .fa, with all bases and gaps
lower-case by default.
Usage:
nibFrag [options] file.nib start end strand out.fa
Where:
strand is + (plus) or m (minus)
Options:
-masked Use lower-case characters for bases meant to be masked out.
-hardMasked Use upper-case characters for bases that are not masked out, and Ns for
masked-out bases.
-upper Use upper-case characters for all bases.
-name=name Use given name after '>' in output sequence.
-dbHeader=db Add full database info to the header, with or without -name option.
-tbaHeader=db Format header for compatibility with tba, takes database name as argument.
| |
|
|
|
|
pslPretty – Convert PSL to human-readable output
Usage:
pslPretty in.psl target.lst query.lst pretty.out
Options:
-axt Save in axt format.
-dot=N Output a dot every N records.
-long Don't abbreviate long inserts.
-check=fileName Output alignment checks to filename.
To improve performance, a .psl file containing multiple targets should be sorted by target.
The target and query lists can be .fasta, .2bit, or .nib files, or a list of these
files, one per line.
| |
|
|
|
|
pslSort – Merge and sort psCluster .psl output files.
Usage:
To sort all of the .psl files in the "inDirs" directories in two stages: first into
temporary files in tempDir, and then into outFile:
pslSort dirs[1|2] outFile tempDir inDir(s)
The device on tempDir must have sufficient space, typically 15-20 gigabytes if processing a
whole genome.
To sort a genome-to-genome alignment, reflecting the alignments across the diagonal:
pslSort g2g[1|2] outFile tempDir inDir(s)
Appending [1|2] to the "dirs" or "g2g" option limits the program to only
the first or second pass of the sort.
Options:
-nohead Suppress the psl header.
-verbose=N Set verbosity level, higher for more output. Default is 1.
Note: pslSort will run out of memory on extremely large files. It may be preferable to use
the following UNIX command in these situations:
sort -k 10 *.psl > sorted.psl
In this case, remove the header lines from the .psl input file using the
"-noHead" option to blat before using the UNIX sort command.
| |
|
|
|
|
pslReps – Analyze repeats and generate genome-wide best alignments from a
sorted set of local alignments.
Usage:
pslReps in.psl out.psl out.psr
Where
in.psl is an alignment file generated by psLayout and sorted by pslSort
out.psl is the best alignment output
out.psr contains repeat info
Options:
-nohead Suppress psl header.
-ignoreSize Reduce weighing in favor of larger alignments.
-noIntrons Don't penalize for not having introns when calculating size factor.
-singleHit Takes single best hit, not splitting into parts.
-minCover=0.N Minimum coverage to output. Default is 0.
-ignoreNs Ignore Ns when calculating minCover.
-minAli=0.N Minimum alignment ratio. Default is 0.93.
-nearTop=0.N Amount of deviation from top allowed to be taken. Default is 0.01.
-minNearTopSize=N Minimum size of alignment that is near top for alignment to be kept. Default
30.
-coverQSizes=file Tab separate file with effective query sizes. When used with -minCover, this
allows polyAs to be excluded from the coverage calculation.
| |
|
|
|
|
Installing webBlat
Installing a web-based blat server involves four major steps:
-
Create the sequence databases.
-
Run the gfServer program to create in-memory indexes of the databases.
-
Edit the webBlat.cfg file to set the machine and port that the gfServer(s) are
running on, and to optionally customize the webBlat appearance for users.
-
Copy the webBlat executable and webBlat.cfg to a directory where the web server can
execute webBlat as a cgi.
Creating sequence databases
Create databases with the program faToTwoBit. Typically, a separate database is created for
each genome being indexed. Each database can contain up to four billion bases of sequence in
an unlimited number of records. The databases for webPcr and webBlat are identical.
The input to faToTwoBit is one or more fasta format files, each of which can contain
multiple records. If the sequence contains repeat sequences, as is the case with
vertebrates and many plants, the repeat sequences can be represented in lower-case and the
other sequence in upper-case. The gfServer program can then be configured to ignore the
repeat sequences. The output of faToTwoBit is a file that is designed for fast random access
and efficient storage. The output file stores four bases per byte, and uses a small amount
of additional space to store the case of the DNA and to keep track of blocks of Ns in the
input. Other ambiguity codes in the input sequence, such as Y and U, will be converted to N.
Here is an example of how a typical installation might create a mouse and a human genome
database:
cd /data/genomes
mkdir twoBit
faToTwoBit human/hg16/*.fa twoBit/hg16.2bit
faToTwoBit mouse/mm4/*.fa twoBit/mm4.2bit
It is not necessary to put all of the databases in the same directory, but it can simplify
bookkeeping. The databases can use the older blat format, .nib, instead of the .2bit format.
The .nib format packs only two bases per byte, and handles only one record per .nib file.
Creating in-memory indexes with gfServer
The gfServer program creates an in-memory index of a nucleotide sequence database that can
be used for translated or untranslated searches. Translated indexes enable protein-based
blat queries and use approximately two bytes per unmasked base in the database. Untranslated
indexes are used for nucleotide-based blat queries and the in-silico PCR utility. A normal
blat index uses approximately one-quarter of a byte per base. For blat on smaller
(primer-sized) queries or for in-silico PCR, it is recommended that you use a more thorough
index that requires one-half of a byte per base. The gfServer program is memory intensive
but typically does not require a lot of CPU power. With sufficient memory, multiple
gfServers can be run on the same machine.
Here is an example of a typical installation:
ssh bigRamMachine
cd /data/genomes/twoBit
gfServer start bigRamMachine 17779 hg16.2bit &
gfServer -trans -mask start bigRamMachine 17778 hg16.2bit &
The "trans" flag creates a translated index. It requires approximately 15 minutes
to build an untranslated index and 45 minutes to build a translated index.
To build an untranslated index to be shared with in-silico PCR:
gfServer -stepSize=5 bigRamMachine 17779 hg16.2bit &
This index will be slightly more sensitive with blat, particularly for small query
sequences.
Editing the webBlat.cfg file
The webBlat.cfg file tells the webBlat program where to look for gfServers and sequence. The
basic format of the .cfg file is line-oriented. The first word of each line is a command,
and lines that are blank or start with "#" are ignored. The webBlat.cfg and
webPcr.cfg files are similar. The webBlat.cfg commands are:
-
gfServer – defines the host and port on which the (untranslated) gfServer is
running, the associated sequence directory, and the name of the database to display on
the web page.
-
gfServerTrans – defines the location of a translated server.
-
background – (optional) defines the background image (if any) to display on the
web page.
-
company – (optional) defines the company name to display on the web page.
-
tempDir – specifies where to put temporary files. This path is relative to the
directory where the web server executes CGI scripts. It is recommended that an automated
mechanism, such as a cron job, be put in place to periodically remove files that haven't
been recently accessed.
The background and company commands are optional. The webBlat.cfg file must have at least
one valid gfServer or gfServerTrans line, and a tempDir line.
Here is an example of a webBlat.cfg file that you might find for a typical webBlat
installation:
company Awesome Research Amalgamated
background /images/dnaPaper.jpg
gfServer bigRamMachine 17778 /data/genomes/2bit/hg16.2bit Human Genome
gfServer bigRamMachine 17779 /data/genomes/2bit/hg16.2bit Human Genome
gfServer mouseServer 17780 /data/genomes/2bit/mm4.2bit Mouse Genome
gfServer mouseServer 17781 /data/genomes/2bit/mm4.2bit Mouse Genome
tempDir ../trash
Locating webBlat
The details of this step vary highly among web servers. Unless you are administering your
own computer, you will likely have to ask your local system administrators for help with
this part of the webBlat installation. Here is an example of a typical installation of
Apache:
ssh webServer
cd kent/webBlat
cp webBlat webBlat.cfg /usr/local/apache/cgi-bin
mkdir /usr/local/apache/trash
chmod 777 /usr/local/apache/trash
In the above example, the executable and configuration file reside in the directory
kent/webBlat/. The program will create some files in the trash directory. It is advisable to
periodically remove old files from this directory.
Here is an example of an installation on Mac OS X:
cp webBlat webBlat.cfg /Library/WebServer/CGI-Executables
mkdir /Library/WebServer/trash
chmod 777 /Library/WebServer/trash
| |
|
|
|