Ancient DNA: Methods and Protocols (43 page)

23 Analysis of High-Throughput Ancient DNA Sequencing Data 221

sequences that are high-identity full-length prefi xes of each other.

This approach is computationally very expensive compared to approaches using alignment coordinates.

Cdhit only accepts FASTA fi les for the clustering step. In addition, this program requires paired-end reads to be in a single line, as they refl ect the read out of a single molecule.

Use a for-loop to iterate over the quality fi ltered fi les, calling the SplitMerged2CDhit script for each line. Exclude incomplete paired-end pairs, and defi ne the
alignments
subfolder as output directory:

 

Loop over the generated FASTA fi les, providing each as input to cd-hit-454 and defi ning an output fi le with the “.clust”

extension:

 

Evaluate the cdhit cluster fi les for the number of clusters using GNU tools. Print out the average coverage, the number of single-tons, the number of unique sequences, and the total number of sequences:

222

M. Kircher

 

1.0

0.8

0.6

0.4

Fraction of library seen (u/M)

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Fraction of unique molecules (u/s)

Fig. 7. Interaction between the number of unique molecules (
u
) and the fraction of the library seen (
u/M
) when assuming sampling with replacement.
Dots
show equal amounts of sampled data (
s
) being added. The number of unique molecules (
u
) relates to the total number of different sequences (
M
) and the number of sampled sequences as follows: u =
M
·(1 − e −
s/M
). For the example data set, we see on average about 85% unique molecules, indicating sampling of about 28% of the total library (
dashed lines
).

The results of this analysis indicate that the libraries have not been sequenced deeply. On average, 84.5% of the molecules are unique (minimum 29.6%, maximum 96.4%), indicating that on

average 28% of the libraries wer
e sequenced (see Fig. 7 ).

4. Notes

 

1. The FastQ sequence and quality format is not native to the 454/Roche or Ion Torrent instrument, which use a binary format called SFF (standard fl owgram format). This format stores the observed fl ow values, and refl ects that the instrument determines stretches of the same nucleotide in one go and that the number of bases in these homopolymer runs is computationally inferred. Even though SFF can be converted to FastQ (e.g.,

( 67 )
or the instrument pipeline program

), the conversion fi xes the number of bases

observed in a homopolymer run and therefore loses information. For this reason, programs working with fl ow values (SFF

fi les) are preferable when using this type of data.

23 Analysis of High-Throughput Ancient DNA Sequencing Data 223

2. The color-space output of the Applied Biosystems/Life technologies SOLiD instrument can be converted to FastQ without performing a nucleotide sequence conversion, as the

output includes four different codes for the dinucleotides.

Several programs can handle color-space FastQ fi les (e.g., Maq/BWA
( 55
) , Bowtie
( 68
) ). Using color-space FastQ fi les is preferable to sequence conversion without a reference

genome, as conversion may create erroneous sequence downstream of every machine error
( 69
) .

3. The Wellcome Trust Sanger Institute and Illumina Inc. use an encoding of base qualities with one character per quality scores.

To achieve this, the computer representation of keyboard characters as integers is exploited. International ASCII encoding tables assign numbers 0–127 to the same characters world—

wide; however, ASCII character 0–32 are nonprintable. Thus, the fi rst character used in the Sanger standard is 33 (“!”). In its fi rst versions, the Illumina software did not use PHRED-like probabilities (quality score = −10·log (error probability))
( 51 )
.

10

Instead, they used a quality score model with negative values, and therefore set the zero quality to 64 (“@”). In current software versions, PHRED-like probabilities are used, but depending on the exact software version still printed with the offset of 64. Other encodings include the actual quality score numbers separated by space in the quality line (e.g., AltaCyclic base call fi les
( 30 )
). This encoding is considered to be ineffi cient and is used rarely today.

4. If the sample name contains dash (“-”) or underscore (“_”) characters, the script will split the name by these characters and only consider the last fi eld as the index name. The name should not contain any special characters, as it will be used as part of the name for the script’s output fi les.

5.
K
-mers, or more specifi cally
n
-mers where
n
is a constant number, refers to nucleic acid or amino acid sequence “words” of a specifi ed length. A word frequency analysis, or
k
-mer analysis, and more precisely the analysis of over-and underrepresentation of words of a specifi c length, can be used in different contexts. For example, k -mer analysis is used to identify

protein-binding motifs or experimental artifacts. A set of known reference sequences can also be used to determine

high-frequency
k
-mers for fi ltering a new data set, and the overlap of
k
-mers can be used to assemble sequences (de Bruijn graphs).

6. Sequence artifacts can be generated not only during the library preparation steps, but also during the actual sequencing process. For example, on the 454/Roche instrument, the light signal from one well position of the sequencing plate may bleed over to empty, neighboring wells and cause so-called “ghost 224

M. Kircher

wells” that contain a mixture of signals. On SOLiD and the Illumina instruments, chemistry particles, dust, and lint may be identifi ed as clusters. Frequently, these sequencing artifacts either fail the demultiplexing step because they contain an incorrect index sequence, have lower quality scores, or contain other features such as low sequence complexity that cause them to be fi ltered from the sequencing results.

7. The 454/Roche software automatically detects adapter

sequences and marks the determined set-in point (where the non-adapter sequence ends) in the SFF data fi le. When exporting sequences from SFF, the trimmed sequences can be obtained. If adapter sequences are degenerated, it may be diffi cult to identify the set-in point, and additional measures should be taken to assure adapter sequences are recognized and trimmed. The 454/Roche software also excludes sequences for which the adapter is identifi ed very early in the sequence readout. The exact cutoff is version-specifi c, but, given the current read length from this platform, using this fi lter may compromise aDNA studies by causing too many short reads to be excluded from analysis.

8. The fi rst fi ltration step of the raw data is generally performed by the instrument software. For example, the Illumina Genome Analyzer uses a signal purity fi lter called “chastity” that marks “good” sequences as PF (pass fi lters). This fi lter requires that the correct intensities for the bases called are 1.5 times higher than the next highest intensity for the fi rst 12 cycles. In more recent versions of the analysis pipeline, the fi lter requires this to be true for the fi rst 25 cycles, but allows for one outlier cycle within these 25. On the 454 platform, sequences are fi rst fi ltered for a matching key sequence (the fi rst four bases of the read) and then further fi ltered based on the number of active fl ows, i.e., how frequently a nucleotide provided for synthesis results in a base read out. Since bases are determined in a sequential order (TACG), about two active fl ows per cycle are expected. Reads with an average number of active fl ows falling at the outer edges of the distribution are frequently due to ghost wells or mixed beads.

9. The Illumina base call qualities are, like Sanger quality scores, inferred for a specifi c base. However, each software update provides new quality score tables and updated algorithms. On 454/Roche and Ion Torrent, the quality score concept fails with homopolymers, which are a single machine signal that is repeated over several bases when converted into sequence

space. On SOLiD, quality scores incorporate a signal deduced from dinucleotides (i.e., two read outs with the chance of one of them being wrong).

10. Sequence alignments vary according to how the ends of the alignment are scored (global alignments, semi-global, and local 23 Analysis of High-Throughput Ancient DNA Sequencing Data 225

alignments). Widely used programs like blast
( 70 )
and blat
( 71
) report local alignments, which fi nd the highest scoring substring of the query sequence to a target sequence (the ends of the query sequence do not need to match the target sequence).

ClustalW

( 72
) , T-Cof
fee

( 73
)
, MUSCLE

( 74 )
, and similar

alignment programs typically report global alignments, where the complete query sequence is assumed to cover the whole

target sequence (thus scoring incomplete coverage negatively).

Semi-global alignments require the complete query to be

aligned, but do not assume complete coverage of the target. It has been shown that local alignments introduce biases in the alignment of short reads to divergent reference sequences. Such biases are not observed with semi-global alignments
( 24
) .

11. Over the last several years, a number of aligners/mappers have been published with the aim to rapidly and accurately align millions of short reads
( 75
) . Some of these programs require long, perfect matches to the reference sequence (often called “seeds,” see refs.
( 76, 77 )
) while others use compressed suffi x arrays for effi cient matching
( 55, 68,
78
) . Most programs limit the number of substitutions and insertion/deletion events to speed up the alignment. Users of the different programs should be aware of the specifi c limitations of each program in terms of alignment accuracy and sensitivity. Most programs are not able to consider misincorporation patterns of aDNA, which may

Other books

Deadly Errors by Allen Wyler
The Castle Mystery by Gertrude Chandler Warner
Bad Blood by Chuck Wendig
The Carrion Birds by Urban Waite
Comanche Gold by Richard Dawes
The Forgotten Locket by Lisa Mangum