ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. SLIDINGWINDOW: Performs a sliding window trimming approach. It starts.

114 KB – 14 Pages

PAGE – 1 ============
Trimmomatic Manual: V0.32 Introduction Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application. There are two major modes of the program : P aired end mode and S ingle end mode. The paired end mo d e will maintain correspondence of read pairs and also use the additional information contained in paired reads to better find ad apter or PCR primer fragments introduced by the library preparation process. Trimmomatic works with FASTQ files (using phred + 33 or phred + 64 quality scores, depending su Implemented trimming steps (Quick reference) Trimmomatic performs a variety of useful trimming tasks for illumina paired – end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line. The current trimming steps are: ILLUMINACLIP: Cut adapter and other illumina – specific sequences from the read. SLIDINGWINDOW: Perform s a sliding window trimming approach. It starts scanning at t once the average quality within the window falls below a threshold. MAXINFO: An adaptive quality trimmer which balances read length and error rate to maximise the value of each read LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality CROP: Cut the read to a specified length by removing bases from the end HEADCROP: Cut the specified number of bases from the start of the read MINLEN: Drop the read if it is below a specified length AVGQUAL: Drop the read if the average quality is below the specified level TOPHRED33: Convert quality scores to Phred – 33 TOPHRED64: Convert quality scores to Phred – 64

PAGE – 2 ============
Running Trimmomatic Processing Order The d ifferent processing steps occur in the order in which the steps are specified on the command line. It is recommended in most cases that adapter clipping, if required, is done as early as possible , since correctly identifying adapters using partial matches is more difficult . Single End Mode For single – ended data, one input and one output file are specified. The required processing steps (trimming, cropping, adapter clipping etc.) are specified as additional arguments after the input/output files. java – jar < path to trimmomatic jar> SE [ – threads ] [ – phred33 | – phred64] [ – trimlog ] or java – classpath org.usadellab.trimmomatic.TrimmomaticSE [ – threads ] [ – phred33 | – phred64] [ – trim log ] – phred33 or – phred64 specifies the base quality encoding. If no quality encoding is specified, it will be determined automatically (since version 0.32). The prior default was – phred64. Specifying a trimlog file creates a log of all read trimmings, indicating the following details: the read name the surviving sequence length the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read the amount trimmed from the end Multiple steps can be specified as required, by using additional arguments at the end as described in the section processing steps. For input and output files adding .gz/.bz2 to an extension tells Trimmomatic that the file is provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively.

PAGE – 3 ============
Paired End Mode For paired – end data, two input files, and 4 output files are specified , 2 for the ‘paired’ output where both reads survived the processing, and 2 for corresponding ‘unpaired’ outpu t where a read survived, but the partner read did not. Figure 1: Flow of reads in Trimmomatic Paired End mode java – jar PE [ – threads ] >] [ – basein | ] [ – baseout | or java – classpath org.usadellab.trimmomatic.TrimmomaticPE [ – threads ] [ – phred33 | – phred64] [ – trimlog ] [ – basein | ] [ – baseout | – phred33 or – phred64 specifies the base quality encoding. If no qu ality encoding is specified, it will be determined automatically (since version 0.32) . The prior default was – phred64. – threads indicates the number of threads to use, which improves performance on multi – core computers. If not specified, it will be chosen automatically. Specifying a trimlog file creates a log of all read trimmings, indicating the following details: the read name the surviving sequence length the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read the amount trimmed from the end Multiple steps can be specified as required, by using additional arguments at the end as described in the section processing steps.

PAGE – 4 ============
Input/Output Files Paired – end mode requires 2 inp ut files (for forward and reverse reads) and 4 output files (for forward paired, forward unpaired, reverse paired and reverse unpaired reads). Since these files often have similar names, the user has the option to provide either the individual file names, or just one name from which the file names can be derived. For input files, either of the following can be used: Explicitly naming the 2 input files N aming the forward file using the – basein flag, where the reverse file can be determined automatically. The second file is determined by looking for common patterns of file naming, and changing the appropriate character to reference the reverse file. Examples which should be correctly handled include: o Sample_Name_R1_001.fq.gh – > Sample_Name_R2_001.fq.gz o Sample_ Name.f.fastq – > Sample_Name.r.fastq o Sample_Name.1.sequence.txt – > Sample_Name.2.sequence.txt For output files, either of the following can be used: Explicity naming the 4 output files Providing a base file name using the baseout flag , from which the 4 output files can be derived . If the name mySampleFiltered.fq.gz is provided, the following 4 file names will be used: o mySampleFiltered _1P .fq.gz – for paired forward reads o mySampleFiltered _1U .fq.gz – for unpaired forward reads o mySampleFiltered _2P .fq.gz – f or paired reverse reads o mySampleFiltered _2U .fq.gz – for unpaired reverse reads For input and output files adding .gz to an extension tells Trimmomatic that the file is provided in gzipped format or that Trimmomatic should gzip the file, respectively . This extension can be used with both explicitly named and template – based file naming.

PAGE – 5 ============
P rocessing S teps in Detail Most processing steps take one or more settings, delimited by ‘:’ (a colon) ILUMINACLIP This step is used to find and remove Illumina adap ters. Identifying adapter or other contaminant sequences within a dataset is inherently a trade off between sensitivity (ensuring all contaminant sequences are removed) and specificity (leaving all non – contaminant sequence data intact). This problem is eve n more acute when only a small part of the contaminant sequen ce is included within the read. The possibility of sequencing errors within the reads complicates the process still further. Although adapter and other technical sequences can potentially occur in any location within reads, by far the most common cause of adapter contamination is sequenc ing of a DNA fragment which is shorter than the read length. In this scenario, the beginning of the read contains valid data, but when the end of the fragment is reached, the sequencer continues to – end of the read. While a full adapter sequence can be identified relatively easily, reliably identifying a short partial adapter sequence is inherently difficult. Interestingly, in a paired – end dataset, read – through will occur on both the forward and reverse reads of a particular fragment in the same position, and also, since the fragment was entirel y sequenced from both ends, the non – adapter portion of the forward and reverse reads will be reverse – complements. Since adapter read – through is a relatively common occurrence, and since Illumina datasets are often paired – end, Trimmomatic includes a second adapter identificatio n strategy, specifically for adapter read – though and which takes advantage of the added evidence available in paired – The diagram below illustrates both strategies. In A, the read contains the entire t echnical sequence within the read, and thus a standard alignment approach is sufficient to determine this fact. In B, only part of the technical Below some length threshold, which depends on the relative costs of false positives and false negatives, it is no longer possible to identify an adapter sequence, thus many short adapter fragments will remain. to check a similar sit uation with a short contaminant – end mode, the region tested as part of the alignment is much longer. Not only are both adapter sequences tested at once, but the f ragment sequence from each reads are also checked. This – mode can also detect long – example , there is no useful fragment at all, and both reads begin with sequence from the adapters. None the less, there is still a sizeable alignment, and thus this scenario can be reliably identified.

PAGE – 6 ============
T rimmomatic uses a two – step approach to find matches between the adapters and reads . First, short sections of each adapter (maximum 16 bp) are tested in each possible position within the perfect or su fficiently close match , determined by the seedMismatch parameter (see below) , the entire alignment between the read and adapter is scored. This two – step strategy results in considerable efficiency gains, since the seed alignment can be calculated very quic kly, while the full alignment score is calculated relatively rarely. The full alignment score is calculated as follows. Each matching base increases the alignment score by 0.6, while each mismatch reduces the alignment score by Q/10. By considering the qu ality of the base calls, mismatches caused by read errors have less impact. A perfect match of a 12 base sequence will score just over 7, while 25 bases are needed to score 15. As such we recommend values of between 7 – 15 as the threshold value for simple alignment mode. . For palindromic matches, a longer alignment is possible , as described above. T herefore this threshold can be higher, in the range of 30. E ven though this threshold is very high ( requiring a match of almost 50 bases) Trimmomatic is stil l able to identify very, very short adapter fragments. (See Figure 2 panels C and D, where the alignment regions are shown). ILLUMINACLIP:::: fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See the section below or use one of the provided adapter files seedMismatches: specifies t he maximum mismatch count which will still allow a full match to be performed

PAGE – 8 ============
before a read is likely to be unique depends on the size and complexity of the target sequence, but a typical target length wo uld be in the order of 40 bases. A dditional read length: The re may be added value in retaining additional bases, beyond those needed to uniquely place a read . This is dependent primarily on the application. For pure counting applications, such as RNA – Seq, unique placement is sufficient. For assembly or variant finding tasks, additional bases provide extra evidence for or against putative results, and thus can be valuable. E rror sensitivity: The downstream analysis can be more or less sensitive to errors w ithin the data. This is determined by the tools and settings used. One extreme would be tools were a single base error would cause the entire r ead to be ignored, which favours agg ressive quality trimming. The other extreme would be tools which can tolerate or even correct a large number of errors, which favour retaining as much data as possible. ond and third factor. T ors at every possible position, as scoring as follows: Minimal read length: The difference between the putative read length and the target read length is score d using a logistic function. This means that reads shorter the target length are heavily penalized, but most of the scoring benefit can be achieved by reads only marginally longer than this target. Additional read length: The putative r ead length is score d linearly, and weighted by strictness). Error sensitivity: The quality scores of the putatively retained bases are combined to calculate the probability that the read is error – free. This score is then weighted by the The combined score is calculate d for each possible read length, and the optimal score is used to determine where the read should be trimmed. In practice, the different factors combine as follows: At very short read lengths, t he minimal read length factor dominates. This will heavily penalize reads which are too short to be useful. Once the target read length has been achieved, the minimal read length factor penalty becomes a modest bonus. However, once the read is significantl y longer than the target length , further bonus es from the minimal read length factor are limited, due to the logistic function. The additional read length factor then provides a modest benefit as additional bases are retained. This is countered by the inc – drops with increasing read length. The balance between these two factors is controlled For most reads, depending on the quality of the read and the strictness in creasing penalty from the likelihood of error exceeds the bonus of retaining additional bases at some point, and the read is trimmed accordingly.

PAGE – 9 ============
MAXINFO:: targetLength: This specifies the read length which is likely to allow t he location of the read within the target sequence to be determined. strictness: This value, which should be set between 0 and 1, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases. A l ow value of this parameter (<0.2 ) favours longer reads, while a high value (>0.8 ) favours read correctness. LEADING Remove low quality bases from the beginning. As long as a base has a value below this threshold the base is removed and the next base will be investigated . LEADING: quality: Specifies the minimum quality required to keep a base. TRAILING Remove low quality bases from the end. As long as a base has a value below this threshold the base is removed and the next base (which as trimmomatic is starting from the end would be base preceding the just removed base) will be investigated. This approach can be quality score of 2), but we recommend Sliding Window or MaxInfo instead TRAILING: quality: Specifies the minimum quality required to keep a base. CROP Removes bases regardless of quality from the end of the read , so that the read has maximally the specified length after this step has been performed. Steps performed after CROP might of course further shorten the read. CROP: length: The number of bases to keep, from the start of the read. HEADCROP Removes the specified number of bases , regardless of quality , from the beginning of the read. HEAD CROP: length: The number of bases to remove from the start of the read.

PAGE – 10 ============
MINLEN This module removes reads that fall bel ow the specified minimal length. If required , it should normally be after all other processing steps. Reads removed by this step will be counted and presented in the trimmomatic summary. MINLEN: length: Specifies the minimum length of reads to be kept. TOPHRED33 This (re)encodes the quality part of the FASTQ file to base 33. TOPHRED33 (no further parameters) TOPHRED64 This (re)encodes the quality part of the FASTQ file to base 64. TOPHRED64 (no further parameters)

PAGE – 11 ============
Examples Paired End java – jar trimmomatic – 0.30.jar PE s_1_1_sequence.txt.gz s_1_2_sequence.txt.gz lane1_forward_pair ed.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3 – PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 This will perform the following in this order Remove Illumina adapters provid ed in the TruSeq3 – PE.fa file (provided). Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2 mismatches. These seeds will be extended and clipped if in the case of paired end reads a score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10 , (about 17 bases). Remove leading low quality or N bases (below quality 3 ) Remove trailing low quality or N bases (below quality 3 ) Scan the read with a 4 – base wide sliding window, cutting when the average q uality per base drops below 15 Drop reads which are less than 36 bases long after these steps Single End java – jar trimmomatic – 0.30.jar SE s_1_1_sequence.txt.gz lane1_forward.fq.gz ILLUMINACLIP:TruSeq3 – SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MIN LEN:36 This will perform the same steps, using the single – ended adapter file . (Of course the :30: parameter has no effect, but a value has to be specified nevertheless)

114 KB – 14 Pages