Blue and its kin Correcting sequencing errors using k mers

Blue and its kin Correcting sequencing errors using k mers Paul Greenfield 1 th July215 CSIRO OCEANS AND ATMOSPHERE FLAGSHIP

Correcting reads CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG TTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTGTTATTATAAGTTAAAAAGT TAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATGATAACTTAAA AATTTGTTTTATACATGATTATTTTAAAA-CATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG GAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAAGAATAAAACGGGATATTTTTATTA CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACACGTTATTTTAAAAA-CATAACAATAAAACGG TAAAAATTTGTTTTATACATGATTTATTTTAAAACAGAACAATAAAACGGGAAATTTTTTATATAAGTTAAAAAGTCGG TGAAAGTGCATAAAAATTTGTTCTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAA GTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAG GCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTT AGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAACAAAACGGGAAATTTTTATCTATAAGTTAA Starting point is a big pile of random reads Overlapping reads buried amongst millions of others No reference sequence for read alignment Reads come from both strands Errors are almost random Types of error depend on platform (subs, dels, ins) Quality falls off on RHS of each read How can the errors in this data be found and corrected? 2

Two interesting k mer graphs k mer distinctiveness within a bacterial genome k mers are distinct ish once k >= 18 The last 4% are biological repeats such as ribosome, transposons, same % for Eukaryotes? k mer repetition within reads Typically 5 2 depth of coverage k mers from good parts of reads are present many times k mers covering sequencing errors are found very few times % of k mers % Uniqueness k mer Uniqueness Within Genome Sequences 1% 9% 8% 7% 6% 5% 4% genome 1 2Mbp genome 2 3Mbp 3% genome 3 4Mbp genome 4 5Mbp 2% genome 5 6Mbp genome >6Mbp 1% all >1Mbp % 5 1 15 2 25 3 35 k mer length 4 3.5 3 2.5 2 1.5 1.5 1 2 3 4 5 Repetition Depth 3

Error correction fundamentals Find an error within a read (or a part of a read) Find the best replacement for the broken part of the read But do this with no reference to say what is right or wrong Mostly using consensus as a guide Most of the reads covering any spot in the genome will agree on the base But can t just align the reads (like a SNP calling algorithm would) CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG TTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTGTTATTATAAGTTAAAAAGT TAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATGATAACTTAAA AATTTGTTTTATACATGATTATTTTAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG GAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAAGAATAAAACGGGATATTTTTATTA CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACACGTTATTTTAAAAACATAACAATAAAACGG TAAAAATTTGTTTTATACATGATTTATTTTAAAACAGAACAATAAAACGGGAAATTTTTTATATAAGTTAAAAAGTCGG TGAAAGTGCATAAAAATTTGTTCTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAA GTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAG GCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTT AGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAACAAAACGGGAAATTTTTATCTATAAGTTAA 4

Error correction algorithms are all the same Published algorithms just differ in how They find errors and determine the correct replacement/fix Build a consensus from the reads using Fixed length read fragments (k mers) Variable length read fragments (suffix arrays/trees of sub reads) Alignment to constructed consensus reads CS data structure du jour, Check (each part of) each read against the consensus Presence/absence or depth of coverage Choose the best replacement for error fragments Minimal distance metric from error fragment to good fragment? This is the major area of difference between published tools 5

Choosing the best replacement Trivially easy to correct most errors in Illumina data Almost all errors are substitutions Often only a single close replacement fragment But correcting the non trivial cases properly is essential Exiting repetitive regions multiple correct corrections TTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTGTTATTATAAGTTAAAAAGT TAAAAATTTGTTTTATACATGATTATTTTAAAAAACTTTTTAACTTATAATAACAATTTCCCGTTTTATTGTTATG AATTTGTTTTATACATGATTATTTTAAAAATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG GAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAAGATAAGAATAAAACGGGATATTTTTATTA CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACACGTTATTTTAAAAAACTTTTTAACTTATAATAACAATTTC TAAAAATTTGTTTTATACATGATTTATTTTAAAAACTTTTTAACTTATAATAACAATTTCCCGTT TGAAAGTGCATAAAAATTTGTTCTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAA GTTTTATACATGATTATTTTAAAAAACTTTTTAACTTATAATAACAATTTCCCGTTTTATTG GCATAAAAATTTGTTTTATACATGATTATTTTAAAAAACTTTTTAACTTATAATAACAATTTCCCGTTTTATTGTTATGA AGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAAACTTTTTAACTTATAATAACAATTTCCCGTTTTATT Making the wrong fix introduces errors rather than fixing them Which fix is the right one for the read being corrected? 6

k mer consensus correction k mer consensus table (hash table or equivalent) Set of distinct k mers derived from the reads Left hand spike is k mers including sequencing errors (low repetition often singletons) Right hand curve contains good k mers (those found in multiple reads) Simple threshold can determine which k mers are bad and which are good Basic correction algorithm Scan each read, generating k mers on the fly Detect bad k mers & replace with good ones Find possible variants of the bad k mer Select best variant by checking against the good k mer consensus Overwrite bad k mer with the selected good k mer and continue scanning/replacing % of k mers 12 1 8 6 4 2 4 3.5 3 2.5 2 1.5 1.5 2 4 6 Repetition Depth Original Healed 1 11 21 31 41 51 61 71 81 91 11 111 121 131 141 151 161 171 181 CSIRO.

Finding potentially bad k mers Primarily looking for anomalies in the repetition depth But depth varies for many reasons, not just as a result of errors A single miscalled base will cause depth to drop to near zero (for k bases) Coming out of repeat regions causes drops that are not errors Some platforms have frequent Ins/Del errors at end of homopolymer runs Depth drops perhaps to half not zero 12 12 1 Original Healed 1 Original Healed 8 8 6 6 4 4 2 2 1 1 19 28 37 46 55 64 73 82 91 1 19 118 127 136 145 154 163 172 181 1 9 17 25 33 41 49 57 65 73 81 89 97 15 113 121 129 137 145 153 CSIRO.

Most of the time correction is simple Often only one variant of a bad k mer will be found in the good k mer set GGATCTTCAACTGGACGGTGACGTCACT GATCTTCAACTGGACGGTGACGTCA not found GATCTTCAACTGGACGGTGACGTCC not found GATCTTCAACTGGACGGTGACGTCG 27 32 GATCTTCAACTGGACGGTGACGTCT not found GGATCTTCAACTGGACGGTGACGTCG ATCTTCAACTGGACGGTGACGTCGA 32 3 ATCTTCAACTGGACGGTGACGTCGC not found ATCTTCAACTGGACGGTGACGTCGG not found ATCTTCAACTGGACGGTGACGTCGT not found GGATCTTCAACTGGACGGTGACGTCGA TCTTCAACTGGACGGTGACGTCGAA not found TCTTCAACTGGACGGTGACGTCGAC not found TCTTCAACTGGACGGTGACGTCGAG 3 29 TCTTCAACTGGACGGTGACGTCGAT not found GGATCTTCAACTGGACGGTGACGTCGAG CSIRO.

Naike: a naïve k mer corrector 95% of k mers in the genome are distinct Only ~5% of k mers are repeated (with consequent fix ambiguities) Progressively tile each read into k mers Detect error k mers through poor depth of coverage On the fly repair means only last base needs to be varied Limit fixes to substitutions Only 4 possible bases so only 4 variants each time Look up depth for each variant k mer and choose the one with the highest repetition count Most often only a single viable variant is found ATCTTCAACTGGACGGTGACGTCGA 32 3 ATCTTCAACTGGACGGTGACGTCGC not found ATCTTCAACTGGACGGTGACGTCGG not found ATCTTCAACTGGACGGTGACGTCGT not found 1

Naike: a naïve k mer corrector Even this simplistic k mer corrector works surprisingly well Produces better results than all correction algorithms published prior to the last couple of years And is much faster as well because most of them are poorly written Uncorrected reads Naive k mer correction 1 2 3 4 5 6 7 8 9 1+ unaligned 1 2 3 4 5 6 7 8 9 1+ unaligned Reads aligned against reference (Bowtie2) with # of mismatches 11

Blue: correcting errors using consensus & context Blue is a (non naïve) k mer consensus correction tool doi:1.193/bioinformatics/btu368 Handles both Illumina and 454/IonTorrent like data ins/del errors as well as subs (unlike most tools) Handling deletions means that there are always multiple possible fixes The same k mer can always be generated by substituting or inserting Recursively explores the tree of potential fixed reads to find best fix Trials potential fixes in the context of the read being repaired Looking both left and right of the fix Recursive exploration to allow for further errors to the right Depth first tree exploration error limited to improve efficiency Generated metrics include no. of matching bases to right, bases to first fix, 12

Blue: correcting errors using consensus & context Can use longer pairs of k mers to improve context for a fix Currently pairs of 16 mers separated by a 16 48bp gap Used to detect errors and improve correction Both k mers and pairs must be found in consensus tables Cull potential variants if not supported by pairs Prevents generation of chimeric corrected reads Fixed tail of read is good but does not belong with start of read Improves correction at the margin only Just a small number of critical corrections Blue separates consensus from the reads being corrected Possible to correct long (e.g. 454) reads with a larger set of Illumina k mers Combine 454 etc read length with depth of Illumina Addresses 454 etc homopolymer problem (different error models) Fixing PacBio data is work in progress and has been for a few months Correct Illumina mate pairs with normal Illumina reads 13

Read tree exploration CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACATGATTATGTTAAAAATATAACAATAAAACGGGAAATTTTTATTATAACTTAAAAAGTTAATG (correct read) CAAATTTTTATGAAAGTGCATAAAAATTTGTTTTATACATGATTATTTTAAAAACATAACAATAAAACGGGAAATTTTTATTATAAGTTAAAAAGTTAATG (uncorrected) Sub C G +14 Sub T C @62..TAACTTAAAA.. Sub G T..TATGTTAAAA.. @3..AAATATAACA.. Ins C + Sub A C +..TAACGTTAAA....AATAATAACA.. Ins T @31 Ins A +..AAATCATAAC....AATACATAAC.. @22 Sub A T Sub T A +.. AAATCATAAC.. @3 Del C +3..AATATAACAA....GTTAAAAAAC.. Sub T A +3..TTAAAAAAAC.. Ins T +1.. AAATACATAA.. Ins G @25 Ins A @26 Ins A +..TATGTTTAAA....GTTATAAAAA....TTAATAAAAA.. Del T Del T +3 Sub T C +31..TTAAAAAACA....AAATATAACA.. @3..GTTAAAAACA.. Ins T +..AAATCATAAC.. 14

Performance tests (Blue 1.2 on Linux) Threads Prep. time (min) Correct. time (min) Prep. mem (GB) Correct. mem (GB) Reads/min (elapsed) Pseudomonas Blue 8 5.4 6.3 1.3.4 843,876 9,859,28 reads BLESS* 1 48.3.1 24,126 (var len, 15bp) Coral 8 437. 11. 22,561 Echo 1 4934.8 19. 1,998 *(8812148 reads HiTEC* 1 319.4 9.9 3,868 12bp fixed len) HSHREC 8 329. 3. 29,967 RACER 8 16.9 1.6 583,389 Reptile 1 45.1 132. 4.2 2.6 55,686 SHREC 8 113.7 17. 86,691 Human Chr21 Blue 8 6.6 6.5 2.1 2.6 1,31,576 13,486,136 reads Coral 8 132.5 11. 11,782 HiTEC 1 414. 12. 32,575 RACER 8 15.8 2.6 853,553 Reptile 1 54.1 194.7 3.6 4.2 54,198 SHREC 8 113.4 21. 118,978 E. coli DH1B Blue 8 6.7 9.3 1.3.3 815,548 13,51,484 reads BLESS 1 94.3.1 138,44 Coral 8 1223. 12. 1,672 HiTEC 1 51. 9.7 25,591 RACER 8 21.9 1.2 595,958 Reptile 1 59. 15.4 3.6 3. 62,328 SHREC 8 146.8 21. 88,97 15

Is correction worthwhile? Does correcting reads as part of a pipeline give you better results? Longer, more accurate contigs? More accurate SNP calling? Testing error correction using real data and real downstream tools Find sequence datasets that have matching reference genomes? Never be 1% identical but can be close enough Extraordinarily hard to find in practice Need to use the entire dataset, not just some a better behaved subset Common to avoid multi mapped reads in accuracy testing! Align corrected reads and count matches Assemble and Check contigs for differences and compare (misleading) assembly stats Check alignments in difficult (repetitive) regions 16

Accuracy test alignments E. coli K12 DH1B alignments Human Chr 21 alignments 11% 1% 9% 11% 1% 9% Discard Unalign 1+ 8% 8% 9 7% 6% 5% 4% 3% 74.8% 99.5% 99.9% 99.9% 98.2% 85.7% 96.9% 88.% 97.1% 94.9% 7% 6% 5% 4% 3% 69.% 83.3% 85.2% 85.5% 8.% 81.7% 75.% 81.6% 8.6% 8 7 6 5 4 3 2% 1% % 2% 1% % 2 1 Alignments against reference using Bowtie2 Counting number of mismatches per read. 17

Accuracy test assembly E. coli K12 DH1B assembly Pseudomonas assembly 4 35 4 35 N5 N5 35 Max align 3 35 Max align 3 3 Broken CDS 25 3 Broken CDS 25 Contig length 25 2 2 Breakages Contig length 25 2 2 Breakages 15 15 15 15 1 1 1 1 5 5 5 5 Synth Original Blue Blue g8% Blue g9% Bless Coral HiTEC RACER Reptile SHREC Synth Original Blue Blue g8% Blue g9% Bless Coral Echo HiTEC RACER Reptile SHREC Metrics: N5 (a bit like a median length) Max aligned region (max contig can include chimeric contigs) CDS that contain mismatches against reference (Mauve) 18

Assembly accuracy miscall density Synth (36) Original (385) Blue (36) BlueGood (8%) (33) BlueGood (9%) (29) Bless (41) Coral (355) HiTEC (48) RACER (31) Reptile (171) SHREC (7) 454 + Illumina (493) BlueG9+454x2Ig (16) 454Coral + IlluminaCoral (428) 454 + IlluminaHiTEC (58) 5 1 15 2 25 3 35 4 45 Density of miscalls (including real differences, comparison errors and assembly artefacts). Mauve miscalls per 1kbp region plotted along E. coli MG1655 genome 19

E. coli rhsd alignments rhsd: 95bp repeated pseudogene region (with somewhat divergent margins) Extracted corresponding contigs and BLASTed onto reference. 2x 2x 2x 2x 2x Synthetic data Blue 454+Illumina Blue Blue Good Original HiTEC Coral Reptile SHREC 2

Is error correction worthwhile? Good error correctors fix almost all fixable errors Sequencing artefacts may still be present but can be discarded 99.9% of reads can be aligned with no mismatches And some of the mismatches may be real differences Does this improve SNP calling? Nice little paper waiting to be written here Greatly improved assemblies (from Velvet) Far fewer errors in generated contigs Longer contigs (limited by non spanable repeats) Results coming from non corrected a bit disturbing 21

Blue 1.3 Reworked/optimised hash table code ~2% performance improvement Extensive parallelism changes in Tessel (the k mer tiler) ~3% performance improvement trim, extend options Keeping as much of each read as possible Grow reads where this is safe (unambiguous extension path) Handling pairs of files as pairs in multi pair read sets IO improvement (as Blue performance is often IO sensitive) Adopting virtual rewrite optimisations from Pup A GxG comparison program based on ideas from Blue Fix for the only reported bug (FASTQ qual determination) 22

Blue 2. Requests for cross correcting PacBiodata Blue was build to correct the occasional error in mostly good reads PacBiois mostly very poor data with the occasional good patch 12 1 8 6 4 2 12 1 8 6 4 2 1 187 373 559 745 931 1117 133 1489 1675 1861 247 2233 2419 265 2791 2977 3163 3349 3535 3721 397 493 4279 4465 Blue changes so far Turn off protection against rewriting reads Allow correction of very poor/unbalanced/deep reads May need further performance optimisation (cf. Pup) 1 8 159 238 317 396 475 554 633 712 791 87 949 128 117 1186 1265 1344 1423 152 1581 166 1739 1818 1897 1976 23

Correcting metagenomic datasets Each (common) organism has its own k mer consensus set Needs dynamic (per read) good/bad threshold setting Blue works quite well in practice Longer contigs, better metrics from normal Velvet Even when coverage peaks are not distinct Used for several published draft genomes 24

Thank you Paul Greenfield Oceans and Atmosphere Flagship t +61 2 9325 325 e paul.greenfield@csiro.au w www.csiro.au/oa Paul Greenfield (O&A) Denis Bauer (Biosecurity) Konsta Duesing (F&N) Alexie Papanicolaou (UWS) David Midgley (Energy) Nai Tran Dinh (F&N) Supported by David Lovell & CSIRO Transformational Biology CSIRO OCEANS AND ATMOSPHERE