Recharacterization of barcodes and common priming sites
Re-characterization of the Yeast Barcode Deletion Collection.
The Yeast Deletion Collection was designed such that each ORF deletion in the yeast deletion is replaced by a kanamycin (G418) resistance cassette flanked by two nucleotide barcode sequences referred to as the uptag and the downtag barcodes. These barcodes are in turn flanked by priming sites common to all uptags and downtags (SI4). In our microarray-based analysis, all uptag or downtag barcodes sequences are amplified and labeled in two separate reactions then hybridized to the barcode array which contains probes complementary to the designed barcodes (1, 2). Relative signal obtained from hybridization analysis reflects strain abundance. Variations in the barcode sequences from the expected designed sequence may affect hybridization efficiency resulting in loss of signal associated with the barcode. Except in cases where there is complete signal loss, this effect can be accounted for when analysis depends on calculating changes in signal intensity relative to a control chip or a set of control chips.
Bar-seq analysis involves high-throughput sequencing and counting the number of occurrences of each barcode sequence within the amplified pool. Variation from the designed sequence will have a more profound effect on Bar-seq analysis as compared to hybridization due to misidentification of barcodes. For example, barcodes with a single substitution could still hybridize while in Bar-seq it might be incorrectly assigned. This effect can be partially accounted for by allowing a degree of sequence variation in the bar-seq analysis to better identify counted sequences, but a more robust approach is to re-characterize the barcode sequences in the deletion pool, by re-sequencing all the barcodes present within the pool.
The complete barcode collection as contained in the Invitrogen 6000 pool was analyzed on the Illumina/Solexa sequencing platform to re-characterize each barcode sequence as well as the common priming sites associated with each ORF deletion. Yeast genomic DNA was fragmented and adapters were ligated to the ends. Fragments encompassing the uptag or downtag barcode were PCR amplified for sequencing using one of the adaptor primers and the second primer was directed against the common U2 and D2 priming sites, respectively. This resulted in amplification of a region containing the barcode, the U1/D1 common priming sites and a stretch of upstream/downstream genomic DNA of variable length. Paired-end sequencing was used to generating a 40bp sequence for the genomic- and the barcode-end of each amplified fragment (SI4, Paired-End Sequencing).
The genomic-end sequences were mapped to yeast genomic DNA using MAQ software(3). All fragments with an identical mapped position were grouped together to define a single distinct fragment. For each distinct fragment, a single consensus sequence (CONS:EMBOSS) as determined from all corresponding barcode-end sequences. Distinct fragments were then grouped by assessing the similarity between barcode-end sequences and combining those with 3 or less base-pair variations. A single consensus sequence was then determined for each set of barcode-end sequences.
Based on proximity to the mid-point of all genomic positions for all the genomic ends of each fragment group, each fragment was assigned to an ORF deletion. The consensus sequence for the barcode-end was aligned to the designed tag sequences associated with this deletion using EMBOSS:Needle. The alignment was characterized to indicate either: a perfect match (PM), a mismatch (insertions, deletions, substitutions) of 1-3 bases (MM) or an extended mismatch (XM). PM and MM were accepted as identifying the correct ORF deletion. Any consensus resulting in an extended mismatch was further analyzed by comparison to all designed barcodes in the collection to determine if a better match could be found. If this occurred then the fragment group was reassigned to the ORF deletion with the more similar barcode sequence.
For those few cases where multiply-assigned barcodes, the complete ORF deletion collection was reviewed to assess the characterized barcodes. The most likely barcode ORF deletion assignment was determined based on the quality of the alignments. Barcodes not found during pool sequencing or with extensive mismatches to a barcode-end consensus were also flagged.
A second set of sequencing reactions were performed to capture the U2/D2 common priming sites within each deletion, as well as a portion of the barcode sequence (18nt for the uptags, 12nt for the downtags). This was a single end read on product amplified from the genomic DNA isolated from the Invitrogen 6000 pool using primers directed against a common region of the kanamycin cassette to the common priming site (SI4, Single End Sequencing). Each unique sequence was compared to all designed barcoded sequences to find the best alignment with no more than 3 variations. The single-end sequence was used to add support to the barcode characterization derived from the paired-end reads. In cases where no paired-end sequence was associated with an ORF deletion, the single-end sequence characterization was used instead, with the length of this characterization limited to 18nt for the uptag barcodes and 12nt for the downtag.
In addition to assessment of the barcode sequence, the U1/D1 primer sequences were also characterized within each identified barcode-end consensus sequence and the U2/D2 priming sequences within each identified single-end read using EMBOSS:Needle. Priming sites were classified as either a perfect match (PM) to the designed common priming site or with the degree of variation identified.
Re-characterization Analysis
Details of the barcode assessment are in the file barcode_characterization.tab, which lists each deletion strain (designated by orf: batch, to indicate orf deletions designed with different barcode combinations), followed by a flag indicating the presence of either an uptag or downtag. Barcode identifiers (tagids) that had been associated with a given strain are indicated for each, along with the tagid identified in the pool and the barcode associated with this tagid. The re-characterization contributes 5 columns.
1. Source: The source for the recharacterization
- R1Kan: both the paired end (Read1) and Single End (Kan) reads
- R1: only the paired end read (no single end confirmation)
- Kan: only the single end read used (not found in Read1 data)
- Multiple: multiple characterizations identified : no re-characterization was performed
2. Match: How well the found sequence matches the designed barcode (PM=Perfect match, MMx=Mismatch of x bases, ND=No data for this barcode)
3. Alterations: Classification of alterations (PM=perfect match, Sx=substitution,Ix=insertion,Dx=Deletions where x=number of bases)
4. Revision from designed barcodes: - = deletion, lowercase=insertion/substitution
5. Revised barcode: The revised barcode used for analysis
Statististics on the Barcode Recharacterization
Less than 5% of the uptag or downtag barcodes were not found in the datasets. Nearly 90% of both the uptag ordowntag barcodes were found from the paired-end (PE) dataset, with approximately 75% confirmation from the single-end (SE) dataset. About 7 % of each set of tags were identified in the single-end dataset only.
Of those found, greater than 80% matched the designed barcode perfectly. Looking across the two barcodes, nearly 97% had either an uptag or a downtag that matched the designed sequence perfectly. The remainder consisted of mismatches of between 1-3 nucleotides which were a combination of nucleotide deletions, insertions or substitutions compared to the designed barcodes. These are broken down further in SI6, to account for single or multiple nucleotide alterations. Details of these variations from the designed sequence can be found in the file barcode_characterization.tab.
Statistics on the Common Primer Site Re-characterization
The common primer sites that flank the uptag and downtag barcode sequence were characterized by comparison to the designed sequences. Common primer 1, on the genomic side of the barcode cassette (U1:TGTCCACGAGGTCTCT, D1:GGTGTCGGTCTCGTAG) was assessed in sequences associated with each orf deletion in the barcode-end of the paired end analysis. Approximately 10% of the barcodes were not identified in this analysis and a corresponding number of common primer 1 sites could not be assessed. Common primer 2, on the kanamycin side of the barcode cassette (U2: CGTACGCTGCAGGTCGAC, D2: ATCGATGAATTCGAGCTCGTTTTC) was assessed in sequences associated with each orf deletion in the single-end sequence analysis. Approximately 20% of the barcodes could not be accurately identified in this dataset and a corresponding number of common primers could not be assessed. The higher number in the paired end is likely to the shorter read length into the barcode (18nt for uptag, 12nt for downtag).
Of the common priming sites identified in each dataset, approximately 75-80% matched the designed sequence. This analysis is detailed in SI7. Details of the alterations for each barcode are detailed in the file recharacterized_primingsites.tab.
Re-characterized Tags and BarSeq Analysis
A total of 2042 barcodes from the collection of 11815 barcodes utilized in the yeast ORF deletion collection have been re-characterized. These have been provided in the file recharacterizedtags.tab, which is meant to supplement the file designedtags.tab. Each is identified by the designed sequence from which it is derived.
We have utilized these tags for analysis of Bar-seq data obtained from studies using both the Invitrogen yeast deletion pool as well as our own yeast deletion pool, the original source of the Invitrogen6000 pool. Each sequence in the datafile is identified by a complete match to either a designed sequence or to a re-characterized sequence. These re-characterized sequences are those that differ from the designed barcode sequence. Sequences which match re-characterized barcodes which are a substring of a designed sequence, due to an end-deletion, are identified as the designed sequence. Sequence counts for each tag are extracted from the data files and are annotated with the uptag or downtag ORF deletion to which the barcode had been assigned. Sequences which do not match a barcode in the designed or re-characterized collection are ignored, the majority occurring with low counts and likely due to sequencing errors.
Supplemental References
1. Pierce SE, Davis RW, Nislow C, & Giaever G (2007) Genome-wide analysis of barcoded Saccharomyces cerevisiae gene-deletion mutants in pooled cultures. Nature protocols 2(11):2958-2974.
2. Pierce SE, et al. (2006) A unique and universal molecular barcode array. Nature methods 3(8):601-603.
3. Li H, Ruan J, & Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 18(11):1851-1858.
4. Eason RG, et al. (2004) Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains. Proceedings of the National Academy of Sciences of the United States of America 101(30):11046-11051.
5. Shoemaker DD, Lashkari DA, Morris D, Mittmann M, & Davis RW (1996) Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nature genetics 14(4):450-456.
6. Bentley DR, et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53-59.
7. Craig DW, et al. (2008) Identification of genetic variants using bar-coded multiplexed sequencing. Nature methods 5(10):887-893.
8. Parameswaran P, et al. (2007) A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing. Nucleic acids research 35(19):e130.
9. Ju J, et al. (2006) Four-color DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proceedings of the National Academy of Sciences of the United States of America 103(52):19635-19640.