Analyzing, interpreting, and editing nucleotide sequence data are among the most challenging conceptual aspects of the DNA barcoding workflow. This supplement to the BOLD-SDP user documentation is intended to assist by providing instructors and students with an overview of dye terminator cycle sequencing, recommendations for submitting COI PCR fragments (amplicons) to a sequencing facility, guidance on how to interpret raw sequence data, and detailed instructions for assembling COI nucleotide sequences and linking them to specimen/sample records. Basic information on how to navigate BOLD-SDP and utilize its major features and functionalities can be found in the BOLD-SDP System Overview, video tutorials, and earlier sections of the Quick Start Guide. We encourage you to carefully review the information in these resources before attempting to perform the steps outlined in this supplement.
Sequence Assembly and Editing in BOLD-SDP
As noted above, the fluorescently labeled DNA fragments generated during dye terminator cycle sequencing migrate sequentially through a capillary according to their size (smallest to largest) and pass by a laser. Upon exposure to the laser, fragments terminated by a ddATP emit green light, fragments terminated by ddTTP emit red light, fragments terminated by ddCTP emit blue light, and fragments terminated by ddGTP emit yellow light. The light signals are detected by the DNA sequencer, processed by a software program, and represented as a series of colored peaks in a trace file (yellow light signals emitted from DNA fragments terminated by ddGTP are represented as black peaks in the trace file to make them more readable on a white background).
The software also uses an algorithm to assign base calls (nucleotides) to each peak in the trace file and to compute a confidence or quality (Q) score for each base call. The quality score represents the level of confidence that a base call was made correctly. To compute quality scores, the algorithm examines several parameters associated with the peak shape and resolution at each position in the trace file. The resulting scores are logarithmically linked to error probabilities according to the following equation:
Q = -10 log10 P
where P represents the probability of an incorrect base call
Based on this equation, a quality score of 20 indicates that the probability of an incorrect base call is 1 in 1,000, whereas a quality score of 40 indicates that the probability of an incorrect base call is 1 in 10,000. Generally speaking, quality scores less than 20 are considered unacceptable and must be edited.
Step 1. Obtain/receive sequence data from facility and organize
Many DNA sequencing facilities send their clients email notifications when sequence data are available. The notification will usually contain a link to a facility-managed website where the trace files can be downloaded. A trace file may bear one of several file extensions (e.g. .abi, .scf, etc.), depending on the type of DNA sequencer and software used by the sequencing facility. ab1 files represent the unedited output files created by the Applied Biosystems Sequencing Analysis Software, which is extensively used by DNA sequencing facilities.
For each COI DNA sample that you submit, you are likely to receive trace files containing the .ab1 file extension. One trace file corresponds to the forward sequencing reaction (which produced the unedited nucleotide sequence of the sense strand of the amplicon), while the other trace file corresponds to the reverse sequencing reaction (which produced the unedited sequence of the antisense strand). We recommend that for each class, you download and organize trace files into a single folder that your students can copy onto a computer desktop.
Step 2. Upload forward and reverse trace files to BOLD-SDP
Unedited trace files form an important part of a DNA barcode record as they aid in the verification of barcode sequences, which are typically assembled using the raw data contained in the forward and reverse trace files. Before assembling barcode sequences with the BOLD-SDP Sequence Editor, the corresponding trace files must first be linked
to a specimen/sample through the Data Management Console of BOLD-SDP. Trace files and associated data are linked to the appropriate specimen/sample by following the steps outlined below. Please be advised that in order to link trace files to a particular specimen/sample, you must have already completed a New Specimen page for the specimen/sample in the Data Submission Console of BOLD-SDP. Consult the relevant section of the BOLD-SDP Brochure and Quick Start Guide for details.
a. Log in to the Main Student Console by clicking on the Student Console icon on the BOLD-SDP home page.
b. On the Main Student Console page, click the <Upload Traces> icon.
c. In Section A of the Upload Traces page, enter a Sample ID to connect your traces with the appropriate specimen/sample. If you forgot the Sample ID of the specimen/sample from which the trace files were generated, click the <Lookup> button to find its ID in your class record list.
d. In Section B, use the pull-down menus to select the Forward and Reverse PCR primers used to generate your COI amplicon. If necessary, consult your instructor for guidance in selecting the correct PCR primers.
e. In Section C, use the pull-down menus to select the Forward and Reverse Sequencing Primers used to generate your trace files. Next, click the <Choose File> buttons to select and upload the forward and reverse trace files from your computer. Be sure that the forward trace file appears in the top selection along the forward sequencing primer, and vice versa.
f. Be sure to select your name from the <Student Attribution> pull-down menu in the upper right-hand corner of the page so that you are credited with uploading trace files. If you are working with another student, click <Add Student> and select his/her name from the pull-down menu that appears. If necessary, this operation may be repeated to include additional student partners.
g. Click the <Submit> button to incorporate the information on this page into the specimen/sample record.
Step 3. Visualize trace files in the specimen/Sample record
Before assembling barcode sequences, it is extremely important to verify that your trace files were incorporated into the correct specimen/sample record. To perform this task, follow each of the steps below:
a. Navigate to the Main Student Console page of BOLD-SDP.
b. In the right sidebar of the Main Console page, click the <View Data> icon.
c. On the Record List page for your class, locate the row for the specimen/sample that you linked to the recently uploaded trace files. Then click the Process ID link for the specimen/sample to open its sequence page in a new window.
d. The PCR primer names, sequencing primer names, and trace file names should appear in the Sequencing Runs pane. BOLD-SDP also displays a quality designation (e.g. high, medium, low, or failed) to each of the trace files. These designations are based on the average quality scores of the base calls in each trace file. High quality trace files have a mean quality score >40, medium quality trace files have a mean quality score between 30 and 40, and low quality trace files have a mean quality score <30. Trace files with fewer than 10 base calls are designated as failed. Please be advised that it may take several hours for BOLD to assign quality designations to the trace files once they are uploaded. Trace files with a low or failed designation usually indicate that a procedural error occurred in the lab prior to the submission. Sequencing facilities sometimes provide an explanation of the likely cause of the problem. Trace files with a low or failed designation cannot be used to assemble contigs using the BOLD Sequence Editor (see below for additional details).
e. To view and examine the trace files, select both check boxes that appear next to their filenames and then click the <View Trace Files> button.
f. The forward and reverse trace files are displayed in the top and bottom panes of the Trace Viewer page, respectively.
The Trace File Viewer displays quality values for individual base calls in the trace files using a histogram. The quality value for each base call can be determined by comparing the height of its shaded bar to the vertical scale that appears to on the right hand side of the trace file window. Continuous stretches of low quality base calls that appear on the 5’ and 3’ ends of each trace file are displayed in reduced opacity.
BOLD-SDP also computes quality statistics and displays them in tabular and graphical format above each trace file window. Of these statistical values, the mean and standard deviation are the most informative. The mean refers to the average quality score for the base calls in a given trace file. The standard deviation (Stdev) is a measure of how close the quality scores for the base calls are to the mean. A low standard deviation value indicates that the quality scores are clustered near the mean, whereas a high standard deviation value indicates that the quality scores are dispersed over a large range of values. Lower standard deviation values therefore indicate a greater level of consistency in the quality of base calls, which imparts a higher degree of confidence in the overall accuracy of the trace file. The frequency histograms that appear above each trace file window show the percentage of base calls that correspond to different quality scores (QV). The data displayed in these histograms provide an indication of the range of quality scores for the base calls.
|
|
The scroll bar at the bottom of each trace file window allows you to examine the sequences along their entire length. Even in the absence of quality values/scores for individual base calls in the trace files, their quality can be inferred from the resolution of their corresponding peaks. Notice that the peaks in the beginning (5’ end) of forward trace file are broad, overlapping, and poorly resolved. This semi-transparent region of the trace file corresponds to low quality base calls, which correlate with high error probabilities.
As you scroll to the right, the un-shaded peaks appear sharp, well resolved, and non-overlapping. This region of the trace files corresponds to high quality base calls, which correlate with low error probabilities.
As you scroll even further to the right (toward the 3’ end of the trace files), the peaks become lower in amplitude and begin to broaden and overlap. This semi- transparent region of the trace file corresponds to low quality base calls.
The low quality base calls that appear at the beginning and end of a trace file arise from technical limitations of dye terminator sequencing, which are complicated and caused by a variety of different interacting factors associated capillary electrophoresis and the underlying chemistry of this particular sequencing method. Regardless of their cause, low quality base calls must be eliminated from the sequence in order to preserve its overall accuracy.
Step 4. Assembling COI barcode sequences from trace files. Important Note: Steps 4-6 must be completed in their entirety.
Although low quality or ambiguous base calls are normally found at the 5’ and 3’ ends of a trace file, they may also appear elsewhere in the sequence. If only a single trace file was generated for a given COI amplicon, it would be difficult or impossible to confidently determine the identity of a base call that is assigned a low quality score value (i.e. a value < 20). However, a second trace file contains duplicate data that can help determine its identity with a greater level of statistical certainty.
Bringing two trace files into register and displaying them in the same window enables a researcher to identify regions of agreement or disagreement in base calls. In cases
where a low quality base call is found in one trace file, the researcher can find the position of the base call in the other trace file and compare the differences in quality scores. If a higher quality score value (>20) is assigned to the base call in the second trace file, then that base call is regarded as the correct nucleotide and accepted.
The algorithm that operates within the BOLD-SDP Sequence Editor (and the Trace File Viewer described above) automatically reverses the sequence of base calls and peaks in the reverse trace file (which corresponds to the sequence of the antisense strand) so that they read in the opposite direction. It then converts each base call to its complementary nucleotide and re-colors the corresponding peaks accordingly. For example, a base call of <T> that appears at the first position of the trace file above a red peak is replaced with a base call of <A> and moved to the last position above a green peak of the same shape and height as the original red peak. The confidence score assigned to the original base call is also shifted to the last position. Next, the program aligns this sequence of complementary base calls and appropriately re-colored peaks with the unaltered sequence of base calls and peaks of the forward trace file (which corresponds to the sequence of the sense strand). The largely overlapping DNA sequences are displayed in a project window of the BOLD-SDP Sequence Editor as shown below.
The forward trace file appears in the top pane of the assembly project window along with its sequence of base calls. The reverse complement of the reverse trace file appears in the lower pane along with its corresponding base calls. For both trace files, quality scores are represented graphically in the form of a histogram, where higher bars indicate higher quality scores and vice versa. The vertical scale on the right side of each trace file histogram displays the numerical quality values.
The nucleotide sequence that appears at the bottom of the project window represents a contig – a continuous nucleotide sequence assembled from two overlapping DNA sequences (in this case, from the forward trace file and the reverse complement of the reverse trace file). The BOLD-SDP Sequence Editor compares the quality scores of base calls at every position of the trace files and accepts the base call with the higher quality score for inclusion in the contig. The bars that appear above each nucleotide in the contig are graphical representations of quality values, which are color-coded according to the legend that appears in the upper right hand corner of the window.
It is important to realize that the algorithm utilized by BOLD-SDP to make these comparisons is not perfect, so students must scan through the contig to ensure that no errors were made. This task is simplified by examining the quality scores that appear over each nucleotide in the contig. These scores represent the algorithm’s confidence that the correct base call was chosen for inclusion in the contig. Low quality scores flagged with orange or red bars require human inspection.
To assemble contigs for your forward and reverse trace files, please follow the steps outlined below:
a. Log in to the Main Student Console by clicking on the Student Console icon on the BOLD-SDP home page.
b. On the Main Student Console page, click the <Add Sequence> icon.
c. In Section A of the Upload Sequence page, enter a Sample ID that corresponds to the trace files that you wish to assemble and edit. If you forgot the Sample ID of the specimen/sample, click the <Lookup> button to find its ID in your class record list.
d. In Section B of the Add Sequence page, click the <BOLD Sequence Editor> button to load the trace files associated with the specimen/sample into the BOLD- SDP Sequence Editor. If you are unable to assemble a contig using the BOLD- SDP Sequence Editor, refer the last section of this document for guidance on how to proceed.
e. The BOLD-SDP Sequence Editor simplifies the editing process by automatically eliminating continuous stretches of low quality base calls from the contig. It is important to realize that although these base calls are not included in the contig, they are still displayed in the forward and reverse trace files in reduced opacity. The scroll bar at the bottom of the assembly project window allows you to examine the trace files and contig along their entire length (moving from 5’ to 3’).
In the BOLD-SDP Sequence Editor, start by scanning the entire length of the assembly to identify low quality bases, which are flagged with orange or red bars above the consensus sequence. Moving the mouse pointer over a base call in the trace files or consensus sequence will highlight the alignment position and display the base calls and associated quality scores/values at the top of the editor. Clicking on a base will expose the editing tool, which enables a base call to be revised, deleted, or made ambiguous. This operation is performed by selecting one of the six options in the pull-down menu (e.g. by selecting an <A>, <T>, <C>, <G>, <N>, or <->).
f. The first step in the editing process is to carefully inspect the color of the bars that appear over each nucleotide in the contig, starting from the 5’ end (left side). The bars are graphical representations of quality values, which are color-coded according to the legend that appears in the upper right hand corner of the window. It is important to watch for quality scores < 20 (which are indicated by orange and reds bars). For assembly projects performed with higher quality trace files, orange or red bars are likely to only appear at the beginning and end of each contig.
If you discover a red bar in the beginning (5’ end) of the contig, highlight the nucleotide that appears beneath it with your mouse. Notice that the corresponding base call in each trace file is also highlighted. Next, carefully inspect the quality scores for the corresponding base calls in both trace files. If the quality score for the base call is >20 in at least one trace file, the nucleotide in the contig can be regarded as correct. However, if the quality score for the base call in both trace files is < 20, then that corresponding nucleotides and all nucleotides that precede it (to the left) in the contig must be changed to an <N> using the Edit Base pull-down menu. BOLD-SDP will automatically eliminate
these base calls during the final processing of your sequence.
Once this operation is performed, click and drag the scroll bar at the bottom of the project assembly window to scan for additional orange or red bars above the contig. If the quality of the trace files is high, then orange or red bars signifying low quality scores will only appear over nucleotides at the extreme 3’ end of the contig. If you discover a nucleotide with an orange or red bar in this region of the contig, click the nucleotide that appears beneath it with your mouse and follow the steps outlined above. If a nucleotide requires deletion, you must convert that nucleotide and all others that follow it (to the right) in the contig with an <N> using the Edit Base pull-down menu.
Instructors should be aware that lower quality trace files invariably present a variety of challenging scenarios during sequence assembly and editing. For instance, low quality base calls will often occur at the same position in more central regions of the trace files. In these instances, deleting the corresponding nucleotide and all others that precede or follow it in the contig will result in a truncated sequence that does not meet the minimum barcode length of 500 nucleotides. A visual inspection of the peaks (and the environment surrounding the peaks) can sometimes provides insights into the likely identity of the base call, but these inspections are extremely subjective and require a modest degree of experience to interpret appropriately.
To simplify this segment of the barcoding workflow for students, sequence editing should only be performed on contigs assembled from two high quality trace files, or one high quality trace file and one medium quality trace file. The quality designation of each trace file can be found on the Trace Viewer page of BOLD- SDP as outlined above. In our experience, the engagement of students in editing contigs assembled from two medium quality trace files requires supervision by someone who is highly experienced in sequence analysis and editing procedures.
If you encounter difficulties in the sequence editing process, contact eBOL staff at the following URL for specific guidance:
http://www.educationandbarcoding.org/contact.php
Step 5. Inspecting contigs for the presence of STOP codons
COI is a mitochondrial gene that directs the production of a protein subunit vital for cellular respiration. All mitochondrial protein-coding genes terminate in a STOP codon
- a triplet nucleotide that may take one of several forms depending on the taxon under study. During the process of transcription, the STOP codon of a protein-coding gene is
transcribed into messenger RNA. At the conclusion of translation, the STOP codon binds a release factor, which signals the ribosome to dissociate and release the newly
synthesized amino acid chain.
The -650 bp region of the COI gene that you amplified by PCR is located upstream of the STOP codon found in the mitochondrial DNA template. Accordingly, STOP codons
should be absent in your edited contig. The presence of a STOP codon indicates one of two likely possibilities: 1) a nucleotide was erroneously omitted in the contig, or 2) an extra nucleotide was erroneously included. Either possibility will require you to re-examine your contig for possible editing errors.
The BOLD-SDP Sequence Editor enables you to examine your sequence for the presence or absence of STOP codons. Because the COI barcode region that you amplified is also downstream of the START (ATG) codon found in the mitochondrial DNA template gene, the Auto Translator algorithm built into the BOLD-SDP Sequence Editor must first organize your contig into three reading frames. For reading frame 1, nucleotides are grouped into codons beginning with the first nucleotide in the contig. For reading frame 2, nucleotides are grouped into codons beginning with the second nucleotide in the contig (the first nucleotide is ignored). For reading frame 3, nucleotides are grouped into codons beginning with the third nucleotide in the contig (the first and second nucleotides are ignored). The translator then uses a translation matrix similar to a genetic code table used in classroom settings to determine the amino acid sequence of each reading frame. It then compares the three amino acid sequences to a database of known COI amino acid sequences to determine which reading frame is correct. The correct amino acid sequence is displayed at the bottom of the sequence editor project window.
a. The single letter amino acid code for the translated sequence appears below the contig sequence. The line that appears above each amino acid identifies the corresponding codon in the contig sequence. To scan the sequence for the presence of a STOP codon (indicated by a red asterisk <*>), click and drag the scroll bar at the bottom of the window. If a STOP codon is found in the amino acid sequence, then an editing error was made. To find the source of the editing error, repeat the steps outlined in the section above.
b. If no STOP codons were detected in the amino acid translation of the contig, then click the <Save> button in the toolbar to enter the contig into the Add Sequence window.
c. Next, instruct BOLD-SDP to automatically trim primers sequence from your edited sequence by clicking the <Trim Primers> button in Section C.
d. Once BOLD-SDP performs the trimming function, the <Trim Primers> button will be replaced by the <Check for Contaminant> button. Clicking this button will instruct BOLD-SDP to inspect your sequence for the presence of common lab contaminants, including human contaminants.
e. Once the contaminant check is passed click <Submit Sequence> button to link the edited and validated barcode sequence to your specimen/sample. Be sure to select your name from the <Student Attribution> pull-down menu in the upper right-hand corner of the page before submitting your sequence.
Step 6. Verify that editing COI sequence was incorporated in specimen/sample record.
The final step in the editing process is to verify that the edited sequence was integrated into the barcode record of the appropriate specimen/sample. To perform this function:
a. Navigate to the Main Student Console page of BOLD-SDP.
b. In the right sidebar of the Main Student Console page, click the <View Data> icon.
c. On the Record List page, locate the row for the specimen/sample that you linked to the recently uploaded sequence.
d. Click the Process ID link for the specimen/sample to open its sequence page in a new window.
- The edited COI nucleotide sequence can be found in the Nucleotide Sequence pane along with associated data, including sequence length (in base pairs), sequence composition (e.g. the number of A’s, C’s, T’s, and G’s in the sequence), and the number of ambiguous characters or nucleotides (N’s).
- The amino acid translation and total number of amino acid residues encoded by your COI nucleotide sequence are located in the lower left pane of the Sequence page in the Amino Acid Sequence pane.
- The illustrative barcode that appears in the upper right hand corner of the Sequence page represents each nucleotide in your barcode sequence as a different colored line. A’s are represented with green lines, T’s with red lines, C’s with blue lines, and G’s with black lines.
e. To compare the barcode sequence in your record with other barcode sequences in the BOLD species database, click the <Species DB> button that appears at
the bottom of the Nucleotide Sequence pane.
A Specimen Identification Request window will open that contains different forms of information.
- The Search Result pane that appears at the top of the page contains a summary statement of the search performed by the BOLD Identification System (BOLD-IDS), which is supported by the data displayed in other sections of the page.
- The Identification Summary Table shows the probability (expressed as %) that the specimen/sample belongs to the taxonomic groups listed in the middle column.
- The second table was generated by BOLD-IDS by comparing the COI nucleotide sequence of your specimen/sample with COI sequences of other specimens registered in the BOLD reference library. The percent nucleotide sequence similarity for the 20 closest matches is displayed along with the taxonomy for each match. Similar data for the top 100 matches is also displayed graphically on the page.
- The world map at the bottom of the page shows the collection site of specimens with COI sequences that are >98% similar to the COI barcode sequence of your specimen/sample.
Step 7. Review data in each specimen/Sample record
The integration of trace files and edited COI sequences into a specimen/sample record completes the record assembly process in BOLD-SDP. If your project involves the creation of reference DNA barcode records, the information contained in each class record will be carefully reviewed by your instructor and submitted to scientific experts, who will review compliance with current barcode data standards. Required data standards for each specimen record minimally include:
- A species name assigned by an expert (or a provisional name)
- A unique specimen identifier
- A retrievable voucher specimen and the name of the institution that is storing the voucher
- A collection record containing the collector name, collection date, collection location, and geospatial coordinates)
- A COI sequence of at least 500 nucleotides in length with fewer than 1% ambiguous base calls (N’s)
- The name of the PCR primers used to generate the COI amplicon
- The unedited trace files
Student records that meet these standards are eligible for inclusion in the BOLD researcher workbench and for publication in GenBank. Maintained by the National Center for Biotechnology Information (NCBI), GenBank is a sequence database that contains an annotated collection of all publicly available nucleotide sequences and their protein translations.
The publication process generally takes between 6-8 weeks. During this time, a GenBank accession number will be assigned to your record and appear on the Sequence page of your barcode record(s) as an inactive link. The accession number is a unique identifier assigned by GenBank to your COI barcode sequence. Once your data is published in GenBank, the accession number on the Sequence page becomes an active link to GenBank. Clicking the number will retrieve your barcode record from the GenBank database.
Notice that the record contains the BARCODE designation in the KEYWORD row. GenBank records with this special designation incorporate data that appears in your specimen records (e.g. taxonomy, collection event details, nucleotide sequence information, etc.). Furthermore, the AUTHOR row lists the names of the students who contributed to the creation of the record.
Step 8. Analyze Barcode Data
Regardless of whether your project involves building the BOLD reference barcode library, or using the reference library to determine the identify of an unknown sample, the Sequence Analysis Console of BOLD-SDP contains an impressive suite of informatics tools that enables you to visual and analyze your barcode data in extremely powerful and informative ways. Once you complete the assembly of barcode records, we encourage you to consult the BOLD-SDP User Manual for information on how to access and utilize these tools.