BNFO301: Exam 1 1. List all the changes that can be produced by a single base pair mutation in the AGA
codon encoding arginine and label the resulting amino acid. In addition label each mutation as silent, missense or nonsense. (4pts)
2. What would be the value of using a dot plot to compare a sequence to its own reverse complement? (2 pts)
Sketch the dot plot o
3. f a 1 kb sequence in which a motif of approximately 50 consecutive bases appears six times in the N terminal region of the sequence. (4 pts)
4. Use the PAM250 matrix to answer question 4. a. Give the score for aligning two alanines (A) (1 pt)
b. Give the score for aligning two tryptophans (W) (1 pt)
c. Both of these alignments constitute “matches”, so why are the scores so different? (2
Use the BLOSUM62 matrix for questions 5 and 6.
5. Calculate the dynamic programming matrix and an optimal GLOBAL alignment for the
protein sequences FKHMEDPLE and FMDTPLNE , scoring -2 for a gap (i.e. 2 is the gap penalty). Use the BLOSUM62 substitution matrix (given above).
a. Fill out the matrix. (6 pts) b. Highlight the traceback alignment. (1 pt) c. Write out the final alignment. (2 pts) d. Score the final alignment. (1 pt)
6. Calculate the dynamic programming matrix and an optimal LOCAL alignment for the protein sequences FKHMEDPLE and FMDTPLNE . Use the BLOSUM62 matrix (provided above).
a. Fill out the matrix. (6 pts) b. Highlight the traceback local alignments. (1 pt) c. Write out the final alignment. (2 pts) d. Score the final alignment. (1 pt)
7. What is 16S rRNA and what is its function inside a cell? (2 pts)
8. 16s rRNA is widely used in microbiome studies. List two strengths and two limitations of
16S rRNA sequencing. (4 pts) 9. Can 16S rRNA be used to classify viruses? Why or why not? (2 pts) 10. Which of the following amino acids is least mutable according to the PAM scoring
matrix? (2 pts)
1. Which of the following sentences BEST describes the difference between a global
alignment and a local alignment between two sequences? (2 pts)
a. Global alignment is usually used for DNA sequences, while local alignment is usually used for protein sequences.
b. Global alignment has gaps, while local alignment does not have gaps.
c. Global alignment finds the global maximum, while local alignment finds the local maximum.
d. Global alignment aligns the whole sequence, while local alignment finds the best subsequence that aligns.
2. How does the BLOSUM scoring matrix differ most notably from the PAM scoring matrix?
a. It is best used for aligning very closely related proteins.
b. It is based on global multiple alignment from closely related proteins.
c. It is based on local multiple alignments from distantly related proteins.
d. It combines local and global alignment information.
3. A global alignment algorithm (such as Needleman-Wunsch algorithm) is guaranteed to
find an optimal alignment. Such an algorithm: (2 pts)
a. puts the two proteins being compared into a matrix and finds the optimal score by exhaustively searching every possible combination of alignments.
b. puts the two proteins being compared into a matrix and finds the optimal score by iterative recursions.
c. puts the two proteins being compared into a matrix and finds the optimal alignment by finding optimal subpaths that define the best alignment(s)
d. can be used for proteins but not for DNA sequences.
4. What are the basic concepts of library preparation? (4 pts)
5. List 3 applications of next-generation sequencing. (2 pts) 6. How many reads do you need to get 30x coverage of your genome if your read length is
300bp and your genome size is 10Mb? (2 pts)
Log in to compile. Navigate to the bnfo301 (home/bnfo301 ) directory. There is a folder called exam1 where you will find all the files you need to answer the next set of questions.
Instructions for this section:
• Write your output files to your user specific folder in /home/bnfo301 (ex. my user specific folder is /home/bnfo301/huangb2 ). You will be graded on the files found in your specific folder. If the files are not in that folder you will not get credit for your answers. No exceptions.
• Make sure you name your output file as instructed in each question. I will take off 1 point for each output file that is not correctly named.
• Code is typically written using a fixed width font. Use a fixed width font to type your commands in this section (ex. courier, inconsolata, menlo, monaco).
• For each question, provide the command when specified, or the command and answer. All output files from this section should be written to you user specific folder on compile. I will access your user specific folder to grade this section.
1. List the files in the exam1 folder. command only (2 pts)
2. Count how many sequences are in the protein-db.faa file? command and answer (2 pts)
3. You have an unknown1.faa sequence that you want to blast against sequences in the protein-db.faa file.
a. Copy the protein-db.faa to your user specific folder. command only
b. Create a blast database for protein-db.faa . command only (2 pts)
c. Blast unknown1.faa against the database you just created. Name your blast output file 3b-unknown-output.txt . command only, leave output file on Compile (2 pts)
d. Filter your blast results for hits with an evalue greater than 1e-05. Name your blast output file 3c-unknown-output.txt . command only, leave output file on Compile (2 pts)
e. What is the percent identity and alignment length of the best hit in your blast results when you filter based on an evalue greater than 1e-05? Hint: you may need to change your output format. (8 pts)
f. What is the percent identity and alignment length of the worst hit in your blast results when you filter based on an evalue greater than 1e-05? Hint: you may need to change your output format. (4 pts)
7. BLAST is a tool that can be used to query multiple databases. It is not always necessary to create your own database. One of the most common blast databases is the non-redundant database (nr).
a. Blast the unknown1.faa sequence against the nr database (/home/norrissw/bin/I-TASSER4.2/lib/nr/nr ) to find out what it is. Name your blast output file 4a-unknown-nr-output.txt . NOTE: you do not need to run the makeblastdb command. Also, it can take a few minutes for your blast to run because the nr database is very big. command only, leave output file on Compile (2 pts)
b. Filter your blast results for hits with an e-value greater than 1e-10. Name your blast output file 4b-unknown-nr-output.txt . command only, leave output file on Compile (2 pts)
c. Based on the best hit from nr, take the accession number and identify what that protein is. (4 pts)
8. The next set of questions involve the pipeline.py script
a. Copy the pipeline.py script to your /home/bnfo301/vcuid (2 pts)
b. Rename the pipeline.py script to 5b-pipeline.py . (2 pts)
c. Describe in detail what the script is doing, including what the output from each step is. (4 pts)
d. Modify the script so it filters the blast results using an e-value cut off of 1e-05. Save the modified script as 5d-pipeline.py . You do not need to run the script, just add in your modification. leave output file on Compile (2 pts)