Except where otherwise noted, this and all course materials for CS 112 are licensed under Attribution-NonCommercial-ShareAlike CC BY-NC-SA held by the Trustees of the University of Illinois (University of Illinois at Chicago).
On your computer, use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank. Once there, find a nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. In other words, find a DNA sequence for the gene that encodes the coagulation factor IX protein. This is found by using the search area at the top of the GenBank web page. You are looking for a specific “accession”—a sequence submission record—with the accession ID:
To summarize, we need to find the nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. To do this we:
NG_007994in the search field
Eukaryotic genes (like F9) are composed of messenger RNA (mRNA)-coding sequences called exons (expressed portions of DNA sequence) and intervening sequences called introns (the name emphasizes their intervening role). Intron sequences in pre-mRNA are non-coding and are removed before transcription to mRNA. The exons are then joined together (concatenated) and comprise mature mRNA. The process of removing introns and reconnecting exons is called ‘splicing.’ Mature mRNA is comprised of coding sequence (CDS) and untranslated regions (UTR) at 5′ and 3′ ends. Coding sequence is made up of codons—the portion of mRNA that codes for amino acids.
The amino acid coding portions (CDS), along with other gene features, are annotated on the left side of the description in GenBank records. For example, you will see something similar to this in the annotations for the F9 gene:
The actual line on the GenBank page will be much longer (i.e. containing more than just the ranges for two exons) but the first two ranges match exactly what is given above.
join in a GenBank record is analogous to a function in Python. It is an instruction to slice out and join (concatenate) the segments separated by commas within parentheses. The resulting string represents the amino acid coding sequence (CDS). Assuming we have the entire F9 gene sequence stored in a variable
F9, the example above could be written in Python as:
cds = F9[5029:5117] + F9[11274:11438]
Caution: Python indexes start at 0, but GenBank annotations start at 1. Notice how the coordinates differ between the GenBank record example and the Python code above. Failure to adjust indexes correctly is a common situation in computer science and the bugs related to this are known as off-by-one errors. While seemingly trivial, these errors may have serious consequences.
extract_f9_cdswhich has one parameter is to take the argument of F9, the F9 gene sequence. The goal of this function is to extract the coding regions from the F9 gene sequence (provided in the template), concatenate them, and return the resulting string. Hint: You can confirm your program is functioning correctly by clicking on the CDS annotation in GenBank. This will highlight the relevant parts of the sequence, it should match your output.
get_max_possible_codonswhich has one parameter
seqand returns the maximum number of codons this DNA sequence would contain if it was wholly composed of coding regions. Remember that each codon is made up of 3 nucleotide bases.
get_gc_percentwhich has one parameter
seq. The goal of this function is to compute the proportion of
Cbases (characters) in
seqto the total number of bases (characters) in
seq. The returned value should be of type
floatin the range between 0.0 and 100.0 (as a percentage, not a fraction). To do this, use the string method count( ) to determine the number of ‘G’ bases and the number of ‘C’ bases.
get_coding_ratiowhich has two parameters
cds. The goal of this function is to calculate the proportion of coding nucleotides to total nucleotides in the entire sequence. In other words: of the total number of nucleotides in the gene (
seq), what is the proportion that codes for amino acids (
cds)? Remember that a ratio will a value of type ‘float’ in the range between 0.0 and 1.0.
print_seq_infowhich has two parameters
cds. This function should use the functions you wrote for problems 1 through 4 and print a correctly formatted summary:
Sequence length: ... Coding sequence length: ... Number of possible codons: ... Number of actual codons: ... First 4 codons of the coding sequence: ... Ratio of Coding NT to Total NT: ... GC percent of the entire sequence: ... GC percent of the coding sequence: ...
get_max_possible_codons( ) function with the ‘seq’ parameter.
get_max_possible_codons( ) function with the ‘cds’ parameter.
get_coding_ratio( ) function with both the ‘seq’ and ‘cds’ parameters.
get_gc_percent( ) function with the ‘seq’ parameter.
get_gc_percent( ) function with the ‘cds’ parameter.
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.Read more
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.Read more
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.Read more
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.Read more