ICS2O Final Project: DNA Identification

Hits: 20

Modified for Java and Grade 10 ICS2O by Mr. King
Original creators: David Malan and Brian Yu

Summary

Students write a Java program that accepts a CSV file containing individual names and a DNA sequence fragment belonging to each name. Using a combination of loops, objects, and string manipulation, and file I/O, students send output to a file which profiles the number of tandem repeats of specific fragments for each person.

Topics

File I/O. Loops. Java Objects. String Manipulation.

Difficulty

This is an aimed at grade 10, but requires careful attention and not a small time commitment.

Strengths

The assignment shows a real-world application of algorithmic thinking (thinking in steps) and string manipulation.

Weaknesses

Students are limited to libraries and tools mentioned in class lessons such as slideshows.

Dependencies

Familiarity with creation of objects, strings, lists, CSV files, output files. Exposure to Java.

Introduction

DNA, or deoxyribonucleic acid, is a set of large molecules that make up our genetic blueprint. They are located in each of our cells, packaged as chromosomes. The entire sequence of DNA for all 23 human chromosomes make up the 3 billion base pairs of what is called our genome. The genome for humans has been sequenced since 2003. It has allowed forensic scientists to identify people based on trace samples of DNA.

These 3 billion base pairs are made up of the nucleotides cytosine, guanine, adenine and uracil, abbreviated C, G, A, and T, in some pseudo-random combination. Some parts of the genome are pretty fixed in composition, while other parts allow for much diversity between individuals.

Identification of individuals

An example of this genetic diversity is shown in Short Tandem Repeats (STRs). These are short DNA sequences which repeat, as in AGCAGCAGC instead of AGCTGCAGCGTCAGC. The first sequence has “AGC” repeating three times. The second example also has “AGC” repeating in a sense, but has intervening bases in the sequence: TGC and GTC. AGCAGCAGC has AGC repeated in an un-interrupted manner, which is why we call that a “tandem repeat”.

The number of these repeats can vary among individuals, and can be used to identify people. “AGCAGCAGC” consists of 3 tandem repeats of the sequence “AGC”, while “AGCTGCAGCGTCAGC” consists of only 1 tandem repeat, even though it occurred 3 separate times. We are only interested in counting the maximum number of tandem repeats in a DNA sample. The tandem repeats can consist of any number of bases.

Your Task

Your task is to write a program that will look at a DNA fragment belonging to an individual and to profile the maximum number of tandem repeats of the sequences AGAT, AATG, and TATC which occur in each sample.

Program Specification

  • Your program should accept as input a CSV file  “samples.csv” which consists of two fields: the person’s name and a DNA sequence fragment.
  • Your program sends output information to a file called “profiles.txt” in the format specified below. The ouput file should have the person’s name, followed by a profile of a count of their STRs for AGAT, AATG, and TATC.
  • Use will be made of strings, file input using Scanner, and file output using PrintWriter. Students are only permitted to use data structures, Java libraries and tools used in the course and presented in slideshows.

Sample Session

From the CSV file “samples.csv”, which consist of “name, sequence” pairs in one unbroken line per person

Alice,AGACGGGTTACCATGACTATCTATCTATCTATCTATCTATCTATCTATCACGTACGTACGTATCGAGATAGATAGATAGATAGATCCTCGACTTCGATCGCAATGAATGCCAATAGACAAAA
Bob,AACCCTGCGCGCGCGCGATCTATCTATCTATCTATCCAGCATTAGCTAGCATCAAGATAGATAGATGAATTTCGAAATGAATGAATGAATGAATGAATGAATG
Charlie,CCAGATAGATAGATAGATAGATAGATGTCACAGGGATGCTGAGGGCTGCTTCGTACGTACTCCTGATTTCGGGGATCGCTGACACTAATGCGTGCGAGCGGATCGATCTCTATCTATCTATCTATCTATCCTATAGCATAGACATCCAGATAGATAGATC

your program should generate the output to “profiles.txt”:

Alice, 5 AGAT, 2 AATG, 8 TATC
Bob, 3 AGAT, 7 AATG, 4 TATC
Charlie, 6 AGAT, 1 AATG, 5 TATC