ICS4U Final Project: DNA Identification

Modified for Java and Grade 12 ICS4U by Mr. King
Original creators: David Malan and Brian Yu

Summary

Students write a Java program that accepts (1) a CSV file representing a DNA database and (2) a text file representing a DNA sequence. Using a combination of loops, objects, and string manipulation, and file I/O, students identify to whom the DNA sequence belongs.

Topics

Algorithms. Computational Biology. File I/O. Loops. Java Objects. String Manipulation.

Difficulty

This is an intermediate assignment, but requires careful attention and not a small time commitment.

Strengths

The assignment shows a connection between computation and biology; it also demonstrates a very real-world application of algorithmic thinking and string manipulation. The assignment also offers a nice balance between exploring features of the Java and also thinking algorithmically about how to compute the longest run of a particular substring.

Weaknesses

The computation of the longest run can be a bit tricky. Students are limited to libraries and tools mentioned in class lessons such as slideshows.

Dependencies

Familiarity with creation of objects, strings, lists, CSV files. Exposure to Java.

Introduction

DNA, or deoxyribonucleic acid, is a set of large molecules that make up our genetic blueprint. They are located in each of our cells, packaged as chromosomes. The entire sequence of DNA for all 23 human chromosomes make up the 3 billion base pairs of what is called our genome. The genome for humans has been sequenced since 2003. It has allowed forensic scientists to identify people based on trace samples of DNA.

These 3 billion base pairs are made up of the nucleotides cytosine, guanine, adenine and thymidine, abbreviated C, G, A, and T, in some pseudo-random combination. Some parts of the genome are pretty fixed in composition, while other parts allow for much diversity between individuals.

Identification of individuals

An example of this genetic diversity is shown in Short Tandem Repeats (STRs). These are short DNA sequences which repeat, as in AGCAGCAGC instead of AGCTGCAGCGTCAGC. The first sequence has “AGC” repeating three times. The second example also has “AGC” repeating in a sense, but has intervening bases in the sequence: TGC and GTC. AGCAGCAGC has AGC repeated in an un-interrupted manner, which is why we call that a “tandem repeat”.

The number of these repeats can vary among individuals, and can be used to identify people. “AGCAGCAGC” consists of 3 tandem repeats of the sequence “AGC”, while “AGCTGCAGCGTCAGC” consists of only 1 tandem repeat, even though it occurred 3 separate times. We are only interested in counting the maximum number of tandem repeats in a DNA sample. The tandem repeats can consist of any number of bases.

Let us suppose that we had a data file of individuals and the number of STR repeats. The first line consists of the number of people, and the sequences in question for STR counts, each field separated by commas. All of the lines after that are the names of the individuals, followed by the number of STRs of each kind as specified on the first line. Example:

3, AGTC, TTAC, GTCA 
Bob, 4, 5, 7 
Medina, 8, 6, 11 
Leslie, 6, 9, 3

So, this means that in the second row, we get the name Bob, followed by the number of repeats of AGTC, TTAC, and GCTA in that order, the same order of appearance of the sequences in the first row. That is, AGTC repeats 4 times, TTAC 5 times, and GTCA 7 times in tandem in Bob’s sample. To have all three of these STR’s match up can be offered as pretty good evidence that the sample was Bob’s.

It is also possible that the combinations of STRs don’t match anyone in your database. If only one of the STR counts are off, it cannot be considered a match.

Your Task

Your task is to write a program that will take a sequence of DNA and a CSV file containing STR counts for a list of individuals and then output to whom the DNA (most likely) belongs.

Program Specification

  • Your program should open the CSV file and read its contents into an array.
    • You may assume that the first row of the CSV file will be the column names. The first number will be the number of individuals (or rows of data), and the remaining columns will be the STR sequences themselves.
  • Your program should open the DNA sequence and read its contents into memory.
  • For each of the STRs (from the first line of the CSV file), your program should compute the longest run of consecutive repeats of the STR in the DNA sequence to identify.
  • If the STR counts match exactly with any of the individuals in the CSV file, your program should print out the name of the matching individual.
    • You may assume that the STR counts will not match more than one individual.
    • If the STR counts do not match exactly with any of the individuals in the CSV file, your program should print “No match”.
  • A CSV file is known as a “comma-separated volume”, or “comma-separated file”. This means that each field is separated by commas or a carriage return. Since a CSV file is nothing more than a plain ASCII text file, there will be no special libraries used to deal with these files, other than what we did in class. CSV files contain no special formatting characters.
  • There will be an object for the three sequences to search for.
  • There will be an object of the individual’s name, and the length of the STRs for each of the sequences. These can be an array of objects. The length of the array would be determined by the number in the first line of the data file, and can thus be dynamically assigned.

Sample Sessions

For the CSV file

3,AGAT,AATG,TATC
Alice,5,2,8
Bob,3,7,4
Charlie,6,1,5

Each of the following sequences would sit in a separate file, containing only the sequence. This will require separate runs. The nucleotide sequence is in a data file in one unbroken line, one data file per individual. To test your code, replace the old sequence with the new one in the same data file, using a text editor (which replit provides).

The following sequence should match Alice:

AGACGGGTTACCATGACTATCTATCTATCTATCTATCTATCTATCTATCACGTACGTACGTATCGAGATAGATAGATAGATAGATCCTCGACTTCGATCGCAATGAATGCCAATAGACAAAA

The following sequence should match with Bob:

AACCCTGCGCGCGCGCGATCTATCTATCTATCTATCCAGCATTAGCTAGCATCAAGATAGATAGATGAATTTCGAAATGAATGAATGAATGAATGAATGAATG

The following sequence should match with Charlie:

CCAGATAGATAGATAGATAGATAGATGTCACAGGGATGCTGAGGGCTGCTTCGTACGTACTCCTGATTTCGGGGATCGCTGACACTAATGCGTGCGAGCGGATCGATCTCTATCTATCTATCTATCTATCCTATAGCATAGACATCCAGATAGATAGATC

And the following sequence should have no match:

GGTACAGATGCAAAGATAGATAGATGTCGTCGAGCAATCGTTTCGATAATGAATGAATGAATGAATGAATGAATGACACACGTCGATGCTAGCGGCGGATCGTATATCTATCTATCTATCTATCAACCCCTAG

%d bloggers like this: