Researchers from the University of Georgia’s Center for Food Safety have developed software that functions as an important step in improving the accuracy of DNA sequence analysis when testing for microbial contamination.
Sepia is a cutting-edge read classifier, written by College of Agricultural and Environmental Sciences Assistant Professor Henk den Bakker, that is available as open-source software. It is expected to make genome sequencing much faster for researchers studying bacteria.
The length of chromosomes of bacteria typically range between 1.5 million base pairs to roughly 9.5 million base pairs, but if researchers want to “read” the individual bases of a genome (the genome sequencing process), they must do it in pieces of 150 to 10,000 base pairs, using modern technology. These pieces are called “reads.”
When researchers want to determine what types of microorganisms and viruses are present in a sample—such as in a nasal swab—and sequence the DNA of those organisms, they use a tool called a “read classifier” to quickly sort through the reads and determine to what microorganisms they most likely belong.
Like other read classifiers, den Bakker’s new software works by cross-referencing the information from the sample to existing databases, but it is designed to address challenges in the process posed by potential errors in the taxonomic information available on some microorganisms or the switch to a new taxonomic system altogether.
Since bacteria are often single-celled microorganisms lacking physical distinctions, they are more difficult to classify than more complex organisms, such as mammals or reptiles. Researchers have only recently begun using DNA to determine the taxonomy of microorganisms. This means that the taxonomy of some databases referenced by read classifiers are sometimes not in agreement with what similarities in DNA show.
“Only recently, in the last decade, we began sequencing these organisms and using the genetic data to build taxonomies. That’s very important because when we know things are genetically similar, a read classifier can use that information to make predictions,” den Bakker said.
Using these predictions, when the read classifier discovers an organism that is missing from the database, it can help researchers determine what that unidentified organism is most closely related to by comparing its genetic material to that of known microorganisms, he said.
When writing the software, den Bakker intentionally made it simple for the end user to make edits and corrections, as needed, to help address the problems with the taxonomy used in databases. Given its wide range of applications, much of his focus was on creating software that was user-friendly, allowing researchers to easily edit the taxonomy of the databases if they find an error.
To test the software, den Bakker recruited the help of Lee Katz, a bioinformatician with the U.S. Centers for Disease Control and Prevention (CDC) and adjunct faculty member with the UGA Center for Food Safety. Katz tested the software for genome contamination, which occurs when researchers confirm that they have sequenced only the organism that they are interested in, and not a mixture of organisms. Based on his findings, Katz has suggested its use to CDC colleagues for metagenomics analysis.
Den Bakker anticipates that the software in its current form will function as a base model onto which he will build additional features. One such upcoming feature is designed to help protect patient confidentiality by removing human DNA from test results. Researchers will then be able to share the results of their research while simultaneously complying with health information privacy laws.
“For me, writing software is also exploring new data structures on a data science level—how to make these things more efficient. Writing it is more or less like starting an experiment in the lab,” den Bakker said.
The software is available now and is free to download on GitHub. More information on Sepia can be found in The Journal of Open Source Software.