Person: Qiao, Dandi
Email Address
AA Acceptance Date
Birth Date
Research Projects
Organizational Units
Job Title
Last Name
First Name
Name
Search Results
Publication Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
(BioMed Central, 2012) Qiao, Dandi; Yip, Wai-Ki; Lange, ChristophBackground: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. Results: Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. Conclusions: The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.
Publication Two Mutations in the SARS-CoV-2 Spike Protein and RNA Polymerase Complex Are Associated With COVID-19 Mortality Risk
(2021) Hahn, Georg; Wu, Chloe M.; Lee, Sanghun; Hecker, Julian; Lutz, Sharon; Haneuse, Sebastien; Qiao, Dandi; Demeo, Dawn; Tanzi, Rudolph; Choudhary, Manish; Etemad, Behzad; Mohammadi, Abbas; Esmaeilzadeh, Elmira; Cho, Michael M.; Li, Jonathan; Randolph, Adrienne; Laird, Nan; Weiss, Scott; Silverman, Edwin; Ribbeck, Katharina; Lange, ChristophSARS-CoV-2 mortality has been extensively studied in relation to host susceptibility. How sequence variations in the SARS-CoV-2 genome affect pathogenicity is poorly understood. Association between whole-genome sequencing (WGS) of the virus and death in patients with SARS-CoV-2 is one potential method of early identification of highly pathogenic strains to target for containment. We analyzed 7,548 single stranded RNA-genomes of SARS-CoV-2 patients in the GISAID database and associated variants with mortality using a logistic regression. In total, evaluating 29,891 sequenced loci of the viral genome for association with patient/host mortality, two loci, at 12,053bp and 25,088bp, achieved genome-wide significance (p-values of 4.09e-09 and 4.41e-23, respectively). Mutations at 25,088bp occur in the S2 subunit of the SARS-CoV-2 spike protein, which plays a key role in viral entry of target host cells. Additionally, mutations at 12,053bp are within the ORF1ab gene, in a region encoding for the protein nsp7, which is necessary to form the RNA polymerase complex responsible for viral replication and transcription. Both mutations alter amino acid coding sequences, potentially imposing structural changes that could enhance viral infectivity and symptom severity, and may be important to consider as targets for therapeutic development. Identification of these highly significant associations, unlikely to occur by chance, may assist with COVID-19 early containment of strains that are potentially highly pathogenic.