Browsing by Autor "Jorge Duitama"

Now showing 1 - 10 of 10

A graph clustering algorithm for detection and genotyping of structural variants from long reads
(University of Oxford, 2023) Nicolás Gaitán; Jorge Duitama
The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
A phased genome assembly of a Colombian Trypanosoma cruzi TcI strain and the evolution of gene families
(Nature Portfolio, 2024) Maria Camila Hoyos Sanchez; Hader Sebastian Ospina Zapata; Brayhan Dario Suarez; Carlos Fernando Giraldo Ospina; Hamilton Julián Barbosa; Julio César Carranza; Gustavo Adolfo Vallejo; Daniel Montes; Jorge Duitama
Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean
(Research Square (United States), 2020) Tatiana García; Jorge Duitama; Stephanie Smolenski Zullo; Juanita Gil; Andrea Ariani; Sarah Dohle; Antonia Palkovic; Paola Skeen; Clara Isabel Bermúdez-Santana; Daniel G. Debouck
Abstract Lima bean ( Phaseolus lunatus L. ) is one of the five domesticated Phaseolus bean crops, which are essential sources of dietary proteins for human consumption. Compared to common bean ( P. vulgaris ), it shows a wider range of ecological adaptations along its distribution range from Mexico to Argentina. These adaptations and its phenotypic plasticity make Lima bean a promising crop for improving food security under predicted scenarios of climate change in Latin America and elsewhere. Lima bean is also an excellent model to study convergent evolution of the adaptive domestication syndrome due to its dual domestication in Mesoamerica and the Andes. Combining long and short read sequencing technologies with a dense genetic map from a biparental population, we obtained the first chromosome-level genome assembly for Lima bean. Annotation of 28,326 gene models showed high diversity among 1,917 genes with conserved domains related to disease resistance. Structural comparison across 21,180 orthologs with common bean revealed high genome synteny and two large intrachromosomal rearrangements. Speciation between P. lunatus and P. vulgaris occurred about six million years ago according to nucleotide evolution between these orthologs. Population genomic analysis of GBS data for 482 wild and domesticated accessions from the Mesoamerican and Andean gene pools provided novel evidence on population structure at a finer geographical scale. Results show that wild Lima bean is organized into six clusters with mostly non-overlapping distributions and that Mesomerican landraces can be further subdivided into three subclusters. A new wild cluster of diversity was found in the Colombian Andes and a separate genetic cluster was observed for Mesoamerican landraces of the Peninsula of Yucatan in Mexico. This study also documents genome wide patterns of selection and haplotype introgression events among gene pools. Analysis of RNA-seq data obtained from wild and domesticated accessions at two different pod developmental stages revealed 4,275 differentially expressed genes, which could be related to pod dehiscence and seed development. We expect that the present resources serve as a solid basis to achieve a comprehensive view of the degree of convergent evolution of Phaseolus species under domestication and provide new tools and information for breeding for climate change resiliency of different domesticated species.
FlavorMiner: A Machine Learning Platform for Extracting Molecular Flavor Profiles from Structural Data
(2024) Fabio Herrera‐Rocha; Miguel Fernández‐Niño; Jorge Duitama; Mónica P. Cala; María José Chica; Ludger A. Wessjohann; Mehdi D. Davari; Andrés Fernando González Barrios
<title>Abstract</title> Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.
FlavorMiner: a machine learning platform for extracting molecular flavor profiles from structural data
(BioMed Central, 2024) Fabio Herrera‐Rocha; Miguel Fernández‐Niño; Jorge Duitama; Mónica P. Cala; María José Chica; Ludger A. Wessjohann; Mehdi D. Davari; Andrés Fernando González Barrios
Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.Scientific Contribution FlavorMiner is an advanced machine learning (ML)-based tool designed to predict molecular flavor features with high accuracy and efficiency, addressing the complexity of food metabolomics. By leveraging robust algorithmic combinations paired with mathematical representations FlavorMiner achieves high predictive performance. Applied to cocoa metabolomics, FlavorMiner demonstrated its capacity to extract meaningful insights, showcasing its versatility for flavor analysis across diverse food products. This study underscores the transformative potential of ML in accelerating flavor biochemistry research, offering a scalable solution for the food and beverage industry.
Genetic diversity and comparative genomics across Leishmania (Viannia) species
(2024) Jorge Duitama; Laura Natalia González-García; María Rodríguez; Marcela Parra-Muñoz; Ana Clavijo; Laura Levy; Clemencia Ovalle‐Bracho; Claudia Colorado; Carolina Camargo; Eyson Quiceno
<title>Abstract</title> Leishmaniasis is a disease representing an important public health problem worldwide, with a broad spectrum of clinical and epidemiological features partly associated with the diversity and complex life cycle of the Leishmania parasites. This study analyzes genomic data from 205 Leishmania (Viannia) samples, including 66 newly sequenced clinical isolates. It also provides chromosome-level genome assemblies for 10 isolates representing different species and populations. The observed distribution of Leishmania genomic diversity across the sampling locations suggests rapid adaptation to different ecosystems. Pangenomic analysis of high-quality assemblies shows consistent copy number variation between species for different gene families. Amastin gene families have larger numbers and diversity than previous reports based on analysis of short-read data. This work provides comprehensive genomic resources to identify population markers for Leishmania spp, leveraging valuable insights into the biology, transmission dynamics, the evolution of virulence mechanisms, and the spread of resistance of the parasite.
Machine Learning Models for Accurate Prioritization of Variants of Uncertain Significance
(2020) Daniel Mahecha; Haydemar Núñez; María Claudia Lattig; Jorge Duitama
The growing use of new generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of Variants of Uncertain Significance (VUS). In this manuscript we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron (MLP). To train the models, we extracted 82,463 high quality variants from ClinVar, using 9 conservation scores, the loss of function tool and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross validation with a grid search. The three models were tested on a set of 5,537 variants that had been classified as VUS any time along the last three years but had been reclassified in august 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF based model yielded the best performance across different variant types and was used to create VusPrize, an open source software tool for prioritization of variants of uncertain significance. We believe that our model can improve the process of genetic diagnosis on research and clinical settings.
Recent evolution, domestication and metabolism of cyanide compounds in Lima bean
(2025) Jorge Duitama; Erick Duarte; Tatiana García Navarrete; F. J. Pérez Zúñiga; Johanna Stepanian; Viviana Parra; Juan Pablo Londoño; Paula Siaucho; Edwin Bautista; Santiago Jiménez-Serrano
<title>Abstract</title> The evolution and functional genomics of the biosynthesis of secondary metabolites, including production of hydrogen cyanide (HCN), is a major goal in Lima bean research. This work describes our latest findings in short time evolution, genomics and expression, applied to Lima bean. This includes a chromosome-level assembly for the Andean gene pool and long-read sequencing of wild relatives. Large indels explained by transposable elements affect promoter regions of several genes related to domestication traits. The two major gene pools of P. lunatus diverged within the last million years, but recent insertions of LTRs produced important variations in genome size and composition. A core metabolic network for Phaseolus revealed patterns of variability in RNA expression for genes related to different primary and secondary metabolic processes. The Phaseolus genomes have important differences in the number, location, and expression of genes, which can explain the unique ability of Lima bean across domesticated Phaseolus species for production of HCN.
Retraining and evaluation of machine learning and deep learning models for seizure classification from EEG data
(Nature Portfolio, 2025) Juan Pablo Carvajal-Dossman; Laura Guio; Danilo García-Orjuela; Jennifer J Guzmán-Porras; Kelly Garcés; Andrés Naranjo; Silvia Juliana Maradei-Anaya; Jorge Duitama
Selection signatures and population dynamics of transposable elements in Lima bean
(Research Square (United States), 2023) Daniela Lozano‐Arce; Tatiana García; Laura Natalia González-García; Romain Guyot; María Isabel Chacón-Sánchez; Jorge Duitama
Abstract The domestication process in Lima bean ( Phaseolus lunatus L. ) involves at least two independent events, within the Mesoamerican and Andean gene pools. Both processes produced similar phenotypic changes in landraces, making Lima bean an excellent model to understand convergent evolution. Despite recent research efforts, the mechanisms of adaptation followed by Mesoamerican and Andean landraces are largely unknown. The genes related to these adaptations can be selected by identification of selective sweeps within gene pools. Most of the previous genetic analyses in Lima bean have relied on Single Nucleotide Polymorphism (SNP) loci and have ignored transposable elements (TEs) which are a major source of variation in plant genomes. The current availability of high-throughput sequencing technologies enables the collection of whole-genome resequencing (WGS) data to approach intraspecies population dynamics of TEs. The present research collected WGS data from 60 wild and domesticated Lima bean accessions to generate the most complete characterization developed to date of transposable elements and SNP loci in the Lima bean genome. We generated an updated annotation of 223,780 transposable elements in the Lima bean genome. Furthermore, we identified genes and variable TEs affected by selective sweeps. Combining three different approaches, selective sweeps were predicted to generate a set of domestication candidate genes. A small percentage of genes under selection (1.6%) were shared among gene pools, suggesting that domestication followed different genetic avenues in both gene pools. Up to 25% of the genes with previously reported selective sweeps in common bean were also detected in Lima bean. We also built a catalog of 39,459 TEs with presence-absence variation (PAV). The fact that 75% of these TEs were located close to genes shows their potential to affect gene functions in Lima bean. The genetic structure inferred from variable TEs was consistent with that obtained from SNP markers, suggesting that TE dynamics can be related to the demographic history of wild and domesticated Lima bean and its adaptive processes, in particular selection processes during domestication.