{"id":1119953,"date":"2023-12-14T03:37:27","date_gmt":"2023-12-14T08:37:27","guid":{"rendered":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/uncategorized\/integrating-dna-methylation-and-gene-expression-data-in-a-single-gene-network-using-the-inetgrate-package-nature-com\/"},"modified":"2023-12-14T03:37:27","modified_gmt":"2023-12-14T08:37:27","slug":"integrating-dna-methylation-and-gene-expression-data-in-a-single-gene-network-using-the-inetgrate-package-nature-com","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/transhuman-news-blog\/dna\/integrating-dna-methylation-and-gene-expression-data-in-a-single-gene-network-using-the-inetgrate-package-nature-com\/","title":{"rendered":"Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package &#8230; &#8211; Nature.com"},"content":{"rendered":"<p><p>Description of datasets    <\/p>\n<p>    In this study we, utilized five independent cohorts including    four cancer- and one Alzheimer-related datasets. Gene    expression profiling was done using RNA-seq and DNA methylation    data were obtained using the Illumina Infinium    HumanMethylation450 BeadChip, measuring DNA methylation levels    (beta values) on about 450,000 genomic loci.  <\/p>\n<p>    The TCGA cohorts were obtained using the TCGAbiolinks    package50    (Version2.24.3). TCGA-LUSC22 and    TCGA-LUAD23 had clinical and    genomic data from 589 and 592 patients, respectively    (Supplementary TableS2). Information on    the pathological stages of the tumors was included in both    datasets. We used this information to stratify the patients    into distinct risk groups and compared the resulting    stratification with clusters obtained from our approach.  <\/p>\n<p>    TCGA-LIHC24 was provided by    a comprehensive study that included 436 cases with clinical    information available in the data. We used the Ishak fibrosis    score51 and    alpha-fetoprotein (AFP) level52,53,54,55,56 to stratify    patients into low-, intermediate-, and high-risk groups. The    employed score is described later in this section.  <\/p>\n<p>    TCGA-L AML was provided by a thorough genomic and epigenomic    study on 200 adult cases with AML25. The risk groups    were defined based on cytogenetic    abnormalities25,57.  <\/p>\n<p>    In addition, we used the ROSMAP cohort provided by the    longitudinal cohort studies of aging and dementia. We    downloaded the ROSMAP dataset from accelerating Medicines    Partnership- AD58 with Synapse IDs    syn3388564 (bulk RNA-seq) and syn5850422 (DNA methylation),    using the synapser (<a href=\"https:\/\/r-docs.synapse.org\/articles\/synapser.html\" rel=\"nofollow\">https:\/\/r-docs.synapse.org\/articles\/synapser.html<\/a>)    R package (Version0.6.61) and a custom R scripts    (Version3.6.1)59.  <\/p>\n<p>    In the TCGA cohorts, events were defined by patients death and    the time to an event referred to the duration from the initial    diagnosis to death time or the last follow-up. In the ROSMAP    cohort, the event was clinical diagnosis of any dementia    including mild cognitive impairment with or without other    cognitive conditions, Alzheimers dementia with or without    other cognitive conditions, and other primary causes of    dementia without clinical evidence of Alzheimers dementia. The    time to an event in this context referred to the age at which    the first dementiarelated diagnosis was made.  <\/p>\n<p>    To enhance the power of our network, we included cases that    have either a single type of data (i.e., gene expression or DNA    methylation) or both data available. In the survival analysis,    we included only patients whose gene expression, DNA    methylation, and survival data were available (Supplementary    TableS2).  <\/p>\n<p>    The initial step in preprocessing involves normalizing the gene    expression data. This is accomplished via a logarithmic    transformation in based 10 to stabilize the variance and make    the data more amenable to following analyses. Because logarithm    of zero is not defined, a small offset is added to the    expression levels prior to applying this transformation.    iNETgrate further preprocesses data in two steps: cleaning and    filtering. The former step involved cleaning DNA methylation    and clinical data using the wrapper function cleanAllData(). Loci with more than    (50%) missing beta values    were removed, while loci with less than (50%) missing values were imputed. The    imputation was performed by replacing each missing value with    the mean of the beta values for the corresponding locus    (preprocessDnam()). The    clinical data was subsequently cleaned by removing cases with    missing survival time and status (prepareSurvival()). The cleaned survival    data had patient information including ID, events, time, and    risk based on the clinical gold standard.  <\/p>\n<p>    The second step in the preprocessing data was filtering out    genes and loci that have a weak absolute Pearson correlation    with survival time and vital status. This was performed by    calling electGenes() inside    the cleanAllData() wrapper    function. In this study, we set the absolute correlation    coefficient cutoffs to 0.2 in all TCGA datasets and 0.1 in the    ROSMAP dataset.  <\/p>\n<p>    Every gene and locus that met the quality control criteria was    retained for the subsequent steps. In addition, we used    computeUnion() to include    corresponding loci of the selected genes and corresponding    genes of the selected loci in the next steps of analysis.  <\/p>\n<p>    In iNETgrate, every node represents a gene with two features    (i.e., gene expression and DNA methylation values). Therefore,    we needed to calculate the DNA methylation value for each gene    using computEigenloci(). This    function calculated a weighted average of loci levels for their    corresponding gene in the following way. When the number of    loci corresponding to a gene was less than six, the first    principal component (i.e., eigenloci) was calculated directly    by taking a weighted average of beta values using PCA. This was    the case for almost (95%)    of loci in our datasets (Supplementary Fig.S1).  <\/p>\n<p>    For the remaining (5%) of    cases, in which the number of loci representing a gene was six    or more, we used findCore() to    determine the most connected cluster of loci for each gene.    Specifically, a graph was constructed for each gene using the    igraph package (Version 1.5.0). In this graph, each locus is    represented by a node. We used a fast greedy    algorithm60 to calculate the    pairwise correlation between loci and detected communities    (i.e., clusters) in the graph. Within each community, the    average pairwise correlation was computed. The community with    the highest average pairwise correlation was identified as a    dense subset of highly co-methylated loci in the graph, and the    eigenloci value was then computed based on this subset.  <\/p>\n<p>    We constructed a network in which nodes represent genes and    edges are weighted based on the absolute correlation of gene    expression and DNA methylation levels for each pair of genes.    This was achieved using the makeNetwork() function. The weight of the    edges between genes (g_i)    and (g_j) was calculated    using the following equation:  <\/p>\n<p>      $$begin{aligned} mathscr      {W}(g_i,g_j)=(1-mu )|{{,textrm{cor},}}_E(g_i,g_j)|+ mu      |{{,textrm{cor},}}_M(g_i,g_j)|, end{aligned}$$    <\/p>\n<p>      (1)    <\/p>\n<p>    Here, (mathscr    {W}(g_i,g_j)) describes the integrated similarity    between genes (g_i) and    (g_j). The term (|{{,textrm{cor},}}_E(g_i,g_j)|)    represents the absolute value of the Pearson correlation    between the gene expression levels of genes (g_i) and (g_j). Similarly, (|{{,textrm{cor},}}_M(g_i,g_j)|)    represents the absolute value of the Pearson correlation    between the DNA methylation levels of these two genes. The    hyperparameter (mu ) is an    integrative factor controlling the relative contributions of    gene expression and DNA methylation data in the network. When    (mu =0), the network is    based solely on gene expression data. Increasing the value of    (mu ) emphasizes the DNA    methylation data in the model, whereas (mu =1) indicates that only DNA    methylation data is used in calculating the edge weights (i.e.,    gene similarities).  <\/p>\n<p>    Construction of the network and identification of the modules    were done by the wrapper function makeNetwork(), which first uses the    pickSoftTreshold() function    (RsquaredCut=0.75) from the weighted gene co-expression network    analysis20(WGCNA) package    (Version 1.72.1) to determine the optimal soft-thresholding    power for our integrated network. Then, the blockwiseModules() function (with    minModuleSize=5, the absolute value of Pearson correlation, and    the default values for the rest of parameters) is utilized to    execute a hierarchical clustering approach. This leads to    identification of modules, where each module is a group of    genes that exhibit similar patterns of expression and DNA    methylation. Additionally, module zero is designed to contain    outlier genes that cannot be confidently assigned to any module    due to their weak or negligible correlation with other genes.  <\/p>\n<p>    We employed PCA to compute an eigengene for every module    (computEgengenes()). In order    to balance the contribution of high-risk and low-risk groups,    the gene expression and DNA methylation data were oversampled.    Intermediate-risk cases were not included in the PCA. An    eigengene is computed from a weighted average of gene    expression levels ((E^e)),    DNA methylation levels ((E^m)), or both ((E^{em})), using the following    equations:  <\/p>\n<p>      $$begin{aligned} E^e = alpha      ^e_{_1} g^e_{_1} + alpha ^e_{_2} g^e_{_2} + cdots + alpha      ^e_{_n} g^e_{_n}, end{aligned}$$    <\/p>\n<p>      (2)    <\/p>\n<p>      $$begin{aligned} E^m = alpha      ^m_{_1} g^m_{_1} + alpha ^m_{_2} g^m_{_2} + cdots + alpha      ^m_{_n} g^m_{_n}, end{aligned}$$    <\/p>\n<p>      (3)    <\/p>\n<p>      $$begin{aligned} E^{em} = (1-mu )      E^e + mu E^m. end{aligned}$$    <\/p>\n<p>      (4)    <\/p>\n<p>    Here, n is the number of genes in the module,    (g^e_{_i}) is the expression    level of gene i, and (g^m_{_i}) is the methylation level    corresponding to gene i (i.e., eigenloci), while    (alpha ^e_{_n}) and    (alpha ^m_{_n}) are the    corresponding weights. These weights are computed using PCA    ensuring maximum variance and minimum loss of biological    information. The eigengene levels are then inferred for the    intermediate-risk group using the same weights obtained from    PCA. It should be emphasized that regardless of which    eigengenes are used, our network and the corresponding modules    are consistently constructed based on both gene expression and    DNA methylation data and they depend on the (mu ) hyperparameter. The resulting    eigengenes are robust features, carrying useful biological    information, which can be leveraged in classification,    clustering, and other downstream analyses including survival    analysis.  <\/p>\n<p>    To identify the optimal subset of modules for precise    prognostication of risk groups, we conducted a two-step    survival analysis using analyzeSurvival(). In the first step, we    performed a penalized Cox regression analysis using the least    absolute shrinkage and selection operator (lasso)    penalty29,30 from the glmnet    R package61 (Version 4.1.7).    This analysis identified the three modules that were most    associated with the survival data. Second, we fitted an AFT    model31 to each    combination of the top three modules and determined the optimal    combination that leads to the smallest p-value. p-values were    based on a log-rank test with a null hypothesis that there is    no difference between the two high- and low-risk    groups62.  <\/p>\n<p>    To categorize the risk groups, iNETgrate uses findAliveCutoff() that searches for a    cutoff on the AFT predictions such that the difference between    high- vs. low-risk groups is optimized. More specifically, for    each risk group, the function iterates over all possible cutoff    values leading to a recall of more than a given threshold    (i.e., for low-risk: minRecall=0.2, for high-risk: minRecall=0.1 in ROSMAP and 0.05 in other    datasets) and selects the cutoff value that maximizes    precision.  <\/p>\n<p>    To ensure the reliability of our integrative approach, we    performed a comparative analysis by benchmarking our results    against alternative methodologies including a well-known    patient similarity network called SNFtool. We also compared our    results vs. risk classification according to the clinical gold    standards based on the intrinsic nature of the disease in each    cohort.  <\/p>\n<p>    The SNFtool first computes a similarity matrix for each data    type (i.e., gene expression and DNA methylation). That is,    using each data type independently, a network is constructed    where nodes are patients and weights of the edges represent    similarity between patients computed based on correlation. The    networks (similarity matrices) are then fused to create a    consensus network representing the overall similarity between    patients across different data types. The resulting patient    similarity network is then used to cluster patients into    subgroups. We noted that the SNFtool faced some limitations in    using all the DNA methylation loci due to memory exhaustion    while computing the similarity matrices. We tackled this issue    by filtering out loci with a relatively low variation    characterized by a standard deviation of less than 0.1.    Determining the appropriate cutoff for a given dataset is    subjective and challenging for SNFtool users.  <\/p>\n<p>    In lung cohorts (LUSC and LUAD), we evaluated the risk groups    based on the tumor stage. Specifically, we classified stages    I,IA,IB,II, and    IIA as the low-risk group, stages IIIB and    IV as the high-risk group, and the remaining stages as    the intermediate-risk group. In the LIHC cohort, we considered    a case high-risk if the AFP level was greater than 500 or the    Ishak fibrosis score was six. In contrast, patients were    considered low-risk if their AFP levels were smaller than 250    and their Ishak fibrosis scores were 0, 1, or 2. The remaining    cases were considered intermediate-risk. In the LAML cohort, we    utilized the classification system available in the clinical    data that categorized cases based on cytogenetic criteria into    three groups of favorable (low-risk), intermediate, and poor    (high-risk). We utilized the Braak score63 to stratify the    ROSMAP cohort into three risk groups. Cases with a Braak score    of 0, 1, or 2 were considered low-risk, those with a Braak    score of 5 or 6 were classified high-risk, while the remaining    cases were grouped as intermediate-risk.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>See the rest here:<br \/>\n<a target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41598-023-48237-8\" title=\"Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package ... - Nature.com\" rel=\"noopener\">Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package ... - Nature.com<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Description of datasets In this study we, utilized five independent cohorts including four cancer- and one Alzheimer-related datasets. Gene expression profiling was done using RNA-seq and DNA methylation data were obtained using the Illumina Infinium HumanMethylation450 BeadChip, measuring DNA methylation levels (beta values) on about 450,000 genomic loci. The TCGA cohorts were obtained using the TCGAbiolinks package50 (Version2.24.3) <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/transhuman-news-blog\/dna\/integrating-dna-methylation-and-gene-expression-data-in-a-single-gene-network-using-the-inetgrate-package-nature-com\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26],"tags":[],"class_list":["post-1119953","post","type-post","status-publish","format-standard","hentry","category-dna"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1119953"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=1119953"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1119953\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=1119953"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=1119953"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=1119953"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}