{"id":168244,"date":"2024-01-12T02:35:55","date_gmt":"2024-01-12T07:35:55","guid":{"rendered":"https:\/\/www.immortalitymedicine.tv\/genomic-prediction-in-multi-environment-trials-in-maize-using-statistical-and-machine-learning-methods-scientific-nature-com\/"},"modified":"2024-08-18T11:39:45","modified_gmt":"2024-08-18T15:39:45","slug":"genomic-prediction-in-multi-environment-trials-in-maize-using-statistical-and-machine-learning-methods-scientific-nature-com","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/machine-learning\/genomic-prediction-in-multi-environment-trials-in-maize-using-statistical-and-machine-learning-methods-scientific-nature-com.php","title":{"rendered":"Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific &#8230; &#8211; Nature.com"},"content":{"rendered":"<p><p>Phenotypic data    <\/p>\n<p>    The data are composed of 265 single cross hybrids from the    maize breeding program of Embrapa Maize and Sorghum evaluated    in eight combinations of trials\/locations\/years under irrigated    trials (WW) and water stress (WS) conditions at two locations    in Brazil (JanabaMinas Gerais and TeresinaPiau) over two    years (2010 and 2011). The hybrids were obtained from crosses    between 188 inbred lines and two testers. The inbred lines    belong to heterotic groups: dent (85 inbred lines), flint (86    inbred lines), and an additional group, referred to as group C    (17 inbred lines), which is unrelated to the dent and flint    origins. The two testers are inbred lines belonging to the    flint (L3) and dent (L228-3) groups. Among the inbred lines,    120 were crossed with both testers, 52 were crossed with the    L228-3 tester only, and 16 lines were crossed with the L3    tester only. Silva et al. (2020) evaluated the genetic    diversity and heterotic groups in the same database. These    authors showed the existence of subgroups within each heterotic    group. Therefore, once these groups were not genetically well    defined and the breeding program from Embrapa Maize was in the    beginning, the effect of allelic substitution in both groups    are assumed to be the same. More details on the experimental    design and procedures can be found in Dias et    al.13,30.  <\/p>\n<p>    The experiment originally included 308 entries, but hybrids    that were not present in all environments were also removed to    evaluate the genomic prediction within each environment,    resulting in a total of 265 hybrids for analysis. Each trial    consisted of 308 maize single cross hybrids, randomly divided    into six sets: sets 13 for crosses with L3 (61, 61, and 14    hybrids each), and sets 46 for crosses with L228-3 (80, 77,    and 15 hybrids each). Four checks (commercial maize cultivars)    were included in each set, and the experiment was designed in    completely randomized blocks. Between trials, hybrids within    each set remained the same, but hybrids and checks were    randomly allocated into groups of plots within each set. This    allocation varied between replicates of sets and between    trials. The WS trials had three replications, except for the    set containing 15 hybrids and the trials evaluated in 2010,    which had two replications. All WW trials, except for the trial    in 2011, had two replicates.  <\/p>\n<p>    Two agronomic traits related to drought tolerance were    analyzed: grain yield (GY) and female flowering time (FFT). GY    was determined by weighing all grains in each plot, adjusted    for 13% grain moisture, and converted to tons per hectare    (t\/ha), accounting for differences in plot sizes across trials.    FFT was measured as the number of days from sowing until the    stigmas appeared in 50% of the plants. A summary of means,    standard deviations, and ranges of both evaluated traits are    available in Table 1.  <\/p>\n<p>    To conduct the analyses, hybrids considered as outliers were    removed (i.e., hybrids that presented phenotypic values greater    than 1.5interquartile range above the third quartile or    below the first quartile) for the GY and FFT traits. The    variations in predictive abilities among hybrids of T2, T1, and    T0 are widely recognized31. However, the    primary aim of our study was to compare different prediction    methodologies in MET assays. In this study, there were 240 T2    hybrids and 68 T1 hybrids, with T2 hybrids had both parents    evaluated in different hybrid combinations, while hybrids being    single-cross hybrids sharing one parent with the tested    hybrids. Given the realistic nature of our scenario, we have a    limited and imbalanced distribution of these hybrid groups,    making a fair comparison challenging. Consequently, we opted to    construct a training set comprising T2 and T0 hybrids.  <\/p>\n<p>    To correct the phenotypic values for experimental design    effects, each trial (WW and WS) and environment were analyzed    independently to obtain the Best Linear Unbiased Estimator    (eBLUEs) for each hybrid, for the two traits evaluated. The    estimates were obtained based on the following model:  <\/p>\n<p>      $${varvec{y}} = 1mu + user2{      X}_{1} {varvec{r}} + {varvec{X}}_{2} {varvec{s}} +      {varvec{X}}_{3} {varvec{h}} + {varvec{e}}$$    <\/p>\n<p>      (1)    <\/p>\n<p>    where (user2{y }left( {n times 1}    right)) is the phenotype vector for (f) replicates, (t) sets of (p) hybrids, and (n) is the number of observations;    (mu) is the mean;    (user2{r }left( {f times 1}    right)) is the fixed effect vector of the replicates;    (user2{s }left( {t times 1}    right)) is the fixed effect vector of the sets;    ({varvec{h}}) (left( {p times 1} right)) is the    fixed effect vector of the hybrids; and (user2{e }left( {k times 1} right))    is the residue vector, with (user2{    e} sim ,MVNleft( {0,{varvec{I}}sigma_{e}^{2} }    right)), where ({varvec{I}}) is an identity matrix of    corresponding order, and (sigma_{e}^{2}) the residual variance.    ({varvec{X}}_{{1user2{ }}} left(    {k times f} right)), ({varvec{X}}_{{2user2{ }}} left( {k times t}    right)) e ({varvec{X}}_{{3user2{ }}} left( {k times p}    right)) represents incidence matrices for their    respective effects. The eBLUES of each environment were used in    further analyses.  <\/p>\n<p>    A total of 57,294 Single Nucleotide Polymorphisms (SNPs)    markers were obtained from 188 inbred lines, and two testers    used as parents of the 265 single cross hybrids. The genotyping    by sequencing (GBS) strategyare detailed in Dias et    al.13. For the    quality control, SNPs were discarded if: the minor allele    frequency was smaller than 5%, more than 20% of missing    genotypes were found, and\/or there were more than 5% of    heterozygous genotypes. After filtering, missing data were    imputed using NPUTE. Then, for each SNP, the genotypes of the    hybrids were inferred based on the genotype of their parents    (inbred line and tester). The number of SNPs per chromosome    ranged from 3121 (chromosome 10) to 7705 (chromosome 1),    totalizing 47,127 markers.  <\/p>\n<p>    The additive and dominance genomic relationship matrices were    constructed32 based on    information from the SNPs using the package    AGHmatrix33, following    VanRaden34 and Vitezica et    al., respectively.  <\/p>\n<p>    Genomic predictions were performed using the Genomic Best Line    Unbiased Prediction (GBLUP) method using the package AsReml v.    436. Two groups    were considered: the first group comprised four environments    under WW conditions, and the second included four environments    under WS conditions. The linear model is described below:  <\/p>\n<p>      $$overline{user2{y}} = mu 1 +      user2{ Xb} + {varvec{Z}}_{1} {varvec{u}}_{{varvec{a}}} +      {varvec{Z}}_{2} {varvec{u}}_{{varvec{d}}} +      {varvec{e}}$$    <\/p>\n<p>      (2)    <\/p>\n<p>    where (user2{overline{y} }left(    {pq times 1} right)) is the vector of eBLUES    previously estimated for each environment with (p) hybrids and (q) environments;(mu) is the mean; (user2{b }left( {q times 1} right))    is the vector of environmental effects (fixed); ({varvec{u}}_{{user2{a }}} left( {pq times    1} right)) is the vector of individual additive    genetic values nested within environments (random), with    ({varvec{u}}_{{varvec{a}}} sim    MVNleft( {0,left[ {{varvec{I}}_{{varvec{q}}} sigma_{{u_{a}    }}^{2} + rho_{a} left( {{varvec{J}}_{{varvec{q}}} -    {varvec{I}}_{{varvec{q}}} } right)} right] otimes    {varvec{A}}} right)), where ({varvec{A}}) is the genomic    relationship matrix between individuals for additive effects,    (rho_{a}) is the additive    genetic correlation coefficient between environments,    ({varvec{I}}_{{varvec{q}}} user2{    }left( {q times q} right)) is an identity matrix,    ({varvec{J}}_{{varvec{q}}} user2{    }left( {q times q} right)) is a matrix of ones, and    (otimes) denotes the    Kronecker product; ({varvec{u}}_{{varvec{d}}})    (left( {pq times 1}    right)) is the vector of individual dominance genetic    values nested within environments (random), with ({varvec{u}}_{{varvec{d}}} sim MVNleft(    {0,left[ {{varvec{I}}_{{varvec{q}}} sigma_{{u_{d} }}^{2} +    rho_{d} left( {{varvec{J}}_{{varvec{q}}} -    {varvec{I}}_{{varvec{q}}} } right)} right] otimes    {varvec{D}}} right)), where ({varvec{D}}) is the genomic    relationship matrix between individuals for dominance effects,    (rho_{{varvec{d}}}) is the    dominance correlation coefficient between environments;    ({varvec{e}}) (left( {pq times 1} right)) is the    random residuals vector with ({varvec{e}}sim MVNleft(    {0,{varvec{I}}sigma_{e}^{2} } right)). The capital    letters (user2{X }left( {pq times    q} right),user2{ Z}_{1} left( {pq times pq}    right)) and ({varvec{Z}}_{2} user2{ }left( {pq times pq}    right)) represent the incidence matrices for their    respective effects, (1user2{    }left( {pq times 1} right)) is a vector of ones. The    (co)variance components were obtained using the residual    maximum likelihood method (REML)37.  <\/p>\n<p>    Two alternative models were also used. The first for genomic    prediction retained only additive effects by removing    ({varvec{u}}_{{varvec{d}}}) from    Eq.(2). The second model    was used to estimate the genetic parameters within each    environment separately.  <\/p>\n<p>    The significance of random effects was tested using the    Likelihood Ratio Test (LRT)38, given by:  <\/p>\n<p>      $$LRT = 2*left( {LogL_{c} -      LogL_{r} } right)$$    <\/p>\n<p>      (3)    <\/p>\n<p>    where (LogL_{c}) is the    logarithm of the likelihood function of the complete model    (with all effects included), and (LogL_{r}) is the logarithm of the    restricted likelihood function of the reduced model (without    the effect under test). Effect significance was tested by LRT    using the chi-square (X2) probability density    function with a degree of freedom and significance level of    5%39.  <\/p>\n<p>    The narrow-sense heritability (({    }h^{2})), the proportion of variance explained by    dominance effects ((d^{2})),    and the broad-sense heritability (left( {H^{2} } right)) for each trait    were estimated following Falconer and Mackay    199635.  <\/p>\n<p>    Similar to the previous topic, the trials were divided between    WW and WS conditions, and the potential of regression trees    (RT) was explored using the following three algorithms:    bagging, random forest, and boosting22. Bagging (Bag)    is a methodology that aims to reduce the RT    variance22. In other    words, it consists of obtaining D samples with available    sampling replacement, thus obtaining D models (hat{f}^{1} left( x right), hat{f}^{2}    left( x right), ldots , hat{f}^{D} left( x    right)), and finally use the generated models to    obtain an average, given by:  <\/p>\n<p>      $$hat{f}_{medio} left( x right)      = frac{1}{D}mathop sum limits_{d = 1}^{D} hat{f}^{d}      left( x right)$$    <\/p>\n<p>      (4)    <\/p>\n<p>    This decreases the variability obtained in the decision trees.    The number of trees used in Bag is not a parameter that will    result in overfitting of the model. In practice, a number of    trees is used until the error has stabilized22. The number of    trees sampled for Bag was set at 500 trees.  <\/p>\n<p>    Random forest (RF) was proposed by HO40 and it is an    improvement of Bag to avoid the high correlation of the trees    and to improve the accuracy in the selection of individuals. RF    changes only the number of predictor variables used in each    split. That is, each time a split in a tree is considered, a    random sample of (m)    variables is chosen as candidates from the complete set of    (p) variables. Hastie et    al.21 suggest that    the number of predictor variables used in each partition is    equal to (m = frac{p}{3})    for regression trees. The number of trees for the RF was set at    500.  <\/p>\n<p>    Boosting uses RT by adjusting the residual of an initial model.    The residual is updated with each tree that grows sequentially    from the previous tree's residual, and the response variable    involves a combination of a large number of trees, such that:  <\/p>\n<p>      $$hat{f}left( x right) = mathop      sum limits_{b = 1}^{B} {uplambda } hat{f}^{b} left( x      right)$$    <\/p>\n<p>      (5)    <\/p>\n<p>    The function (hat{f}left( .    right)) refers to the final tree combined with    sequentially adjusted trees, and  is the shrinkage parameter    that controls the learning rate of the method. Furthermore,    this method needs to be adjusted with several splits in each of    the trees. This parameter controls the complexity of the Boost    and is known as the depth. For Boosting, the number of trees    sampled was 250, with a learning rate of 0.1 and a depth of 3.  <\/p>\n<p>    To perform hybrid prediction for each environment based on MET    dataset, we propose the incorporation of location and year    information in which the experiments were carried out as    factors in the data input file together with SNPs markers as    predictors in machine learning methodologies. As a response    variable, the eBLUEs previously estimated by    Eq.(1) were used.  <\/p>\n<p>    For the construction of the bagging and random forest models,    the randomForest function from the package    randomForest41 was used.    Finally, the package's gbm function    gbm42 was used for    boosting. All analyzes were implemented in the software    R43.  <\/p>\n<p>    Genomic predictions were carried out following Burgueo et    al.16, considering    two different prediction problems, CV1 and CV2, which simulate    two possible scenarios a breeder can face. In CV1, the ability    of the algorithms to predict the performance of hybrids that    have not yet been evaluated in any field trial was evaluated.    Thus, predictions derived from the CV1 scenario are entirely    based on phenotypic and genotypic records from other related    hybrids. In CV2, the ability of the algorithms to predict the    performance of hybrids using data collected in other    environments was evaluated. It simulates the prediction problem    found in incomplete MET trials. Here, information from related    individuals is used, and the prediction can benefit from    genetic relationships between hybrids and correlations between    environments. Within the CV2 scenario, two different situations    of data imbalance were evaluated. In the first, called CV2    (50%), the tested hybrids were not present in half of the    environments, while in the second, called CV2 (25%), the tested    hybrids were not present in only 25% of the environments. Table    2 provides a    hypothetical representation of this CV1, CV2 (50%), and CV2    (25%) validation scheme.  <\/p>\n<p>    To separate the training and validation sets, the k-folds    procedure was used, considering (k =    5). The set of 265 hybrids was divided into five    groups, with 80% of the hybrids considered as the training    population, and the remaining 20% hybrids considered as the    validation population. The hybrids were separated into sets    proportionally containing all the crosses performed    (DentDent, DentFlint,    FlintFlint, CDent,    CFlint). The cross-validation process was performed    separately for each trait, condition (WS or WW) and scenario    (CV1, CV2-50% and CV2-25%) and was repeated five times to    assess the predictive ability of the analyses.  <\/p>\n<p>    The predictive ability within each environment for the    conditions (WS and WW) was estimated by the Pearson correlation    coefficient44 between the    corrected phenotypic values (eBLUES) of    Eq.(1) for each environment    and the GEBVs predicted by each fitted method.  <\/p>\n<p>    The authors confirm that all methods were carried out by    relevant guidelines in the method section. The authors also    confirm that the handling of the plant materials used in the    study complies with relevant institutional, national, and    international guidelines and legislation.  <\/p>\n<p>    The authors confirm that the appropriate permissions and\/or    licenses for collection of plant or seed specimens are    taken.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read more:<br \/>\n<a target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41598-024-51792-3\" title=\"Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific ... - Nature.com\" rel=\"noopener\">Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific ... - Nature.com<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Phenotypic data The data are composed of 265 single cross hybrids from the maize breeding program of Embrapa Maize and Sorghum evaluated in eight combinations of trials\/locations\/years under irrigated trials (WW) and water stress (WS) conditions at two locations in Brazil (JanabaMinas Gerais and TeresinaPiau) over two years (2010 and 2011).  <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/machine-learning\/genomic-prediction-in-multi-environment-trials-in-maize-using-statistical-and-machine-learning-methods-scientific-nature-com.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[1231415],"tags":[],"class_list":["post-168244","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/168244"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=168244"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/168244\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=168244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=168244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=168244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}