{"id":1118601,"date":"2023-10-16T06:42:06","date_gmt":"2023-10-16T10:42:06","guid":{"rendered":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/uncategorized\/genotyping-sequencing-and-analysis-of-140000-adults-from-mexico-nature-com\/"},"modified":"2023-10-16T06:42:06","modified_gmt":"2023-10-16T10:42:06","slug":"genotyping-sequencing-and-analysis-of-140000-adults-from-mexico-nature-com","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/transhuman-news-blog\/human-genetics\/genotyping-sequencing-and-analysis-of-140000-adults-from-mexico-nature-com\/","title":{"rendered":"Genotyping, sequencing and analysis of 140,000 adults from Mexico &#8230; &#8211; Nature.com"},"content":{"rendered":"<p><p>Recruitment of study participants    <\/p>\n<p>    The MCPS was established in the late 1990s following    discussions between Mexican scientists at the National    Autonomous University of Mexico (UNAM) and British scientists    at the University of Oxford about how best to measure the    changing health effects of tobacco in Mexico. These discussions    evolved into a plan to establish a prospective cohort study    that could investigate not only the health effects of tobacco    but also those of many other factors (including factors    measurable in the blood)1. Between 1998 and    2004, more than 100,000 women and 50,000 men 35years of age or    older (mean age 50years) agreed to take part, were asked    questions, had physical measurements taken, gave a blood sample    and agreed to be tracked for cause-specific mortality. More    women than men were recruited because the study visits were    predominantly made during working hours when women were more    likely to be at home (although visits were extended into the    early evenings and at weekends to increase the proportion of    men in the study).  <\/p>\n<p>    Participants were recruited from randomly selected areas within    two contiguous city districts (Coyoacn and Iztapalapa). These    two districts have existed since the pre-Hispanic period and    are geographically close to the ancient Aztec city of    Tenochtitlan. Originally, Indigenous populations settled there,    but over the centuries, the population dynamics have    substantially changed. Many people from Spain, including the    conqueror Hernn Corts, resided in Coyoacn while the capital    of New Spain was being built over the ruins of Tenochtitlan.    The modern populations of Coyoacn and Iztapalapa derive    largely from the development of urban settlements and    migrations from the 1950s to the 1970s. Over this period, both    districts, but particularly Iztapalapa, received large numbers    of Indigenous migrants from the central (Nahuas, Otomies and    Purepechas), south (Mixtecos, Zapotecos and Mazatecos) and    southeast (Chinantecos, Totonacas and Mayas) regions of the    country.  <\/p>\n<p>    At recruitment, a 10-ml venous EDTA blood sample was obtained    from each participant and transferred to a central laboratory    using a transport box chilled (410C) with ice packs. Samples    were refrigerated overnight at 4C and then centrifuged    (2,100g at 4C for 15min) and separated the next    morning. Plasma and buffy-coat samples were stored locally at    80C, then transported on dry ice to Oxford (United Kingdom)    for long-term storage over liquid nitrogen. DNA was extracted    from buffy coat at the UK Biocentre using Perkin Elmer Chemagic    360 systems and suspended in TE buffer. UV-VIS spectroscopy    using Trinean DropSense96 was used to determine yield and    quality, and samples were normalized to provide 2g DNA at    20ngl1 concentration (2% of samples provided a    minimum 1.5g DNA at 10ngl1 concentration) with    a 260:280nm ratio of >1.8 and a 260:230nm ratio of    2.02.2.  <\/p>\n<p>    Genomic DNA samples were transferred to the Regeneron Genetics    Center from the UK Biocentre and stored in an automated sample    biobank at 80C before sample preparation. DNA libraries were    created by enzymatically shearing DNA to a mean fragment size    of 200bp, and a common Y-shaped adapter was ligated to all DNA    libraries. Unique, asymmetric 10bp barcodes were added to the    DNA fragment during library amplification to facilitate    multiplexed exome capture and sequencing. Equal amounts of    sample were pooled before overnight exome capture, with a    slightly modified version of IDTs xGenv1 probe library; all    samples were captured on the same lot of oligonucleotides. The    captured DNA was PCR amplified and quantified by quantitative    PCR. The multiplexed samples were pooled and then sequenced    using 75bp paired-end reads with two 10bp index reads on an    Illumina NovaSeq 6000 platform on S4 flow cells. A total of    146,068 samples were made available for processing. We were    unable to process 2,628 samples, most of which failed QC during    processing owing to low or no DNA being present. A total of    143,440 samples were sequenced. The average 20 coverage was    96.5%, and 98.7% of the samples were above 90%.  <\/p>\n<p>    Of the 143,440 samples sequenced, 2,394 (1.7%) did not pass one    or more of our QC metrics and were subsequently excluded.    Criteria for exclusion were as follows: disagreement between    genetically determined and reported sex (n=1,032);    high rates of heterozygosity or contamination (VBID>5%)    (n=249); low sequence coverage (less than 80% of    targeted bases achieving 20 coverage) (n=29);    genetically identified sample duplicates (n=1,062    total samples); WES variants discordant with the genotyping    chip (n=8); uncertain linkage back to a study    participant (n=259); and instrument issue at DNA    extraction (n=6). The remaining 141,046 samples were    then used to compile a project-level VCF (PVCF) for downstream    analysis using the GLnexus joint genotyping tool. This final    dataset contained 9,950,580 variants.  <\/p>\n<p>    Approximately 250ng of total DNA was enzymatically sheared to    a mean fragment size of 350bp. Following ligation of a    Y-shaped adapter, unique, asymmetric 10bp barcodes were added    to the DNA fragments with three cycles of PCR. Libraries were    quantified by quantitative PCR, pooled and then sequenced using    150bp paired-end reads with two 10bp index reads on an    Illumina NovaSeq 6000 platform on S4 flow cells. A total of    10,008 samples were sequenced. This included 200    motherfatherchild trios and 3more extended pedigrees. The    rest of the samples were chosen to be unrelated to third degree    or closer and enriched for parents of nuclear families. The    average mean coverage was 38.5 and 99% of samples had mean    coverages of >30, and all samples were above 27.  <\/p>\n<p>    Of the 10,008 samples that were whole-genome sequenced, 58    (0.6%) did not pass one or more of our QC metrics and were    subsequently excluded. Reasons for exclusion were as follows:    disagreement between genetically determined and reported sex    (n=16); high rates of heterozygosity or contamination    (VBID>5%) (n=10); genetically identified sample    duplicates (n=19 total samples); and uncertain linkage    back to a study participant (n=14). The remaining    9,950 samples were then used to compile a PVCF for downstream    analysis using the GLnexus joint genotyping tool. This final    dataset contained 158,464,363 variants.  <\/p>\n<p>    The MCPS WES and WGS data were reference-aligned using the OQFE    protocol35, which uses BWA    MEM to map all reads to the GRCh38 reference in an alt-aware    manner, marks read duplicates and adds additional per-read    tags. The OQFE protocol retains all reads and original quality    scores such that the original FASTQ is completely recoverable    from the resulting CRAM file. Single-sample variants were    called using DeepVariant (v.0.10.0) with default WGS parameters    or custom exome parameters35, generating a gVCF    for each input OQFE CRAM file. These gVCFs were aggregated and    joint-genotyped using GLnexus (v.1.3.1). All constituent steps    of this protocol were executed using open-source software.  <\/p>\n<p>    Similar to other recent large-scale sequencing efforts, we    implemented a supervised machine-learning algorithm to    discriminate between probable low-quality and high-quality    variants8,12. In brief, we    defined a set of positive control and negative control variants    based on the following criteria: (1) concordance in genotype    calls between array and exome-sequencing data; (2) transmitted    singletons; (3) an external set of likely high quality sites;    and (4) an external set of likely low quality sites. To    define the external high-quality set, we first generated the    intersection of variants that passed QC in both TOPMed Freeze8    and gnomADv.3.1 genomes. This set was additionally restricted    to 1000 genomes phase1 high-confidence SNPs from the    1000Genomes project36 and gold-standard    insertions and deletions from the 1000Genomes project and a    previous study37, both available    through the GATK resource bundle (<a href=\"https:\/\/gatk.broadinstitute.org\/hc\/en-us\/articles\/360035890811-Resource-bundle\" rel=\"nofollow\">https:\/\/gatk.broadinstitute.org\/hc\/en-us\/articles\/360035890811-Resource-bundle<\/a>).    To define the external low-quality set, we intersected    gnomADv3.1 fail variants with TOPMed Freeze8 Mendelian or    duplicate discordant variants. Before model training, the    control set of variants were binned by allele frequency and    then randomly sampled such that an equal number of variants    were retained in the positive and negative labels across each    frequency bin. A support vector machine using a radial basis    function kernel was then trained on up to 33 available site    quality metrics, including, for example, the median value for    allele balance in heterozygote calls and whether a variant was    split from a multi-allelic site. We split the data into    training (80%) and test (20%) sets. We performed a grid search    with fivefold cross-validation on the training set to identify    the hyperparameters that returned the highest accuracy during    cross-validation, which were then applied to the test set to    confirm accuracy. This approach identified a total of 616,027    WES and 22,784,296 WGS variants as low-quality (of which    161,707 and 104,452 were coding variants, respectively). We    further applied a set of hard filters to exclude monomorphs,    unresolved duplicates, variants with >10% missingness, 3    mendel errors (WGS only) or failed HardyWeinberg equilibrium    (HWE) with excess heterozgosity (HWE    P<11030 and observed heterozygote    count of >1.5 expected heterozygote count), which resulted    in a dataset of 9,325,897 WES and 131,851,586 WGS variants (of    which 4,037,949 and 1,460,499 were coding variants,    respectively).  <\/p>\n<p>    Variants were annotated as previously    described38. In brief,    variants were annotated using Ensembl variant effect predictor,    with the most severe consequence for each variant chosen across    all protein-coding transcripts. In addition, we derived    canonical transcript annotations based on a combination of    MANE, APPRIS and Ensembl canonical tags. MANE annotation was    given the highest priority followed by APPRIS. When neither    MANE nor APPRIS annotation tags were available for a gene, the    canonical transcript definition of Ensembl was used. Gene    regions were defined using Ensembl release 100. Variants    annotated as stop gained, start lost, splice donor, splice    acceptor, stop lost or frameshift, for which the allele of    interest was not the ancestral allele, were considered    predicted loss-of-function variants. Five annotation resources    were utilized to assign deleteriousness to missense variants:    SIFT; PolyPhen2 HDIV and PolyPhen2 HVAR; LRT; and    MutationTaster. Missense variants were considered likely    deleterious if predicted deleterious by all five algorithms,    possibly deleterious if predicted deleterious by at least one    algorithm and likely benign if not predicted deleterious by    any algorithm.  <\/p>\n<p>    Samples were genotyped using an Illumina Global Screening Array    (GSA) v.2 beadchip according to the manufacturers    recommendations. A total of 146,068 samples were made available    for processing, of which 145,266 (99.5%) were successfully    processed. The average genotype call rate per sample was 98.4%,    and 98.4% of samples had a call rate above 90%. Of the 145,266    samples that were genotyped, 4,435 (3.1%) did not pass one or    more of our QC metrics and were subsequently excluded. Reasons    for exclusion were as follows: disagreement between genetically    determined and reported sex (n=1,827); low-quality    samples (call rates below 90%) (n=2,276); genotyping    chip variants discordant with exome data (n=44);    genetically identified sample duplicates (n=1,063    total samples); uncertain linkage back to a study participant    (n=268); and sample affected by an instrument issue at    DNA extraction (n=6). The remaining 140,831 samples    were then used to compile a PVCF for downstream analysis. This    dataset contained 650,380 polymorphic variants.  <\/p>\n<p>    The input array data from the RGC Sequencing Laboratory    consisted of 140,831 samples and 650,380 variants and were    passed through the following QC steps: checks for consistency    of genotypes in sex chromosomes (steps14); sample-level and    variant-level missingness filters (steps 5 and 6); the HWE    exact test applied to a set of 81,747 third-degree unrelated    samples, which were identified from the initial relatedness    analysis using Plink and Primus (step7); setting genotypes    with Mendelian errors in nuclear families to missing (step8);    and a second round of steps57 (step9). Plink commands    associated with each step are displayed in column2    (Supplementary Table 9). The final post-QC    array data consisted of 138,511 samples and 559,923 variants.  <\/p>\n<p>    We used Shapeit (v.4.1.3; <a href=\"https:\/\/odelaneau.github.io\/shapeit4\" rel=\"nofollow\">https:\/\/odelaneau.github.io\/shapeit4<\/a>)    to phase the array dataset of 138,511 samples and 539,315    autosomal variants that passed the array QC procedure. To    improve the phasing quality, we leveraged the inferred family    information by building a partial haplotype scaffold on    unphased genotypes at 1,266 trios from 3,475 inferred nuclear    families identified (randomly selecting one offspring per    family when there was more than one). We then ran Shapeit one    chromosome at a time, passing the scaffold information with the    --scaffold option.  <\/p>\n<p>    We separately phased the support-vector-machine-filtered WES    and WGS datasets onto the array scaffold. The phased WGS data    constitute the MCPS10k reference panel. For the WGS phasing, we    used WhatsHap (<a href=\"https:\/\/github.com\/whatshap\/whatshap\" rel=\"nofollow\">https:\/\/github.com\/whatshap\/whatshap<\/a>)    to extract phase information in the sequence reads and from the    subset of available trios and pedigrees, and this information    was fed into Shapeit (v.4.2.2; <a href=\"https:\/\/odelaneau.github.io\/shapeit4\" rel=\"nofollow\">https:\/\/odelaneau.github.io\/shapeit4<\/a>)    through the --use-PS 0.0001 option. Phasing was carried out in    chunks of 10,000 and 100,000 variants (WES and WGS,    respectively) and using 500 SNPs from the array data as a    buffer at the beginning and end of each chunk. The use of the    phased scaffold of array variants meant that chunks of phased    sequencing data could be concatenated together to produce whole    chromosome files that preserved the chromosome-wide phasing of    array variants. A consequence of this process is that when a    variant appeared in both the array and sequencing datasets, the    data from the array dataset were used.  <\/p>\n<p>    To assess the performance of the WGS phasing process, we    repeated the phasing of chromosome2 by removing the children    of the 200 motherfatherchild trios. We then compared the    phase of the trio parents to that in the phased dataset that    included the children. We observed a mean switch error rate of    0.0024. Without using WhatsHap to leverage phase information in    sequencing reads, the mean switch error rate increased to    0.0040 (Supplementary Fig. 23).  <\/p>\n<p>    The relatedness-inference criteria and relationship assignments    were based on kinship coefficients and probability of zero IBD    sharing from the KING software (<a href=\"https:\/\/www.kingrelatedness.com\" rel=\"nofollow\">https:\/\/www.kingrelatedness.com<\/a>).    We reconstructed all first-degree family networks using PRIMUS    (v.1.9.0; <a href=\"https:\/\/primus.gs.washington.edu\/primusweb\" rel=\"nofollow\">https:\/\/primus.gs.washington.edu\/primusweb<\/a>)    applied to the IBD-based KING estimates of relatedness along    with the genetically derived sex and reported age of each    individual. In total, 99.3% of the first-degree family networks    were unambiguously reconstructed. To visualize the relationship    structure in the MCPS, we used the software Graphviz (<a href=\"https:\/\/graphviz.org\" rel=\"nofollow\">https:\/\/graphviz.org<\/a>) to construct    networks such as those presented in Supplementary Fig.    5. We used the sfdp    layout engine which uses a spring model that relies on a    force-directed approach to minimize edge length.  <\/p>\n<p>    To identify IBD segments and to measure ROH, we ran hap-ibd    (v.1.0; <a href=\"https:\/\/github.com\/browning-lab\/hap-ibd\" rel=\"nofollow\">https:\/\/github.com\/browning-lab\/hap-ibd<\/a>)    using the phased array dataset of 138,511 samples and 538,614    sites from autosomal loci. Hap-ibd was run with the parameter    min-seed=4, which looks for IBD segments that are at least 4cM    long. We filtered out IBD segments in regions of the genome    with fourfold more or fourfold less than the median coverage    along each chromosome following the procedure in IBDkin    (<a href=\"https:\/\/github.com\/YingZhou001\/IBDkin\" rel=\"nofollow\">https:\/\/github.com\/YingZhou001\/IBDkin<\/a>),    and filtered out segments overlapping regions with fourfold    less than the median SNP marker density (Supplementary Fig.    28). For the    homozygosity analysis, we intersected the sample with the exome    data to evaluate loss-of-function variants, which resulted in a    sample of 138,200. We further overlaid the ROH segments with    local ancestry estimates, and assigned ancestry where the    ancestries were concordant between haplotypes and posterior    probability was >0.9, assigning ancestry to 99.8% of the    ROH.  <\/p>\n<p>    We used the workflow implemented in the R package bigsnpr    (<a href=\"https:\/\/privefl.github.io\/bigsnpr\" rel=\"nofollow\">https:\/\/privefl.github.io\/bigsnpr<\/a>).    In brief, pairwise kinship coefficients were estimated using    Plink (v.2.0) and samples were pruned for first-degree and    second-degree relatedness (kinship coefficient<0.0884) to    obtain a set of unrelated individuals. LD clumping was    performed with a default LD r2    threshold of 0.2, and regions with long-range LD were    iteratively detected and removed using a procedure based on    evaluating robust Mahalanobis distances of PC loadings. Sample    outliers were detected using a procedure based on    K-nearest neighbours. PC scores and loadings for the    first 20 PCs were efficiently estimated using truncated    singular value decomposition (SVD) of the scaled genotype    matrix. After removal of variant and sample outliers, a final    iteration of truncated SVD was performed to obtain the PCA    model. The PC scores and loadings from this model were then    used to project withheld samples, including related    individuals, into the PC space defined by the model using the    online augmentation, decomposition and procustes algorithm. For    each PC analysis in this study, variants with MAF<0.01    were removed.  <\/p>\n<p>    Admixture (v.1.3.0; <a href=\"https:\/\/dalexander.github.io\/admixture\" rel=\"nofollow\">https:\/\/dalexander.github.io\/admixture<\/a>)    was used to estimate ancestry proportions in a set of 3,964    reference samples representing African, European, East Asian,    and American ancestries from a dataset of merged genotypes.    This included 765 samples of African ancestry from 1000Genomes    (n=661) and HGDP (n=104), 658 samples of    European ancestry from 1000Genomes (n=503) and HGDP    (n=155), 727 samples of East Asian ancestry from    1000Genomes (n=504) and HGDP (n=223), and    1,814 American samples, including 716 Indigenous Mexican    samples from the MAIS study, 64 admixed Mexican American    samples from MXL, 21 Maya and 13 Pima samples from HGDP, and    1,000 unrelated Mexican samples from the MCPS. Included SNPs    were limited to variants present on the Illumina GSAv.2    genotyping array for which TOPMed-imputed variants in the MAIS    study had information r20.9    (m=199,247 SNPs). To select the optimum number of    ancestry populations (K) to include in the admixture    model, fivefold cross validation was performed for each    K in the set 4 to 25 with the cv flag. To obtain    ancestry proportion estimates in the remaining set of 137,511    MCPS samples, the population allele frequencies (P)    estimated from the analysis of reference samples were fixed as    parameters so that the remaining samples could be projected    into the admixture model. Projection was performed for the    K=4 model and for the K=18 model that    produced the lowest cross-validation error, and point    estimation was attained using the block relaxation algorithm.  <\/p>\n<p>    The MAIS genotyping datasets were obtained from L.Orozco from    Insituto Nacional de Medicina Genmica. For 644 samples,    genotyping was performed using an Affymetrix Human 6.0 array    (n=599,727 variants). An additional 72 samples (11    ancestry populations) were genotyped using an Illumina Omni 2.5    array (n=2,397,901 variants). The set of 716    Indigenous samples represent 60 of out the 68 recognized ethnic    populations in Mexico3. Per chromosome,    VCFs for each genotyping array were uploaded to the TOPMed    imputation server (<a href=\"https:\/\/imputation.biodatacatalyst.nhlbi.nih.gov\" rel=\"nofollow\">https:\/\/imputation.biodatacatalyst.nhlbi.nih.gov<\/a>)    and imputed from a multi-ethnic reference panel of 97,256 whole    genomes. Phasing and imputation were performed using the    programs eagle and MiniMac, respectively. The observed    coefficient of determination (r2) for    the reference allele frequency between the reference panel and    the genotyping array was 0.696 and 0.606 for the Affymetrix and    Illumina arrays, respectively.  <\/p>\n<p>    Physical positions of imputed variants were mapped from genome    build GRCh37 to GRCh38 using the program LiftOver, and only    variant positions included on the Affymetrix GSA v.2 were    retained. After further filtering out variants with imputation    information r2<0.9, the following    QC steps were performed before merging of the MAIS Affymetrix    and Illumina datasets: (1) removal of ambiguous variants (that    is, A\/T and C\/G polymorphisms); (2) removal of duplicate    variants; (3) identifying and correcting allele flips; and (4)    removal of variants with position mismatches. Merging was    performed using the --bmerge command in Plink (v.1.9).  <\/p>\n<p>    We used publicly available genotypes from the HGDP    (n=929) and the 1000Genomes project    (n=2,504). To obtain a combined global reference    dataset for downstream analyses of population structure,    admixture and local ancestry, the HGDP and 1000Genomes    datasets were merged. The resulting merged public reference    dataset was subsequently merged with the MAIS dataset and MCPS    genotyping array dataset. Each merge was performed using the    bmerge function in Plink (v.1.9; <a href=\"https:\/\/www.cog-genomics.org\/plink\" rel=\"nofollow\">https:\/\/www.cog-genomics.org\/plink<\/a>)    after removing ambiguous variants, removing duplicate variants,    identifying and correcting allele flips, and removing variants    with position mismatches. The combined global reference dataset    comprised 199,247 variants and 142,660 samples.  <\/p>\n<p>    To characterize genetic admixture within the MCPS cohort, we    performed a seven-way LAI analysis with RFMix (v.2.0; <a href=\"https:\/\/github.com\/slowkoni\/rfmix\" rel=\"nofollow\">https:\/\/github.com\/slowkoni\/rfmix<\/a>)    that included reference samples from the HGDP and 1000Genomes    studies, and Indigenous samples from the MAIS study. This    merged genotyping dataset of samples across these studies with    the 138,511 MCPS participants included 204,626 autosomal    variants and 5,363 chromosomeX variants.  <\/p>\n<p>    To identify reference samples with extensive admixture to    exclude from LAI, we performed admixture analysis with the    program TeraSTRUCTURE (<a href=\"https:\/\/github.com\/StoreyLab\/terastructure\" rel=\"nofollow\">https:\/\/github.com\/StoreyLab\/terastructure<\/a>)    on a merged genotyping dataset (n=3,274) that included    African (AFR), European (EUR) and American (AMR) samples from    the HGDP, 1000Genomes and MAIS studies, and 1,000 randomly    selected unrelated MCPS samples. Following the recommended    workflow in the TeraSTRUCTURE documentation (<a href=\"https:\/\/github.com\/StoreyLab\/terastructure\" rel=\"nofollow\">https:\/\/github.com\/StoreyLab\/terastructure<\/a>),    we varied the rfreq parameter from the set of {0.05, 0.10,    0.15, 0.20} of autosomal variants with K=4 and    selected the value that maximized the validation likelihood    (20% of autosomal variants; rfreq=45,365). We then varied the    K parameter and ran it in triplicate to identify the    value that attained a maximal average validation likelihood    (K=18). Each of the estimated K ancestries was    assigned to a global superpopulation (that is, AFR, EUR and    AMR), and the cumulative K ancestry proportion was used    as an ancestry score for selecting reference samples. Using an    ancestry score threshold of 0.9, 666 AFR, 659 EUR and 616 AMR    samples were selected as reference samples. The AMR samples    used for seven-way LAI comprised 98 Mexico_North, 42    Mexico_Northwest, 185 Mexico_Central, 128 Mexico_South and 163    Mexico_Southeast individuals.  <\/p>\n<p>    Reference samples were phased using Shapeit (v.4.1.2; <a href=\"https:\/\/odelaneau.github.io\/shapeit4\" rel=\"nofollow\">https:\/\/odelaneau.github.io\/shapeit4<\/a>)    with default settings, and the phasing of the 138,511 MCPS    participants was performed as described above (see the section    Array phasing). Seven-way LAI was performed using RFMix    (v.2.0), with the number of terminal nodes for the random    forest classifier set to 5 (-n 5), the average number of    generations since expected admixture set to 15 (-G 15), and ten    rounds of expectation maximization (EM) algorithm (-e 10).    Global ancestry proportion estimates were derived by taking the    average per-chromosome Q estimates (weighted by    chromosome length) for each of the seven ancestries (that is,    AFR, EUR, Mexico_North, Mexico_Northwest, Mexico_Central,    Mexico_South and Mexico_Southeast). Inferred three-way global    ancestry proportion estimates were obtained by combining    proportions for each of the five Indigenous Mexican populations    into a single AMR category.  <\/p>\n<p>    To delineate local ancestry segments for use in the estimation    of ancestry-specific allele frequencies (see the section    Ancestry-specific allele frequency estimation), we performed    a three-way LAI analysis using a merged genotyping dataset that    excluded the MAIS samples as this afforded greater genotyping    density (493,036 autosomal variants and 12,798 chromosomeX    variants). Before LAI analysis, reference samples were selected    using the same workflow for TeraSTRUCTURE as described above,    with modifications being the inclusion of 10,000 unrelated MCPS    participants and an ancestry threshold of 0.95. RFMix was    applied as described above, with modifications being the use of    753 AFR, 649 EUR and 91 AMR reference samples, specification of    5 rounds of EM (-e 5), and use of the --reanalyze-reference    option, which treated reference haplotypes as if they were    query haplotypes and updated the set of reference haplotypes in    each EM round.  <\/p>\n<p>    To measure the correlation in ancestry between partner pairs,    we used a linear model to predict ancestry of each partner    using the ancestry of their spouse, education level (four    categories) and district (Coyoacn and Iztapalapa) of both    partners.  <\/p>\n<p>    We averaged local ancestry dosages (estimated using RFMix at    98,012 positions along the genome) from 78,833 unrelated MCPS    samples and performed a per-ancestry scan testing for deviation    of local ancestry proportion from the global ancestry    proportion19. The test is based    on assumptions of binomial sampling and normal approximation    for the sample mean. The global ancestry proportion for each    ancestry was estimated as a robust average over local ancestry    using the Tukeys biweight robust mean. The scan was performed    in all autosomes separately for African, European and    Indigenous Mexican ancestries with the significance threshold    1.7107=0.05\/(98, 0123), which accounts for    the number of local ancestry proportions tested and the three    ancestries.  <\/p>\n<p>    IBD segments from hapIBD were summed across pairs of    individuals to create a network of IBD sharing represented by    the weight matrix (Win    {{mathbb{R}}}_{ge 0}^{ntimes n}) for n    samples. Each entry ({w}_{{ij}}in    W) gives the total length in cM of the genome that    individuals i and j share identical by descent.    We sought to create a low-dimensional visualization of the IBD    network. We used a similar approach to that described in ref.    14, which used the    eigenvectors of the normalized graph Laplacian as coordinates    for a low-dimensional embedding of the IBD network. Let    D be the degree matrix of the graph with ({d}_{{ii}}=sum _{{j}}{w}_{{ij}}) and 0    elsewhere. The normalized (random walk) graph Laplacian is    defined to be (L=I-{D}^{-1}W), where I is the    identity matrix.  <\/p>\n<p>    The matrix L is positive semi-definite, with eigenvalues    (0={lambda }_{0}le {lambda    }_{1}le cdots le {lambda }_{n-1}). The multiplicity    of eigenvalue 0 is determined by the number of connected    components in the IBD network. If L is fully connected,    the eigenvector associated with eigenvalue 0 is constant,    whereas the remaining eigenvectors can be used to compute a    low-dimensional representation of the IBD network. If p    is the desired dimension, and u1,,    up the bottom 1p eigenvectors    of L (indexed from 0), the matrix (Uin {{mathbb{R}}}^{ntimes p}) with    columns u1,, up    define a low-dimensional representation of each individual in    the IBD network39. In practice, we    solved the generalized eigenvalue problem to obtain    u1,, up.  <\/p>\n<p>    If u is an eigenvector of L with eigenvalue    , then u solves the generalized eigenvalue    problem with eigenvalue 1.  <\/p>\n<p>    To apply to the IBD network of the MCPS cohort, we first    removed edges with weight >72cM as previously    done14. We did this to    avoid the influence on extended families on the visualization.    We next extracted the largest connected component from the IBD    network, and computed the bottom u1,,    u20 eigenvectors of the normalized graph    Laplacian.  <\/p>\n<p>    To examine fine-scale population structure using haplotype    sharing, we calculated a haplotype copying matrix L    using Impute5 (<a href=\"https:\/\/jmarchini.org\/software\/#impute-5\" rel=\"nofollow\">https:\/\/jmarchini.org\/software\/#impute-5<\/a>)    with entries Lij that are the length    of sequence individual i copies from individual    j. Impute5 uses a scalable imputation method that can    handle very large haplotype reference panels. At its core is an    efficient Hidden Markov model that can estimate the local    haplotype sharing profile of a target haplotype with respect    to a reference set of haplotypes. To avoid the costly    computations of using all the reference haplotypes, an approach    based on the PBWT data structure was used to identify a subset    of reference haplotypes that led to negligible loss of    accuracy. We leveraged this methodology to calculate the    copying matrix L, using array haplotypes from a set of    58,329 unrelated individuals as both target and reference    datasets, and used the --ohapcopy ban-repeated-sample-names    flags to ban each target haplotype being able to copy itself.    SVD on a scaled centred matrix was performed using the    bigstatsr package (<a href=\"https:\/\/cran.r-project.org\/web\/packages\/bigstatsr\/index.html\" rel=\"nofollow\">https:\/\/cran.r-project.org\/web\/packages\/bigstatsr\/index.html<\/a>)    to generate 20 PCs. This is equivalent to an    eigen-decomposition of the variance-covariance matrix of    recipients shared segment lengths.  <\/p>\n<p>    We imputed the filtered array dataset using both the MCPS10k    reference panel and the TOPMed imputation server. For TOPMed    imputation, we used Plink2 to convert this dataset from    Plink1.9 format genotypes to unphased VCF genotypes. For    compatibility with TOPMed imputation server restrictions, we    split the samples in this dataset into six randomly assigned    subsets of about 23,471 samples, and into chromosome-specific    bgzipped VCF files. Using the NIH Biocatalyst API (<a href=\"https:\/\/imputation.biodatacatalyst.nhlbi.nih.gov\" rel=\"nofollow\">https:\/\/imputation.biodatacatalyst.nhlbi.nih.gov<\/a>),    we submitted these six jobs to the TOPMed imputation server.    Following completion of all jobs, we used bcftools merge to    join the resulting dosage VCFs spanning all samples. For the    MCPS10k imputation, we used Impute5 (v.1.1.5). Each chromosome    was split into chunks using the imp5Chunker program with a    minimum window size of 5Mb and a minimum buffer size of    500kb. Information scores were calculated using qctool    (<a href=\"https:\/\/www.well.ox.ac.uk\/~gav\/qctool_v2\/\" rel=\"nofollow\">https:\/\/www.well.ox.ac.uk\/~gav\/qctool_v2\/<\/a>).  <\/p>\n<p>    The 1000Genomes WGS genotype VCF files were downloaded    (<a href=\"http:\/\/ftp.1000genomes.ebi.ac.uk\/vol1\/ftp\/data_collections\/1000G_2504_high_coverage\/working\/20201028_3202_phased\/\" rel=\"nofollow\">http:\/\/ftp.1000genomes.ebi.ac.uk\/vol1\/ftp\/data_collections\/1000G_2504_high_coverage\/working\/20201028_3202_phased\/<\/a>)    and filtered to remove sites that are multi-allelic sites,    duplicated, have missingness >2%, HardyWeinberg    P<1108 in any subpopulation and    MAF<0.1% in any subpopulation. We used only those 490 AMR    samples in the MXL, CLM, PUR and PEL subpopulations. We    constructed two subsets of genotypes on chromosome2 from the    Illumina HumanOmniExpressExome (8.v1-2) and Illumina GSA (v.2)    arrays, and these were used as input to the TOPMed and MCPS10k    imputation pipelines.  <\/p>\n<p>    We measured imputation accuracy by comparing the imputed dosage    genotypes to the true (masked) genotypes at variants not on the    arrays. Markers were binned according to the MAF of the marker    in 490 AMR samples. In each bin, we report the squared    correlation (r2) between the concatenated    vector of all the true (masked) genotypes at markers and the    vector of all imputed dosages at the same markers. Variants    that had a missing rate of 100% in the WGS dataset before    phasing were removed from the imputation assessment.  <\/p>\n<p>    The LAI results consist of segments of inferred ancestry across    each haplotype of the phased array dataset. As the WES and WGS    alleles were phased onto the phased array scaffold, we inferred    the ancestry of each exome allele using interpolation from the    ancestry of the flanking array sites. For each WES and WGS    variant on each phased haplotype, we determined the RFMix    ancestry probability estimates at the two flanking array sites    and used their relative base-pair positions to linearly    interpolate their ancestry probabilities. For a given site, if    ({p}_{{ijk}}) is the    probability that the jth allele of the ith    individual is from population k, and    Gij is the 0\/1 indicator of the    non-reference allele for the jth allele of the    ith individual then the weighted allele count    (ACk), the weight allele number    (ANk) and the allele frequency    (k) of the kth population is    given by  <\/p>\n<p>      $${{rm{AC}}}_{k}=mathop{sum      }limits_{i=1}^{n}mathop{sum      }limits_{j=1}^{2}{p}_{ijk}{G}_{ij},,{{rm{AN}}}_{k}=mathop{sum      }limits_{i=1}^{n}mathop{sum      }limits_{j=1}^{2}{p}_{ijk},,{theta      }_{k}=frac{{{rm{AC}}}_{k}}{{{rm{AN}}}_{k}}$$    <\/p>\n<p>    An estimate of the effective sample size for population    k at the site is ({n}_{k}={{rm{AN}}}_{k}\/2). Singleton    sites can be hard to phase using existing methods. Family    information and phase information in sequencing reads was used    in the WGS phasing, and this helped to phase a proportion of    the singleton sites. In the WES dataset, we found that 46% of    exome singletons occurred in stretches of heterozygous    ancestry. For these variants, we gave equal weight to the two    ancestries when estimating allele frequencies.  <\/p>\n<p>    To validate the MCPS allele frequencies, we downloaded the    gnomAD v.3.1 reference dataset (<a href=\"https:\/\/gnomad.broadinstitute.org\" rel=\"nofollow\">https:\/\/gnomad.broadinstitute.org<\/a>)    and retained only high-quality variants annotated as passed QC    (FILTER=PASS), SNVs, outside low-complexity regions and with    the number of called samples greater than 50% of the total    sample size (n=76,156). We additionally overlapped    gnomAD variants with TOPMed Freeze8 high-quality variants    (FILTER=PASS) (<a href=\"https:\/\/bravo.sph.umich.edu\/freeze8\/hg38\" rel=\"nofollow\">https:\/\/bravo.sph.umich.edu\/freeze8\/hg38<\/a>).    We further merged gnomAD variants and MCPS exome variants by    the chromosome, position, reference allele and alternative    allele names and excluded MCPS singletons, which were    heterozygous in ancestry. This process resulted in 2,249,986    overlapping variants available for comparison with the MCPS WES    data. Median sample sizes in gnomAD non-Finish Europeans,    African\/Admixed African and Admixed American populations were    34,014, 20,719 and 7,639, respectively.  <\/p>\n<p>    To investigate the effect of relatedness on allele frequency    estimates, we implemented a method to compute    relatedness-corrected allele frequencies using    identical-by-descent (IBD) segments. This method computes    allele frequencies at a locus by clustering alleles inherited    IBD from a common ancestor, then counting alleles once per    common ancestor rather than once per sample. Because IBD    sharing is affected by both demography and relatedness, we    limited IBD sharing to segments between third-degree relatives    or closer. Conceptually, this is equivalent to tracing the    genealogy of a locus back in time across all samples until no    third-degree relatives remain, then computing allele    frequencies in the ancestral sample.  <\/p>\n<p>    We estimated allele frequencies in two steps. First, we    constructed a graph based on IBD sharing at a locus. Second, we    estimated allele counts and allele numbers by counting the    connected components of the IBD graph. Our approach is similar    to the DASH haplotype clustering approach40. However, we make    different assumptions about how errors affect the IBD graph and    additionally compute ancestry-specific frequencies using local    ancestry inference estimates.  <\/p>\n<p>    To construct the IBD graph, suppose we have genotyped and    phased N diploid samples at L biallelic loci. For    each locus l we construct an undirected graph    Gl=(Vl,El)    describing IBD sharing among haplotypes. Let the tuple    (i, j)l represent haplotype    j of sample i at locus l, and let    ({h}^{{left(i,jright)}_{l}}in    {mathrm{0,1}}) be the allele itself. Define  <\/p>\n<p>      $$begin{array}{l}{V}_{l},=,{{(i,j)}_{l}:{rm{for}},1le      jle 2,{rm{and}},1le ile N}\\      {E}_{l},=,{({(i,j)}_{l},{(s,t)}_{l}):{h}^{{(i,j)}_{l}},{rm{and}},{h}^{{(s,t)}_{l}},{rm{are}},{rm{IBD}}}.end{array}$$    <\/p>\n<p>    In words, the set of vertices V constitute all    haplotypes at locus l. Each edge in E is between    a pair of haplotypes that fall on the same IBD segment    (Supplementary Fig. 25).  <\/p>\n<p>    If IBD segments are observed without error, then each maximal    clique of Gl represents a set of    haplotypes descended from a common ancestor. In practice, edges    will be missing owing to errors in IBD calling. Thus, what we    observe are sets of connected components rather than maximal    cliques. Because we limited edges to pairs of third-degree    relatives or closer, we assumed missing edges in connected    components are false negatives and included them. We    additionally removed edges between haplotypes for which the    observed alleles conflicted.  <\/p>\n<p>    Given an IBD graph    Gl=(Vl,    El) for a locus l, we estimated    alternative allele counts and allele numbers by counting the    connected components of the graph. Let    Cl1,,Clm be    the connected components of Gl. Let    CALT={Cim:    haplotypes in Cim have the ALT allele}    and CREF={Cim: haplotypes    in Cim have the REF allele}  <\/p>\n<p>    Then  <\/p>\n<p>      $$begin{array}{l}AC=|      {C}_{{rm{ALT}}}| \\ AN=| {C}_{{rm{ALT}}}| +|      {C}_{{rm{REF}}}| \\ AF=AC,\/,ANend{array}$$    <\/p>\n<p>    We additionally used LAI estimates to compute ancestry-specific    frequencies. Let ({p}^{{(i,j)}_{l}}in    {{mathbb{R}}}^{K}) be the vector of probabilities that    an allele on haplotype j from sample i at locus    l comes from one of K populations. For each    connected component, we averaged local ancestry estimates  <\/p>\n<p>      $${bar{p}}_{{C}_{im}}=frac{1}{|{C}_{lm}|}{sum      }_{{(i,j)}_{l}in {C}_{lm}}{p}^{{(i,j)}_{l}}$$    <\/p>\n<p>    We computed a vector of weighted allele counts W and    allele numbers N by  <\/p>\n<p>      $$begin{array}{l}W={sum }_{Cin      {C}_{{rm{ALT}}}}{bar{p}}_{C}\\ N={sum }_{Cin      {C}_{{rm{ALT}}}}{bar{p}}_{C}+{sum }_{Cin      {C}_{{rm{REF}}}}{bar{p}}_{C}end{array}$$    <\/p>\n<p>    Ancestry-specific frequencies were estimated by dividing each    component of W by the corresponding component of    N.  <\/p>\n<p>    For singletons for which the phasing of haplotypes was unknown,    we averaged local ancestry estimates from haplotypes in the    sample.  <\/p>\n<p>    To generate source datasets for assessing trans-ancestry    portability of BMI PRS, whole genome regression was performed    using Regenie (<a href=\"https:\/\/rgcgithub.github.io\/regenie\/\" rel=\"nofollow\">https:\/\/rgcgithub.github.io\/regenie\/<\/a>)    in individuals in the MCPS and in a predominantly    European-ancestry cohort from the UK Biobank. Individuals with    type2 diabetes (ICD10 code E11 or self-reported) were    excluded. BMI values underwent rank-based inverse normal    transformation (RINT) by sex and ancestry; models were    additionally adjusted for age, age2 and technical    covariates (UK Biobank). The Regenie summary statistics from    the UK Biobank were used to generate a BMI PRS in MCPS;    conversely, MCPS summary statistics were applied to UK Biobank    statistics.  <\/p>\n<p>    To avoid overfitting with respect to selection of a PRS    algorithm and its associated tuning parameters, LDpred    (<a href=\"https:\/\/github.com\/bvilhjal\/ldpred\" rel=\"nofollow\">https:\/\/github.com\/bvilhjal\/ldpred<\/a>)    with  value of 1 was chosen from a recent publication    of BMI and obesity27. Summary    statistics were restricted to HapMap3 variants and followed    existing filtering recommendations. In the MCPS, two PRS values    were generated; imputed variants were obtained from the MCPS10k    reference panel or the TOPMed panel. In the UK Biobank data,    PRS values were calculated separately by continental ancestry    (African, East Asian, European, Latino, South Asian),    determined from a likelihood-based inference    approach8 in a merged    dataset of variants from UK Biobank and the 1000Genomes    project.  <\/p>\n<p>    To evaluate PRS performance, BMI values were transformed (RINT)    by sex and ancestry and regressed on PRS, age and    age2. As for the generation of summary statistics,    individuals with diabetes were excluded from the analysis. PRS    accuracy was assessed by incrementalR2    (proportional reduction in regression sum of squares error    between models with and without BMI PRS). Additionally, raw BMI    values with PRS, age, age2, sex and ancestry were    modelled to obtain per BMI PRS standard deviation effect-size    estimates. The impact of ancestry differences on source summary    statistics compared to target PRS was assessed with two    approaches. For the MCPS, individuals were divided into    quantiles by estimated Indigenous Mexican Ancestry using the    LAI approach described above. For the UK Biobank, metrics were    calculated within each 1000Genomes-based continental ancestry.  <\/p>\n<p>    The MCPS represents a long-standing scientific collaboration    between researchers at the National Autonomous University of    Mexico and the University of Oxford, who jointly established    the study in the mid-1990s and have worked together on it ever    since. Blood sample collection and processing were funded by a    Wellcome Trust grant to the Mexican and Oxford investigators.    However, at the time, no funding was requested to create an    appropriate long-term sample storage facility in Mexico City.    Therefore, the Mexican investigators agreed for the samples to    be shipped to Oxford where they could be stored in a    liquid-nitrogen sample storage archive (funded by the UK    Medical Research Council and Cancer Research UK) that had    previously been established by the Oxford team, and only on the    understanding that control of the samples remained with the    Mexican investigators. The shipping of blood samples from    Mexico to the United Kingdom was approved by the Mexican    Ministry of Health, and the study was approved by scientific    and ethics committees within the Mexican National Council of    Science and Technology (0595 P-M), the Mexican Ministry of    Health and the Central Oxford Research Ethics Committee    (C99.260). Although appropriate facilities in Mexico City now    exist to store the samples, the Mexican investigators have    decided that the costs of sending them back to Mexico exceed    the benefits of having closer access to them. Study    participants gave signed consent in keeping with accepted    ethical practices at the time for observational cohort studies.    The baseline consent form stated that their blood samples would    be stored and used in the future for unspecified research    purposes (with a specific statement that this would include    future analysis of genetic factors) and that it would probably    be many years before such blood analyses were done. The MCPS    consent form also stated that the research was being done in    collaboration with the University of Oxford and that the    purpose of the study was to benefit future generations of    Mexican adults. In 2019, the Mexican and Oxford investigators    jointly agreed to allow the extracted DNA to be sent to the    Regeneron Genetics Center after they had offered to genotype    and exome sequence the entire cohortthereby creating the    resource now available for future research by Mexican    scientists (see the Data Availability section)in exchange    for sharing the other data with them for the purpose of    performing joint collaborative genetic analyses. Formal    approval to share MCPS data with commercial institutions was    sought and obtained from the Medical Ethics Committee of the    National Autonomous University of Mexico    (FMED\/CEI\/MHU\/001\/2020). Major discoveries from the study have    been disseminated through open-access scientific publications,    local and international scientific meetings, press releases,    social media and local television, but direct communication of    study results to the original study participants is    unfortunately not practical as no information on telephone    numbers or email addresses was collected at recruitment. As in    other prospective cohort studies (such as the UK Biobank), it    was agreed that there would be no feedback of individual blood    results to participants, as it has been shown that such    feedback can do more harm than good (whereas no feedback    ensures that that is not the case).  <\/p>\n<p>    Recruitment of individuals in the MAIS cohort was done with    approval of the leaders of the Indigenous communities and with    the support of the National Commission for the Development of    Indigenous Communities of Mexico (CDI), now the Instituto    Nacional de los Pueblos Indgenas (INPI). All participants    provided written informed consent, and authorities or community    leaders participated as translators where necessary. The    consent form described how findings from the study may have    commercial value and be used by for-profit companies. Sample    collection for MAIS was approved by the Bioethics and Research    Committees of the Insituto Nacional de Medicina Genmica in    Mexico City (protocol numbers 31\/2011\/I and 12\/2018\/I).    Preliminary data from the MAIS cohort have been discussed with    the Indigenous leaders and volunteer individuals included in    the study, explaining the meaning of the findings on health or    populations history, and the potential use of the data in    future collaborations.  <\/p>\n<p>    Further information on research design is available in    theNature Portfolio    Reporting Summary linked to this article.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Link:<br \/>\n<a target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41586-023-06595-3\" title=\"Genotyping, sequencing and analysis of 140,000 adults from Mexico ... - Nature.com\" rel=\"noopener\">Genotyping, sequencing and analysis of 140,000 adults from Mexico ... - Nature.com<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Recruitment of study participants The MCPS was established in the late 1990s following discussions between Mexican scientists at the National Autonomous University of Mexico (UNAM) and British scientists at the University of Oxford about how best to measure the changing health effects of tobacco in Mexico. These discussions evolved into a plan to establish a prospective cohort study that could investigate not only the health effects of tobacco but also those of many other factors (including factors measurable in the blood)1. Between 1998 and 2004, more than 100,000 women and 50,000 men 35years of age or older (mean age 50years) agreed to take part, were asked questions, had physical measurements taken, gave a blood sample and agreed to be tracked for cause-specific mortality <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/transhuman-news-blog\/human-genetics\/genotyping-sequencing-and-analysis-of-140000-adults-from-mexico-nature-com\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27],"tags":[],"class_list":["post-1118601","post","type-post","status-publish","format-standard","hentry","category-human-genetics"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1118601"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=1118601"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1118601\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=1118601"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=1118601"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=1118601"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}