Exploring correlations in genetic and cultural variation across language families in northeast Asia – Science Advances

Posted: August 22, 2021 at 3:19 pm

Abstract

Culture evolves in ways that are analogous to, but distinct from, genomes. Previous studies examined similarities between cultural variation and genetic variation (population history) at small scales within language families, but few studies have empirically investigated these parallels across language families using diverse cultural data. We report an analysis comparing culture and genomes from in and around northeast Asia spanning 11 language families. We extract and summarize the variation in language (grammar, phonology, lexicon), music (song structure, performance style), and genomes (genome-wide SNPs) and test for correlations. We find that grammatical structure correlates with population history (genetic history). Recent contact and shared descent fail to explain the signal, suggesting relationships that arose before the formation of current families. Our results suggest that grammar might be a cultural indicator of population history while also demonstrating differences among cultural and genetic relationships that highlight the complex nature of human history.

The history of our species has involved many examples of large-scale migrations and other movements of people. These processes have helped shape both our genetic and cultural diversity (1). While humans are relatively homogeneous genetically, compared to other species, there are subtle population-level differences in genetic variation that can be observed at different geographical scales (2). Furthermore, while there are universal features of human behavior [e.g., all known societies have language and music (3)], our cultural diversity is immense. For example, we speak or sign more than 7000 mutually unintelligible languages (4), and for each ethno-linguistic group, there tend to be many different musical styles (5). Researchers have long been interested in reconstructing the history of global migrations and diversification by combining historical and archeological data with patterns of present-day biological and cultural diversity. Going back as far as Darwin, many researchers have argued that cultural evolutionary histories will tend to mirror biological evolutionary histories (69). However, differences in the ways that cultural traits and genomes are transmitted mean that genetic and cultural variation may be explained by different historical processes (1015). Major advances in both population genetics and cultural evolution since the second half of the 20th century now allow us to test these ideas more readily by matching genetic and cultural data (10, 16).

The cultural evolution of language has proven particularly fruitful for understanding past population history (genetic history statistically inferred from genetic variations) (1719). A classic approach involves identifying and analyzing sets of homologous (cognate) words among languages. This lexical approach allows the reconstruction of evolutionary lineages and relationships within a single language family, such as Austronesian (20) or Indo-European (17, 18). However, lexical methods cannot usually be applied to multiple language families (19), as they do not share robustly identifiable cognates due to a time limit of approximately 10,000 years, after which phylogenetic signals are generally lost (20, 21). An alternative approach is to study the distribution of features of grammar and phonology, such as the relative order of word classes in sentences or the presence of nasal consonants. Structural data in language tend to evolve too fast to preserve phylogenetic signals of language families (22, 23), and the history of lexica and structure might be partially independent as, for example, in the emergence of creole languages (12). However, the geographical distribution of language structure often points to contact-induced parallels in the evolution of entire sets of language families beyond their individual time depths (24, 25).

Yet language is only one out of many complex cultural traits that could serve as a proxy for deep history. It has been proposed that music may preserve even deeper cultural history than language (2629). Standardized musical classification schemes (based on features such as rhythm, pitch, and singing style) can be used to quantify patterns of musical diversity among populations for the sake of comparison with genetic and linguistic differences (26, 27, 29). Among indigenous Taiwanese populations speaking Austronesian languages, these analyses revealed significant correlations between music, mitochondrial DNA, and the lexicon (27), suggesting that music may preserve population history. However, whether these relationships extend beyond the level of language families remains unknown.

To address this gap, we focus on populations in and around northeast Asia (Fig. 1). Northeast Asia provides a useful test region because it contains high levels of genetic and cultural diversity, including a large number of small language families or linguistic isolates (e.g., Tungusic, Chukuto-Kamchatkan, Eskimo-Aleut, Yukagir, Ainu, Nivkh, Korean, and Japanese). Crucially, while genetic and linguistic data throughout much of the world have been published, northeast Asia is the only region for which published musical data allow direct matched comparison of musical, genetic, and linguistic diversity (30, 31).

Because some of the areas overlap in space, they are plotted in two separate maps.

We here use these matched comparisons to test competing hypotheses about the extent to which different forms of cultural data reflect population history at a level beyond the limits of language families. Specifically, we aim to test whether patterns of cultural evolution are significantly correlated with patterns of genetic evolution (population history), and if so, whether music or language [lexicon (32), grammar (33, 34), or phonology (3436)] would show the highest correlation with patterns of genetic diversity, after controlling for the influence of recent contact between languages (spatial autocorrelation) and shared inheritance within individual language families.

We selected all available populations from in and around northeast Asia (14 populations, encompassing 11 language families/isolates) for which all four sources of data [genome-wide single-nucleotide polymorphisms (SNPs), grammars, phonology, and music] were available (Fig. 1; Materials and Methods) (29). For genetic data, we newly genotyped 22 Nivkh individuals from Sakhalin Island in Russia using the Illumina Human Omni 2.5-8 BeadChip array (Materials and Methods). First, we investigated the similarity between populations in each of the dimensions of inquiry. For this purpose, we used split networks (37), which display multiple sources of similarity in a consistent manner (Fig. 2, figs. S12 to S16, and tables S2 to S6). Distance analysis of lexical data resulted in a network topology with an overall star-shaped structure (Fig. 2C). Exceptions are given by the three pairs of languages that are related to one another and that stand out as proximate (Even and Evenki both belong to the Tungusic family, Chukchi and Koryak both belong to the Chukotko-Kamchatkan family, and Selkup and Nganasan both belong to the Uralic family) (4). The results of this distance analysis are consistent with the fact that lexical material is able to detect relationships within language families, but cannot resolve historical relations between families.

Colors indicate language families: Selkup and Nganasan belong both to Uralic; Even and Evenki to Tungusic; and Koryak and Chukchi to Chukotko-Kamchatkan.

Distance analyses of grammatical, phonological, genetic, and musical distances reveal potentially more informative structure. In agreement with the claim that language structure does not identify family relationships (20, 22), the clustering emerging from the distances does not generally coincide with language families, except for Chukotko-Kamchatkan (Chukchi and Koryak) in genetics and phonology (where the within-family distance dfam is smaller than the distance dnun to the next unrelated neighbor, relative to the total distance range: genetics dfam = 0.15 < dnun = 0.26; phonology dfam = 0.28 < dnun = 0.36 (Supporting Information 1, section 4.1), and marginally for Tungusic (Even and Evenki) in grammar (dfam = 0.22 < dnun = 0.28). Most of the clustering instead points to interfamily relations: for example, Korean and Japanese are neighbors in the networks based on grammar, SNPs, and music, but not phonology (38). Buryat and Yakut are close together in SNPs (39), grammar, and phonology, but not in music. The music-based network is consistent with a previous study showing the uniqueness of Ainu music and a distinction of East Asian music from circumpolar music based on cluster analysis of musical components (29). Nivkh shows different patterns for each factor. For example, Nivkh is genetically closer to Korean, Japanese, and Buryat than the others and shows the second highest affinity with Ainu in all populations in the distance matrix (table S3), reflecting the trees branch position. However, music, grammar, and phonology do not follow these relationships in Nivkh.

Together, these results suggest that neither the population history nor the cultural features (other than the lexicon) evolved by simple vertical descent along language families. Instead, apart from the possible case of Chukotko-Kamchatkan, they might have each followed independent trajectories. While this challenges the idea of a unified phylogeny, it leaves open the possibility that some of the features are associated with each other because they trace back to a prehistoric maze of horizontal and vertical transmission. In other words, features might still be associated with each other because they were present in the same period(s) and places in which people were in contact and/or were genetically related. To find out whether any such association is still detectable today, we implemented a redundancy analysis (RDA) on the principal components (or coordinates) of the data (Materials and Methods and Supporting Information 1). RDA summarizes the variation in a response variable that can be explained by an explanatory variable and finds directed associations. The RDA analysis reveals two associations that are significant under a permutation test (Fig. 3): Grammatical similarity predicts genetic similarity (grammar genetics, adjusted R2 = 0.64), and genetic similarity predicts grammatical similarity (genetics grammar, adjusted R2 = 0.54).

Variance in the response explained by each explanatory variable; * indicates a significant association (P 0.05).

While both associations possibly reflect deep-time correspondences, dating back to before the formation of current language families (as identifiably by cognate words), spatial proximity and contact between societies might lead to similar patterns of association that are relatively recent and shallow. To find out, we evaluated three possible scenarios to explain the signal in the data: (i) Recent contact scenario: The associations reflect recent and current contact and, hence, can be explained by spatial autocorrelation in the current data; that is, societies that are currently close to each other tend to have similar grammars and population history. (ii) Inheritance scenario: The associations reflect common ancestry. The associations result from vertical descent within the remaining linguistic families for which our sample contains more than one member (Tungusic, Chukotko-Kamchatkan, and Uralic). (iii) Deep-time correspondence scenario: The associations reflect a nonshallow correspondence between grammar and genetics that cannot be explained by recent contact or phylogenetic inheritance within known families.

To distinguish between the three scenarios, we treated spatial proximity and inheritance as potential confounds and carried out a partial RDA to control their effect (Supporting Information 1, section 5). As societies and languages placed far from the equator tend to display larger spatial ranges (40), we represented the territory of each society with areas rather than points and sampled random spatial locations from within these areas. The partial RDA reveals strong evidence against the recent contact scenario: Spatial proximity fails to explain both associations (figs. S18 to S20). When controlling for spatial autocorrelation (1000 random samples allowing the uncertainty of peoples locations), the observed explained variance is still greater than that of random permutations [normalized differences between observed and permuted explained variance z > 1 SD in more than 99% of spatial samples; Kullback-Leibler divergence (KLD) > 3; fig. S20 and table S7]. When controlling for both recent contact and phylogenetic inheritance of language in partial RDA, still both associations show stronger evidence than the other relationships (z > 1 SD in 90% of samples, KLD 1.5; Fig. 4, figs. S21 to S23, and table S8). Our analysis reveals no other associations at comparable strengths; there are a few weak signals (e.g., grammar, music, and phonology; Fig. 2), but they all disappear once we control for both spatial autocorrelation and genealogy (Fig. 4 and table S8), suggesting that any patterns here are likely to stem from recent contact and family-specific lines of inheritance.

Numbers right to the dashed line show the proportion of samples with a difference of at least one SD. Gray shading reflects the KLD between the observed and permuted adjusted R2. The KLD is transparent for associations where the z-normalized difference is negative for more than 50% of the samples.

Given the relatively small sample of only 14 groups, we evaluate the robustness of the grammar/genetics associations through three types of sensitivity analyses. First, we varied the number of principal components (or coordinates) passed to the RDA and, thus, the amount of variance in both the response and the predictor. Different thresholds of how much variance a component needs to explain to be included (10%, 15%, and 18%) show little effect on the results (z > 1 SD in at least 84%, KLD > 1.2; figs. S24 and S25 and table S9). Second, we varied the language sample passed to the RDA. While most languages have little to no effect on the signal, this is not true for Ainu, as removing Ainu from the analysis weakens the support for the associations of grammar and genetics (z > 1 SD in only 14 to 31%, KLD 0.2, when controlling for spatial proximity and inheritance; figs. S26 to S29 and tables S10 and S11). Third, in the partial RDA, some spatial samples happen to explain the variance in the response better than others (lower tail of observed adjusted R2 in figs. S21 and S22). Spatial clusters of locations with low adjusted R2 might indicate recent language contact (see section 5.4, Supporting Information 1), and clusters with high adjusted R2 might indicate that systematic outliers influence the signal. We mapped locations in the 0.2 (figs. S30 and S31) and 0.8 percentile (figs. S32 and S33). We find only weak and partial clustering in the high percentile, and none in the low percentile. This suggests that neither recent contact nor systematic outliers explain the signal.

To summarize, we found significant correlations between genetics and grammar by the basic RDA using the complete set of genomes, music, and language in northeast Asia. The partial RDA controlling for geography and linguistic inheritance as well as sensitivity analyses suggest that the relationships may trace back to earlier relationships between languages before the recent contacts and inheritance.

We have simultaneously explored the relations among genetic, linguistic, and musical data beyond the level of language families. We find remarkable evidence for the relationships between population history and grammatical similarity, while genomes and grammar might be influenced by different evolutionary forces, such as a difference between mating systems and cultural transmission (13).

A possible interpretation of our findings is that the relationship between grammar and population history was exceptionally well preserved over the recent contact beyond language families, regardless of whether or not the evolutionary mechanisms of grammar are the same as those of genomes. Population genetics detect gene flows between populations beyond phylogenetic relationships. Our dataset covers a phylogenetically broad range of populations: three lineages to the present-day East Eurasian (Ainu, East Asian, and northeast Asian) and one to North American (Greenlandic Inuit) (41), including gene flows beyond the lineages, such as Japanese-Ainu (38) and Buryat-Yakut (39). While the evolutionary forces that influence population history are fairly well understood, determining to what extent the genetic relationships of particular populations reflect shared ancestry versus prehistoric contact in culture is still challenging. Moreover, the evolutionary processes that influence culture and language are under debate (14) but can obviously be very different from those influencing genomes. For example, cultural replacement and language shift can occur even within a single generation due to colonization or other sociopolitical factors, like warfare and cultural expansion (15, 42). Our results removing the influence of the proximity in cultural similarities give support to the notion that these different data reveal different historical patterns, yet show that some cultural features can still preserve relationships extending even beyond the boundaries of language families. The similarities in grammar do not arise from simply following the genetic phylogeny (see Fig. 2D, which lacks the Korean-Japanese-Nivkhh-Ainu and Koryak-Chukchi-West Geenlandic clusters in Fig. 2A). Instead, they are likely to reflect a complex interplay of partially independent vertical and horizontal transmission in prehistory.

This pattern is markedly different for the lexicon that traces language families but does not reveal higher-level relationships in our dataset (Fig. 2). This contrasts with expectations from historical linguistics (22) and also from recent findings that suggest that grammar evolves faster than the lexicon in Austronesian (23) and also shows rapid evolution in Indo-European (43); for example, while English and Hindi preserve many cognate words (name versus nm, hand versus hth, etc.), they differ substantially in word order (verb-medial versus verb-final) and case-marking (invariable nouns versus complex case system). However, these findings bear on grammatical evolution within families, while our approach seeks to unravel a shared history that allows early contact between families. Therefore, our findings are compatible with a scenario where specific traits (e.g., word order) evolved rapidly within families but were repeatedly copied and readapted, yielding a relatively uniform profile over a prehistoric period (44) that mirrors the genetic network of the same period.

The statistical power to detect a signal is weakened when Ainu was removed in the sensitivity analysis (figs. S26 to S29 and table S10). While this might suggest a special position of Ainu in the northeast Asian context (45), we need larger samples of languages and populations inside and outside of the region to resolve this question.

Our results are qualitatively different from the only previous study to quantitatively compare genetic, linguistic, and musical relationships (27). Among indigenous Austronesian-speaking populations in Taiwan, music was significantly correlated with genetics but not language, while we find here that music is not robustly associated with either language or genetics. However, there are several methodological differences that might underlie these differences. In particular, the two studies looked at different types of data (genome-wide SNPs, structural linguistic features, and both group and solo songs here versus mitochondrial DNA, lexical data, and only group songs previously). Further research with larger samples and different types of data may help to elucidate general relationships among language, music, and genetics.

The recent studies highlight northeast Asian populations as one of major genetic components of basal East Eurasians (46). The high linguistic diversity in northeast Asia may reflect prehistorical relationships with less influence from agricultural populations by geographic barriers, as hypothesized in the previous studies (24, 47). However, our knowledge about relationships between culture and local population history is limited in northeast Asia. In addition to revealing an association between genetic and grammatical patterns, our results also reveal complex dissociations in which these data reflect different local histories, potentially including cultural shift. For example, while previous studies suggest specific genetic and cultural relationships between Korean and mainland Japanese populations (38) or posit a shared origin (48, 49), our findings support similarities in SNPs, music, and grammar, but not in lexicon and phonology (Fig. 2 and Supporting Information 1) (50). Although the Ainu show particular genetic similarity to the Japanese, their music clusters more closely with that of the Koryak (Fig. 2 and tables S3 and S4). This may reflect different levels of genetic, linguistic, and musical exchange at different points of history. Musical patterns may reflect more recent cultural diffusion and gene flow from the Okhotsk and other circumpolar populations that interacted with the Ainu from the north within the past 1500 years (51), as we previously proposed in our triple structure model of Japanese archipelago history (29). Newly genotyped Nivkh samples showed the closeness to Ainu in SNPs but not in others (Fig. 2A), suggesting historical relationships in the coastal region of northeast Asia. Nivkh might be a key population connecting Ainu and other northeast Asians; however, the population history of Nivkh is not well understood. Thus, Neighbornet trees might reflect the relationships linking populations, but further analyses are necessary to investigate, in more detail, the local population history and cultural relationships in northeast Asia including Nivkh. Most pressingly, future research will need a larger sample of societies and a richer coding of their cultural traits.

In conclusion, we have demonstrated a relationship between grammar and genome-wide SNPs across a variety of diverse northeast Asian language families. Our results suggest that grammatical structure may reflect population history more closely than other cultural (including lexical) data, but we also find that different aspects of genetic and cultural data reveal different aspects of our complex human histories. In other words, cultural relationships cannot be completely predicted by human population histories. Alternative interpretations of these mismatches would be historical events (e.g., language shift in local history) or culture-specific evolution independent from genetic evolution. Future analyses of these relationships at broader scales using more explicit models should help improve our understanding of the complex nature of human cultural and genetic evolution.

Selection of populations in this study. We selected 14 populations for which matching musical (Cantometrics/CantoCore), genetic (genome-wide SNP), and linguistic (grammatical/phonological features) data were available (tables S1 and S13 and Fig. 1). These represented a subset of 35 northeast Asian populations whose musical relationships were previously published and analyzed in detail (29). Linguistically, these 14 populations fall into 11 language families/isolates (4). Korean, Ainu, Nivkh, and Yukaghir are language isolates. Buryat, Japanese, Yakut, and West Greenland Inuit are the sole representatives in our sample of the Mongolic, Japonic, Turkic, and Eskimo-Aleut language families, respectively. The remaining languages are classified into three language families: Koryak and Chukchi are Chukotko-Kamchatkan languages; Even and Evenki are Tungusic languages; and Selkup and Nganasan are Uralic languages. Note that the need to assemble matching genetic, linguistic, and musical data meant that some important populations could not be included (e.g., we had matching musical and genetic data for multiple Ryukyuan populations, but no corresponding grammatical data were available, while for the Aleut genetic and linguistic data were available but not musical). Future research should attempt to collect new data to allow more complete comparisons within and between language families.

Music data. All music data and metadata are detailed in our previous report of circumpolar music (29). For the present analysis, we used a subset of 14 of the original 35 populations with matching genetic and linguistic data; these 14 populations are represented by 264 audio recordings of traditional songs. Each song was analyzed manually by P.E.S. using the same 41 classification characters used in (30) [from Cantometrics (29) and CantoCore (52)].

We used the DNAs of Nivkh maintained by the Asian DNA Repository Consortium (ADRC). The DNA samples were originally collected in Sakhalin, Russia by S. Horai in the 1990s (53) and were kept at 4C in Sokendai. We genotyped 32 Nivkh individuals (14 females and 18 males) with the Illumina Omni 2.5-8 BeadChip Array at the National Center for Global Health and Medicine (table_S16_SampleID_Nivkh.xlsx). Two DNA samples were removed because of their poor quality. We selected 2,246,124 sites for SNPs with a call rate greater than 95%. Using PLINK (54), we performed a Hardy-Weinberg equilibrium test to exclude sites with P < 106, resulting in 2,246,123 sites. Then, we calculated inbreeding coefficients using sites with minor allele frequency (MAF) > 0.01, confirming that none of the cousin equivalents exceeded F = 0.0625. Using the same threshold of MAF, we found kinship between 12 pairs (involving 14 individuals) with PI_HAT >0.125 (third-degree relative or closer). Eight samples were removed; 22 individuals thereby passed the quality control and kinship tests. Then, we carried out strand checks between the Illumina Human Omni 2.5-8 BeadChip SNPs and JPT + CHB in 1000 Genomes using BEAGLE 4.0 (55). In the Nivkh data, 2,041,779 sites passed the strand check and 114,077 sites were flipped using PLINK. After the strand check, all sites that did not have an allele match were removed. We converted the Illumina unique IDs to rsIDs.

Publicly available genome-wide SNP array data for 14 populations, including three Nivkh individuals (table S1) (38, 5659), were obtained and curated as follows. As several genotyping platforms were used, to avoid discordancy of alleles on +/ strands, we used the strand check utility in BEAGLE for a dataset of Ainu against JPT and CHB in 1000 Genomes. To obtain shared SNPs among different platforms, genotype datasets including our Nivkh data were merged into a single dataset in PLINK file format by PLINK.

We manually removed outlier individuals from the merged dataset based on results of principal components analysis (PCA) and ADMIXTURE (6062). Last, we used 15 individuals of Nivkh (13 individuals from our data and 2 individuals from public data) in the population genetics analysis (tables S1 and S16). The final merged genotype dataset included 245 individuals and 37,093 SNPs (total genotyping rate was 0.999). The merged dataset in PLINK format was converted to Genepop format using PGDSpider (63).

We measured lexical distances between those words in the ASJP (Automated Similarity Judgment Program) database v. 19 (32) that have best coverage in our sample, corresponding to 40 concepts that are attested in at least 74% of all word lists. These correspond to the concepts commonly thought to be most stable over time (64) and to best reflect language relatedness, at least as a first approximation (Supporting Information 3) (65).

We combined data on grammatical and phonological traits from AUTOTYP (34, 66), WALS (33), the ANU Phonotactics database (35), and PHOIBLE (36) and extracted a set of 25 grammar and 87 phonological features with coverage more than 80% in each language, and in most cases 100% (Supporting Information 2 and table S13).

In contrast to population history, standardized methods for modeling cultural evolution across different types of data are not yet established. Therefore, we matched population history to cultural similarities to analyze both genetic and cultural data in a common framework. We obtained distance matrices representing differences between populations/languages for a subsequent comparative analysis using the following procedures for music and language, because musical and linguistic (grammatical and phonological) data have different data structures.

Genetic analysis. To estimate population differentiations, pairwise Fst values between populations were calculated with Genepop version 4.2 (67). Pairwise Fst is the proportion of the total genetic variance due to between-population differences, and is a convenient measure because it does not depend on the actual magnitude of the genetic variance. In other words, genetic markers that evolve slowly are expected to have the same Fst value as markers that evolve more rapidly, because the total variance is decomposed into within-population and between-population components.

Music analysis. A previously published matrix of pairwise distances among all 283 songs was calculated using normalized Hamming distances (68) to calculate the weighted average similarity across all 41 musical features (29). This distance matrix was then used to compute a distance matrix of pairwise musical st values among the 14 populations using Arlequin (69) and the lingos function of the ade4 package in R (70). st is analogous to Fst but takes into account distances between individual items, making it more appropriate for analysis of cultural diversity (68, 70). Further details concerning the calculations can be found elsewhere (70).

For the main analysis, we compute distances in ASJP word alignments weighted by sound correspondence probabilities, a method that provides good first approximations of language relatedness (Supporting Information 3, table S14, and fig. S34) (65). For comparability with other ASJP-based work, we also report normalized Levenshtein distances (Supporting Information 3, table S15, and fig. S35).

In contrast to songs and individual genotypes, language data do not represent individuals for each population. In view of the fact that the data are partly numerical and partly categorical, we used a balanced mix of PCA and multiple correspondence analysis (MCA) to calculate differences between languages (Supporting Information 1, section 3) (71). Empty values were imputed using the R package missMDA (72).

We performed a principal coordinate analysis (PCoA) on the distance matrices of pairwise Fst for SNPs and pairwise st for music (Fst and st matrices are available from github; Supporting Information 1, section 3) (73). Similar to a PCA, a PCoA produces a set of orthogonal axes whose importance is measured by eigenvalues (figs. S2 to S6). However, in contrast to the PCA, non-Euclidean distance matrices can be used. Heat plots of PCo and PC were visualized by ggplot2 in R (figs. S7 to S11) (74).

Distances were visualized using the SplitsTree neighbornet algorithm [version 4; (37)] and are reported in detail in Supporting Information 1, tables S2 to S6, and figs. S12 to S16. To control for multicollinearity, we used PCA/MCAs and PCoAs as input rather than the raw data.

The geographical polygons were taken from the Ethnologue (75) via the World Language Mapping System (76), supplemented by a hand-drawn polygon estimate for Ainu.

In view of the mobility of speakers over time, we sampled 1000 random locations from within the polygons and used these for assessing correlations. Location samples were always taken from geometries (i.e., polygons on a sphere) and not from a potentially distorted image of these geometries on a map. Location samples were generated in PostGIS https://postgis.net/ (Supporting Information 1, section 2.4). For each of the 1000 samples, we computed the spherical distance between all random locations, which we store in a distance matrix. Then, we perform a distance-based Morans eigenvector map analysis (dbMEM) to decompose the spatial structure of each of the resulting 1000 distance matrices (Supporting Information 1, section 3.3) (77). Similar to a PCoA, dbMEM reveals the principal coordinates of the spatial locations from which the distance matrix was generated. We only return those eigenfunctions that correspond to positive spatial autocorrelation.

RDA was carried out to explore the linear relationship between SNPs, grammar, phonology, and music. Partial RDA was used to control for spatial dependence (Supporting Information 1, section 5) (78). (Partial) RDA is an alternative to the traditionally used Mantel test, which was found to yield severely underdispersed correlation coefficients and a high false-positive rate in the presence of spatially correlated data (79). RDA performs a regression of multiple response variables on multiple predictor variables (80), while partial RDA also allows to control for the influence of confounders. RDA yields an adjusted coefficient of determination (adjusted R2), which captures the variation in the response that can be explained by the predictors. We compare the observed adjusted R2 values against a distribution under random permutations (Fig. 4 and figs. S18 to S23). To assess robustness, we z-normalize the difference between observed and permuted adjusted R2 and report the proportion of samples for which the observed adjusted R2 is one SD larger than the permuted (z > 1 SD). Moreover, we compute the KLD between the distribution of observed adjusted R2 and permuted adjusted R2. The KLD allows to assess the overall divergence of the two distributions; z > 1 SD reports the proportion of samples with a strong positive difference. (p)RDA and subsequent analyses were performed in R using the vegan package (65).

D. E. Brown, Human Universals (Temple Univ. Press, 1991).

B. Nettl, The Study of Ethnomusicology: Thirty-Three Discussions (University of Illinois Press, 2015).

C. Darwin, The Descent of Man, and Selection in Relation to Sex (J. Murray, 1871), volume 1.

L. L. Cavalli-Sforza, P. Menozzi, A. Piazza, The History and Geography of Human Genes (Princeton Univ. Press, 1994).

P. J. Richerson, R. Boyd, Not by Genes Alone: How Culture Transformed Human Evolution (University of Chicago Press, 2005).

A. Mesoudi, Cultural Evolution: How Darwinian Theory Can Explain Human Culture and Synthesize the Social Sciences (University of Chicago Press, 2011).

M. Stoneking, An Introduction to Molecular Anthropology (Wiley-Blackwell, 2016).

J. Nichols, in The Comparative Method Reviewed, M. Durie, M. Ross, Eds. (Oxford Univ. Press, 1996), pp. 3971.

J. Nichols, Linguistic Diversity in Space and Time (University of Chicago Press, 1999).

A. Lomax, American association for the advancement of science, in Folk Song Style and Culture (Transaction Books, 1978).

B. Pakendorf, in The Routledge Handbook of Historical Linguistics, C. Bowern, B. Evans, Eds. (Routledge, ed. 1, 2015), pp. 627641.

B. Bickel, in Language Dispersal, Diversification, and Contact, M. Crevels, P. Muysken, Eds. (Oxford Univ. Press, 2020), pp. 78101.

M. Robbeets, Diachrony of Verb Morphology: Japanese and the Transeurasian Languages (De Gruyter Mouton, 2015).

N. Tranter, Languages of Japan and Korea (Routledge, 2012).

H. Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2009).

Read more here:
Exploring correlations in genetic and cultural variation across language families in northeast Asia - Science Advances

Related Posts