A guide to the Mantel test for linguists

Jon Carr

The Mantel test has become a common method in my field for estimating how structured a language is. In short, the Mantel test is useful if we have a set of words W and a set of corresponding meanings M and we want to know how systematically the two are related. For example, if you have two words that are similar, say bodu and bodi then you would expect these two words to refer to similar meanings. Conversely, two words that are very different, say gova and hiju, would be expected to refer to very different meanings.

To do this analysis, you first need a distance metric for estimating the dissimilarity between two words (e.g. the Levenshtein distance) and a distance metric for estimating the dissimilarity between two meanings (e.g. the Hamming distance for feature values). These metrics will allow you to produce a distance matrix. I assume that you’ve already got this side of things figured out. Here I explain how to go about correlating the two distance matrices using the Mantel test to estimate how systematic the relationship is between them.

I’ve tried to keep this guide simple. If you know what a correlation is, you should have no trouble following along. If you’re already familiar with distance matrices, you can probably skip down to section 3 below. I’ve tried to link to Wikipedia articles where any technical vocabulary arises.

1. Redundant distance matrices

A distance matrix represents the distances between all possible pairings of items. Let’s look at two example matrices for a set of four items (labelled A–D). To make things a little more concrete, let’s say that Matrix 1 gives the distances between word forms (string-edit distance) and that Matrix 2 gives the distances between meanings (meaning distance).

Matrix 1             Matrix 2

| A  B  C  D         | A  B  C  D
---------------      ---------------
A | 0  i  j  k       A | 0  p  q  r
B | i  0  l  m       B | p  0  s  t
C | j  l  0  n       C | q  s  0  u
D | k  m  n  0       D | r  t  u  0

The letters in in Matrix 1 represent the string-edit distances and the letters pu in Matrix 2 represent the meaning distances. For example, m = the string-edit distance between word B and word D, and q = the meaning distance for the things described by words A and C.

Notice two things. Firstly, there are zeros down the diagonal. Secondly, the lower half of the matrix (below the zeros) is the mirror image of the upper half of the matrix (above the zeros). We refer to the matrix as redundant because half of the information it contains is easily predicted from the other half. The reason for these two features of a distance matrix is obvious when you think about it: (i) the distance between a given item and itself will always be 0 (the coincidence axiom) and (ii) the distance between A and B will always be the same as the distance between B and A (symmetry).

Since we have 4 items, A, B, C, and D, there are 6 possible ways of pairing the items up: AB, AC, AD, BC, BD, and CD. For any number of items n, we can easily calculate the number of possible pairings. Firstly, n2 gives us the number of cells in the matrix. Then we subtract n since there are n zeros. Finally, we can half the result since the lower half of the matrix is identical to the upper half. This gives (n2n) ∕ 2 as the number of ways to pair the items up. The more conventional way to express this is: n(n − 1) ∕ 2.

2. Condensed distance matrices

As we have seen, more than half of the information contained in a distance matrix is redundant. A more efficient encoding of the information takes the form of a condensed distance matrix. A condensed distance matrix is simply a vector listing the distances in the upper part of the regular matrix working from the top left to the bottom right. Here are our two matrices converted to the condensed format.

Vector 1               Vector 2

AB AC AD BC BD CD      AB AC AD BC BD CD
------------------     ------------------
i  j  k  l  m  n       p  q  r  s  t  u

3. Correlation of distance matrices

At first glance, you might think we can just correlate Vector 1 with Vector 2 to arrive at an estimate of the tendency for Vector 1 (string-edit distance) to covary with Vector 2 (meaning distance). However, a regular correlation is problematic in the case of correlating sets of pairwise distances as we are doing here. This is because each distance is not independent of the others, so a regular correlation violates the assumption of independence. For example, if one letter in word C was modified, three of the string-edit distances (not one) would change: j (AC), l (BC), and n (CD).

Nathan Mantel’s (1967) solution to this problem was to use a Monte Carlo (randomization) technique on top of the normal correlation coefficient to arrive at a better estimate of the strength and significance of the correlation between two distance matrices.

4. The Mantel test

The Mantel test works as follows. First, we calculate the regular correlation as described above. Usually this means taking the Pearson’s product-moment correlation coefficient between Vector 1 and Vector 2. This is referred to as the veridical correlation, which we will denote x.

x = correlation(vector 1, vector 2)

Next we shuffle one of the matrices (more on this in the next section), convert it to the vector format, and compute the correlation again. This process is repeated thousands of times. While running all these randomizations, we keep track of the correlations we’ve derived for each one. We then calculate the mean (μ) and standard deviation (σ) of all these correlations.

It doesn’t matter which matrix we shuffle. We could even shuffle both. However, for the sake of efficiency and consistency, we just shuffle one of them. In this example, let’s say that we’re going to shuffle Matrix 1 (string-edit distance).

Sometimes a random permutation of Matrix 1 will correlate positively with Matrix 2; other times the random permutation will correlate negatively. However, most of the time, there won’t be any correlation at all because you will have broken any link that exists between the two distance matrices by randomizing one of them. Therefore, the mean correlation (μ) of all these randomizations should be around 0 and the correlations should be normally distributed. In other words, it’s very likely that a random permutation of Matrix 1 will have no correlation with Matrix 2 and it’s very unlikely that a random permutation of Matrix 1 will have a strong positive or strong negative correlation with Matrix 2.

If there is indeed a link between Matrix 1 (string-edit distance) and Matrix 2 (meaning distance), then our veridical correlation x should be significantly greater than (or significantly less than) the mean of the randomizations (μ). We can determine whether or not this is the case by essentially counting the number of standard deviations between the mean of the randomly generated correlations (μ) and the veridical correlation (x). This gives us a Z-score, which is calculated in the following way:

z = (xμ) / σ

A positive Z-score indicates a positive relationship; a negative Z-score indicates a negative relationship. If the Z-score is greater than 1.96 (or less than -1.96), it is significant at the α = 0.05 level. You can calculate the p-value for your Z-score in the normal way (or use this handy converter here).

5. Shuffling pitfalls

There are two common mistakes I’ve seen in relation to the shuffling method. In fact, I’ve made one of these mistakes myself which was the reason I created this guide.

First we need to consider the number of possible permutations of a matrix. In the example above, we have 4 items (A–D) giving a 4×4 distance matrix. There are therefore 4! = 24 valid permutations of the matrix. A valid distance matrix is one with zeros down the diagonal with mirrored upper and lower halves. In other words, a valid distance matrix is one where the order of the columns is the same as the order of the rows: if you randomize the order of columns to D, A, C, B (from left to right) then the rows should match this (from top to bottom).

Below you’ll see an example of a valid and invalid permutation of Matrix 1. One common mistake is where the cells of the matrix are shuffled around without maintaining the order of elements for both rows and columns.

Valid matrix        Invalid matrix

| D  A  C  B        | B  A  D  C
---------------     ---------------
D | 0  k  n  m      C | l  j  n  0
A | k  0  j  i      A | i  0  k  j
C | n  j  0  l      B | 0  i  m  l
B | m  i  l  0      D | m  k  0  n

The mistake that I made was slightly different, although related. The dataset I was working with was large with various distance metrics and lots of randomizations. Making the code as efficient as possible was therefore important, since I didn’t want to have to wait days to compute my results. My solution was to take two shortcuts:

Shortcut 1: Good idea

Don’t compute distances that you’ve already computed before, especially if your distance metric is computationally intensive. Once you’ve computed the distance matrix, there’s no need to compute it again for the next randomization; you can simply shuffle the matrix, so long as you’re careful to maintain the order of items along the rows and columns. If in doubt and efficiency isn’t an issue, just recompute the distance matrix (after randomizing the order of words or meanings) and you can’t go wrong.

Shortcut 2: Bad idea

Rather than shuffle the matrix and convert it to a vector for every randomization, I converted the matrix to a vector once, and then shuffled the vector for each randomization. This seemed intuitive, since the numbers in the vector will always be the same for every randomization – they’ll just be in a different order.

However, this is incorrect. To see why, recall that in our 4×4 matrix there are 6 possible pairings of items. Therefore, the vector representation of the distance matrix contains 6 distances. Shuffling this vector gives 6! = 720 possible permutations. However, as we saw above there are only 4! = 24 valid permutations of the distance matrix. Therefore, shuffling the vector representation of the matrix results in vector permutations that are not valid distance matrices.

In my case, I was working with 48×48 matrices. So the actual number of valid permutations of the matrices was 48!, but my vector representation had 1128! possible permutations. This means that approximately 102955 permutations of the vector space I was sampling from were invalid matrices – a discrepancy that’s impossible to convey in words (perhaps it helps to say that there are an estimated 1080 atoms in the universe).

6. The curse of the 3×3 toy example

One reason why I didn’t notice the error immediately is because I was using 3×3 toy matrices to test my code. It turns out that 3 items is a special case. Let’s look at a 3×3 matrix.

Matrix 3

| A  B  C
------------
A | 0  i  j
B | i  0  k
C | j  k  0

Here there are 3 items: A, B, and C. Therefore, there must be 3! = 6 valid permutations of the matrix. If we convert this to a condensed distance matrix, we have: [i, j, k]. The condensed format also has 3 items, so there are also 3! = 6 possible permutations of the vector.

Therefore, in the case of a 3×3 matrix only, all 6 possible permutations of the vector are valid distance matrices, so Shortcut 2 works. However, as we will see in the next section, it wouldn't be appropriate to use the Mantel test for such a small matrix anyway.

7. Small matrices

To see why the Mantel test shouldn’t be used for small matrices, we need to consider why we’re doing randomized shuffles of the matrix in the first place. As mentioned above, the Mantel test uses a Monte Carlo technique. Monte Carlo techniques are a large class of statistical methods that rely on randomization, so named because of the casinos in Monte Carlo.

Ideally, we would like to calculate the correlation for every possible valid permutation of the distance matrix and then compare the mean of these correlations against the veridical correlation. However, if we wanted to correlate two 15×15 matrices, then we would have to compute a correlation for 15! = 1.3 trillion permutations of one of the distance matrices. If your computer can calculate 1000 correlations per second, it would take 41 years to calculate them for all 1.3 trillion permutations. This is clearly intractable.

The Monte Carlo solution we use in the Mantel test avoids this issue by randomly sampling the possible permutations. So when we want to correlate large distance matrices (where by large I mean anything more than about 8×8), randomization becomes essential.

If you’re working with matrices smaller than 8×8, then it’s computationally feasible to compute all possible permutations in a reasonable amount of time. Doing so is also preferable since you can never guarantee the reliability of your random number generator in uniformly sampling the space of possible permutations.

8. The appropriate number of randomizations

The more times you shuffle the matrix and measure the correlation between it and the unshuffled one, the more accurate and reliable your Z-score will be. However, there’s obviously a trade-off with computation time: the more shuffles you run, the longer it’s going to take.

If you run a Mantel test with a small number of shuffles (e.g. 100), you’ll find that the Z-score changes every time your run the test. This is pretty undesirable since you want your results to be accurate and reproducible. I usually start with 1000 shuffles and experiment by raising this number until I consistently get the same Z-score to at least a few decimal places. You’re probably going to be presenting your results in a graph of some sort; if there’s no discernible difference between 10,000 and 100,000 that you can see with the naked eye, then you may as well go with 10,000.

Once I’ve found a good number, I estimate how long it will take to run the test for all participants in all conditions under all metrics (and so forth), and then I adjust according to the time I have available.

9. Undefined results

The final thing to understand is what happens when all the distances in one of the two matrices are the same. This would be the case, for example, if you had a language where a single word is used for all meanings. In this case, no matter how you shuffle one of the matrices, the correlation with the unshuffled one will always be the same. If there’s no variation in the correlations then the standard deviation will be 0. When you then calculate the Z-score, you will be attempting to do division by 0 which gives an undefined result.

Similar problems can arise if you have a language with little variation between the words. This can result in erratic Z-scores. I don’t have a nice answer for these issues. You should make a judgement call if you’re getting these kinds of results and consider leaving them off your final graphs (of course, it’s important to state why you’re doing this).

10. Example code

I’ve posted a simple Python implementation of the Mantel test on GitHub here. You can feed the code two distance matrices (in condensed vector format) and it will return the Z-score.

Kevin Stadler has a much more full-featured R package over here. This package has handy features for computing your distance matrices, running normality tests, and setting the type of correlation to use (among other things).

References

Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Research, 27, 209–220.