April 7th, 2011

Notes for an analysis of (a) Rohde paper, part (1) LONG

I'm at my "sampling" stage, where I flip around in a text and read snippets, to get an overall view, before going back to the beginning and reading until I reach the end and then stop. I find that sampling counters the inevitable effects of narrative/rhetorical structure imposed by the author. Or at least exposes it.

Few agree with me, but I do it anyway.

"The first simulation, referred to as A1, used fairly liberal
parameters, as shown in the top row of Figure 3.
The maximum population was 25 million, reached in the
year 4000 BP. The ChangeTownProb was 20%, meaning
that 80% of the sims marry within their birth town. The
ChangeCountryProb was set to 0.1%, so about 1 in 1000
sims leave their home country, which may seem somewhat
high. But to put this in perspective, with a population of
25 million, there are about 48,000 people born in each
country per generation. So, on average, a ChangeCountryProb
of 0.1% will result in 48 people leaving a country
every 30 years, which certainly does not seem excessive."

(1) ChangeTownProb is _way too fucking high_.
(2) ChangeCountryProb is _way way way way_ too fucking high.
(3) 48 people leaving a country every 30 years over all human history is _insanely_ _mindbogglingly_ excessive. (Let's not even get into definition of "country", okay?)

"More liberal is the fact that, in simulation A1, there are
no locality constraints on inter-country migration or the
use of ports. Migrants have an equal chance of traveling to
any country within the continent and can use a port from
anywhere within the continent. The PortRate was set to 10
migrants per generation, in each direction, which is about
1 migrant every three years."

Tell that to the massive waves of migration that came into what is now New York Harbor and took many, many generations to get past New Jersey. In an era with _train travel_. This isn't "more liberal". _Insects_ don't behave this way.

I'm temporarily ignoring the population size and size change issues, because I think the migration assumptions are more damaging. I'm also ignoring statements like this: "Simulation A3 lowered the PortRate from 10 sims per generation to just one per generation." because I'm not entirely certain I understand what the author(s) mean(s).

Next is some discussion of the impact of changing individual parameters, followed by what happens when a bunch of parameters change simultaneously.

"If these five parameter changes have independent effects,
we might expect the net effect of combining all of
them to be either the sum of their independent additive effects
or the product of their multiplicative effects."

I cannot imagine why anyone would expect the parameters above to have independent effects. They aren't independent. At all.

"Thus, the effects of the parameters
appear to be greater than their independent additive
or multiplicative combination and we might conclude
that there is interaction between them."

Gee. Ya think?

"the NonLocalCountryProb, the
ChangeCountryProb, and to some extent the NonLocal-
PortProb have greater effects in simulations A7-A12.
This indicates that these parameters are interacting, probably
with one another."

Like, maybe _technology_ is involved?

"Lineage can spread fairly rapidly if either
the ChangeCountryProb is high, meaning there are more
inter-country migrants or the NonLocalCountryProb is
high, meaning that there may be only a few migrants
but they are more likely to travel long distances. When
there are both few migrants and they tend to move short
distances, there is a much greater resulting effect on the
MRCA date."

This matches my historical and human-behavior intuition quite well, and is actually a quite succinct way of describing my issues with a recent MRCA. My sense of migration patterns is simple:

(1) People don't move.
(2) If they do move, then they stay put.
(3) They go with people they know.
(4) AND THEIR KIDS MARRY EACH OTHER WHEN THEY GET THERE.

Makes it hard for lineages to spread fast. Really, the twin inventions of automobiles and high schools are probably the only reason meaningful mixing has occurred at all.

That model did not have geographic limitations like oceans.

Here are snippets from Model B:

"However, because there are so
few ports and they are intended to resemble fairly major
intercontinental passages, the ports were given migration
rates 10 times higher than those in the first model."

I'm not sure what planet that would make sense on. Less connectivity leads to _more_ traffic? But they had to do that because:

"Following the increase in migration rate, simulations
B1-6 result in very similar MRCA dates as the corresponding
A1-6 simulations, averaging less than a 2% increase
in MRCA time."

Who wants a boring old MRCA long, long ago, after all? That's not going to get you a headline anywhere.

The parameter interaction problem has, as I would have expected, gotten stronger in Model B, including all parameters. I'm continuing to ignore population size and size change issues except for this:

"Therefore, although we
were not able to simulate a full-size world population, it
seems that the use of a reduced population has made these
models more, rather than less, conservative, resulting in
MRCA and ACA dates that are somewhat older than they
should be."

It may seem that way, but it's overly optimistic to believe that. I'm about a third through; I'm going to post this so I don't lose it.

Notes for an analysis of (a) Rohde paper, part (2) LONG

"We found that, even with different architectures and
quite widely varying parameters, the date of the MRCA is
relatively stable, roughly falling between 2000 and 6000
BP,"

Well, because you manipulated the fuck out of your migration rate to _get_ that effect, sure.

"Model B was possibly overly
conservative in that it allowed only four intercontinental
ports with between 10 and 100 migrants per generation
across them. Certainly the interchange between Africa
and Eurasia has been higher than that, and there are potentially
other routes of passage between the continents,
such as migration from Borneo to Madagascar and potential
contact between Greenland Inuit and Vikings and
between South America and Polynesia."

If Rohde et al were arguing for a Eurasia MRCA in the time frames under discussion. We all _agree_ that interchange between Africa and Eurasia was higher than the rate you picked, but that doesn't really explain why you picked a rate that was so unbelievably high for exchange rates between the Americas and Eurasia (never mind all the Pacific Islands and everywhere else).

I'm continuing to ignore population issues.

Model C is supposed to fix some of these problems, but will in fact make some of them worse (for example, I disagree with his pick for likely start of habitation in the Americas). I'm more than a little disturbed by comparing the rates of migration between some of his ports:

"Between
Africa and Eurasia, there are ports between modern-day
Morocco and Spain (100 sims/generation), Tunisia and
Italy (100 s/g), Egypt and Israel (500 s/g), and between
Ethiopia and Yemen (50 s/g), providing several points of
contact. Other static ports include a pair between Thailand
and Malaysia (100 s/g), and from the tip of Indonesia
(Timor) to Arnhem Land and from New Guinea to Cape
York, both with a rate of just 5 s/g."

Really? Just 20X going between Morocco and Spain vs. New Guinea and Cape York? Yeah, I don't think so.

"The migration rates used in this model are not based on
firm historical data, because such information is, for the
most part, unknown (Jorde, 1980)."

Yes, Rohde et al used a _1980_ citation in support of this assertion. He has no problem at all deciding the Americas had no one in them before 20K BP, either. This isn't a matter of whether historians or archaeologists or whoever have one or many opinions. Rohde is basically creating a MRCA and ACA calculating SIM based on a Risk board. This thing bears exactly the same relationship to population genetics that Risk bears to military strategy, as well. All right, what's Jorde?

Jorde, L. B. (1980). The genetic structure of subdivided populations:
A review. In J. H. Mielke & M. H. Crawford (Eds.),
Current developments in anthropological genetics: Vol. 1
(pp. 135-208). New York: Plenum Press.

Believe me, there are more recent reviews of this issue than Jorde.

"Without a firm basis in fact, an attempt was
made to err on the side of conservatism."

I can't tell whether they really mean this, or if they just feel they need to say it. I'm suspicious they actually mean it.

"However, experience suggests that its results
are quite stable and insensitive to all but the most
significant changes."

That's grad student speak for, "I really want this to be true, and have worked hard to find a range of inputs that produces stable results. When you come up with more believable inputs that produce wildly different results, I'm going to be shocked, shocked, I tell you!"

"In the previous models, immigrants using a port could
settle in any random town in the destination country. As
a result, immigrants were immediately assimilated into
the host community. It is more often the case in modern
times, and presumably throughout history, that immigrants
will gravitate towards a sub-community of fellow
immigrants who share the same cultural or linguistic
background. The result is a delay in the exchange of
lineages between the immigrants and hosts. This is simulated
in the model by having new immigrants initially
choose from one of five towns, out of up to 46, in the destination
country. This set of towns is dependent on the
source country from which the migrant came. As a result,
immigrants will tend to cluster, though they will not be
entirely segregated."

I won't lie. I actually found this paragraph very early in the sampling process and have saved it until now, because you really needed to understand just _how amazingly awful_ the previous models were that the authors could truly believe this was an improvement.

This is not how immigration works. Once again, from the previous entry, I'll start calling them walkitout's rules of immigration if it'll make you happier:

(1) People don't move.
(2) If they do move, then they stay put.
(3) They go with people they know.
(4) AND THEIR KIDS MARRY EACH OTHER WHEN THEY GET THERE.

Immigrants don't come to 1 out of 5 out of up to 46 (2-10% of the available choices of where to settle) based on their source country. They come to the same few blocks of _one_ town -- where that town is well under 1% of their choices of where to settle. And they do that even with significant, cheap transport options to do otherwise.

"A basic assumption of the model is that sims act independently
in their migration decisions."

You know, I can't do this any more.

The whole MRCA question is completely open, as far as I'm concerned. We're not going to get any kind of answer out of simulations until their modeling improves. This isn't a believable approximation, and there isn't any obvious excuse for this. Spherical chickens of uniform density are required for FFT calculations; if you're running SimLineage, you can put in any assumptions you like.

Hopefully, they'll be better assumptions than these.

MRCA debate, what a good simulation needs, etc.

I realize the last few posts have been quite horrible and I apologize. On top of that, there's been some offline conversation with Ebeth that is therefore missing from whatever this blog is a record of. I'm going to attempt to summarize what I've been thinking about, why it matters, and what I think is worth looking for on this topic in the future.

There are a couple of mathemetical constructs used when thinking about population genetics. One is MRCA (most recent common ancestor) -- the most recent person who shows up in _everyone living's_ family tree. Another -- less commonly used -- is ACA or All Common Ancestors, which I'm still ignoring (I don't disbelieve it, per se, because it's quite obvious that MRCA and ACA are both real and in historical times and well documented for subpopulations such as Iceland). Mitochondrial MRCA provides a time frame of about 200K years ago. Y-Chromosomal MRCA provides a time from of 60-90K years ago. If you are a grad student or postdoc in population genetics, you could potentially get a big, juicy chunk of fame if you could pull this date in a ways.

A math professor named Jack Lee is apparently something of a genealogy buff and was initially pleased to discover he was descended from Charlemagne, until he did some probability calculations that suggested to him that an awful lot of people are descended from Charlemagne. His essay on the topic is widely reproduced and has within it the basic problem that shows up in a lot of initial attempts to model lineages: it assumes random mating. Jack Lee recognized that as an issue, and also pointed out that while his scribble doesn't definitively show that everyone within a certain area is descended from a certain person, it does show that it is quite wildly improbable than any arbitrarily chosen person within that area is _not_ descended from that certain person. The argument about Charlemagne can be made about anyone in the same time frame, the same location, who can be shown to have living descendants today.

The next approach tried (AFAIK) was simulation. Basically, try to model lineages and find out where they meet up in the distant past. The initial attempt at this would appear to be Chang in 1999, the same Chang who appears as an author on Rohde's 2004 paper. I have not read Chang's first paper on the topic, but it, too, used random mating. I did eventually find a .pdf of a draft of Rohde's paper; I can't speak to how closely it resembles what was published in Nature. The draft is what I was poking at in the last two posts.

One of the reasons Lee argument is so simple and compelling is because the area he is describing is accessible by walking within a comparatively short period of time and in fact we know people _did_ move around on foot throughout the generations in question. Mixing really was happening. The problem tackled by Chang and Rohde is much, much harder: they have geographic bottlenecks to confront. How far back to we have to trace to find someone whose lineage touches every part of the globe today? Rohde, Change and people like Humphrys behave as if they are ideologically committed to a within-history time frame. They build into their assumptions that there are no truly isolated populations today. They assume a very high migration rate in very implausible locations over very broad ranges of time, and claim they are being conservative by pointing out how low (by comparison) they made the migration rate in plausible places like Morocco to Spain. They allow recent arrivals unhindered access to the interior. They fail to model centuries long communities practicing significant endogamy within an otherwise mixing region. They have too high a rate of people leaving their village to marry someone from another region.

Migration from one village to another. Migration from one region to another. Migration from one continent to another. Number of people migrating. Which port they would choose. Rohde et al made all of these variables too high, but they convinced themselves they _weren't_ too high, because their max population multiplied by the probability produced a number that "felt" small. They also assumed many things were independent which anyone who has done their own genealogy knows perfectly well are not -- and anyone who knows some history would not even know where to begin the criticism. Then they were surprised that the effects of these parameters didn't just add up or multiply, either.

And at no point did they factor in any technology or resource effects (either in terms of enabling migration or forcing it) and they treated the decision to migrate as one made by each "sim" independently.

If you corrected for these errors, you would shortly discover all kinds of very interesting things -- but none of them would be a MRCA in historical times. Thus, no way to bag the fame.

The world _is_ an island. We are all "cousins". I believe these things. When I belonged to a crazy religion and believed that God created The World and Every Living Thing In It that swims in the sea and so forth, I did _not_ believe He did so in 6 24 hour days and I did _not_ believe that you could add up all the years in the genealogies in the Bible and figure out how long ago that creation event took place. I don't need for us all to have been here for only 6000 or so years. I didn't then. I don't now. I understand it's hard to differentiate yourself from the crowd these days as a scientist, but this is not a good game to be playing.

I would _love_ to see a SimLineage some day (when we have the computing power for it, because we clearly don't yet). But I'd actually be okay with running these models with better parameters. Specifically, parameters that better match my sense of the Laws of Migration:

(1) People don't move.
(2) If they do move, then they stay put.
(3) They go with people they know.
(4) AND THEIR KIDS MARRY EACH OTHER WHEN THEY GET THERE.

There probably should be a 5th one, about how they pick their destination, but I'm still kind of hazy on exactly how that works. Maybe, (2a) If they do move, they go to the nearest larger place that will accept them.

ETA: I was never entirely certain whether Rohde's sim was a person or a household. I think it was a person. If Rohde et al had instead simulated family units, it might have made more sense.

50th cousin assertion update

The "some geneticists believe" assertion from Cecil Adams "Straight Dope" quoted in the wikipedia article on pedigree collapse was a cite to Alex Shoumatoff's _The Mountain of Names: A History of the Human Family_, which is an interesting and probably worth reading book that I'll be poking at over the next few days. In the interim, I did find the 50th cousin quote in it on page 244 of the Kodansha Globe paperback. While my copy dates from later, the original was published 1985 and excerpted earlier in the New Yorker and therefore written at some unknown earlier date. Here is the relevant bit:

"Most geneticists are in agreement that, as the science writer Guy Murchie explains, "no human can be less closely related to any other human than approximately fiftieth cousin, and most of us are a lot closer... [ that, in other words ] the family trees of all of us, of whatever origin or trait, must meet and merge into one genetic tree of all humanity by the time they have spread into our ancestors for about fifty generations."

That's bad enough (and obviously now known to be false on multiple levels). However, Shoumatoff exacerbates Murchie's understandable errors from the mid- to late 1970s and expands upon them.

"The main point is, rather, that each of us contains genetic contributions from practically everybody who ever lived. All it takes for widely divergent populations to merge genealogically is migration by one person. "A single indirect genetic contact between Africa and Asia in a thousand years can make every African closer than fiftieth cousin to every Chinese," Murchie writes. "Surprisingly, this may happen without any natives of either continent doing any particular travelling at all, but simply in consequence of the wanderings of nomads in intermediate territory." And Lewontin remarks that "a very small amount of migration -- as little as one migrant individual exchanged between groups in each generation -- is quite sufficient to prevent differentiation between groups by genetic drift."

Guy Murchie is _Seven Mysteries of Life_, 1978
Richard Lewontin is _Human Diversity_, 1982

I've ordered the Murchie (because it has unbelievably good reviews on Amazon); I'll look at the Lewontin next but I doubt I'll buy that one as it sounds textbook-y.

I'm mildly curious as to the math underlying the "single contact in a thousand years" and the one guy per generation idea. I suspect what happpened here is a bad interface between the people-who-do-math and the people-who-understand-human-behavior. There are a lot of indications that Shoumatoff is one of the latter, rather than the former. It was clear that Rohde et al were members of the former, rather than the latter. I want to find the person who both understands math and understands human behavior AND thinks this comes down to fiftieth cousins. I don't think they existed even in the 1980s. They _definitely_ don't exist now. The error about DNA-in-us-from-everyone is just wildly fantastic now (and, arguably, was then as well, altho I'm less certain about that).

This feels like an echo chamber "fact", sort of like that one from the bicycling community and people in America and how far they live from where they work. Obviously, on the face of it, cannot be true, yet repeated hither and yon for reasons that make excellent psychological sense but no sense whatsoever in terms of reality (and boil down to, "we _want_ it to be true").