Mark Liberman has a couple of fascinating recent posts on comparing the vocabulary and on comparing the efficiency of different languages, over at Language Log:
Alex Baumans described a bilingual magazine’s problems in equalizing space and word-count allocations between Dutch and French…Alex’s discussion of Dutch compounds underlines a point that I made in the earlier post, namely that spaces are not a very helpful way to define the boundaries of words, especially in comparisons across languages. But what I’d like to follow up on today is his observation about comparisons of word and character counts.
As discussed in a post a few years ago (“One world, how many bytes?”, 8/5/2005), based on a variety of large collections of English-Chinese parallel texts, English texts are larger than their Chinese counterparts by a factor of between 1.37 and 2.27 before compression, or 1.19 to 1.41 after compression.
My impression is that there are several different factors at work here — but they don’t seem to me to account fully for the differences in length, especially in comparing compressed texts.