The first question to ask is this: how much information needs to be obtained for each person? I am concerned with making all of human society work, so I will not accept a number which is insufficient to capture the differences between all members of human society. At the exact second in which I type this sentence, there are, according to one human population clock, 7, 475, 322, 996 people in the world. Oops, by the time I had typed the last few letters an extra hundred or so had been added. Let me express that first number in binary: 110111101100100000110000001110100. If I have done that right, there are 33 binary digits, i.e. 33 bits there. That is roughly the number of bits required to distinguish between all members of the world’s population.
To derive this easily, consider that 10 bits represents the number 1,024 in decimal, roughly one thousand. 20 bits is 1,073,741824 decimal, which we might treat as roughly one million. Then 30 bits is 1,099,511,627,776 — which we can think of as one billion. Adding two more bits would multiply this by 4, which is not enough, but adding 3 more bits would multiply it by 8, giving 8,796,093,022,208. That is more than necessary at the time I write this, which is January 5, 2017. So as of today, 33 bits of information are enough to distinguish between all members of the human race. It is not hard to foresee 34 bits being required soon, within a generation.
Actually expressing all of the characteristics of individual human being a priori is much too hard a task. The first bit must divide humanity exactly in half, or some of its information carrying power is being wasted. The second bit must also divide humanity in half, for the same reason, but it must do so in a way perfectly orthogonal to the first, which is profoundly difficult to do a priori, or again some information carrying capacity is lost.
Note, however, that it is very easy to do this with a simple algorithm using empirical data. Collecting the data and processing it prior to conversion to binary is not so so easy, but the actual division into orthogonal binary splits is not a hard problem.
To understand the problem with doing the division according to any philosophical or empirical categorization, consider the most obvious split, by gender. It should be easy enough to divide humanity into male and female, those with XY and with XX chromosome pairs and the corresponding genitals. But some people have XXY or XYY triplets instead of simple pairs, and there is a wider range of external genitalia which does not correspond exactly to the chromosomes. Hermophroditic individuals with both mail and female genitals, in some arrangement or other, do not necessarily have a corresponding chromosomal anomaly. So what bit is used to express gender?
One might also want a bit for sexual preference, which is not necessarily as expected from biological gender, even when the other is clear. But sexual preference and biological gender are highly correlated, inversely, so having a separate bit for sexual preference wastes some of the information carrying capacity available in two bits of data.
Working a priori, I have found a ternary system better than a binary one. Instead of using 0 and 1 as symbols or values representing halves of a set of individual items, I often use -1, 0 and 1 for this purpose. For gender, -1 might be male, 1 might be female, and 0 might be “it’s complicated”. The advantages of the ternary system are more visible when used to express the very important matter of age. It is hard to define age in a binary system, which would require setting some arbitrary dividing line, say 18 years, which would be hard to justify. Using a ternary system, it is not hard to justify setting the two dividing lines according to ability to procreate. One is either two young for procreation, old enough to participate successfully in human reproduction, or too old for this purpose.
From a statistical point of view, neither ternary division is ideal, but they are much better for use in making up a description of a human being without a lot of empirical work.
Analogous to a bit, a ternary digit is a trit (trinary digit). One trit is equivalent to log23 (about 1.58496) bits of information. To find the number of trits required to distinguish the number of individuals in the human race, calculate the logarithm, base 3, of the population. It turns out that 21 trits are required, and using those 21 it would be possible to distinguish 10,460,353,203 people.
Using the ternary system has a real advantage for a priori work, but this should be firmly based on empirical evidence, so the number of bits of information is more important, and that would currently be 33. Though impossible to do ahead of time, there is a simple enough algorithm for doing it from empirical data. Assume that data is available for all people in a set of individuals. Assume that this data has been linearized, as described elsewhere, and represented in a large matrix of real numbers. Then take the Singular Value Decomposition, SVD, of the columns of data, where each row represents one person.
Apply the SVD, creating a new data matrix, with the columns sorted by singular value, from largest to smallest. Take the largest column and split it in two, one having only binary values, divided by the mean value of the data in the column, and the other created by subtracting those binary values. The two columns will sum to the original one, and are thus an equally good representation of that data.
The subsequent column are orthogonal so there would be no need to perform the SVD on them again, but it is necessary to get the new singular values. Ignore the newly created binary column, and do the SVD on the matrix of the original size, containing the newly created non-binary data column, made by subtracting the binary one. Again sort the columns of the matrix by singular value, and perform the same operation on the column with the largest singular value. Each time this operation is performed, a new binary division of the data is created. When the number of binary columns is equal to the base 2 logarithm of the number of rows, rounded up to the nearest integer, stop. The set of binary columns in desired result.
The same method can be used to produce a ternary division, and this is much more satisfactory when used to explain the results, though not ideal without some manual fudging along the way. If done mechanically, the division of people by gender into male, female and “its complicated” would have equal numbers of all three sets. That is unlikely and would have to be evaluated empirically, using statistics. But to divide the population of the earth appropriately could require only a few more than 21 trits, the dividing points being estimated through empirical research at each iteration of the algorithm. This would definitely be worth doing, as simply dividing the human population into sets is much less useful if the results cannot be explained to interested parties.
Whether one uses 33 bits or 21 trits, data compressed in this way is of little real value as such. For actual matching of one person to another for romantic purposes or simple friendship should involve at least twice as many bits of data, as explained elsewhere. For matching one person to a suitable job should also involve twice as many bits of data. Simultaneously matching a person to both a spouse and a job would involve the use of three times as many bits, almost a hundred for each person. One should add the same number of bits so each person will have a best friend, and more bits should be added for the sake of other aspects of a person’s social environment. All of this is explained elsewhere.