yet another birthday problem

Feb 2010
7
0
Hello everyone,

I would appreciate your help on the following variant of the birthday problem:

Assume that we have a number of identical boxes which all contain N black-or-white balls each. The order of the balls is important, i.e. each box has a placeholder for ball 1 to N. All placeholders of all boxes are filled randomly with balls.

Two given boxes are considered to be "similar" if at least K *corresponding* placeholders contain balls of the same color (which placeholders are not of interest, just their number).

Given N and K, which is the number of boxes that we need in order to have a better than even (>1/2) chance that at least two of the boxes are "similar"?

Example:

N=5, K=2
box 1: B-W-B-B-B
box 2: W-B-W-W-W
box 3: B-W-B-W-B

similarity between 1 and 2 = 0
similarity between 1 and 3 = 4
similarity between 2 and 3 = 1

boxes 1 and 3 are considered "similar" (4>=K), all other combination is "dissimilar".

TIA
 

CRGreathouse

Forum Staff
Nov 2006
16,046
936
UTC -5
Wow, this seems to be a really hard problem. An approximate solution would be to find the probability that two random boxes are similar, then find the number of comparisons between random boxes that would be expected to give a 50-50 chance of finding two similar ones, then double and take the square root: if you make T_n comparisons you'll find it on your n-th box (and double then square root approximates that transition).

For N = 15, K = 12 a given box is similar to
\(\displaystyle \sum_{i=0}^{N-K}{N\choose i}=\sum_{i=0}^3{15\choose i}=32647\)
other boxes, so the probability is 576/32768 = 9/512. (1 - 9/512)^39 is about 1/2, so sqrt(2*39) =~ 9 boxes should be about right. 9 boxes have 45 pairwise comparisons, giving match probability 55%. (The true probability is higher, but probably not too much.)
 
Feb 2010
7
0
First of all, thank you for your prompt reply!

This is not my area of expertise, so I will tell you how I understand your post - please correct me if I say something stupid. BTW this is not some college exercise...

I got a little bit confused with the number 576, it seems that is the outcome of the expression sigma ... etc (not 32647).

i=0 refers to an identical box (which is allowed, of course)
i=1 refers to a box with exactly 1 different ball, which can be anywhere in N positions
i=2 refers to a box with exactly 2 different balls, which can form any combination (N,2)
...

So the probability that two random boxes are similar is 9/512 if 3 out of 15 balls (at most) are allowed to have different color.

if M boxes are selected in the pool, then the number of unique pairwise comparisons T is M(M-1)/2. When M=10 (not 9) then T=45 (because the first box is not compared to anything), see below:

when M=1, T=0
when M=2, T=1
when M=3, T=1+2=3
when M=4, T=1+2+3=6
...
T= M(M-1)/2

The probability that NO 2 out of 10 boxes are similar is (1-9/512)^45=(503/512)^45=~0.450206. The probability that at least one pair of these 10 boxes is similar is 1-0.450206=0.549794.

I wrote a program to simulate this. With a million experiments and a decent random number generator, the at-least-one-similar-pair probability with 10 boxes was ~0.553 (stable 2nd digit), a little bit higher than 0.549794 (as you have said).

What I dont really get is WHY the actual probability is a little bit higher. I understand that, by working on the box level rather than on the ball level, we approximate the result (very nice approximation, though). If - somehow - we could evaluate the exact probability, why would it be a little bit higher?

Thank you very very much for your past and future help! =)
 

CRGreathouse

Forum Staff
Nov 2006
16,046
936
UTC -5
Randomas said:
What I dont really get is WHY the actual probability is a little bit higher. I understand that, by working on the box level rather than on the ball level, we approximate the result (very nice approximation, though). If - somehow - we could evaluate the exact probability, why would it be a little bit higher?
Conceptually: there is a space of possibilities and you're trying to see if you can fit a certain number in without any being "too close". If you've managed to get some large number of events in without any being "too close" then you have an unusually good target area. Normally, there would be some possibility that you would be close to several, and we subtract this probability off so as to not double-count it. But since everything is far away from everything else, you're less likely to be close to several things at once, meaning the stuff you have to subtract is smaller than you'd expect from chance alone. (It's not zero because you might be close to two things that aren't close to each other. You could calculate the probability in a way that ignores this possibility of overlap and it would give you an overestimate rather than an underestimate.)

I'm not sure that what I wrote is understandable; sorry. It's hard to phrase.
 
Feb 2010
7
0
If I understand correctly, the key to the observed difference is that the box entering the pool of dissimilar boxes might be similar to more than one boxes already inside the pool (although these boxes are dissimilar to each other). I can visualize this in the following way (although I am not sure it is appropriate): In an empty rectangle we start drawing circles of equal radius. The centre is the actual box configuration while the disk represents the similarity area. It is possible that a newly-drawn circle intersects two existing circles that are close but do not intersect each other. Since we only take account of the probability of pairwise intersection, we ignore this small possibility of three-or more circle intersections.

Assuming that the aforementioned is correct, I fail to see why the probability we evaluate is not "exact", in the sense that we first evaluate the probability of no-match i.e. (503/512)^45=~0.450206. The complementary probability 1-0.450206=0.549794 should include these three-or more circle intersections.

I also expect that this discrepancy between evaluated and expected probability is increased with the number of boxes already in the pool.

In any case, how do you suggest I evaluate the overestimate so that I can have an upper and lower bound?

Thank you very much in advance
 

CRGreathouse

Forum Staff
Nov 2006
16,046
936
UTC -5
Randomas said:
If I understand correctly, the key to the observed difference is that the box entering the pool of dissimilar boxes might be similar to more than one boxes already inside the pool (although these boxes are dissimilar to each other). I can visualize this in the following way (although I am not sure it is appropriate): In an empty rectangle we start drawing circles of equal radius. The centre is the actual box configuration while the disk represents the similarity area. It is possible that a newly-drawn circle intersects two existing circles that are close but do not intersect each other. Since we only take account of the probability of pairwise intersection, we ignore this small possibility of three-or more circle intersections.
The visualization is appropriate, though remember that generally the dimension will be much higher than 2.

Randomas said:
Assuming that the aforementioned is correct, I fail to see why the probability we evaluate is not "exact", in the sense that we first evaluate the probability of no-match i.e. (503/512)^45=~0.450206. The complementary probability 1-0.450206=0.549794 should include these three-or more circle intersections.
Let me give an example; that should be easier to make sense of than my fumbling explanations. The "no-match" probability calculated by this formula is always greater than 0, right? But with enough events the actual probability will fall to 0 (not close to 0) since the space will be filled.
 
Feb 2010
7
0
In my previous post, I obviously messed up some things. Two already drawn circles may intersect each other, as long as the distance between their centers is more than the radius (in other words, the center of each circle cannot be within the area of another).

Is this overlap of similarity area the reason of this discrepancy that what you are talking about? If I understood correctly, the formula will be more precise at the begining? As the space gets filled, the error is getting worse?

The point regarding the no-match formula yielding always positive results is understood. At some point, the no-match probability will be exactly zero, because the space will be filled completely.
 

CRGreathouse

Forum Staff
Nov 2006
16,046
936
UTC -5
Randomas said:
In my previous post, I obviously messed up some things. Two already drawn circles may intersect each other, as long as the distance between their centers is more than the radius (in other words, the center of each circle cannot be within the area of another).

Is this overlap of similarity area the reason of this discrepancy that what you are talking about? If I understood correctly, the formula will be more precise at the begining? As the space gets filled, the error is getting worse?
I don't think the "overlap of similarity area" is what I'm talking about -- 'no center can be within the diameter of another center' is sufficient. But you're right that the formula will be more precise in the beginning -- should be exact for fewer than 3. So maybe you do understand but we're not communicating well. (I know I'm not -- I'm rather hazy in the head today, probably `cause of lack of sleep.)
 
Feb 2010
7
0
The problem is not you - it is that I haven't even remotely studied probability since the university days - but I think that I understand what you are saying.

What confuses me is that, due to these overlaps, I would expect the "no match" space to be slightly bigger in reality as compared to the formula.

In other words, I would expect the actual at-least-one-similar-pair probability to be slightly lower (not higher) than the one derived from the formula.

Alternatively, I got it all wrong :)
 

CRGreathouse

Forum Staff
Nov 2006
16,046
936
UTC -5
I feel that I should be able to better explain, and ideally come up with a formula describing roughly how far off we expect to be with that method. Right now I'm actually unable... but let me look at it afresh tomorrow (hopefully after a long sleep!) and maybe I'll be of more help.

This is an interesting problem you raise and it's a pity we don't have a real expert in this subject to help out.