Hang on.
I've just come back from a job so I'm not totally clued in yet (black shirts, v white shirts, no less). But I'm a little confuzzled about what the exact nature of the problem, or the paradox is.
Firstly, an incident light reading basically measures the amount of light falling on a given scene. This means that using that setting under that lighting, middle grey should reproduce as middle grey.
Now with the dark shirt/light shirt example, is that dark shirt exactly the same number of stops darker than the light shirt is brighter? If the dark shirt is not as dark as the light shirt is light (eek for the expression) then it stands to figure that it's entirely probable that the light shirt will not reproduce as well. Have a further look at the zone system for a more thorough examination of this.
Add to that the fact that dynamic range of the sensor is very pertinent. Firstly if your tone curve is linear (which it isn't) then if the dark shirt is exactly the same number of stops darker than the light shirt is brighter, and you have accurately metered for middle grey, then they should reproduce similarly. But, especially when it comes to digital sensors, the tone curve tends to be S shaped, headroom in the shoulder area is poor, and dynamic range tends to be limited, with particularly poor performances in the highlight areas especially when any single channel comes close to being clipped.
Finally, you also have to bear in mind that most highlight clipping on the camera triggers if any of the RGB channels is clipped, rather than if all are clipped. It's entirely possible that a light shirt that might appear to be clipped is actually within tolerances.
Now also consider, if you reversed the lighting situation. So instead of the two shirts in sunlight, you have them in a dimly lit room. The dark shirt is likely to block up, while the light shirt would likely be properly exposed.
I'm still not entirely sure where the paradox comes in, but I suspect it's somewhere in the whole "correct exposure" concept. Incidence metering takes care of your mid point, after that a scene with lots of light tones will appear light, a scene with dark tones will appear dark, and if the dynamic range of the scene exceeds the sensor's capabilities then you will end up with clipped highlights, or clipped shadows. Because of sensor design at the moment, clipped highlights are more likely than clipped shadows generally speaking, with more latitude in the shadow areas than in the highlights.
Have a quick look at any dynamic range graph on DPReview for example, or any other reliable test site. As an example I had a quick look at the D3000 review on that site, and at ISO 200 the useable shadow range is 5.1EV, and the useable highlight range is 3.6EV (below and above mid grey respectively). That's an extra 1.5 stops of latitude in the shadow areas, and why your dark shirt is AOK while the light shirt gets clipped.