I thought of that too. But then, when I look at the A7R4 @ 61Mb, the 20 megapixel M43 sensor is already as dense (based on pixel per sq in).
Anyway, the Full Frame A7R4 has also reach its limits, with some reviewers noted that the noise level is slightly higher at high ISO compared to the 42mp A7R3.
So it looks like it will be challenging for M43 to increase the megapixel beyond 20mp.
Pixel density of a 20MP m43 sensor is equal to 80MP FF. So the A7R4 has less pixel density than 20MP m43 sensors.
But all this is a misnomer. Pixel density doesn't increase the noise levels of an image. It may decrease the signal levels slightly as the pixel walls still occupy some surface area but with BSI and gapless microlenses, these have been non-issues for a long time already.
The differences between A7R4 and A7R3 comes down to how the sensors have been optimised. A7R4 sensor has a larger bandwidth and the need to offload data quicker does increase electronic noise.
But the two sensors occupy the same surface area (the FF frame size) and will perform largely the same regardless of pixel count. The sensor area is the main determinant of how much signal is collected (assuming same sensor QE). Photons shot noise will be the same. Electronic read noise will differ depending on how the sensor is optimised.
WRT to smartphone sensors, the incredible pixel density are real. And generally speaking, the tech on smaller sensors are far more advance and more efficient than large sensors. It generally takes a few years before sensor tech work it's way up to larger sensors.
But all those pixels doesn't mean it is resolving anywhere near it's maximum potential. You've got diffraction rearing it's head and probably most importantly, the lenses in front of those sensors doesn't resolve anywhere near it's maximum potential.
So you can get 100+MP of data from these tiny lensor modules but you're not achieving anywhere near 100+MP of resolution.
But because these sensors are incredibly quick, you can do all sorts of things with that data plus you wouldn't want 100MP files anyways due to storage issues.
Binning is just a way of combining pixels and have been around forever. Nikon's D1X was a binned sensor from 2001.
A 4:1 bin from a 108MP Bayer sensor gets you 27MP images but each pixel now has all the RGB colour information compared to just a regular 27MP Bayer sensor where colour information is interpolated from its neighbour pixels. I think they experiment with various different CFA's such as quad-Bayer rather than regular Bayer.
They can do all sorts of creative methods to improve the final image by starting off with far more sampling (large pixel count) and because the data offload are so quick and you've powerful CPU's working on those data, there's not really any lag penalty and creates better final images of a far more reasonable (for storage) resolution.
In terms of how the results look, forget resolution for a moment and think about magnification.
The main reason larger sensors produce better results when displaying or printing large is because they are magnified much less. If you don't magnify the results much (eg. displaying on a phone or small prints) you just can't see the difference.
When you do magnify the image for display, then resolution starts to matter but this is relative to the amount of magnification.
Your viewing distance is also a factor because how close you view an image/photo is similar to how much you magnify the image. That's why you can have iphone billboard sized images that look just fine as long as they're on a billboard which means you're viewing it from far away.