I have never looked into this. I assume that it was done the way you propose at some point, it seems like the lowest cost approach. I have however found a Billboard blurb from 1979 to the effect that internal electronic conversion was possible with dedicated equipment, but I can't say how it would have worked.
It's very possible, given the quoted $200,000 price point, that the equipment in question worked exactly the way it would 20 years later: using a digital framebuffer to capture and then reoutput the signal with different timing. Digital memory was obscenely expensive at the time, not to mention quality DACs and ADCs, but if you had the money, there were devices (transcoding, effects, editing, etc.) that worked precisely the same way as modern equipment.
Intriguingly however, the approach you described is no different in principle than the modern solution, it just lacks the digitization step. You have a pair of signal transducers - a CRT turning an NTSC electrical signal into particles, and a second CRT (we're talking videotube era) turning particles back into a PAL electrical signal - and the two are separated by a phosphor screen which acts as a buffer, smoothing over the timing discrepancy between the devices. The phosphor either holds a 50hz image stable long enough for the 60hz tube to scan it one and a quarter times or whatever, or it allows 60hz frames to be slightly merged together by PALs slower scan.