25, white-Latinx, plural trans therian photographer and musician. Anarcha-feminist. Occasionally NSFW

discord: hypatiacoyote


wolf-apparatus
@wolf-apparatus

so this color image (cinestill 50d / speed graphic) took about 6 minutes whereas yesterday it would have been about 10. getting firmly into "it really wouldn't be that bad to scan a whole roll on here" territory finally!

and it means I'm once again limited by CCD readout speed, which is limited by lack of hardware CCD timing support, so it's pretty close to the best I can do before more boards show up.

technical details under the cut, since it's some neat stuff


This is an implementation of the "nuclear option" I referenced at the end of a previous post, since it turns out that was much easier to implement than I had worried. Basically, I've taken over control of the microcontroller's USB pipeline from the board-support library (at least while scanning is happening), and I'm letting it run much faster and without any CPU overhead.

Previously, the pipeline from "ADC" to "host computer" looked like this:

  1. Read 85 ADC samples into memory (copy 1)
  2. Package those 85 ADC samples into a 512-byte USB packet (copy 2)
  3. (in USB library code) Copy the 512-byte packet into an internal buffer (copy 3)
  4. Ask the USB controller to send a 512-byte DMA transfer from that internal buffer
  5. Wait until the USB request completes
  6. Repeat until the line's over

This is slow for a bunch of reasons. Obviously, we're copying the ADC data three different times, two of which are memory-to-memory copies (read: "lots of CPU time for no reason", and recall that we're right up against the margins on CPU time here), but, more subtly, telling the USB core to do a bunch of individual 512-byte transfers isn't very efficient either.

Aside: Brief explanation of DMA

DMA, or direct memory access, is a way to interact with hardware peripherals. Without DMA, you have to interact with hardware via programmed I/O, where you essentially take a byte at a time from memory and run a CPU instruction that boils down to "hey, go tell that guy over there that 0x3C is happening" (which is also how I'm talking to the ADCs at the moment). This isn't a good use of anyone's time, so DMA lets you just say "hey hardware peripheral, I have a buffer of 65,536 bytes at address 0x2000_4000, go ahead and directly access memory to get it out". The key is that DMA doesn't require active participation from the CPU.

However, if I'm going to use the imxrt-usbd USB code, I have to put up with the internal buffer copy in step 3 and the 512-byte-at-a-time transfers. The internal buffer copy is stupid, but the packet-at-a-time limit is downright criminal considering how much memory the USB core can actually directly access.

See, 512 bytes is the limit on USB packet size, at least how I have it set up, but the hardware USB controller isn't limited to sending a packet at a time under software control. In fact, a single "transfer descriptor" passed to the controller can target a buffer of up to 20 kiB, and the USB controller will automatically split it out into packets.

One line of scan data is about 60 kiB, so that should mean I only have to interact with the USB controller three times a line (down from 128), right? Well, there's one more trick. One of the fields in that transfer descriptor is actually the address of the next transfer descriptor the USB controller should process. That means there's no limit to the size of a single USB transfer - I could literally tell the controller to upload the entire 1 MiB of onboard RAM to the computer in one go.

And, as it turns out, telling the controller to do that isn't very hard. The USB library code is pretty darn complicated, but once it sets everything up I just have to construct the transfer descriptor structure somewhere and tell the USB controller "hey, go look at this one instead for a moment", and as long as I don't try to do USB with the library code while that's happening it actually Just Works.

Once I got that working, the final touch on the whole thing was to switch to a double-buffered architecture instead of a queue and write pre-formatted packets into the buffer as the ADC samples come in.

Again, if you're not familiar with double-buffering, it's a neat technique to be able to have two different threads or processes reading and writing "the same data" at the same time. You construct two buffers, and then take a pointer to each of them, calling one pointer the _front buffer_ and the other the _back buffer_. Whatever thread's generating data writes it into the _back buffer_, and at the same time the other thread's reading the last set of data from the _front buffer_, and when both threads are done you swap the pointers.

Now the front pointer points to what was previously the back buffer, so on the next cycle the reader thread will read what got written this cycle, and the back buffer points to what was previously the front buffer, which has already been sent out and is usable as scratch space for the next dataset.

Double-buffering can be a lot faster than using a queue because it completely removes data dependencies between the threads. When I was using a single buffer, the reader thread would have to wait until the writer thread completed (or at least wrote some data), and then at the end I'd have to wait for the reader thread to complete. With double-buffering, the only time synchronization is required is when I'm actually swapping the buffer pointers.

The downside is there's now a pipeline delay, since we're now reading the previous line's data instead of the current line. Thankfully, I've set up the host-side code to be able to deal with that since it's not the only source of delay in the system (the current pipeline is actually 3 lines long).

So now the cycle looks something like this:

  1. Tell the USB controller to start reading an entire fucking line out of the front buffer (no CPU involvement).
  2. Read 10,750 ADC samples, storing them in the back buffer in wire format (copy 1)
  3. Wait for the USB write to complete if it hasn't already (it usually has)
  4. Swap the buffers

Since the ADC copy-in is almost required (technically I might also be able to DMA from the ADC interface, but it's extremely fiddly and probably won't be required), this means I now have zero additional overhead required to send the data back up to the host computer, whereas previously it took about 1/3 of the CPU time during a scan. That means I can now run the USB write and ADC read at the same time, knocking about 1/3 of the scan time off!

And of course getting that much margin back means I'm very confident I'll be able to ramp up to a 2.3 MHz pixel-clock when the next hardware revision shows up.


You must log in to comment.