Detour: copying ROM to RAM on startup

Posted on Sat 30 May 2020 in MUPS16

I've been toying with the idea of trying to use serial ROM chips in my build for quite a while, but I couldn't find anyone else who had done something similar, and wrote it off as too complicated. A couple of weeks ago, however, I started working on the PCB for the ROM component of the MUPS/16, and decided to have another look.

Parallel ROM chips are certainly easier to design for, but they have a few properties I really don't like:

they're huge, compared to most surface-mount components
reprogramming them requires removing them from the board, which forces them to be installed into a chip holder, further increasing their size.
they're expensive (roughly £5-7 each, and I need at least 8 of them)
they're slow; the fastest I could find that weren't hundreds of pounds had an access time of ~140ns, which is terrible compared to the ~10ns of a 50p SRAM chip. Given that every single cycle requires at least one ROM access for microcode, and we have to allow for a second memory access to ROM, this would automatically cap the clock rate at no more than about 3Mhz.

Now, admittedly #4 is unlikely to be a big issue to start with, since I'd be surprised if I could get the clock rate that high anyway, but I'd still like to give myself some wiggle room. The biggest practical objection is the difficulty of reflashing the chips, especially since microcode changes would require flashing up to 7 separate ROM chips.

Serial ROM chips are much smaller and much cheaper than parallel ROM chips, but they're obviously completely unusable for microcode (or even normal ROM) in their serial form. The ideal combo would be to have a single serial ROM chip, and copy it into multiple SRAM chips on startup. I wasn't sure if this would be practical, since I wanted to stick to my rule of having no microprocessors in my build, which precluded the obvious solution of using an ATTiny and a couple of shift registers. So, I decided to give it a go.

Most serial ROM chips these days seem to use the I2C protocol, so I started my first attempt with an I2C ROM chip. This really wasn't very successful. I2C is a reasonably complicated protocol to implement solely in discrete components. In particular, the clock-stretching and acknowledgement parts of the protocol gave me a few headaches. I got clock stretching working, but implementing acknowledgements was getting more complicated than I was comfortable with. Shortly after giving up on that, however, I found the Microchip 25LC512 ROM, which uses the SPI protocol. SPI is a much simpler protocol than I2C, and after about two weeks of sporadic evening work I had a reliable prototype working on breadboards.

Breadboard

Fundamentally, what we're trying to do here is to set up a circuit such that when power is first applied to the system, we hold the rest of the system in reset mode while we copy all 64KB out of the serial ROM into one or more SRAM chips, and once that's done, release the reset line and allow the system to boot.

To get an idea of what we need to do, we can look at the example read sequence from the ROM chip's datasheet:

Read sequence

Looking at this, we can see that we're going to need, at minimum:

a clock
some kind of flipflop for the CS line
a counter for individual bits
a counter for bytes
a shift register to hold the bytes coming out of the ROM chip

It turns out that it's much simpler to implement if we split the byte counter into two layers, with a state counter, which will have 5 states:

command, in which we send the READ command (00000011);
addr0, in which we send the first byte of the address (always 0, for our use-case);
addr1, in which we send the second byte of the address (also always 0);
data, in which we read 65,536 bytes of data from the MISO line and send it to RAM; and
finished, in which we set CS high again and stop the SPI clock

and a separate address counter, which will count from 0 to 2¹⁶-1 once we hit the data state.

Power-on reset

Reset circuit

We're using a few stateful components (the flip-flop and counters), so we need to make sure that they're all initialiased to a sensible value on powerup. We do this with a simple one-shot 555 timer circuit that stays low for a short time after startup, then goes high. This is connected to the green wires, which extend around the board to reach all the components that need resetting on startup.

Since we're going to be triggering the CS line off this, we also want to make sure that we syncronise the reset line with the clock, so that it goes high on a clock boundary. We do this with a JK flip-flop (we only really need a simple SR latch, but I had an SN74LS107 chip handy, so I used that). We wire the CLR line of the flipflop to reset, so that the state is initialised to 0 on startup, and we set J to 1 and K to 0, so that the latch is set on the first clock pulse after reset goes high.

Note that the J and K lines aren't directly wired to 1 and 0, since we want to be able to clear the flipflop when we're done. Instead, these are driven by the done output from the state demux, which we'll see later.

Note also that these are master-slave JK flip-flops, which means that the value doesn't change until the falling edge of the clock. This turns out to be quite useful.

The final point of note here is that the 555 timer's own reset line is pulled high via a 1kΩ resistor. This keeps it high normally, but allows us to pull it low at any point to reset the whole circuit and restart the copy (and, since we're going to be tying the rest of the CPU's reset line to this circuit eventually, this lets us reset the whole CPU).

Serial clock

Clock circuit

There are actually three separate clock circuits here, though they are all syncronised:

the system clock (in white). In this implementation this is just a standard Ben Eater-style clock;
the internal clock used by the ROM->RAM copy circuit; and
the serial clock connected to the clock input of the ROM chip

Initially, only the system clock will be ticking. The other two are disconnected by the tri-state buffers in the 74HCT125 chip, and pulled low with pull-down resistors (see 'Problems' below for why it wasn't a great idea to use pull-down resistors for the internal clock here).

Since we've connected the Q line from the JK flip-flop to the enable input of the two buffers, as soon as the flipflop goes high (one clock cycle after the reset line goes high) we will connect both the internal and serial clocks to the system one, and they will all start ticking together.

The reason that we separate the serial clock from the internal one like this is so that we can leave the serial clock disconnected from the internal clock when we're done. This will allow other components in the CPU to use the ROM chip (for example, if we want to support flashing the ROM while the system is running, which is perfectly feasible). If we left the serial and internal clocks connected then anyone driving the serial clock would also be clocking this ROM copy circuit while the system is running, which we definitely don't want.

Bit counter

Reset circuit

** Note ** All the counter chips in this circuit are placed upside-down (pin 1 is the top right, not the bottom left as usual. This was because it made the routing much simpler (the outputs always flow down).

This counter is responsible for counting from 0-7 repeatedly while the circuit is running. Unfortunately, I didn't have any 3-bit counters, but I did have lots of 4-bit 74HCT161 counters. This doesn't really matter, since we can just use the bottom three bits to cycle from 0-7. Before looking at the detail of how this counter is set up, however, it's useful to think about what we're going to use it for:

we need to know when the ROM is going to read the 7th and 8th bits on the MOSI line on the next clock tick, so we can set it high accordingly;
we need to know when we're going to wrap around to zero on the next clock cycle, so we can enable the state counter to advance at the same time;
we need to know when the shift register has clocked in all 8 bits, so we can set the RAM's WE line low to store the byte we just clocked out of the ROM
we need to know when it's safe to advance the RAM address counters by one, after we've safely set WE high again, so we don't change the address lines while we're writing

The first three of these suggest that we actually want the bit counter to count 1 cycle earlier than the serial clock, since all of them deal with needing to know what's going to happen on the next cycle. Fortunately, this is exactly the behaviour that we get if we feed the serial clock directly into the counter:

Counter

The trace above is from a logic analyser tracing the main system clock (Clk), serial clk (SerClk), and CS lines in our circuit. As we can see, because the CS line goes low on the trailing edge of the clock cycle, the next rising edge on SerClk will clock in bit 0 to the ROM, and at the same time increment the counter to 1. On the following rising edge, we'll clock in bit 1, and increment the counter to 2, etc.

In reality, however, this isn't exactly what I ended up doing here. It turns out that it's really useful later on to have the counter count only half a cycle earlier, not a full one (see 'Memory Signals' below for why). To do this, we can simply invert the clock signal before passing it into the counter:

Counter

Here, the green line is the inverted clock that's passed to the bit counter. Now, rather than going high immediately at the start of the first cycle, it goes high half-way through each cycle, giving us the half-cycle delay we want. This means that the counter will be wrong for the first half of each cycle, but that still gives us plenty of time to settle down before the next cycle at the clock rates we'll be using.

Returning to our list of the counter's uses above, we can handle #2 quite naturally as a consequence of this being a 4-bit counter, not a 3-bit one. We want to advance the state counter by one every time the bit counter rolls around to zero. Since we're counting half a cycle in advance of the clock, we can simply tie bit 4 of the bit counter to the enable line on the state counter: when we reach the second half of the cycle for bit 7 the counter will advance to 8 (1000), enabling the state counter, which will advance on the next clock cycle. We do the same thing for the address counters, too.

Of course, now we're relying on the value of the 4th bit going high every 8 cycles, we can no longer just let the counter count to 15 and reset to zero (if we do that, bit 4 will only go high every second byte, which would be a problem). Instead, we want to reset the counter after 8 cycles. Fortunately, the 74LS161 allows us to syncronously set its value, by setting the SPE input low. The counter will then read in the values on the P0-P3 inputs on the next clock cycle. We simply tie SPE to the inverse of the Q3 output.

Since we're using the bottom 3 bits to indicate the current bit, we need the value following 1000 to be 0001, so we get the following counting cycle:

0000 - 0
0001 - 1
0010 - 2
0011 - 3
0100 - 4
0101 - 5
0110 - 6
0111 - 7
1000 - 0
0001 - 1
...

State counter and demux

State circuit

The state counter keeps track of which of the 5 states (command, addr0, addr1, data or finished) we're in. This has the usual green reset line to initialise it to zero on startup.

The 74LS161 ❶ counter has two enable inputs, both of which must be high for the counter to advance. We have two conditions that must be met for the state to advance:

bit 4 of the bit counter output must be high, indicating we're on the last bit of the current byte; and
if we're in the data state, we don't advance at all until the address counter reaches 2¹⁶.

We can get this effect by tying one enable line (PE) to bit 4 of the bit counter, and the other to a line that's high only when the current counter value is 3 and the address bits are all 1 (see below).

Since we need to know the current state value in a few places, and using the binary encoding of the count is unwieldly, we use a 74LS138 3-8 line demux ❷ to get 8 individual lines that go low on the corresponding counter value. We take value from this in several places, but of most interest here is that we feed the Q3 output, which is low whenever the state is data, and feed it to the 74LS32 quad-OR chip ❸ next to the state counter. The other input to this comes from the address counter, and will be high iff all 16 address bits are high (see "Address counters" below for how we work this out). This gives us the second enable input: it will be high whenever the data line is high (i.e. in any state except data), or when we're on the last byte.

The final part of note here is the connection to the JK flipflop ❹, which is how we terminate the copying process. The J and K inputs to the flipflop are tied to finished state line and its inverse respectively. On startup, the state counter will be initialised to zero, so the demux will have command low, and all other bits high. This gives us J=1, K=0 on startup. In a JK flipflop this combo will set the value to 1 on the first clock cycle after the reset line goes high. J and K will keep these values until the state reaches finished, at which point they swap and we have J=0, K=1. This will clear the flipflop on the next cycle, which sets the control inputs to the buffer chip to high, detaching the circuit from the clock and the ROM input lines, effectively terminating the copy. Since the yellow circuit clock isn't ticking any more, the state will remain at finished until power off, or a manual reset.

The inverted finished state value is also what we can use as the reset line for the rest of the CPU: it will start off low on powerup, and go high only after all the ROM copying is complete.

Sending the READ command

State circuit

At this point, we have everything that we need to send the READ command to the ROM chip. As we saw in the excerpt from the datasheet above, we do this by sending 3 bytes: 3 (indicating the READ command), then the high address byte, then the low address bytes. Now, we're fortunate that we're only ever going to want to read the entire ROM chip, so our address is always going to be 0, so we can really simplify generating the MOSI line: it's always low, apart from when we're sending bits 6 and 7 of the first byte.

We could get this behaviour by ANDing the inverse of the command line from the state demux (which would be high only when we're sending the first byte) with bits 1 and 2 of the bit counter output (which would be high only when the bit counter value was 110 or 111). This requires an inverter and two AND gates, however, and there's a simpler way. Recall that we said above that we wanted to leave all the inputs to the ROM chip disconnected when the power-on copy circuit had completed. If we tie the command state line ❶ to the enable input of one of the spare buffers on the 74HCT125 chip, and pass BIT1 AND BIT2 ❷ to the data line of the buffer, and connect the output of the buffer to the MOSI line of the ROM chip ❸, then we'll be driving the MOSI line high only when both command is low and bits 1 and 2 are high, which is exactly what we want. The rest of the time (during the addr0, addr1, data and finished states) the MOSI line is left disconnected, and pulled to ground by a 1KΩ resistor.

Putting all this together, the following trace shows the behaviour we want. The A1 and A2 markers show the point at which the ROM will snap the value of bits 6 and 7. The numbers in blue above the MOSI line show the logic probe's decoding of the SPI protocol values, confirming that we send 3 as the command, then 0 for both address bytes:

Read command trace

ROM and shift register

State circuit

Now that we have everything in place to send the read command to the ROM ❶, we need to look at what we do with the data that comes out. Our RAM chip has an 8-bit parallel data interface, and the serial ROM has a 1-bit serial output. To bridge these two we use a 74HCT4094 shift register ❷.

First, the inputs to the ROM. All three inputs are connected to the 74LS125 buffer chip, so that they can disconnect once copying is finished, and leave the inputs in a high-impedance state, so other parts of the system can drive the ROM:

CS is controlled by the output of the JK flip-flop. When the flip-flop output is low CS will be tied to ground, and the ROM is active. When the flip-flop is high (during the initial startup delay, or after copying finishes) then the 4KΩ pullup resistor keeps the chip deactivated.
the MOSI line (the control input to the ROM chip) is fed from the output of bits 1 and 2 of the bit counter (see 'Sending the READ command' above)
the clock input is connected to the system clock

The 25LC512 has two other inputs: a hold input, which temporarily pauses the current command so another chip sharing the same SPI bus can use it, and a wp write-protection line. We won't be using the hold feature, so we tie that line to 5V. We tie wp low, so that we don't accidentally write to the chip if we make a mistake in the input signal.

The ROM chip only has one output (the MISO line). We tie this directly to the serial input of the shift register. Note that this output is in a high-impedance state whenever CS is high, so we should really have a pull-down or pull-up resistor here, so that the line doesn't float, but I didn't here, and it didn't affect the working of the circuit.

The spec says that the ROM chip will start clocking out the data 1 bit at a time, starting with the first falling edge after the end of the address, and it will continue to clock sequential bits out as long as the clock runs. This means that we can basically ignore the ROM chip after this, and the shift register will always give us a view of the most recent 8 bits that we clocked out of the ROM chip. We just need to make sure that we write the output of the shift register to RAM on byte boundaries, which we'll handle below.

The 74HCT4094 has one other nice feature: we can disable the output of the the 8 parallel output bits, putting them into a high-impedance state, using the OE input. In a simple setup in which one ROM chip is connected to one RAM chip this lets us avoid the need for an extra buffer chip to release the data lines once copying has finished and leave them free for the rest of the system to use. In this simple setup I've connected this to the inverse of the data state line ❸, so that the register outputs only during the data phase of the operation.

Address counters

Address circuit

The address counter setup is pretty simple. All these have to do is to count from 0 to 2^16-1, incrementing once every 8 bits during the data phase of the copy.

Since these are 4-bit counters, and we need a 16-bit counter, we need to chain them together. The TC pin ❶ on the 74HCT4094 is designed to allow easy chaining of multiple counters. This pin goes high only when the counter value is 15 (i.e. 1111). By connecting this pin on counter 0 to the TE enable pin ❷ of counter 1, and TC on counter 1 to TE on counter 2, etc., each successive counter will count up by one only when the previous counter is at its maximum value. Counter 1 will only count once every 16 cycles of counter 0, and counter 2 once every 16 cycles of counter 1, etc.

We control the counting behaviour of counter 0 by connecting its PE enable input to bit 3 of the bit counter, which is high once every 8 bits, though since we only want to count during the data phase, we first AND bit 3 with the inverse of the data output from the state demux ❸.

Note that I unnecessarily tied all the counters' PE inputs together. In theory I could have tied all but counter 0's PE inputs high, since they would still be controlled by their TE inputs, but it's harmless (just extra unnecessary wires to cut).

The only remaining point of interest here is how we stop the counting. Recall from the 'State counter' section above that we pause the state counter when we're in the data state until we reach address 2¹⁶-1. We construct the signal that signifies reaching that address by feeding all 4 counters' TC pins into a 74LS21 4-input AND gate ❹. Since the TC output is high only when the counter is at value 1111 this gives us a line that's high only when the counter is at 1111 1111 1111 1111.

Writing to RAM

The final functional part of the circuit handles connecting the address lines to RAM, and controlling the RAM's WE line to write data.

Connecting the address lines up is simple. We take each of the Q outputs from the counters and wire them to a pair of 74LS245 line drivers, so that we can disconnect them when the copy is finished, and leave the address lines free for the rest of the computer to use. The '245 chip is somewhat overkill, as it is bidirectional and we're only ever going to use it in one direction, but I had lots of them around, and the pin layout is very convenient. We tie the DIR pin to ground, which connects the B pins on the north side of the chip to the A pins on the south.

The outputs of the line drivers are connected to the RAM chip's address inputs (note that the order that we connect the address pins really doesn't matter, as long as we're consistent). In this case I'm using an AS6C6264 8KB RAM in the test I photographed, so we only use 13 address lines. This means that the memory will be overwritten 8 times, and end up containing the top 8KB of the ROM. In the final circuit I intend to take the top 3 address bits, feed them into a 74LS138 3-8 demux, with the outputs wired to the CE inputs of 8 different RAM chips, so that each in turn gets 8KB of ROM.

The data inputs of the RAM can be directly connected up to the QP outputs of the shift register, since that has tri-state outputs. I haven't shown that connection here as I used ribbon wires that obscured the chips beneath.

The final control for the RAM is the WE line. This must go low when the shift register contains the 8 bits we want to write, and the address lines contain the correct address. It's important to note that this RAM (and most SRAM these days) is not syncronous. There is no clock line to control when the inputs are snapped. In particular, we have to make sure that the address lines stay stable for the entire time that WE is low. If they aren't, and change even slightly while it's low we could end up writing to the wrong address. It's less important that the data line is stable for the entire time, so long as it is stable for at least 25ns before WE goes high again (see the spec linked above for these TDW and TDH timing values).

we timing

The diagram above shows the period during which it's safe for WE to be low. Before A1 we won't have snapped the last bit into the shift register, and at A2 the data will change to include the first bit of the next byte, and no longer be correct.

My first implementations used a non-inverted count clock, with the bit counter advancing exactly one cycle ahead of the data. In theory, with this I could simply tie WE to the inverse of the bit3 output of the bit counter, which is high exactly when we're in the period A1-A2 in the diagram above. The problem here was that I ran into timing issues. In theory, bit3 goes low exactly at A2, at the same time as the data lines change to refect the new bit, but in practice there's propagation delay to worry about, and the delay of the inverter on the bit3 line was sometimes just enough that WE stayed low after the data had changed. I tried AND'ing the bit3 value with the clock, so that WE stayed low only during the first half of the cycle, but this ran into the same problem: at the end of the interval the clock line would go high a few nanoseconds before bit3 went low, causing WE to blip low again just after the data had changed:

we timing

I ended up settling on the approach I described above, where the bit count advances only half a cycle early, and used bit0 & bit1 & bit2 & clk as the input to the WE pin. With this, I get the desired effect, with writes happening only during the first half of the safe window, and no blips at the end:

we timing

And, with this in place, copying works! We can hook up an Arduino to the address and data lines of the RAM and read the contents once the copying circuit has reached the finished state and disconnected itself:

we timing

Progress bar

Progress bar circuit

One final part which has no functional value, but which adds the all-important blinkenlights, is the progress bar display which shows how far through the copy we are. For this we use another 74HCT4094 shift register ❶ that's initialised to zero, and which shifts in the value 1 every 8KB.

Advancing the register every 8KB is quite simple. We need a line that will go high exactly once every 8192 bytes, which we can tie to the clock input of the register. Counter 3 has almost exactly what we want in output Q1 ❷, which toggles every 4KB: it's low for 4KB, then high for 4KB, giving us one low-high transition every 8KB. This is almost perfect. The only minor problem is that the count is offset by 4KB: the low-high transitions happen after 4KB, 12KB, 20KB, etc., instead of 8KB, 16KB, 24KB etc. In practice, this doesn't really matter, and isn't noticeable.

Unfortunately, initialising the shift register to zero isn't quite as trivial as it sounds. There is no reset pin on a 74HCT4094, so we don't have any way to directly set it to zero. The only way to do this is to shift in 0 at least 8 times. To do this, we make use of a 74LS157 quad 2-1 multiplexer ❸. By tying the green reset line to the selection bit of the mux we can choose between the input values 0 during reset, and 1 during running of the circuit, and connect the output to the D pin of the register. We use a second unit of the mux to choose between the white system clock line during reset (to ensure that we clock the 0s in as fast as possible, to clear the register), and the Q1 output from counter3 mentioned above, connected to the CP pin.

Performance

The fastest clock speed I've been able to run this circuit at on breadboards is about 425KHz. I suspect I could significantly improve on that if I ditched the rubbish Elgoo breadboards I started with and replaced them with the BB830 ones, which have given me far less trouble, but for testing 400KHz is fine. At that speed copying 64KB takes almost exactly 1.25 seconds. With a 4MHz system clock the copy would take just under 150ms, which is perfectly acceptable during startup. Even if I have to step down the system clock to well below 4Mhz (and I almost certainly will) we can still run this copy circuit at full speed.

Problems

Mostly, this circuit just worked once I'd sorted out the bit counting. The main problems I still have are:

I started out using the same Elgoo breadboards I used in my ALU, and I hadn't realised just how bad they were until I finally got my hands on some BB830 boards, and suddenly all my random power problems and bad connections went away. I should have replaced the three Elgoo ones used here, but I was lazy, and as a result I still see transient power issues, especially when running above 400KHz
I used pull-down resistors on the internal clock line. This was a poor choice, as TTL chips already have pull-up resistors on their inputs, and with 8 TTL chips connected to the clock line the pull-up resistors are strong enough to drag the resting voltage of the clock line to nearly 0.6V, which is dangerously close to the 0.8V threshold for TTL inputs to be considered low. I had problems with transient voltage spikes happening at the time that the reset line went high and every chip in the circuit came out of reset, which caused phantom clock ticks (you can see these in the logic analyser traces in the 'Bit counter' section). Since these didn't happen while CS was low they were ignored, but still not ideal.