Since I have been playing about with bitcoin mining for the last few weeks, it has given me the opportunity to look at some of the publicly available source code.
We decided to take a gander at the communication channels used to communicate the results back from the FPGA’s to the master computers.
On the whole the code is mostly abysmal…., seemingly written by people that have absolutely NO comprehension about how real hardware actually works… or in some cases even basic binary maths.
Take for example the Serial communication:
Lots of fixed constants that cause the code to fall apart as soon as you try to increase the clock rate of the design.
Why? Because the counters and values in the design are dictated by the clock rate and as you increase the clock rate you need to increase the length of the counter chains in this code.
Hahar… you say..
I can vary the clock rate for the main design but then keep the clock rate for the UART at a fixed rate.
Because of this low quality code in miner.vhd
hit <= '1' when outerhash(255 downto 224) = x"00000000" and step = "000000" else '0';
As the SHA256(SHA256(x)) code does its magic, a new hash result is calculated off the base nonce at the edge of EVERY clock cycle of the master clock.
Therefore any valid result has to be recovered in a SINGLE clock cycle of the master clock before the result is lost.
Now if the UART is running on another clock, it may very well miss the value that has been transferred to the TX logic, at the very least you would need to syncronise the clock edges.
So Unless we want to totally re-write the VHDL we are stuck with the UART code running on the master clock
The next problem
The actual UART code is written so badly that even if it were the ONLY core in the FPGA, it would max out if the master clock value was increased past 140Mhz
However with a SINGLE change, its maximum operating frequency can be taken up to about 294.898Mhz.
yep... that is correct, a SINGLE change to the VHDL code can more than DOUBLE the frequency that the UART can operate at. That is before we even start to take a serious look at the code construction.
To buffer or not to buffer that is the question
We now come to the issue of buffering(FIFO) the results of hash searches and nonce hits.
Since the generation of hashes is a continual process there is no guarantee about the quantity of collision hashes you get when checking the full range nonce from 0x00000000 to 0xffffffff.
Furthermore, there is no guarantee that two valid nonces will not appear within a few clock cycles of each other( Other than mounds of research that say changing a single bit radically changes the whole has value, but they don't take into account the fact that bit-coin is looking for a hash whose top-end is populated with zeros).
So, a question arises:
How can a UART core operating at a baud rate of 115,200 possibly service a nonce generator running at 120Mhz?
Well, only under the pretense of multiple nonces NOT turning up within the 'send' window used for notifying a successful nonce back to the master controller.
From this we can ascertain that there is probably a need for a FIFO to be inserted between the SHA256(SHA256(x)) engine and the UART (yes...yes.. I know some people think it is very rare for nonces to be close, but I'm afraid I have seen them regularly within 1ms of each other, plus that value only gets smaller as the SHA256(SHA256(x)) system speed increases)
So we implemented a FIFO
We had to provide our own FIFO code, because the shitty Xilinx core generation tool only allow a FIFO to have a minimum size of 512 places. (yep like 512 places * 64 bits is going to help with the routing issues we already have!!)
The results were actually quite interesting: There was a noticeable INCREASE in stability of the whole system, as well as an increase of successful nonces discovered.
diff 0x723b07 (7486215D)
Above we can see two nonces detected and reported correctly within 32ms of each other, yep we can say that this is STILL well within the capability of the UART to report these without a FIFO, but the difference is that a NONCE can be detected at any time, even when the UART is currently sending data.
This code is difficult to test, almost impossible to simulate, because you need to be able to generate viable hashes that are very close together, and for that you need to know the base hash (x) PRIOR to the SHA256(SHA256(x)) that is going to produce the final results.
(if we knew that we could "mine" bit-coins without actually doing any work and we would be very rich).
Nevertheless even with these difficulties, we managed to capture the following:
Here we see two nonces within 1ms of each other namely:
A quick subtraction of these two nonces, tell us that they are:
0x1fa76b apart (2074475) or 2074475 clock cycles....
Hang on a minute
2074475*5ns(at 2002MH/s) =10372375ns , which is 10.3ms but the timing above says they are 1ms apart.
and from this we can see our fifo Is actually saving data, because something 10.3ms apart is being compressed into 1ms apart(the FPGA was sending a nonce to the controlling system, but during the send a second nonce was discovered, preserved in the FIFO and then TAGGED onto the end of the send straight AFTER the first nonce).
If indeed the two nonces were generated closer together and not a potential interruption of the UART, then the subtracted result above would be SMALLER.
But the nonces would still be received by the controlling system within 1ms of each other(according to the python reporting, but that's what you get for using a gash interpreted programming language to code a front end to a hardware design).
And finally a "zero time nonce":
Diff 0x19fa93 (1702547D)
Now that we have a working FIFO we can FINALLY, split out the shitty UART code from the master CLK.
Why? Well now we can start to look at running the SHA256(SHA256(x)) at a higher CLK rate without having to worry about resources that will not route correctly due to them failing to meet the timing requirements.
This means the design has the chance to compile at a far higher operating frequency, looking at the design, over 70% of the timings are due to routing delays.
Cut the bullshit, anyone can say code is bad
OK, So how far have we improved the miner?:
The current STABLE speed (200MH/s) is almost double when compared to the Public domain code.
And this is without even trying to work on the routing or adding constraints to the xilinx files.
Interestingly a research project for NIST has already accomplished a working ASIC for SHA3.
If you read the paper very carefully it states "It contains all the SHA-3 five finalists".. SO WHAT!!
Then it goes on to state: "and a reference SHA256"
Catch the link here:
SHA256 in ASIC
and something far more interesting related to die sizes and processes
ASIC Die sizes