RSS
 
 

Bitcoin Mining/publicly available VHDL source code and the Xilinx XUPV5

10 Dec

Since I have been playing about with bitcoin mining for the last few weeks, it has given me the opportunity to look at some of the publicly available source code.

We decided to take a gander at the communication channels used to communicate the results back from the FPGA’s to the master computers.
On the whole the code is mostly abysmal…., seemingly written by people that have absolutely NO comprehension about how real hardware actually works… or in some cases even basic binary maths.

Take for example the Serial communication:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/VHDL_Xilinx_Port/uart.vhd

Lots of fixed constants that cause the code to fall apart as soon as you try to increase the clock rate of the design.
Why? Because the counters and values in the design are dictated by the clock rate and as you increase the clock rate you need to increase the length of the counter chains in this code.

Hahar… you say..
I can vary the clock rate for the main design but then keep the clock rate for the UART at a fixed rate.

NOPE!!!!!!

Because of this low quality code in miner.vhd

hit <= '1' when outerhash(255 downto 224) = x"00000000" and step = "000000" else '0';

As the SHA256(SHA256(x)) code does its magic, a new hash result is calculated off the base nonce at the edge of EVERY clock cycle of the master clock.
Therefore any valid result has to be recovered in a SINGLE clock cycle of the master clock before the result is lost.
Now if the UART is running on another clock, it may very well miss the value that has been transferred to the TX logic, at the very least you would need to syncronise the clock edges.

So Unless we want to totally re-write the VHDL we are stuck with the UART code running on the master clock

The next problem
The actual UART code is written so badly that even if it were the ONLY core in the FPGA, it would max out if the master clock value was increased past 140Mhz

However with a SINGLE change, its maximum operating frequency can be taken up to about 294.898Mhz.
yep... that is correct, a SINGLE change to the VHDL code can more than DOUBLE the frequency that the UART can operate at. That is before we even start to take a serious look at the code construction.

To buffer or not to buffer that is the question
We now come to the issue of buffering(FIFO) the results of hash searches and nonce hits.
Since the generation of hashes is a continual process there is no guarantee about the quantity of collision hashes you get when checking the full range nonce from 0x00000000 to 0xffffffff.
Furthermore, there is no guarantee that two valid nonces will not appear within a few clock cycles of each other( Other than mounds of research that say changing a single bit radically changes the whole has value, but they don't take into account the fact that bit-coin is looking for a hash whose top-end is populated with zeros).
So, a question arises:
How can a UART core operating at a baud rate of 115,200 possibly service a nonce generator running at 120Mhz?

Well, only under the pretense of multiple nonces NOT turning up within the 'send' window used for notifying a successful nonce back to the master controller.

From this we can ascertain that there is probably a need for a FIFO to be inserted between the SHA256(SHA256(x)) engine and the UART (yes...yes.. I know some people think it is very rare for nonces to be close, but I'm afraid I have seen them regularly within 1ms of each other, plus that value only gets smaller as the SHA256(SHA256(x)) system speed increases)

So we implemented a FIFO
We had to provide our own FIFO code, because the shitty Xilinx core generation tool only allow a FIFO to have a minimum size of 512 places. (yep like 512 places * 64 bits is going to help with the routing issues we already have!!)
The results were actually quite interesting: There was a noticeable INCREASE in stability of the whole system, as well as an increase of successful nonces discovered.

0x1dc5323a
0x1e376d41
diff 0x723b07 (7486215D)

Above we can see two nonces detected and reported correctly within 32ms of each other, yep we can say that this is STILL well within the capability of the UART to report these without a FIFO, but the difference is that a NONCE can be detected at any time, even when the UART is currently sending data.

This code is difficult to test, almost impossible to simulate, because you need to be able to generate viable hashes that are very close together, and for that you need to know the base hash (x) PRIOR to the SHA256(SHA256(x)) that is going to produce the final results.
(if we knew that we could "mine" bit-coins without actually doing any work and we would be very rich).

Nevertheless even with these difficulties, we managed to capture the following:

Here we see two nonces within 1ms of each other namely:
0xc1f5cf25
0xc1d627ba
Diff 0x1fa76b(2074475d)
A quick subtraction of these two nonces, tell us that they are:
0x1fa76b apart (2074475) or 2074475 clock cycles....
Hang on a minute
2074475*5ns(at 2002MH/s) =10372375ns , which is 10.3ms but the timing above says they are 1ms apart.
and from this we can see our fifo Is actually saving data, because something 10.3ms apart is being compressed into 1ms apart(the FPGA was sending a nonce to the controlling system, but during the send a second nonce was discovered, preserved in the FIFO and then TAGGED onto the end of the send straight AFTER the first nonce).

If indeed the two nonces were generated closer together and not a potential interruption of the UART, then the subtracted result above would be SMALLER.
But the nonces would still be received by the controlling system within 1ms of each other(according to the python reporting, but that's what you get for using a gash interpreted programming language to code a front end to a hardware design).

And finally a "zero time nonce":

0x75c67f66
0x75e079f9
Diff 0x19fa93 (1702547D)

The Future
Now that we have a working FIFO we can FINALLY, split out the shitty UART code from the master CLK.
Why? Well now we can start to look at running the SHA256(SHA256(x)) at a higher CLK rate without having to worry about resources that will not route correctly due to them failing to meet the timing requirements.
This means the design has the chance to compile at a far higher operating frequency, looking at the design, over 70% of the timings are due to routing delays.

Cut the bullshit, anyone can say code is bad
OK, So how far have we improved the miner?:
The current STABLE speed (200MH/s) is almost double when compared to the Public domain code.

And this is without even trying to work on the routing or adding constraints to the xilinx files.

The Future
Interestingly a research project for NIST has already accomplished a working ASIC for SHA3.
If you read the paper very carefully it states "It contains all the SHA-3 five finalists".. SO WHAT!!
Then it goes on to state: "and a reference SHA256"
Catch the link here:
SHA256 in ASIC
and something far more interesting related to die sizes and processes
ASIC Die sizes

 
 

Leave a Reply

 
*

 
  1. Hardcorefs

    December 18, 2012 at 9:44 am

    Posted on Behalf of “THESEVEN”:

    Being the one who wrote the code you’re complaining about, I’d like to state a few things:
    – This code was written as a proof of concept back in the summer of 2011. It was never intended to be released, but at some point people pushed me into releasing it. Yes, it is a quick hack.
    – I intentionally avoided a clock domain crossing to avoid opening a can of worms. Yes, a cross-clock-domain fifo is an elegant solution here. I just didn’t bother back then.
    – The lack of a FIFO causes 0.0012% of the shares to be lost on average. Do you really care about that? I don’t. Network latency causes way more lost income than that.
    – “Lots of fixed constants”… there is just that UART clock divider constant. The whole UART is a quick and dirty hack for one specific test setup. This was done for convenience while stabilizing the UART (turned out to be a metastable flipflop problem in the end) and never ended up in real products. There are no other things that fall apart except for this single one thing.
    – I have no idea why you think that the hit detector is low quality code. That’s how every sane design implements this. How else could you even do it? You surely need this comparator/XOR tree.
    – “seemingly written by people that have absolutely NO comprehension about how real hardware actually works… or in some cases even basic binary maths.” is a harsh personal attack, and it is not true at all. I very well know how both of those work.
    – The code that you based this on is NOT public domain. It is licensed under the GPL. I hereby ask you to release your optimizations back into the public, both as source and as a compiled bitstream, like I did.

     
    • Hardcorefs

      December 18, 2012 at 10:00 am

      I will reply to Three of your points:

      1. You critisie my statement:
      “seemingly written by people that have absolutely NO comprehension about how real hardware actually works… or in some cases even basic binary maths.” is a harsh personal attack, and it is not true at all. I very well know how both of those work.

      I will now backup and quantify my statements:

      Consider the following code:
      from “UART.vhd”

      elsif rxclkdiv = “00000000000000” then
      rxclkdiv <= “10000010001”;

      1. Note the constants scattered ALL over the file, why not define:
      Constant bit_period : std_logic_vector(bit_period_in’range):=”10000010001″;
      (yep Ok you say you were in a ‘hurry’ and it was a hack)

      2. As regards your binary maths:
      elsif rxclkdiv = “00000000000000” then…

      I mean WTF a 14 input comparator and 3 levels of logic!!

      why not make the ‘rxclkdiv’ count 1 bit longer, decrement the count by one and then just test the “underflow”

      as in:
      elsif rxclkdiv(rxclkdiv’high)=’1′ then…..

      NO 14 input comparators, cleaner code, and it DOUBLES the speed the UART works at (which is CRITICAL for increasing the core speed)

      Please feel free to explain your statement “I very well know how both of those work.”
      yep you may “know how they work”, but do you UNDERSTAND it.

      Yep OK I may be/can be a little harsh, but please consider YOUR PUBLIC remarks and your kurt replies to my requested “bug fixes” for “Modular-Python-Bitcoin-Miner” on GITHUB.

      Where you stated that it was “my network” (later ABCpool publicly informed it was a problem with THEIR server).

      Sorry…. but you cannot have it BOTH ways.. You cannot slag me/ my setup down in public, then cry when I retaliate.

      Furthermore on another “bug” fix request you informed me that the code should NOT gracefully handle DOS attacks against the host server.
      So I let it ride and fixed the code myself.
      I can say now that prior to that my full intention was to bounce ALL the code back to you, but after your GIT hub replies , I though Fuck it…..

      2. Release the fixes back, under the GPL.
      OK I have released two fixes above.
      But note that I do not have to release back any external modules, such as the ram FIFO, since it sits EXTERNAL to “your code”.

      3. I will be releasing bitstreams, but I want to ensure they are correct and functional, My goal is to DOUBLE the speed of the XUPV5 FPGA (120HMS) WITHOUT having to mess about with constraints.

      A freebie (I said 3 points but lets make it 4):
      4. “The lack of a FIFO causes 0.0012% of the shares to be lost on average. Do you really care about that? I don’t.”
      This statement is nonsense and unquantifiable because:
      To quantify it:
      1. You would have to know every piece of work issues by all the bitcoin servers in the world.
      2. You would need to know every solution nonce and the pre-hash.
      3. You could only quantify this statement AFTER 2032, when all the bit-coins have been generated.
      The only way you can work out a % is if you have accurate figures to work with, rather than taking the 120MHS and dividing down by sticking in a decimal point and some zeros. (it is a function of baud rate and clock frequency, one of which you do not appear to have taken into account.)

      But the biggest point, is that you have STILL FAILED to see WHY a FIFO is needed, irrelevant of ‘dropping’ nonces that are close together.

      Sorry if you feel hard done by and instead need to justify all the problems on:
      “Its a hack”
      “I was not going to release it, I was forced”
      “it’s not my fault”

      But hay, things could have been worse….. You could have designed a Chinese ASIC product, or be trying to run a share based ‘scam’ on mining based from China…….

       
      • Luke-Jr

        December 18, 2012 at 1:06 pm

        “But note that I do not have to release back any external modules, such as the ram FIFO, since it sits EXTERNAL to “your code”.”
        I think perhaps you don’t understand the GPL terms. While it is true that your code is your own, the GPL forbids you to link *either to or from* any incompatibly licensed code. Therefore, if your code is linking to the GPL’d code, you must release it or you are infringing on the GPL’d code’s copyright.

        “…AFTER 2032, when all the bit-coins have been generated.”
        Not really important here, but the block subsidy won’t reach zero until around 2144.

         
      • TheSeven

        December 20, 2012 at 5:59 am

        “2. As regards your binary maths:
        elsif rxclkdiv = “00000000000000? then…

        I mean WTF a 14 input comparator and 3 levels of logic!!

        why not make the ‘rxclkdiv’ count 1 bit longer, decrement the count by one and then just test the “underflow”

        as in:
        elsif rxclkdiv(rxclkdiv’high)=’1? then…..

        NO 14 input comparators, cleaner code, and it DOUBLES the speed the UART works at (which is CRITICAL for increasing the core speed)”

        This isn’t even a comparator, it’s an OR tree. And it will be implemented on a total of 4 LUTs, cascading only 2 levels deep (if I remember the virtex5 architecture right). Your approach would extend a carry chain. Not sure what’s better here, but as I said, this whole module is just a quick hack.

        “Yep OK I may be/can be a little harsh, but please consider YOUR PUBLIC remarks and your kurt replies to my requested “bug fixes” for “Modular-Python-Bitcoin-Miner” on GITHUB.

        Where you stated that it was “my network” (later ABCpool publicly informed it was a problem with THEIR server).”

        I cannot tell from the miner side whether it’s your network or the pool. I can just say that it isn’t the miner. And if it is the pool, that clearly isn’t a miner bug.

        “Furthermore on another “bug” fix request you informed me that the code should NOT gracefully handle DOS attacks against the host server.”

        If your interpretation of “gracefully handling” is not displaying any error debug information, then I have to disagree with that interpretation. MPBM does gracefully handles downtimes, it will just produce a bit of chatter in the log that might help to locate the cause of the problem. I see nothing wrong with that. Hiding away that information is nothing that I would pull.

        “So I let it ride and fixed the code myself.
        I can say now that prior to that my full intention was to bounce ALL the code back to you, but after your GIT hub replies , I though Fuck it…..”

        Possible MPBM copyright infringement here.

        “2. Release the fixes back, under the GPL.
        OK I have released two fixes above.
        But note that I do not have to release back any external modules, such as the ram FIFO, since it sits EXTERNAL to “your code”.”

        As Luke already said, you apparently didn’t understand the meaning of the GPL here. This is clearly a copyright infringement. Just sayin’.

        “3. I will be releasing bitstreams, but I want to ensure they are correct and functional, My goal is to DOUBLE the speed of the XUPV5 FPGA (120HMS) WITHOUT having to mess about with constraints.”

        Sure, go for that. But remember to release everything that you need to reproduce that bitstream along with it to avoid copyright problems.

        “A freebie (I said 3 points but lets make it 4):
        4. “The lack of a FIFO causes 0.0012% of the shares to be lost on average. Do you really care about that? I don’t.”
        This statement is nonsense and unquantifiable because:
        To quantify it:
        1. You would have to know every piece of work issues by all the bitcoin servers in the world.
        2. You would need to know every solution nonce and the pre-hash.
        3. You could only quantify this statement AFTER 2032, when all the bit-coins have been generated.
        The only way you can work out a % is if you have accurate figures to work with, rather than taking the 120MHS and dividing down by sticking in a decimal point and some zeros. (it is a function of baud rate and clock frequency, one of which you do not appear to have taken into account.)”

        I don’t care about whether it’s 0.001% or 0.002%. And you can statistically prove that the average number of lost shares in a sufficiently big (let’s say a couple of months on one chip) data window will be inside that range with a very high confidence level. All of this is of course assuming a truly random distribution of the nonces, but that’s close enough to reality as well.

        “But the biggest point, is that you have STILL FAILED to see WHY a FIFO is needed, irrelevant of ‘dropping’ nonces that are close together.”

        I do see that the FIFO helps a lot with the clock domain crossing, and that a clever design which allows for runtime clock speed adjustments might take advantage of it. Actually that’s how it’s implemented in the verilog source code that we use on our production boards today. That’s in the same git if you want to have a look at it. It is just that one little virtex experiment that doesn’t have that FIFO.

         
        • Hardcorefs

          December 21, 2012 at 8:09 pm

          I just skimmed the above post and picked out a few main points.

          A FIFO WILL help with crossing a clock domain because a FIFO can be dual port accesses with separate clock domains.

          As long as the overall input rate into the FIFO does not exceed the rate at which you can empty it, and IF the input rate fills the FIFO you can always gate the FIFO full flag to the clock running the SHA engine, shut it down/slow it down, until the FIFO is emptied. (It may be far more efficient to empty the FIFO down, rather than just toggle off the “full flag, since a situation may arise where you are just one entry under a full FIFO and continually ping ponging the SHA Engine clock rate. as the FIFO oscillates around the FULL-flag)

          As regards “statistical proof” it is complete nonsense to say you can predict the losses on the nonces.
          Because that would pre-suppose they follow /track statistical analysis, and IF they did that, then you would be RICH.
          You could use said statistics to “exclude” search areas where there was a probability of no nonces, thereby significantly improving the speed to search for a solution nonce.

          The WHOLE point of SHA256(SHA256(x)) is that it prevents such attacks:
          To-date there are only a couple of research papers on shortcutting the *SHA256 system, but they break down above 21 bits, and I am not aware of any statistical predictions for nonces, it has been over 10 years since SHA256 made its appearance.
          *Such attacks appear to be mitigated by the double SHA256 in bit-coin (SHA256(SHA256(x)))

          However I am currently building statistical analysis into the “Modular-Python-Bitcoin-Miner”, in an attempt to gain reproducible research on any patterns that may exist.

          You are of course free to your own opinion as I am to mine, but personally I have to say you appear to just be trying to justify poor decision making…. Sometimes you just have to let go and look at the realty of the situation.

          As regards the GPL you are also misinformed. I recommend you RE-READ the GNU license

          Specifically:
          “One is allowed to make private modified versions, without any obligation to divulge the modifications as long as the modified software is not distributed to anyone else”

          Which basically precludes me from releasing finished FPGA bit files unless I want to breach an unenforceable licensing arrangement.(Just take a look at all the products coming out of China that breach the GNU)

           
          • TheSeven

            December 22, 2012 at 10:42 am

            Why? What’s wrong with displaying that kind of debug information on the console and the log file? Why make debugging of some problems harder for no reason?

             
          • Hardcorefs

            December 23, 2012 at 2:50 pm

            If you really do not know, Then I’m afraid I cannot really help you.
            But as a rough guide:

            1. Consistency, up until that point, the log is beautifully formatted.
            Making data extraction a dream and then you get punched in the face.

            2. professionalism.

            Read a book on User interfaces and design.