Taken for Granted

ESL, embedded processors, and more

Day 3 of DAC 2009: Bill Dally Keynote

Filed under: Uncategorized — July 29, 2009 @ 11:29 pm

Bill Dally, Chief Scientist and Senior VP of Research at Nvidia, and a professor at Stanford, gave today’s (Wednesday July 29) DAC keynote on the theme “The end of denial architecture and the rise of throughput computing”.

Bill Dally of Nvidia

This was quite an interesting keynote, partly because the views he advanced are not universally shared.  He opened with the now classicall discussion of how single thread processor architectures – the “denial architectures” – have stopped scaling in performance at anywhere near the previous long term rate – for technology and architectural reasons.   He calls this the “denial” architecture because it preserves two illusions – the illustion of serial execution, which denies parallelism; and the illusion of flat memory, which denies the locality of memory.   The single threaded processor used ILP (instruction level parallelism), which has run out of steam, and caches for memory access, which are inefficient when too small for the working set of an application.

Dally recommended the industry “come out of the closet” and  move from the old era of denial architecture and enter a new era of throughput computing resting on two key premises:

performance = parallelism

efficiency = locality

This implies moving from latency optimised CPUs which are improving slowly, to throughput optimised processors which are improving rapidly.   Naturally, given his Nvidia position, his primary example was Graphics Processing Units (GPUs), but he did acknowledge that other processor architectures may also fall into the throughput optimised class.  This is a good thing! (since I would argue that application specific instruction set processors (ASIPs) applied to the dataplane, to form dataplane processor units (DPUs) also fall into the class of throughput computing).

He then discussed possible applications, focusing on high performance scientific ones and media-centric embedded ones – that can take advantage of parallelism and data locality and thus throughput computers.  He then made a point that I agree with, although many may not – that for many current and future applications, Amdahl’s Law does not apply, because there is no signfiicant serial part of the application.   Rather, he described it as an hourglass, in which the parallelism may be wide or narrow, but never so narrow as to be serial.   I think this is an argument that many will not agree with – but one that I think does characterise many embedded data-intensive applications such as media processing.  One other point he made was that Amdahl’s Law is not a “law” – it is an Observation, based on old programmes from the time it was formulated, and there is no reason it must apply to all future programmes and algorithms.  (hear, hear!)

For future architectures, stream processing that optimises the use of scarce bandwidth and “overprovisions” arithmetic (and has a rich exposed local storage hierarchy) is the right model – where you explicitly manage data movement rather than leave it to implicitly reactive caches.

I needed to cut out a bit early, while he was illustrating the argument with Nvidia GPUs such as the GeForce GTX280 which is packaged in the Tesla T10, with 240 scalar processors; and the CUDA programming model.  I heard from a colleague that he forecast future machines in 2015 in 11 micron technology having 5000 or more serial processing engines in them.    It will be interesting to watch the programming model scale to that level and to see how wide an application space this will apply to.

All in all, an interesting and thought provoking keynote.

6 Comments »

  1. SKMurphy » DAC 2009 Blog Coverage Roundup:

    [...] Grant Martin on “Day 3 of DAC 2009: Bill Dally Keynote“ [...]

  2. Rishiyur Nikhil:

    In a recent post on DeepChip, (http://www.deepchip.com/items/0477-10.html) a university professor suggested that we at Bluespec, Inc. were “in denial” for pursuing such an unconventional route to High-Level Synthesis (HLS).

    Last week, another university professor, Bill Dally of Stanford University and NVidia, gave a keynote speech at DAC that is already creating much buzz (Richard Goering: http://tinyurl.com/ldrt9a, Grant Martin: http://tinyurl.com/lla2j4, Tech On: http://tinyurl.com/nfuuwj, EE Times: http://tinyurl.com/mbe22q, DAC Blogspot: http://tinyurl.com/lfldbs).

    Dally called on the community to abandon what he termed “denial architectures”. He identified the two principal features of widespread denial:

    Sequential computation model (and weak variants like instruction-level parallelism)

    Flat memory model (fixed, equal cost of access to all data)

    Between C++ and BSV (Bluespec SystemVerilog), guess which one is the poster child for these two denial features?

    The HW/SW design competitions at the last three MEMOCODE conferences (http://csg.csail.mit.edu/Memocode2009/) were, in the end, all about quickly designing the most efficient parallel architectures. Is it any surprise that the winners did not use sequential C/C++ (2007: http://tinyurl.com/mszr77, 2008: http://tinyurl.com/nzgku2, 2009: http://tinyurl.com/ms6yvl)?

    Are we the ones in denial? No, we’re at the Denali Party (enjoying a DACiri)! :-)

    Rishiyur S. Nikhil, CTO Bluespec, Inc.

  3. Grant Martin:

    Rishiyur

    Thanks very much for your comment. However, I don’t think you can make a claim that Bluespec is the only high level synthesis input that deals with the “denial” issues. The reason for that is that none of the high-level synthesis tools using C/C++/SystemC takes “sequential” C/C++ alone as its input. All of them take extra pragmas or synthesis controls and directives in one form or another that allow users to specify and explore potential concurrency/parallelism and also explore various input mechanisms and memory models/access methods. This may be more inherent in the language, such as SystemC, or more provided by pragmas or special methods as in other languages. But none of them as far as I know lack a means to explore these aspects.

    So if there is an advantage to Bluespec – which users will confirm – it will be because its means of specification is more efficient, or more capable, or because its underlying synthesis target architecture allows better results, or its algorithms are better. But you should discuss these aspects in a follow up if you feel it explains the better results in the competitions you mention.

  4. Rishiyur Nikhil:

    Grant,

    Thanks for your comment, and I’m glad you brought up the point that “all the current C/C++/SystemC HLS tools take extra pragmas or synthesis controls and directives” to steer the synthesis towards a desirable architecture (many people are unaware of this, and are surprised when they first see this). I have several follow-up comments on this.

    (1) The first comment is that algorithms people write in C/C++, the starting point for C/C++ synthesis tools, may simply be bad algorithms for parallel implementation. This is why we have fat textbooks on parallel algorithms, and high-quality conferences like ACM SPAA and PODC engaging some of the best minds in the business.

    (2) The second and perhaps most serious comment is that these pragmas are very weak and indirect knobs to control the final architecture, like pushing on a rope. In many head-to-head comparisons, we’ve found that although these tools may get you quickly to initial RTL, there’s often a very long and tedious tail in trying to squeeze it down to acceptable area, precisely because of lack of architectural transparency and control. So, yes, I am claiming that BSV’s (Bluespec SystemVerilog)’s means of specification for parallel algorithms and architectures is much more direct, transparent, and controlled. See also my comment here for more on this.

    (3) Another comment is that all these “extra pragmas or synthesis controls and directives” detract from the “ANSI C/C++” rhetoric, because producing these pragmata involves most of the hard labor in getting a good synthesis result, and they are usually extremely tool-specific.

    (4) SystemC is of course a parallel language (albeit an unrealistically weak one, because of the non-preemptive thread scheduling semantics). But when people use the phrase “SystemC synthesis”, it’s hard to tell whether they mean synthesis from RTL-like SystemC– using sc_signals, sc_in, sc_outs, sc_clocked threads (equivalent to RTL’s ‘always’ blocks), etc, or whether they mean CDFG-based synthesis. In the former case, it’s no better than RTL, and in the latter case, it’s no better than C/C++ synthesis.

  5. Grant Martin:

    Rishiyur, thanks for your follow up comments. Have Bluespec users done a direct comparison of SystemC or C/C++ based synthesis vs. BSV synthesis, and published (in an independent forum) direct comparative results? That would be an interesting comparison of the capabilities. In the most recent Design and Test magazine there is an interesting article by authors from TI that looked at several of the High level synthesis tools (three commercial ones) and without naming them, draws some lessons and recommendations for improvement. It is interesting to note that the authors say that 2 of the 3 tools used C or variants of C as input; the 3rd used a “nonsequential” input language that is unspecified. It is an interesting paper, but I think if we are to make further progress, we probably need to start “naming names” when reporting on real user experiences – that is, if the names can be named.

    Thanks again
    Grant

  6. Amdahl’s Law is a Law (Hey, a New Post!) « Bugs Are Easy:

    [...] Scientist at Nvidia, gave a keynote speech in which he claimed that Amdahl’s law is not a law, but an observation. This uses the popular interpretation of law to mean “proven [...]

RSS feed for comments on this post. TrackBack URI

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)