back to article Moore's Law is deader than corduroy bell bottoms. But with a bit of smart coding it's not the end of the road

In his 1959 address to the American Physical Society at Caltech, physicist Richard Feynman gave a lecture titled, "There's Plenty of Room at the Bottom," laying out the opportunity ahead to manipulate matter at the atomic scale. The semiconductor industry since then has made much of miniaturization, increasing the density of …

  1. ovation1357

    Could this finally mean an end to the programmers who get away with terrible code by arguing that they can simply throw more RAM and a faster CPU at it?

    I'm yet to be convinced by the strategy of old only optimising code which is deemed to be too slow. I'm not saying that everything has to be carefully written in assembler but I've seen so many code abominations which technically work but are hugely wasteful of resources and become extremely difficult to maintain and support.

    1. Anonymous Coward
      Devil

      Hammer ISO Nail

      "Could this finally mean an end to the programmers who get away with terrible code by arguing that they can simply throw more RAM and a faster CPU at it?"

      Of course not. Terrible code will be with us always.

      As to the study, I'm not surprised that programmers think the solution is programming.

      Chip designers, OTOH, think the solution is chip design. The current push is for multi layered chips with interlayer connections between the transistors.

      Then there are the engineering boffins who think spintronics will solve the problem, given enough funding. https://www.sciencedaily.com/releases/2020/06/200603122949.htm

      And, of course, the bean counters vote for cheap programmers, cheap chips, and no research that won't affect the current quarter.

      Same as it ever was.

    2. LucreLout
      Pint

      Could this finally mean an end to the programmers who get away with terrible code by arguing that they can simply throw more RAM and a faster CPU at it?

      As a programmer I do hope so.

      While I'm here though, here's a pint for all the generations of CPU designers and hardware engineers who have brought us so far as they have. Top quality work folks!

    3. The Man Who Fell To Earth Silver badge
      WTF?

      So what's the point other that Python & Java suck at calculations?

      I downloaded the paper and I didn't see any info on the precision of the calculation.

      So I wrote a little program in the oldest least maintained compiler I have access to, (PB which hasn't had a compiler update in a decade or so & only generates 32-bit executables). My version dimensions the three matrices 4096-by-4096 matrices, initializes the A & B with data of the appropriate precision (so they are not full of just a bunch of zeros), then times how long it takes to matrix multiply A & B and assign the result to C using FOR loops as they do in the paper (as opposed to PB's built in matrix operators). I wrote it as a single thread 32-bit application. I redid it with the matrices & math at single, double & extended precision. Ran it on a 7 year old laptop that sports an i7-3632QM @2.20GHz. Windows 10 Pro, 64-bit 1909 (18363 build).

      Matrices declared as Single Precision: 732.2969 seconds ( = 12.2 minutes)

      Matrices declared as Double Precision: 871.9297 seconds ( = 14.5 minutes )

      Matrices declared as Extended Precision: 1076.062 seconds ( = 17.9 minutes)

      The PB compiler does all floating point calcs in extended precision (10 bytes), so one would not expect huge speed differences between calculations where the matrices are declared as single (4 bytes), double (8 bytes) or extended (10 bytes) matrices, as the work difference is mostly all the type conversion overhead.

      But it once again underscores that Python & Java suck at numerical calculations.

      1. Tom 7

        Re: So what's the point other that Python & Java suck at calculations?

        I ran some tests too - but I used a math library that was written in C++ to do the matrix calcs and it ran only a fraction of second slower than an optimised C++ version, but about 2000 times faster to write! That's some of what the author was trying to get across I think!

        1. DavCrav

          Re: So what's the point other that Python & Java suck at calculations?

          A very clever guy called Richard parker has spent a significant amount of time trying to optimize a matrix muliplication algorithm, but over finite fields rather than using doubles. (Think {0,1} if you don't know what a finite field is.)

          He has managed to improve the current best algorithm, as implemented, by orders of magnitude. This is used for multiplying matrices in the hundreds of thousands of dimensions, rather than a few thousand.

          The first thing is you don't need to do as many rounds of Strassen as you might expect to do, only half a dozen at most. And then the problem becomes chopping your matrices up into the correct sized blocks so you work just inside each level of cache. You really have to worry about the distance between the processor and the memory when you are doing things like this, and trying to optimize it. But you also need code that works on lots of different computers. Thus he has to make conservative estimates on the amount of L1/2/3 cache, etc. so that it always works.

          He has to work in assembler, because he can't trust even a C compiler not to stuff up his code.

          It's fascinating stuff, but a little too close to the coalface for my taste.

        2. Cuddles

          Re: So what's the point other that Python & Java suck at calculations?

          "That's some of what the author was trying to get across I think!"

          That's the impression I got. Anyone familiar with Python should be aware that while it's an easy one to learn, there's a reason pretty much all the heavy lifting is done by calling C libraries. But it's interesting to see a more quantative look at exactly how much speedup you can get from specific changes.

          As for HildyJ's point about programmers saying programming is important and chip designers saying chip design is important, the obvious answer is that neither is particularly useful in isolation. This study gives a nice example of that. Parallelising the code to use all the cores gives a nice speedup. If the chip doesn't have multiple cores that obivously isn't going to help, while if the programmer doesn't use them it doesn't matter how many extra chips you add. Hardware and software have to develop together, otherwise you'll just end up with hardware providing options that aren't used by the code, and code trying to do things that aren't physically possible on the hardware.

      2. John H Woods

        Re: So what's the point other that Python & Java suck at calculations?

        Smalltalk is always my go-to --- 'notoriously' slow but very powerful and easy to knock something up.

        1966 seconds for a single threaded effort on a aging i5 whilst I was doing something else. Come on, is Python really that bad? I was going to try to learn it!

        Maybe before we try to improve our algorithms (and there's some pretty clever stuff you can do with matrices) we should ensure our languages aren't taking the mickey.

        1. Someone Else Silver badge

          Re: So what's the point other that Python & Java suck at calculations?

          Come on, is Python really that bad?

          No. But if you want balls-to-the-wall performance, look elsewhere. You know, matching the tool to the job....

      3. TimGJ

        Re: So what's the point other that Python & Java suck at calculations?

        Ordinary Python does suck at number crunching. That's why we use numpy, so you get the efficiency of C++ with the convenience of Python.

        The completely meaningless and cherry-picked example of the 4096x4096 matrix multiplication takes 37s to execute on an single core of an AMD 3900X.

      4. Someone Else Silver badge

        Re: So what's the point other that Python & Java suck at calculations?

        Wonder how this would go using APL (a unreadable, unwritable language optimized for matrix manipulations)?

        It'd probably go like a bat out of hell...once you got it working. But writing and debugging the damn thing would cost more than any runtime gains you might realize.

    4. Anonymous Coward
      Anonymous Coward

      the problem is (always) manglement

      In my experience it is the product managers who typically shy away from putting aside the necessary time for performance tuning. Security is also low on the list of priorities.

      Heck... (anon mode enabled) Last week we received an e-mail from our department mangler. He told us that a different product from within our company had been exposed, and now every product would be assigned somebody to be responsible for keeping third-party software up-to-date. This task would not require any planning, because it would take only five minutes (tops!) per week.

      Never mind that our team has a security shitlist containing about a year's worth of work if we were to start shoring up security properly (we still have some ancient sql-injection vulnerabilities. We hope that the remaining ones are impossible to exploit, but the low-hanging fruits have all been picked and those that are left are difficult/time-consuming to wrestle with). And never mind that it takes considerable time to convince internal IT to run our products in a secure manner. I spent a day alone convincing some bastards to shut down port 80 on an externally faced web application. "Why bother? The customers have been told to use https!"

      Jizusfreakingholymarycow.

      Same team mangler argued that we should use http internally between our microservices. "https carries a 20% overhead!". That guy incidentally has the confidence of Clint Eastwood, but we have noticed he says a lot of crap that just does not add up. We believe HR never checked his CV or references. The guy is a danger to himself and others - a potential Darwin award candidate.

      I care less and less. May these bastards rot in hell. In my experience, once a company reaches about 30 employees, the rot starts setting in.

      I'll keep my head down and write as good code as I can manage, but I feel my days are numbered.

      1. Anonymous Coward
        Anonymous Coward

        Re: the problem is (always) manglement

        Obligatory Dilbert.

        1. Claptrap314 Silver badge

          Re: the problem is (always) manglement

          I figured you were going for "If we eliminate the design & testing phases, we can hit the release window", but this is good, too.

    5. big_D Silver badge

      Run it in Fortran on a VAX, it will take less than a second... The optimizing compiler back then compiled a similar demo down to a single NOP wrapped in an .exe bundle.

      The same program running on a much more powerful mainframe took several days.

      The DEC compiler worked out that with 1) no input 2) fill a matrix with values 3) no output, it could optimize out part 2, because it wasn't needed, which left optimizing parts 1 and 3, which optimized down to NOP (no ouput), or an empty executable.

    6. Kubla Cant

      I'm yet to be convinced by the strategy of old only optimising code which is deemed to be too slow

      The trouble is that the programmers who currently write bad, inefficient code will produce something even worse if they're encouraged to "optimise" it.

    7. Someone Else Silver badge

      Could this finally mean an end to the programmers who get away with terrible code by arguing that they can simply throw more RAM and a faster CPU at it?

      No.

      Next question?

  2. rcxb Silver badge

    the seven hour number crunching task can be reduced to 0.41s, or 60,000x faster than the original Python code.

    Yes, but how many hours of programmer time did it take to do the optimization, and how much money does a few hours of a programmer's time cost versus a few hours of a single CPU core?

    I'm all for simple and efficient programming, but processors are rather ridiculously fast now. The only thing I concern myself with performance wise is why my web browser takes so damn long to load what should be a simple Amazon product page, with it's delayed loading of tons of javascript that happens AFTER I've started typing into the search box messing things up, and not knowing when it's actually finished.

    Longer-term, I'm sure chipmakers aren't going to just give up. Quantum computers are in the works, much has been said of optical, shortening pipeline or increasing L1 cache helps, and there's potential for exotic layouts like 3D multi-layer chips to give a speed boost as well. With many billions to be made, there won't be a shortage of R&D when we actually hit the wall.

    1. RM Myers

      This seems like a case of "horses for courses". If you are writing the typical web applications, then the current optimizations built into the programming languages are probably more than enough. If you are writing operating system kernels, then you may well want to aggressively tune your algorithms. And if you are writing large scientific applications running full bore on supercomputers costing tens to hundreds of millions, then you probably want the type of optimizations mentioned in this article.

      1. Carlie J. Coats, Jr.

        If only...

        Except that many of these supercomputing applications are written to suit the hardware-behavior and compiler-limitations of 1980's vintage vector machines. Unnecessarily.

        Which is why my version of the WRF weather-model is so much faster than NCAR's...

      2. big_D Silver badge

        I've worked on optimizing a few web projects where the "built-in" optimizations weren't enough, because the code had been written to be elegant and human readable, with no thought about how "executable" the code was.

    2. vtcodger Silver badge

      Things grow ... until they don't

      Moore's law is just exponential growth. It's probably best stated as "a lot of things tend to grow at a constant rate ... until they don't." If you want an equation, try X = R^T where R is a growth rate and T is a time. For Moore's Law -- the growth in the number of "transistors" in a given area of an IC, R is about 1.414, so that for time (T) = two years, X=1.414**2.0 = 2.0. i.e. doubling in two years.

      What about "... until they don't". Well, things genuinely don't grow exponentially forever. But feature density has managed a pretty good run -- 60 years. Will it continue? For How long? Who knows?

      1. Milo Tsukroff
        Big Brother

        Re: Things grow ... until they don't

        > feature density has managed a pretty good run -- 60 years.

        > Will it continue? For How long? Who knows?

        Now that chip making has moved to China, indeed, who knows? How can anyone know? 'Nuff said.

      2. Captain Hogwash Silver badge

        Re: Things grow ... until they don't

        Did you know that disco record sales were up 400% for the year ending 1976, if these trends continue...AY!

    3. Doctor Syntax Silver badge

      "Yes, but how many hours of programmer time did it take to do the optimization, and how much money does a few hours of a programmer's time cost versus a few hours of a single CPU core?"

      And how much is the user's time worth whilst they wait for a task to complete?

      The programmer only has to optimise it once. Many users may use the program many times.

      1. Carlie J. Coats, Jr.

        Performance limits

        And because of parallel overhead (for example), that inefficiency sets limits on best turnaround time. When you plot performance vs number of processors, you normally get a U-shaped curve: at first, adding processors cuts your turnaround time, but eventually the parallel overhead kicks in enough that adding more processors adds to your turnaround time. 64 processors may well be slower than 32 in that instance.

        The way to fix it is better algorithms and better coding, to push the whole curve downward toward the X-axis.

        Twenty years ago or so, I reviewed a paper on someone's parallelization of an atmospheric-chemistry model. Their best performance was happening at the 16-processor level. But. A different and equivalent model I know achieved better performance than that one on just 2 processors -- and scaled better, so that its performance-curve was best at 32 processors. If your model is too inefficient to begin with, it doesn't matter how many processors you throw at it, you are limited by that initial inefficiency.

        1. Rol

          Re: Performance limits

          Anyone with a touch of Asperger's, would vomit uncontrollably at the sight of the x86 instruction set.

          When it comes to inefficiencies, the instruction set that has been used as much as a weapon by the dominant player, Intel, against it's competitors, is one of the most inglorious examples of how not to develop technology.

          If we were developing a cpu today, with no regard for the legacies of the past, then without any added leaps of science we could turn the x86 into a beast that would more than double performance of any application at a stroke. Of course the application would need to be compiled using a new compiler, that mapped to the more efficient use of the bytes set aside for instructions, and obviously allowing for many more instructions to be added, but this time, by a consensus of the players, not individual companies looking to steal an edge on their competitors.

          Seeing as the industry is transfixed with the idea of obsoleting everything they can in the search for greater revenues. it beggars belief that the x86 instruction set has not been completely overhauled and set on a more steady course, with a governing body overseeing additions to the universal set.

          1. Anonymous Coward
            Anonymous Coward

            Re: Performance limits

            Intel tried to replace it, moving stuff into software to improve performance, that's why we now all use AMD x86 64bit CPU instruction set.

          2. Mike 16

            Re: Performance limits

            Replacing x86 (and x86_64 aka amd64)? Go for it!

            https://www.ebay.com/b/Intel-Itanium-Computer-Processors-CPUs/164/bn_5757133

            Maybe compromise with a nice i860 box?

            Yeah, barely touched a 4U Itanium server back in the day. Wouldn't mind picking up an Itanium VMS pizza-box, but really only for much the same reason as keeping my KIM-1 running)

            1. bombastic bob Silver badge
              Devil

              Re: Performance limits

              yeah even with ARM upwardly gaining ground in the world of computing prowess, x86 and amd64 instruction sets are on the vast majority of high performance personal computing devices.

              There are reasons for this. Even though 'back in the day' everyone thought RISC would solve performance issues by acting more like microcode in the actual applications, they apparently forgot that instruction fetching takes time, too, and when it takes 2 or 3 or 4 instructions in a RISC architecture to do what 1 instruction does in CISC architecture, the lines of performance benefit get blurry. And of course, Intel and AMD made their pipelines and caching more efficient.

              The article got it right early on: the tech industry needs software performance engineering

              Right, Micros~1?

              People will buy a new computer when it's perceived to be faster (and better) than what they already have; otherwise, you only need to maintain the old one. And statistics on desktop usage focus on new sales, NOT on existing users that simply fix their existing boxen. This skewed the numbers, causing many bad decisions to be made, not the least of which is the assumption that performance DID NOT matter as much as "feature creep" "new, shiny" features. And here we are today!

              .

              1. Nick Ryan

                Re: Performance limits

                It's made more complicated because in order to handle to ridiculously complicated and often edge-case instructions cobbled into the extended x86 instruction set, both AMD and Intel implemented much of it using a simplified RISC instruction set underneath. Much easier to validate this and to performance tweak it.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Performance limits

                  But also easier to fetch, as increasingly it's the stuff outside the CPU that's the bottleneck for performance as memory lost the ability to keep up with the CPU around the time of 80486DX2, the first of the clock-doublers. What the other poster was saying is that RISC loses its simplicity advantage when it gets choked by the memory. At this point, it's a general wash when it comes to computational efficiency because that's not the main chokepoint anymore. What you really need, especially in HPC, is a lot of memory bandwidth, and Intel/AMD currently have the ARM chips beat (because, like it or not, memory handling needs power). For ARM to compete, they need chipsets specifically built for high memory bandwidth, and that's a relatively new field in ARM.

          3. John Smith 19 Gold badge
            Unhappy

            "it beggars belief that the x86 instruction set has not been completely overhauled"

            But it's really all microsoft know how to code for of course.

            They don't call them "Code museums" for nothing.

            And BTW the sort of orthogonal instruction set you're talking about was developed by both Motorola on the M68000 and IBM in the POWER PC ISA's.

            1. Ozzard
              Boffin

              Re: "it beggars belief that the x86 instruction set has not been completely overhauled"

              Take a look at the PDP-11 instruction set vs. the M68000. You might be surprised by the similarities - nothing new under the sun.

              (Nope, still missing the "old fogey" icon)

              1. Nick Ryan

                Re: "it beggars belief that the x86 instruction set has not been completely overhauled"

                I remember the pleasure of not having to spend more CPU cycles juggling a pitiful number of limited registers around rather than do anything constructive with the CPU cycles. To "fix" this, they just added more and more complicated instructions that after glaring at them for a while, holding the documentation sideways and waving a dead chicken at I'm often at a loss as to why some of them exist other than the very occasional edge case.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: "it beggars belief that the x86 instruction set has not been completely overhauled"

                  I believe Parkinson's Law applies here. No matter how many registers you have in a processor, the jobs you have will expand to take them all up until you have to juggle them all over again.

              2. Anonymous Coward
                Anonymous Coward

                "the PDP-11 instruction set vs. the M68000."

                "the PDP-11 instruction set vs. the M68000." "nothing new under the sun"

                ?

                PDP11: 16bit instruction and operand and address size, any general register (R0-R5) could be used for any purpose with any addressing mode (lots to choose from, including indexed and pre-decrement and post-increment). R6 was conventionally the stack pointer, R7 was the PC. Both R6 and R7 could be used with the same addressing modes as other general registers. What might MOV -(PC), -(PC) do?

                Almost everything you need to know about PDP11 instruction set architecture could be found on the PDP11 Programming Card, still available on the Interweb at e.g.

                https://www.montagar.com/~patj/dec/pocket/pdp11_programmingcard_1975.pdf

                68K: various sizes of operand (mostly 32bit), lots of 32bit registers (but some separate registers for addresses and data), some (but not total) flexibility in register usage and addressing modes.

                I could go on, but what's the point.

                What was your point again?

    4. Kevin McMurtrie Silver badge

      Plenty of companies are paying millions of dollars a year in compute costs. The ones that will still be around tomorrow don't want to hear any crap about scaling up to accommodate lazy code.

      It doesn't matter how fast computers are. You're in trouble if your competitor can make them run even 50% faster.

      1. Anonymous Coward
        Anonymous Coward

        Damage limitation

        Actually if your software is wrong or unreliable, the slower the computers run it the better.

      2. LucreLout

        The ones that will still be around tomorrow don't want to hear any crap about scaling up to accommodate lazy code.

        On the one hand I agree with you, on the other hand "Javascript programmer".

    5. LucreLout

      Could this finally mean an end to the programmers who get away with terrible code by arguing that they can simply throw more RAM and a faster CPU at it?

      For the 10x increase you're basically talking about it being freely available. Most programmers who know Python will also know or can learn quickly sufficient .NET or Java to see gains.

      To be honest, multithreading the code provided should be within the capacity of any proper programmer in any proper language to do quickly and accurately which will see you hitting way north of the 47x improvement because you'll be able to leverage n cores on the CPU, meaning you should be able to get closer to 47 * n speed increase, depending on what other compute is happening on the box. (ETA: You will of course never actually get 47 * n) Thereafter doing a distributed calculation across horizontally scaled boxes should allow for significant real world gains if you still need more power.... or just learn assembler.

      Programmers that require hardware to bail them out of performance problems could do the industry a massive favour and move to another profession.

      1. Tomato42

        there's also the problem with subscripting arrays in Python being notoriously slow, I wouldn't be surprised if just rewriting the code to use iterators and array comprehension wouldn't speed it up by a factor of 10

        1. LucreLout

          Quite possibly it will. I guess there comes a point in any software performance curve where you need to be realistic about a languages relative horsepower. Sometimes Python won't get you there, so you need Java or .NET. Sometimes they won't get you there so you need C. Sometimes that won't get you there so you go for assembler.

          The most efficient trick, is understanding that before you start coding, so you can make a realistic determination of language / framework trade offs relating to speed of execution vs speed of development vs cost of maintenance. Lets face it, if I had 25 years assembler or C experience, I'd be wanting to earn more than I do as a .NET/Java/Python dev, which should be a factor.

      2. jelabarre59

        Programmers that require hardware to bail them out of performance problems could do the industry a massive favour and move to another profession.

        Yes, but they'll just move into management.

    6. John Brown (no body) Silver badge

      "Yes, but how many hours of programmer time did it take to do the optimization, and how much money does a few hours of a programmer's time cost versus a few hours of a single CPU core?"

      You seem to be comparing "expensive" optimisation against a single ruin of a programme. How do you think the numbers stack up when the programme is run every day for a year? Maybe it runs multiple instances for different users? Or it's a product you sell and it's run millions of times per day all over the world? So maybe it cost a few grand to optimise, but it saves millions of hours of CPU time around the world making your product that much more competitive than others so you sell more.

      1. gvp

        Rule of a million

        Back in the day when I was doing this stuff , I applied a rule of thumb that I called the rule of a million

        Does it run a million times a year, or does it use more than a million CPU seconds when our largest client runs it? Optimise.

        The rule of a million stood next to the rule of three: if I can't think of three qualitatively different ways to solve a problem, I don't understand the problem. More investigation required.

        (The latter rule almost always led to me using the fifth or sixth approach that I thought of. Think of two ways, others are hard to see. Three, it gets easier.)

    7. big_D Silver badge

      It depends on what you are doing. If it is a one-off quick and dirty calculation, then the optimization probably doesn't matter.

      If it is a system used by hundreds or thousands (or web based, possibly millions of users), then the time invested by the programmer is very cheap, if he can bring down the processing time.

      Two examples:

      1) a set of financial reporting tools, run every month on a users computers, blocks the computer from all other use during that time. Runtime before optimization: 22 hours, times 250 financial users around the world every month. Optimization: 1 programmer for 2 weeks. Runtime after the optimization: < 3 hours. A saving of 4,750 processing hours per month and around 2,000 man hours of recovered activity involving using their computers. That was 80 hours well invested.

      2) an online shop with 4 load-balanced front-end servers and a big back-end MySQL server. When the PayPal newsletter came out and the shop was listed, the whole thing would keel over and die, when around 250 users were spread over the 4 front end servers - the query to generate the front page menu would go from under 1 second to over 2 minutes and the DBA would put in overtime restarting the MySQL database every few minutes.

      4 hours of looking at the code, optimizing some decision trees and re-ordering the "WHERE" clauses of the SQL statements, under the load of over 250 user PER SERVER, the menu query was down to under 500 milliseconds and the loading of the front page was under 4 seconds.

      They could have thrown a bigger database server and more load-balanced front end servers at the problem, but that wouldn't have been economical, especially when a programmer who understood MySQL and processor architecture was let loose on the code and could get that sort of performance improvement for less than the price of a new SAS drive...

      Those are both real-world examples I was involved in. I was brought in to fire-fight both projects, the first an MS-DOS based system, written in BASIC by FORTRAN mainframe programmers and maintained by COBOL mainframe programmers. Having someone who actually understood PC architecture and where the weaknesses were (video output was the biggest bottleneck) made that huge difference.

      Likewise, the second one was a couple of year back. The code was elegant and easy for a human to read, but the devs had little of no knowledge of processor architecture (and how to optimize PHP to work more efficiently) and little to no knowledge of optimizing MySQL. Quickly re-ordering the queries and some ifs and loops was all that was required, it was still elegant and easy for a human to read, but more importantly, it was also efficient for a computer to read and execute.

  3. J. R. Hartley

    Quantum

    Where does quantum computing fit into all of this?

    1. ThatOne Silver badge
      Devil

      Re: Quantum

      > Where does quantum computing fit into all of this?

      Principle of uncertainty applies...

    2. Ian Johnston Silver badge

      Re: Quantum

      It relies on fusion power and Linux on the Desktop. As soon as those have been cracked, quantum computing will be along. Well, in ten years.

      1. jake Silver badge

        Re: Quantum

        They'll all be delivered in a flying car, driven by a household robot which can also dig the spuds, do the dishes, fetch me the mail and a beer, and change the sprog's nappies ... all in one unit.

      2. Doctor Syntax Silver badge

        Re: Quantum

        Linux has been on my desktops and laptops for years. A glance out of the window shows that fusion power is ticking along nicely, as it has been all these billions of years. So what's holding up quantum computing?

        1. Tomato42
          Trollface

          Re: Quantum

          that uneven spread of The Future

    3. redpawn

      Re: Quantum

      Fusion powered Quantum processors cooled with black hole technology allow quantum computing obey Moors Law for the foreseeable future. Smart coding can wait.

      1. bombastic bob Silver badge
        Coat

        Re: Quantum

        actually, you just need to make the inside bigger than the outside. Voila!

      2. John Brown (no body) Silver badge

        Re: Quantum

        "Fusion powered"

        Really? That's so last century. All the cool kids are working on Zero Point Energy these days. It's the next (in only 50 years!!) big thing.

    4. Anonymous Coward
      Anonymous Coward

      Re: Quantum

      It will or it won't or it will and won't at the same time though you can't be sure even when looking at it.

    5. Annihilator

      Re: Quantum

      In one of the universes, it's already here.

  4. Anonymous Coward
    Anonymous Coward

    DEC Fortran

    There was a time that the DEC Fortran compiler was the bleeding edge of compiler design and every version was eagerly awaited, so we could eeek out more performance from code. I was involved in important finance applications which depended on the quality of the code and the performance on whatever hardware DEC could provide. In short, we crafted and cared about the code quality, the efficiency of compiler optimization & linkers and algorithm performance in user code and the libraries. We even had image profilers to optimize application images for specific hardware characteristics.

    The world reverted to BASIC and interpreters and a multi-billion dollar company was created with bloated horrible code. C was mainstream for no good reason and then C++ and the whole "object" thing got out of hand and code got worse and more bloated. Hardware got faster and cheaper and no one cared, least of all MS who were taking over the world and re-aligning the industry's concept of quality and performance massively downward. Java VMs came for another level of abstraction from the HW and for more sluggishness and massive unreliability. Then we all went WWW and more interpreted and useless code was foist upon us. JIT got in there to help but it hardly made a jot of difference.

    It's all so blurry now ... why did it go so wrong?

    The answer is "Russian Programmers", or at least that mindset.

    /end of rant

    1. jake Silver badge

      Re: DEC Fortran

      I know folks who say that that DEC Fortran compiler is still the bleeding edge of compiler design. At least a couple of them are in Redmond working on the Windows kernel.

      Russian programmers aren't the answer. They are still working with stolen ideas, and have come up with few of their own.

      1. Anonymous Coward
        Anonymous Coward

        Re: DEC Fortran

        I am sad to report that I haven't seen a VMS/Rdb machine since 2007 as I was forced over to the dark side of WindowsServer/SQLServer. Money and opportunity were the principle reasons. It was horrible, 8 years of horror, but mainly because of the MS fanboys of limited skill and even more limited knowledge with whom I had to share oxygen.

      2. Anonymous Coward
        Anonymous Coward

        Re: DEC Fortran

        The "Russian Programmer" reference was from the Michael Lewis book where one of the principle characters is a Russian programmer who himself is infamous.

    2. YetAnotherJoeBlow

      Re: DEC Fortran

      So, back full circle. It's nice to know that the way I was taught in the 70's and 80's is back in vogue again. I never lost sight of that and continue to this day writing fail-safe code.

      I have fond memories of DEC FORTRAN. Both F IV and F77. I used to burn EPROMS under RT11 and RSX.

    3. Alan J. Wylie

      Re: DEC Fortran

      Back in the early 80's, I used DEC FORTRAN on a VAX-11/780 developing an early Geographic Information System (as it is called these days). I remember one program (interpolating spot heights on a grid from contour lines, perhaps) which did a lot of looping over arrays. There was a DEC supplied program that drew a text based representation on a VT100 of pages being swapped (paged?) in and out of memory (we originally had a huge 512kB, later expanded to 3/4 of a MB). You could see when you had your array indices the wrong way round, pages were rapidly swapped in and out all over the place, rather than a neat little chunk with pages being added at the end and lost at the beginning.

      That computer (about 1MIPS) and memory were enough to run an interactive line following digitising program as well as several developers simultaneously editing and compiling.

      1. djnapkin

        Re: DEC Fortran

        512KB? We used to dream of having that much memory around 1980. The Unisys mainframe was as big as several cupboards, and had 192KB of memory. Each memory card was 16KB but was the size of what would be a large motherboard today.

        Sure taught us how to make sure our programs were optimised.

        1. Anonymous Coward
          Anonymous Coward

          Re: DEC Fortran

          We lived in shoebox in middle of road!

          But seriously, I worked on a S/34 writing RPG II code where we only got 1 (sometimes 2) compiles per day. The S/34 we used was maxxed out (256K of RAM and 256MB) and Big Blue had no upgrade path. Sometime after that they invented the S/36 but that was a long time coming.

          I was glad to be out of the place, but it sure taught me to appreciate some good skills. There is just so much you can do with 99 indicators.

          1. big_D Silver badge

            Re: DEC Fortran

            Shudder. I remember S/36 and RPG II and III.

    4. Graham Newton

      Re: DEC Fortran

      My final year project at Uni was a mathematical model of the human eye at low light intensity. Like the article it relied on loop within loops. The university computer was a DEC 10 and the program was written in Fortran.

      I spent a lot of time ensuring that the program would run to completion without intervention. Unlike my fellow student who would babysit their programs overnight.

      However my program consumed 12 hours of run time and was terminated. I got a "see me" email.

      I was worried, this was my final year project.

      They didn't bollock me but suggested that I sent my program to Manchester University . Not knowing about modems I thought I had to send my program by post but was put straight on this.

      After a compile failure a CDC (Control Data)machine executed my program in less than a second.

      This taught me that you had to program to the machine, not try to make it do things it wasn't built to do.

      So for example I have:

      Used the Transputers ability to do 2D memory manipulation and paralleled processing and CPU core linking to to do image processing for world class astronomical telescopes.

      Produced a minimal memory and CPU cycle timesliceing OS to run experiments on the Cassini Huygens lander.

      Programmed SHARC DSPs to operate on multiple audio streams concurrently using the SIMD mode.

      Normal programming makes me weep, it's sledgehammer all the way, no artistry, no finesse.

      I sit and wonder WTF when it takes several seconds for a word document to load.

      1. Anonymous Coward
        Anonymous Coward

        Re: DEC Fortran

        I feel unworthy ...

    5. big_D Silver badge

      Re: DEC Fortran

      We were using it for seismic surveys of oil and other mineral fields (predominantly oil). You needed to eek every last millisecond out of the calculations, because they would tie up the computer room full of VAXes for hours at a time.

      I related above, but the optimization was so good, one mainframe sales-rep went away with his tail between his legs. They gave us a mainframe to play with and a test-suite to run in parallel on a spare VAX. The test-suite should run for a week on the mainframe and a few weeks on the VAX. We should call him in a week, when the mainframe was finished.

      When he got back to his office an hour later, there was a message for him to call us, the VAX was finished. The DEC FORTRAN compiler had looked at the code, worked out that 1) no input, 2) fill random array, 3) no output meant that 2 was superfluous and optimized that out of the executable, which was essentially empty and took less than a second to run...

    6. Greybearded old scrote

      Re: DEC Fortran

      I was going to protest the slur against a whole nation.

      Then I remembered the person who trashed performance on my favourite language with inefficiently implemented extended object libraries and an absolutely hideous ORM. He is, in fact, from Russia. A single datum I know, but it's the only one I have.

      Sadly both of those lardy things are generally seen as the One True Way these days. I'd fail at interview if I even mentioned any misgivings about them.

    7. Dan 55 Silver badge

      Re: DEC Fortran

      There's nothing that says that BASIC has to be interpreted, Dartmouth wasn't and the ones running on the 1970s mainframes weren't. CBASIC on CP/M wasn't either. The late 1970s-1980s computer versions were interpreted (and Microsoft did many of those so you know where the blame lies) but then later on they became compiled too as home computers and PCs became more powerful.

      1. big_D Silver badge

        Re: DEC Fortran

        Microsoft was selling a BASIC compiler throughout the 80s. Many computers came with BASIC interpreters built into ROM, including the original IBM PC (standard configuration was a cassette port and BASICA in ROM, no floppy drive - well, and no keyboard or display either, everything was extra).

        But Microsoft also sold a compiler for CP/M and MS-DOS. We had code that ran on HP-125 (CP/M), HP-150 (MS-DOS), HP Vectra (sort of IBM compatible, only not very, MS-DOS) and IBM PCs (PC-DOS). All were compiled using Microsoft's BASIC compiler and you had to replace the header file, which contain the definitions (strings with escape codes) for accessing the screen (moving the cursor, clearing the screen, inverse video, bold etc.) for each platform.

  5. jake Silver badge

    They are eyeballing the problem from the wrong orthagonal perspective ...

    "the tech industry needs software performance engineering, better algorithmic approaches to problem solving, and streamlined hardware interaction."

    None of those is the "top of the compute stack". The real TOTCS is BKAC ... the nut behind the wheel. The human.

    We need to start doing is to teach humans to actually use computers to get more out of them. And by use computers, I don't mean facebook, twiitter and power point etc.; nor do I mean iFads and Fandroids. I mean actually learning how to use computers.

    It'll never happen, alas. Not as long as marketing is still runni ... oh! SHINY!

    1. Anonymous Coward
      Anonymous Coward

      Re: They are eyeballing the problem from the wrong orthagonal perspective ...

      So it can't be the real TOTCS because humans don't compute. Otherwise, we wouldn't have computers in the first place. So the article is still correct.

      1. John Brown (no body) Silver badge

        Re: They are eyeballing the problem from the wrong orthagonal perspective ...

        Before digital electronic computers we had electric analogue computers. Before those, we had mechanical computers. Before those, "computer" was a job title.

        1. Anonymous Coward
          Anonymous Coward

          Re: They are eyeballing the problem from the wrong orthagonal perspective ...

          That was then. This is now. You underestimate the rate of dumbing of the collective human intelligence. Ask one of today's "experts" to figure out trigonometry with a slide rule.

    2. Greybearded old scrote
      FAIL

      Re: They are eyeballing the problem from the wrong orthagonal perspective ...

      I disagree, there's no excuse for requiring the mundanes to understand the innards, any more than they have to be able to strip and reassemble their car. (Yes some do, but it's optional.)

      Given that what is in your pocket would have qualified as a supercomputer not so long ago we ought to be able to have shiny that is also amazingly fast. We don't, for the sorts of reasons detailed in the article.

      We, as an industry, have failed everybody.

  6. Duncan Macdonald

    Look first at the problem

    Often a huge speedup can be obtained by spending a few minutes thinking about the problem before starting the design.

    Many years ago the company that I was working for needed to do a lot of computation on a few days values from an Oracle database that had multiple years of data. The code that the consultants came up with worked - but would have taken over 2 weeks to produce the results as the main table was joined to itself in the query in a way that negated the speedup of the indexes. A bit of thinking and a much smaller table was produced by selecting only the required days from the main table and running the query using that table instead. That reduced the time required from over 2 weeks to under half an hour.

    Another system was monitoring temperatures in a power station - the original spec had all the temps being monitored every second which was too much for the low performance mini computer of the time (early 1980's). A bit of looking at the items being monitored showed that many did not need a high scan rate (if the concrete pressure vessel temps are changing significantly in under 30 secs then it is well past time to run like hell!!). Changing the spec so that only the required items were scanned at high speed made the job easy for the computer to handle.

    If you have a job that is going to heavily load a computer system then it is often worth spending some time to try to understand the problem (not just the spec) and see if there is any obvious inefficiencies in the spec that can be easily mitigated before starting coding.

    1. Julz
      Joke

      Heretic

      " spending a few minutes thinking about the problem before starting the design"

      How can you move fast and break stuff efficiently if you stop and think first! As for a design, what the hell...

      1. TwistedPsycho

        Re: Heretic

        If you don't work fast, you can't take on more underpaid jobs, and then who is going to keep the Vodka flowing!

  7. ratfox
    Pint

    The authors stress the need for hardware makers to focus on less rather than Moore.

    *clap clap clap*

  8. Ian Johnston Silver badge

    Anyone who multiplies two matrices like that is an idiot. Doing it properly was Exercise 1 in the IBM "How to use a supercomputer" course I took at the Rutherford Lab's place in Abingdon in 1989.

    1. KorndogDev

      This!

      Anyone who does not use an optimized library dedicated to numerical computations is an idiot. And of course even Python has one.

      1. bombastic bob Silver badge
        Meh

        Re: This!

        the example from the article in Python was an attempt at showing a gross performance method, vs optimized code. Use of "yet another Python lib" isn't helping, nor a solution, to this kind of gross inefficiency. You simply do NOT use an interpretive lingo like Python, with slow code repeated a zillion times within a big loop, like that example does. You write it properly, with efficient methods, in a lingo that's capable of rapidly and efficiently performing the necessary calculations. I believe THAT was the point.

        So, write it in C, using a hand-tweeked threaded algorithm, and inline assembly for the innermost parts. That'll do nicely!

        Based on this example, I'd say there are 2 kinds of programmers: Those who code low-level efficient code for things like kernel drivers and microcontrollers, and those who don't. The ones who don't often use inefficient lingos like Python and Javascript and THEN claim that "libraries" will somehow 'fix' the inefficiency. But they never do.

        (am I the only one who got that out of this portion of the article?)

        1. DavCrav

          Re: This!

          "So, write it in C, using a hand-tweeked threaded algorithm, and inline assembly for the innermost parts. That'll do nicely!"

          As I mentioned in a previous comment, C code, hand optimized, is about 100x or more too slow compared with a big brain (not mine) and assembler code that is optimized to hell, using each level of cache perfectly and only calling from RAM as and when needed.

          1. Paddy

            Re: This!

            > ... using each level of cache perfectly

            So thisis code for only one CPU cache setup? That'sa pretty precise hardware spec, not very useful to others in general.

            1. DavCrav

              Re: This!

              No, he uses the minimum amount that exists over all standard cores. What I meant by using each level perfectly is doing the right operation in the right level, and maximizing the number of operations done before moving the piece elsewhere.

              Almost all of the algorithm is memory management, trying to minimize I/O, because that's the real bound on the algorithm, it's not mathematical operation bound. He has to create his own scheduler in some sense, because there's no 'off the shelf' program that does what he needs. He needs to make sure that the piece of data hits the right core at the right time so it's used efficiently.

              He is, arguably, the leading authority on multiplying large matrices. His programs (Meataxe64) are the ones used by all people who want to manipulate big matrices.

        2. KorndogDev
          Go

          Re: This!

          No Honey, there is a reason why specialized libraries exist and are used by millions of people. One day you will discover it, I am sure of it.

          Meanwhile, you can still hope that your super-optimised hand-crafted code from last night is bug free and one day it will serve next generations.

    2. Anonymous Coward
      Anonymous Coward

      Yeh, not a good choice, real world problems tend to be sparse matrix and matrix multiply is implemented as a parallel algorithm nowadays anyway. So why mention that old brute algo and then mention that better algos exist???

      I honestly don't get the point of the article.

      " argue that the tech industry needs software performance engineering, better algorithmic approaches to problem solving, and streamlined hardware interaction."

      Yeh, but that's exactly what it is now. A new algo is designed, the hardware gets new APUs to handle it. How is that not what happens now?

  9. Gene Cash Silver badge

    They shoot themselves in the foot

    So they go on and on about how conventional CPUs should be improved to handle vector calculation... then run their code in an instant on a GPU, which is DESIGNED for massive parallel vector calculation.

    This is like saying "you don't need a floating point unit, the CPU should do that" or "you don't need a GPU, the CPU should do that"

    I think they should get out of their ivory tower.

    1. Joe W Silver badge

      Re: They shoot themselves in the foot

      That was the paper in a nutshell. A specialised processing unit beats a general purpose CPU hands down. Not only for this task, but also for others. Same with optimised libraries, moreso if it's a problem that is not embarrassingly parallel (multiple instances of the same code to speed it up).

      No, I don't find it surprising. It is a good reminder, but bleedin obvious. No wonder Science published it...

    2. djnapkin

      Re: They shoot themselves in the foot

      I had a rather different take on the article to you. I thought it was well laid out and covered the progress of the optimisation, with great clarity. I'd say the results from optimising on a multi threaded CPU were impressive. The overall message of optiising your software was well carried.

      Threading is beyond many programmers, and running on a GPU is surely a specialised art - and I am not sure how many servers, either inhouse or cloud, would have GPUs. Perhaps they do. I just have not heard of that being a thing.

      1. Anonymous Coward
        Anonymous Coward

        Re: They shoot themselves in the foot

        GPUs as a specialized server calculation unit have been a thing for a number of years now. Thus nVidia's Data Center GPUs designed for HPC and Deep Learning functions and so on.

    3. John Brown (no body) Silver badge

      Re: They shoot themselves in the foot

      "This is like saying "you don't need a floating point unit, the CPU should do that" or "you don't need a GPU, the CPU should do that""

      It looked to me like they were demonstrating not just optimisation but strongly emphasising the types of optimisation. They then showed what happens when you reach the end of the path on a general purpose CPU and went on to show how specialised, dedicated processors could further optimise specific tasks in specific ways. So, as we reach 5nm and likely can't go smaller, there are fewer methods of optimising the hardware of general purpose CPUs so software people need to concentrate on optimising their code and the hardware people need to step up with more and better specialised processors.

      You only have to look at GPUs, why they were invented and what they are now used for. Likewise in audio and similar waveforms. DSPs do that very well, but you can do signal processing, albeit more slowly on a general purpose CPU. Or Cryptoprocessors designed for, well, you get the idea.

      1. Anonymous Coward
        Anonymous Coward

        Re: They shoot themselves in the foot

        But then there's the old tradeoff: ASICs are good at what they do, but what happens when the job you need shifts away from the ASIC's specialty? That's why the big push for general-purpose computing in the first place. Sure, you end up with the Jack-of-All-Trades problem, but at least you're likely to find something for it to do that would make any ASIC choke because the job at hand isn't within their realms of expertise. Put simply, there's a reason the world still has General Practitioners along with Specialists.

  10. Anonymous Coward
    Anonymous Coward

    C rocks.

    It really does.

    1. djnapkin

      Re: C rocks.

      Yes it does, until the wrong subscript variable is used for an array index and you can't figure why you occasionally get memory corruption in a large program, causing disaster.

      Not that this ever happened to us.

      1. Anonymous Coward
        Anonymous Coward

        Re: C rocks.

        Well, it's the old tradeoff. If you want to go all out, you can't have safeguards get in your way.

      2. Rich 2 Silver badge

        Re: C rocks.

        ... whereas in some “higher level” (scripty) languages I can think of, the program doesn’t crash - you just get gibberish or because undefined variables magically come into existence!! That’s even harder to debug. At least when c crashes, you can get a core dump out of it which often points to the problem pretty quickly

      3. Tom 7

        Re: C rocks.

        You can do unit testing in C too you know!

        1. Anonymous Coward
          Anonymous Coward

          Re: C rocks.

          But then you're gonna have to test the unit test as well. Since C doesn't include its own safeguards, you can easily end up in a Turtles-All-The-Way-Down situation.

    2. munrobagger

      Re: C rocks.

      C is just a well dressed assembler, but same rules apply.

      I worked on a C application that wrote large chunks of csv numbers. Turns out the implementation used strcat (yes that old) to write each number to an ever growing buffer string. But strcat uses the null terminator to find the end of a string, so each additional write requird strcat to start at the beginning of the buffer and scan all the way through it, before appending. Performance was horrendous when the chunks got large. Simple solution to maintain a pointer to the buffer string end, and modern libraries will do all for you, but a very good lesson on the need to know what is happening under the hood.

      1. Brewster's Angle Grinder Silver badge

        Re: C rocks.

        When I moved from asm to C and discovered null terminated strings were the norm I had a heart attack. It was a real step down. The problems you outline are just the beginning.

  11. Anonymous Coward
    Anonymous Coward

    Underpinning Ideology

    I know there's a risk that this comment may just sound like the ranting is some old bawbag, but do hear me out.

    About 25 years ago, I worked for a smallish financial institution. The even smaller investment arm was good. Very good. They frequently topped the charts with their performance, using an in-house dBase application (eventually Clipper) that they had honed over 8 or 9 years by that stage.

    There were frustrations. Data had to be loaded each day from other systems, but the biggest issue was the network IO for each client workstation. This was seen as a Moore's Law issue at the time, so more hardware was thrown at it, but really it was that IO problem.

    After a while, they bought in an Oracle based system running on dedicated minicomputers. The initial budget was £1million and they ran way beyond that by the time the new system was working.

    But... Their investment performance didn't so much tail off as fall off a cliff. Whatever their previous modus operandi, the new system simply did not allow it, and within a short time they became also-rans in the investment game.

    I often think of this experience, as around the time of their migration I wanted to explore the possibility of a centralised system fro running the clipper app, like a linux box, running dosemu, which would resolve their IO problems.

    I'm not suggesting that that old Clipper program could have been stretch out indefinitely, but that following a perfectly valid and acceptable path tossed the baby out with the bathwater.

    The main point stretched by this example, though, is that we in technology, I think, are so deeply inculcated with the assumptions of "upgrade", "better", "improved" and various other ideological concepts that are so closely allied to the technology industry that we never even question them. We think such assumptions of improvement are a natural part of our lives, even though we should be able to learn from the conclusions of, say, Enlightenment thinkers who similarly got mired in a philosophy of improvement. We do not stop to question ourselves frequently enough, or ever.

    It may be argued that this ideology of improvement is what has driven the technology industry to achieve the heights it has. That may be the case, but we also need to think about limits to these assumptions, because of other unintended consequences, and also because reason dictates that "believing" in principles like Moore's Law is highly unlikely to be sustainable. We may, ourselves, become part of a problem, if we leave such underpinning assumptions unexamined.

    1. MJI Silver badge

      Re: Underpinning Ideology

      Clipper

      We used it up to XP, we had great database support especailly when running on Netware.

      Our data server was Advantage Xbase later Database Server.

      https://en.wikipedia.org/wiki/Advantage_Database_Server

      We were wiping the floor of the SQL based competitors on performance.

  12. Rich 2 Silver badge

    “... tech industry needs software performance engineering, better algorithmic approaches to problem solving, and streamlined hardware interaction”

    I read that and my first thought was “shame the world seems to be migrating to stuff like python“

    ... only to have the issue highlighted a couple of paragraphs down. It’s been said zillions of times before of course, but hacky scripty language’s like Python that pay no regard whatsoever to how the underlying hardware works are the reason we all use multi-GHz multi-core power-eating monster computers - just to do a bit of word processing.

    Hacky scripting languages have there place but things would work a lot better if the authors of the bigger (in size and/or time) python applications learned to use a more effective language that runs (for example) 47 times faster!! What a pile of shite modern software is :-(

    1. KorndogDev
      Holmes

      that puzzle

      There is a reason why a guy who is a C language expert decided to create Python (and write it in C).

      Now, YOU go and look for it.

    2. Tom 7

      Modern software is not shite. There are just a huge number of inexperienced programmers out there. As I said in a different comment I used Python to run a matrix multiplication test that ran a fraction of a millisecond slower than the C++ version - because I had the experience to know it would be a problem and could use a library that would do it as best as possible, If I've had a GPU I could have used that and the Python would have been even faster - it might even have finished running while I told the C++ version to use that library too, I might even have done the whole thing in ROOT which would have been quicker than doing it in C++ and fixing compile time runs. Python is a good tool for playing and prototyping and exploring things - but most of the Python ecosystem is written in C++ for a reason - well my Pip install seem to run GCC more than I do!

      The problem we have today is computing is incredibly complicated and yet people come out of a 3 year university course knowing maybe 10% of what they need to know.

    3. Anonymous Coward
      Anonymous Coward

      "It’s been said zillions of times before of course, but hacky scripty language’s like Python that pay no regard whatsoever to how the underlying hardware works are the reason we all use multi-GHz multi-core power-eating monster computers - just to do a bit of word processing."

      Hacky scripty languages like python and tcl are what form the backbone of silicon design/implementation. Without them you wouldn't have the performance increase seen in CPUs/GPUs over the last 25 years.

      EDA software on the other hand could do with some bottom-up rethink/redesign to better harness the potential of said silicon.

    4. Paddy

      Look! someones lying on the internet!!!

      > "hacky scripty language’s like Python that pay no regard whatsoever to how the underlying hardware works"

      Read the groups frequented by the C'Python core devolopers and you would see your error. Read up about Timsort and you would find that those core develppers also look above the hardware at how the language and its libraries are used and optimise that too.

      You may suffer from a narrow view of "to optimise" not shared by all.

  13. Richard Boyce

    Fundamental problem

    Businesses often regard external costs as irrelevant. For example, how much has been wasted by Microsoft because it's cheaper to produce inefficent products when it's the users who are paying for the megawatts of power and waiting for something to happen.

    Even within a company, a manager can get rewarded if his department produces something quick and dirty for some other department to use. The costs are coming out of someone else's budget.

    More competition helps, but we also need user education to accentuate the negative feedback, especially when mother nature is on the receiving end of planned inefficiency.

  14. RichardEM

    The same old problem

    It seems to me that what I read about the article, I didn't go behind the paywall, and many of the comment and reply's made are either saying or implying that the problem is not hardware, software or another specific thing. The problem is the same thing that I ran into when I was consulting that is: Ask the right questions. Many time the client would say we need to do this, but not really addressing the problem but the problem with the result of what they were presently doing.

    I was constantly trying to find out what the root of the problem was so I could get them what they needed.

  15. Anonymous Coward
    Anonymous Coward

    Python 3

    Looks like the language could benefit from some behind the scenes tweaking in algos and also to make the most of the Hardware and OS that the program is being run on.

  16. DaemonProcess

    optimize / optimise

    There are after-market optimizing compilers for (even) Node, Python and Java - a language is just a language, GC or not. People just tend to use them in the standard manner (interpreted, JIT or whatever). A compilation stage could simply be added to a devops pipeline, if only people trusted their near-to-non-existent testing these days (a-hem iOS and Android).

    I programmed one of the world's first prototype cash machines. It only had 256 12-bit words of RAM in magnetic core storage. Out of that I had to handle screen i/o, cash dispensing and comms. Obviously it had fairly limited functionality and there were separate i/o processors and hardware controllers to handle stuff but it's amazing what skills we have lost. For example had to use self-modifying code to save memory, so the screen output routine was essentially the same as the the cash dispensing and serial output loops, with a few changed words in the middle as required.

    As for the comment about C being close to Assembler, there was a language in-between those, called NEAT-3 which was like assembler but with variables, it was a fun way to learn about instructions, stacks and algorithms. Also you get millicode and microcode but those are different subjects.

    You are not a real programmer unless you remember when an assembler multiplication of 200x5 could be made faster than 5x200. But then again I did once know a programmer who's idea of a program was a C header followed by 200 lines of un-commented assembler...

    1. Charles 9

      Re: optimize / optimise

      "You are not a real programmer unless you remember when an assembler multiplication of 200x5 could be made faster than 5x200."

      Unless every little cycle counted (in a limited-resource environment, I'll grant you), the difference really wouldn't be all that great (if you take the shift-and-add approach, as both types of instructions are usually pretty cheap time-wise, you'd only need one additional shift-and-add--4+1 versus 128+64+8).

    2. G.Y.

      ASM e: optimize / optimise

      I heard that, when the law "thou shalt write COBOL" came out, lots of COBOL programs were written where line 1 was "enter assembler" and all else was assembly code

    3. Brewster's Angle Grinder Silver badge

      Re: optimize / optimise

      Self modifying code was a lot of fun. But an absolute bastard to maintain. And lets face it: memory, even cache, isn't in short supply and writing to the code segment is a security nightmare.

    4. John Smith 19 Gold badge
      Unhappy

      "I programmed one of the world's first prototype cash machines."

      Was that the IBM one with the one line dot matrix LEDs, or did were they still using a printed roll of instructions?

      But TBH I only know either by reputation.

  17. iron

    > the MIT researchers wrote a simple Python 2 program

    At which point they lost all credibility. At least use a currently supported version of a language, not a very old and out of date one, and preferably a language that is actually up to the task put before it. Python is not a suitable language for matrix maths which is about all they proved.

    1. Anonymous Coward
      Anonymous Coward

      This car won't go! (key?)

      and you not reading the following paragraphs lost you all credibility. A starting point for them should not be the ending point for you, unless you really didn't want to know what the article said. Go read the article.

    2. Anonymous Coward
      Anonymous Coward

      Don't get me started on language bloat. There are too many languages and there seems to be a new one every year claiming to be better and different than last year's language.

  18. Alister

    deader than corduroy bell bottoms

    WHAT!

    why did nobody tell me?

    1. John Brown (no body) Silver badge
      Windows

      Re: deader than corduroy bell bottoms

      You are an OU lecturer and ICM£5

  19. Robert Grant

    The code, they say, takes seven hours to compute the matrix product, or nine hours if you use Python 3. Better performance can be achieved by using a more efficient programming language, with Java resulting in a 10.8x speedup and C (v3) producing an additional 4.4x increase for a 47x improvement in execution time.

    A "more efficient language"? It's not the language, it's the runtime. Run that in MyPy and see the difference between that and CPython. Any techies on the staff?

  20. Anonymous Coward
    Anonymous Coward

    Oozlum Computing

    Many years ago, when dinosaurs still roamed the planet, I recall a project using one of these new-fangled computers to improve an inspection stage that took a trained inspector four days to complete (on each part). It involved checking thousands of small (say 2mm) holes drilled in a metal ring. He had to check each was clear using a wire. There were inevitably a few that were not clear but returning to redial them was a lengthy and expensive operation. The engineers could accept a certain number of blocked holes for each 15deg sector. So poking the wire and then calculating acceptance (no electronic calculators then) took time.

    With the computer, a small photoresists array and some early LEDs, the job was reduced to four hours in BASIC. Once the principle was proven and accepted as a working solution, the program was rewritten in assembler and the job reduced to 20min. Even then, it was recognised that any high-level language was inefficient in computer time.

    I recall, many years later when the ZX81 came out, how it was possible to program in just 1k of RAM (the aforementioned works m/c had an enormous 32k). I gave up any pretence at programming 30 years ago, other than simple VB spreadsheets, but it never ceases to amaze me how big programmers manage to make simple programs nowadays.

    My point - I reckon we can go a long way by looking at coding. We had the luxury of faster chips making it too easy fro too long.

    Anonymous to permit flaming...

  21. Paddy
    Linux

    That's a wrap!

    Those same MIT professors should progress to *wrapping* their orders of magnitude faster solution so it becomes simply callable from Python. That would then allow other scientists and engineers to benefit from superfast matrix multiplication with the ease of a Python function call. It's how peole get things done in, for example, data-science in simple "Python" without having to now the intricate details of all of the libraries they are using.

  22. Andy Non Silver badge
    Happy

    Designing the right algorithm always helps

    One of the first programs I wrote for my employer was on an Apple 2 as I remember back in the early 80's. At the time one of their programs was taking around 2 to 3 DAYS continuous processing and I got that down to around half an hour. The software had to reconcile invoice information for the accounts dept, (outstanding invoices against payments made as I recall). Essentially there were two very large sequential lists (in data entry order) to check off against each other. The guy who had written the original software worked his way down the first list and checked every item against the second list to see if it matched the invoice number, so the second list was being searched top to bottom thousands of times. It worked, but the poor algorithm wasted lots of processing time. My approach was to sort both lists first by invoice number, then proceed crabwise down both lists necessitating only one pass of each list. The extra overhead of doing the sorts first was massively outweighed by the subsequent fast comparison. I found the sort algorithm in an old Commodore Pet book - Shell Metzner. Used it quite a lot after that. This was back in the days of sequential files before databases came along.

    1. John Smith 19 Gold badge
      Unhappy

      "The guy who had written the original software worked his way down the first list and

      checked every item against the second list to see if it matched the invoice number,"

      Yup. That's the classic dumbass coding method in a nutshell.

  23. nautica Silver badge
    Boffin

    Why all this thashing around seeking a suitable language?

    "By understanding a machine-oriented language,‭ ‬the programmer will tend to use a much more efficient method‭; ‬it is much closer to reality".--Donald Knuth‭

    He's talking about assembly language, folks. Can't get more efficient than that. And then...

    "Simplicity and elegance are unpopular because they require hard work and discipline to achieve and education to be appreciated."--Edsger Djikstra‭

  24. nautica Silver badge
    Holmes

    The road to hell. It goes on and on and...

    There is a basic, basic philosophical disconnect here, and one which is taking over the whole of engineering and all forms of technological design--COMPLETELY--; to wit:

    ALL PROBLEMS CAN BE SOLVED WITH A SOFTWARE SOLUTION

    Get that? ALL problems...even those problems where a century of history and impeccable safety dictates the use of triply-redundant hardware. Here we have a basic problem, dictated by the immutable laws of physics; and the now-prevalent mind-set (with legitimacy provided by none other than the now-highly-questionable authority known as "MIT") says, "No problem, mon! We'll fix that with software! We can fix anything with software".

    Does the phrase "Boeing 737 Max" ring a bell, boys and girls? ...and MIT?

    The only answer to this mentality is the famous quote of Wolfgang Pauli--

    "This is so bad it's not even wrong."

  25. 89724102172714182892114I7551670349743096734346773478647892349863592355648544996312855148587659264921

    I wonder how long it will be before AIs make all programmers redundant

  26. Anonymous Coward
    Anonymous Coward

    Numpy

    Anyone doing serious numerical computations in Python would be using numpy, which can call MKL and other optimised libraries. Also, you would probably use Fortran or C for the number crunching and use Python to glue it together. For real world applications, matrices are often sparse or have a special structure, which allows the use of specialised algorithms that can be orders of magnitude faster.

  27. The Rest of the Sheep
    Holmes

    Hal Hardenberg Rides Again

    This article should sound familiar to anyone who remembers the DTACK Grounded newsletter.

    Its Editor (nom de plume FNE) spent many pages extolling the virtues of assembler and efficient programming. I've been hiding from COVID for a while and lost track of time, but didn't realize I was back in the early eighties again. DTACK is archived at http://www.easy68k.com/paulrsm/dg/ should you desire a '60s electrical engineer's take on the "software efficiency doesn't matter, the chips will always get faster" argument.

  28. Grumpy Rob

    Horses for courses

    One problem I've seen with software development is the old "when you've got a hammer everything looks lik a nail". Young programmers learn one language, and think that it's applicable to all problems they're given. So I've seen what should have been a simple web application run like an absolute dog because it used MEGABYTES of Javascript and MEGABYTES of HTMl to render a few simple and small tables. But the developer was using Java/Swing (I think) and some client side libraries that were HUGE. Who needs click-sortable columns on a table with typically three or four lines of data?? But the developer clearly didn't know any better.

    Once you know a thing or two you can select Python for quick and dirty one-off jobs - it doesn't matter if it takes a few hours to extract/migrate data if you're only doing it once. While for a production task you may pick something more efficient and suited to the task - and (gasp!) actually do some thinking at the design stage.

    One of my early jobs (more than 30 years ago) was writing the telemetry driver for a SCADA system that had three dual-CRT operator consoles. All written in assembler and fitted into 256k bytes of core memory.. including the OS. And the Interdata 32 bit mini had its performance measured in DIPS (dozens of instructions per second). As with a previous poster, when I sit waiting long seconds for a 4 page Word document to load on an i5 machine with 8Gb of RAM I just shake my head in amazement/disgust.

  29. Anonymous Coward
    Anonymous Coward

    "Better performance can be achieved by using a more efficient programming language, with Java resulting in a 10.8x speedup and C (v3) producing an additional 4.4x increase for a 47x improvement in execution time."

    Well, "more efficient programming language" and Java don't really click. Java has always been shit and it is time everyone on the market realize this ...

  30. John Savard

    Oh, dear.

    I had read claims that the design of the Python interpreter was so advanced, code written in Python ran as fast as compiled code. Apparently that was mistaken.

  31. PaulVD
    Facepalm

    Amazing how many smart El Reg readers missed the point

    The authors picked a really simple problem for which we have a lot of analysis and some very good solutions, and showed that a really poor algorithmic choice falls far outside the achievable frontier. No doubt they did a bit of searching over languages etc to find a really bad starting point.

    But it is beside the point to argue that they should have used a modern BLAS library, a better language, and other optmisations that are obvious to all of us. They showed that there are design choices which make orders-of-magnitude differences to the performance of this very simple and well-understood problem.

    But now, apply that to problems that are not well-understood and for which there are no conveniently pre-optimised libraries: the database structures from which you extract that complicated query, or the nonlinear pattern-matching algorithm, or whatever programming and software design task you get paid for. Can thinking more carefully about your fundamental approach to the data structures or the mathematics yield orders of magnitude improvements? Given that we can no longer count on major improvements in future processing speed, we will have to depend on improving our high-level thinking about data structures, algorithms, and suitable programming languages.

    This is a very self-evident point, for which the authors have offered a correspondingly trivial example. My initial thought was that the article was not interesting enough to be publishable. But a surprising number of commentators have attacked the example and missed the underlying point, so perhaps the point is not as self-evident as it ought to be.

    1. Anonymous Coward
      Anonymous Coward

      Re: Amazing how many smart El Reg readers missed the point

      "This is a very self-evident point, for which the authors have offered a correspondingly trivial example. My initial thought was that the article was not interesting enough to be publishable. But a surprising number of commentators have attacked the example and missed the underlying point, so perhaps the point is not as self-evident as it ought to be."

      Plus there are the fundamental limits that make truly tackling these feats a matter of engineering: budgets and deadlines. IOW, taking a few minutes may mean missing the deadline...

  32. Torben Mogensen

    Dennard scaling

    The main limiter for performance these days is not the number of transistors per cm², it is the amount of power drawn per cm². Dennard scaling (formulated in the 1970s, IIRC) stated that this would remain roughly constant as transistors shrinks, so you could get more and more active transistors operating at a higher clock rate for the same power budget as transistors shrink. This stopped a bit more than a decade ago: Transistors now use approximately the same power as they shrink, so with the same amount of transistors at smaller areas you get higher temperatures, which requires more fancy cooling, which requires more power. This is the main reason CPU manufacturers stopped doubling the clock rate every two years (it has been pretty much constant at around 3GHz for laptop CPUs for the last decade). To get more compute power, the strategy is instead to have multiple cores rather than faster single cores, and now the trend is to move compute-intensive tasks to graphics processors, which are essentially a huge amount of very simple cores (each using several orders of magnitude fewer transistors than a CPU core).

    So, if you want to do something that will benefit a large number of programs, you should exploit parallelism better, in particular the kind of parallelism that you get on graphics processors (vector parallelism). Traditional programming languages (C, Java, Python, etc.) do not support this well (and even Fortran, which does to some extent, requires very careful programming to do so), so the current approach is to use libraries of carefully coded code in OpenCL or CUDA and call these from, say, Python, so few programmers would even have to worry about parallelism. This works well as long as people use linear algebra (such as matrix multiplication) and a few other standard algorithms, but it will not work if you need new algorithms -- few programmers are trained to use OpenCL or CUDA, and using these effectively is very, very difficult. And expecting compilers for C, Java, Python etc. to automatically parallelize code is naive, so we need languages that from the start are designed for parallelism and do not add features unless the compiler knows how to generate parallel code for these. Such languages will require more training to use than Python or C, but far less than OpenCL or CUDA. Not all code will be written in these languages, but the compute-intensive parts will, while things such as business logic and GUI stuff will be written in more traditional languages. See Futhark.org for an example of such a language.

    On a longer term, we need to look at low-power hardware design, maybe even going to reversible logic (which, unlike irreversible logic, has no theoretical lower bound of power use per logic operation).

    1. Charles 9

      Re: Dennard scaling

      But what happens when you get caught between Scylla and Charybdis: stuck with an inherently serial job that requires a lot of raw computing power BUT can't be parallelized? Or even just a job that is highly serial (like high-ratio compression, including video compression)?

      1. Torben Mogensen

        Re: Dennard scaling

        Video compression is not really highly serial. The cosine transforms (or similar) used in video compression are easily parallelised. It is true that there are problems that are inherently sequential, but not as many as people normally think, and many of those that are are not very compute intensive. It is, however, true that not all parallel algorithms are suite for vector parallelism, so we should supplement graphics processors (SIMD parallelism) with multi-cores (MIMD parallelism), but even here we can gain a lot of parallelism by using many simple cores instead of few complex cores.

        But, in the end, we will have to accept that there are some problems that just take very long time to solve, no matter the progress in computer technology.

        1. Charles 9

          Re: Dennard scaling

          "The cosine transforms (or similar) used in video compression are easily parallelised."

          Not if they're dependent on the ones BEFORE them, and the most efficient video codecs are INTER-frame, meaning you can't do the next frame until you do the one. This is why x264 didn't go multithreaded for a dog's age and even now takes approaches that appear to have tradeoffs in quality or speed.

  33. KBeee
    Joke

    You all missed the point

    I can't believe that everyone commenting BTL has missed the whole point of the article! The article is purely there to denigrate those of us that choose to wear corduroy bell bottom trousers!

  34. Marco van de Voort

    looptiling

    Before you start throwing in technologies that require a radical different approach, start with simple optimization like looptiling to optimize for cache effects.

  35. Caver_Dave Silver badge

    Time budgets

    I work in the certifiable, hard-realtime world now, but years ago had to write some echo cancellation and noise reduction code for a mobile phone producer.

    It was probably the best defined project I have ever worked on and consisted of source sound files, an algorithm and the expected output.

    The customer had working simulations in MathLab (or similar) and had auto-converted that to C code, but this was still far too slow for their time budget and spent variable times computing the answers.

    My job was to take the algorithms and convert to assembler, in a manner so that all possible routes through the code took exactly the same number of clock cycles.

    As you can imagine the algorithms were complex, but I managed it with just one nop statement, and that was in a rarely used branch. The executable was about 30% the size of the C code produced executable and always took 50% of the time of the fastest path through the C code.

    That really was worth 2 weeks of my time for the customer.

    The two most important decisions at the start of a project in my mind are always algorithm and language, whether that be a little bit of scripting for a web page, or a deeply embedded PID controller.

    1. genghis_uk
      Happy

      Re: Time budgets

      I wrote an assembler program for a telco line card and every path through the main loop had to time to exactly 1ms - there were a lot of potential paths.

      Eventually I had to print it out on 30ft of paper and crawl up and down marking loops with different colour pens. It amused the office staff walking past me in the corridor!

      Happy times :)

  36. John Smith 19 Gold badge
    Unhappy

    Hmm. 5nm iw 23 atoms wide.

    IOW plenty of room still to go.

    Although likely to be ballsachingly difficult quite challenging to get there.

    1. Charles 9

      Re: Hmm. 5nm iw 23 atoms wide.

      Can't rely on the atom width at paths that small. Once you get that small, quantum phenomena come into play. Thus you have issues like quantum tunneling where subatomic particles (like electrons) suddenly appear on the other side of a barrier (which is a problem when the barrier in question is a transistor).

  37. Stephen Davison

    Code efficiency

    Java's multidimensional arrays are more efficient when the reference to the inner array is cached. Without that, most the computation is the program trying to figure out where the data was stored in memory with modulo functions. The loop orders and values are also not very efficient unless the compiler is supposed to sort that out, which means that we're more measuring differences in compilers.

    The following optimised Java version took 42.6s on one core. It would likely speed up the code in most the languages and remove the compiler as a factor.

    Python gets a lot of its speed up from compiled C functions like sort() but this isn't in play when it does some basic looping so it looks very bad in an example like this.

    int size = 4096;

    double[][] A = new double[size][size];

    double[][] B = new double[size][size];

    double[][] C = new double[size][size];

    for (int i = 0; i<size; i++) {

    double[] asubarray = A[i];

    double[] csubarray = C[i];

    for (int k = 0; k<size; k++) {

    double[] bsubarray = B[k];

    double asubarrayValue = asubarray[k];

    for (int j = 0; j<size; j++) {

    csubarray[j] += asubarrayValue * bsubarray[j];

    }

    }

    }

  38. Rol

    One for the conspiracy cats

    "Hey this is the biggest breakthrough of the century. 16 cores, running at 5Ghz, on a 5nm die. It's out of this world"

    "Yeah, it took some time, but we aimed for the prize and made it"

    "Our customer's are going to go bonkers with this"

    "Err...well...not really"

    "?"

    "You see marketing has quite rightly pointed out, that giving the world this piece of kit now, is the equivalent of making everlasting lightbulbs. We'll sell millions of them and then that will be it, as every Tom Dick and Harry will have a machine that fulfils their every desire for many years to come."

    "so what will we be marketing"

    "Something that will still blow their minds, but obviously seriously cut down. a single core running at 150 Mhz with a 60Mhz bus"

    "Yep. Still quite impressive"

    "And every two years we'll double it up, so as to get people to upgrade. Instead of making millions we'll make trillions of dollars over the span of a few decades, as customers try to keep up to date"

    "Mr Moore, you are a genius."

    "You know. I predict our customers will probably see me as less of a genius and more of a prophet"

  39. Billy Bob Gascan

    Since I am bored and can’t go out and ride my bike in the rain I used XCode to write a C program on my iMac to multiply two 4086 x 4086 matrices of 64 bit floating point numbers. I used a complex, robust random number generating algorithm to fill the arrays. This takes 17 seconds. Then I multiplied the two arrays to fill a third array. This takes 0.102632 seconds. You have to wonder how the authors could have gotten such crappy results even using crappy languages.

  40. Someone Else Silver badge
    Joke

    I'll be here all week...try the veal...

    The authors stress the need for hardware makers to focus on less rather than Moore.

    Ba-DOOM-tish!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like