Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads Article Swipe
YOU?
·
· 2020
· Open Access
·
· DOI: https://doi.org/10.1109/tc.2020.3027900
· OA: W3013692244
Data-parallel applications, such as data analytics, machine learning, and\nscientific computing, are placing an ever-growing demand on floating-point\noperations per second on emerging systems. With increasing integration density,\nthe quest for energy efficiency becomes the number one design concern. While\ndedicated accelerators provide high energy efficiency, they are\nover-specialized and hard to adjust to algorithmic changes. We propose an\narchitectural concept that tackles the issues of achieving extreme energy\nefficiency while still maintaining high flexibility as a general-purpose\ncompute engine. The key idea is to pair a tiny 10kGE control core, called\nSnitch, with a double-precision FPU to adjust the compute to control ratio.\nWhile traditionally minimizing non-FPU area and achieving high floating-point\nutilization has been a trade-off, with Snitch, we achieve them both, by\nenhancing the ISA with two minimally intrusive extensions: stream semantic\nregisters (SSR) and a floating-point repetition instruction (FREP). SSRs allow\nthe core to implicitly encode load/store instructions as register reads/writes,\neliding many explicit memory instructions. The FREP extension decouples the\nfloating-point and integer pipeline by sequencing instructions from a\nmicro-loop buffer. These ISA extensions significantly reduce the pressure on\nthe core and free it up for other tasks, making Snitch and FPU effectively\ndual-issue at a minimal incremental cost of 3.2%. The two low overhead ISA\nextensions make Snitch more flexible than a contemporary vector processor lane,\nachieving a $2\\times$ energy-efficiency improvement. We have evaluated the\nproposed core and ISA extensions on an octa-core cluster in 22nm technology. We\nachieve more than $5\\times$ multi-core speed-up and a $3.5\\times$ gain in\nenergy efficiency on several parallel microkernels.\n