Quadric has announced that support for Llama 2 large language model (LLM) is available on its Chimera general purpose neural processing unit (GPNPU) intellectual property (IP) core. Unlike other IP and semiconductor application processor suppliers, Quadric was able to add this support with a simple software port with no hardware changes, so existing designs can run this model. Other suppliers have announced plans to change their hardware to offer support in 2024 or beyond.
Meta introduced the Llama2 LLM for generative artificial intelligence (AI) on July 18 of this year. Coincident with the unveiling of Llama2, Meta and Qualcomm announced a partnership to port Llama2 to future Qualcomm Snapdragon chips expected in 2024 smartphones and laptops. Use of LLMs had previously been considered to only be viable in cloud data centres. Meta’s announcement set off a flurry of activity from chip and IP providers to race to capture market attention and investment dollars for on device LLM implementations.
In addition to Qualcomm’s announced timeline of more than six months to port Llama2, fellow silicon company Mediatek announced four weeks later that it, too, was working on Llama2 support with an expectation that 2024 mobile phones powered by Mediatek silicon would support Llama2. Also in August, IP licensor Ceva announced that a redesigned IP core would soon be delivered to its customers that could support LLMs. Cadence similarly unveiled new IP in September to support LLMs. Those Ceva and Cadance customers will need to license new cores and design new chips for delivery sometime in 2025 or 2026. Notably, all four chip and IP announcements directly or indirectly imply that silicon respins are required to gain this new capability.
“Why would these titans of the semiconductor and IP worlds need to wait until 2024 or 2025 or beyond to support today’s newest, hottest ML (machine language) model? Why would a new machine learning model force a silicon respin typically costing in excess of $100 million (€94.33 million) to be able to run new ML code? SoCs with Quadric’s Chimera GPNPU are ready to run Llama2 today!” says Quadric CMO, Steve Roddy.
Existing silicon chips for consumer devices all have applications-class CPUs (central processing units). Many have eight ostensibly high-performance CPUs. Why wouldn’t SoC vendors simply port Llama2 to the CPU, which can run any ML model? Clearly either the performance won’t meet consumer expectations, or the power consumed will kill device battery life. Hence the need to respin the ML accelerator to run LLMs. Or why not choose a fully programmable GPU from a ML vendor such as NVIDIA? Perhaps the 10 Watt power dissipation and the need for a cooling fan prohibits use in a smartphone?
Like those programmable but power-hungry CPU and GPU (graphics processing unit) solutions, Quadric’s GPNPU is also programmable. But only Quadric Chimera GPNPUs (general-purpose graphics processing units) deliver programmability with the power-performance profile needed in portable consumer devices.
Quadric’s processor architecture combines C++ programmability. This will help to run any ML model with the performance productivity of NPU (neural processing unit) accelerators found in many first-generation SoCs (system on chips) in the market. Unlike inflexible accelerators that force silicon respins when complex new models such as Llama2 are invented, Chimera cores can be programmed. Chimera GPNPUs come with the ability to run any model, including all of the layers, without requiring removal of problematic layers. It also does not offer partitioning, or force data scientists to convert convolutions to adhere to the limited subset of ‘conv’ types supported in hardware. Chimera offers support to any model, across any network using any operator
Quadric’s team invested a total of 13 engineer-weeks over a total of 4 elapsed weeks to port an INT8 quantised version of Llama2 to the Chimera platform and tune performance. Two new ML operator layers, and two variants of existing operator kernels, were coded in C++ by the Quadric applications team to get the model running.
A further two engineer-weeks investment ironed out corner case performance and accuracy tweaks to ensure operation across all three sizes of the Chimera QB series processors (1 TOPs, 4 TOPs and 16 TOPs variants). Other machine learning inference solution providers with much larger teams are still struggling to meet six-month porting targets! Quadric’s Chimera QB4 4 TOPs GPNPU running Llama2 15M delivers 225 Token/Sec/Watt efficiency in a 5nm technology, while occupying only 2.5 mm2.
Follow us and Comment on Twitter @TheEE_io