In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.
We're targeting the edge market first, such as NVIDIA's Jetson line, because it's far less supported/focussed on. In our experience, whenever we did training runs on H100 clusters with x86, any pip package would be easily installable, and a wide array of software just worked. This is not the case in Jetson, where we constantly have to rebuild packages from source, and in general, NVIDIA will only release a better board every five years. As for the second part of your question, we agree. Much of our work has been trying to make switching to our software layer straightforward (a single line of code). The ideal endgame is that, given an ONNX file, we can parse the generated node tree and determine if our hardware supports all the nodes. Of course, this is assuming we have a large enough share of the market using our software, so we know what operations we need to support on the hardware side of things.
I cannot see any way of building HW profitably for the Jetson market. You are really competing with Raspberry PI, not Jetson, IMO. I mean, I'm no expert, but I would suggest doing a deep dive on your business plan if you intend to target the small hardware world rather than spending any time designing HW or SW. Then reduce your estimate by at least half since doing anything in that embedded/edge world has many more technical issues.
In general, Jetson has quite a large market. Vehicle companies use automotive-rated Jetson Orins, and defense companies also use Jetson Orins to power ML applications on the edge (Anduril). Many of the companies we currently talk to are robotics companies that are forced to use Jetsons because they are both the least of the bad options and the only edge compute provider with enough juice to run larger transformer models.
And the auto and Defense markets are so easy to enter! /s
Both of these markets have long lead times, tight HW build times, and move incredibly slowly. They are not the kind of markets that like using stuff from new companies with no history. Again, I'm no expert, but I'd say you need to be concentrating on sales and market research now.
With respect it doesn't sound like you know much about any of these businesses. This startup is extremely early, the road to silicon is long, and there is a lot of external change and learning by doing that will happen between here and there. This is them getting started and based on my related work experience I think it's pretty interesting.
We are not under the illusion these markets are easy to enter. Still, we believe providing an effortless and compatible experience for edge ML computing is a strong competitive advantage. We have not met anyone who likes using Jetsons yet, unlike A100/H100s in the server market.
Edit: I should note that if it weren't for Dusty and his docker image generating GitHub repo for Jetson, we would have spent weeks trying to get our kernels and optimized models shipped to customers.
What's your point? Is it that one shouldn't attempt to enter a market just because it's difficult? Or are you trying to educate the founders about something obvious that they likely have already spent 1000x more time thinking about than you?
I think the only way this could work is if you had the backing of one of the major LLM providers who decided that your ideas are worth doing a PoC. That way you actually have a client on board before you spend all the money. I know you guys probably like the designing of the HW and SW, and maybe the implementation of both, but really, what you need now is to do sales.
There are multiple ways to run a business like this.
1. Go deep on the tech, there are funders who will want equity stakes in risky startups because they operate in adjacent markets. It's often cheaper to invest 1MM on a startup than internal R&D activities. If it has promising results, those same investors may ramp up their spend or pivot to an acquisition strategy.
2. Get early customers, if you have 1-10 large enterprises with a committed spend - then you are likely golden. However as nice as this option sounds, there are few avenues to get this type of commitment. If you are in the fortunate position of knowing the exec/founding/investor team of a large LLM provider - it's possible. But easier said than done.
3. Build it and they will come, business strategies take time to develop - maybe that time is poorly spent. Build the best version of your product and someone might take it up. There are a few investors who will take a flyer on this type of founder mentality. Benefit to the investor is that they can get a much larger equity stake/board position in exchange for the early creative freedom. If it works out, the investor can get a lot of alpha. A card which handled LLM inference at 1/100th the cost of an H100 could produce quite a bit of value for the right buyer.
4. Do the technical work to get it a little bit beyond just an idea and then get acqui-hired by a large company who has the resources to push this.
So if I was them I would be doing thought experiments on how this technology could benefit a whole range of businesses e.g. gaming consoles, televisions etc. Not many people would've guessed LG acquiring Palm for example.
I'm currently working on a portable computer vision project using Pi/Jetson with some Luxonis camera modules and I completely see where you're headed. In the long-game I think you could capture hw accelerated robotics CV.
Why not target the enthusiast first? The buzz created around something interesting an "amateur" cooked up may be what you need. The investment involved with creating dev hardware should be minimal, correct?
I may be wrong, but from few other enthusiast niches I conclude, enthusiasts number is very little to feed hardware development.
- Need millions sells, but really most real project have made thousands sells.
And this is long known - even Raspberry born for other market, fortunately, was not just killed but conversed to target enthusiast and even now incomplete project.
Having been working in DL inference for now 7+ years (5 of which at startup) which makes me comparably ancient in the AI world at this point. The performance rat race/treadmill is never ending, and to your point a large (i.e 2x+) performance improvement is not enough of a "painkiller" for customers unless there is something that is impossible for them to achieve without your technology.
The second problem is distribution: it is already hard enough to obtain good enough distribution with software, let alone software + hardware combinations. Even large silicon companies have struggled to get their HW into products across the world. Part of this is due to the actual purchase dynamics and cycle of people who buy chips, many design products and commit to N year production cycles of products built on certain hardware SKUs, meaning you have to both land large deals, and have opportune timing to catch them when they are evening shopping for a new platform. Furthermore the people with existing distribution i.e the Apple, Google, Nvidia, Intel, AMD, Qualcomms of the world already have distribution and their own offerings in this space and will not partner/buy from you.
My framing (which has remained unchanged since 2018) is that for silicon platform to win you have to beat the incumbents (i.e Nvidia) on the 3Ps: Price (really TCO), Performance, and Programmability.
Most hardware accelerators may win on one, but even then it is often theoretical performance because it assumes their existing software can/will work on your chip, which it often doesn't (see AMD and friends).
There are many other threats that come in this form, for example if you have a fixed function accelerator and some part of the model code has to run on CPU the memory traffic/synchronization can completely negate any performance improvements you might offer.
Even many of the existing silicon startups have been struggling with this since them middle of the last decade, the only thing that saved them is the consolidation to Transformers but it is very easy for a new model architecture to come out and require everyone to rework what they have built. This need for flexibility is what has given rise to the design ethos around GPGPU as flexibility in a changing world is a requirement not just a nice to have.
Best of luck, but these things are worth thinking deeply about as when we started in this market we were already aware of many of these things but their importance and gravity in the AI market have only become more important, not less :)
We've spent a lot of time thinking about these things, in particular, the 3Ps.
Part of making the one line of code work is addressing programmability. If you're on Jetson, we should load the CUDA kernels for Jetson's. If you're using a CPU, we should load the CPU kernels. CPU with AVX512, load the appropriate kernels with AVX512 instruction, etc.
The end goal is that when we introduce our custom silicon, one line of code should make it far easier to bring customers over from Jetson/any other platform because we handle loading the correct backend for them.
We know this will be bordering impossible, but it's critical to ensure we take on that burden rather than shifting it to the ML engineer.
Why start a company to make this product? Why not go work at one of the existing chip manufacturers? You'd learn a ton, get to design and work on HW and/or SW, and not have to do the million other things required to start a company.
We were waiting for a Bitnet-based software and hardware stack, particularly from Microsoft, but it never did. We were essentially nerd-sniped into working on this problem, then we realized it was also monetizable.
On a side note, I deeply looked into every company in the space and was thoroughly unimpressed with how little they cared about the software stack to make their hardware seamlessly work. So, even if I did go to work at some other hardware company, I doubt a lot of customers would utilize the hardware.
I recommend getting a job at NVIDIA. They care deeply about SW. It is a great place to learn about HW and the supporting SW. There is much to learn. Maybe you will learn why you are unimpressed with their SW offerings. For me, the hard part was the long lead time (8+ years) from design to customers using the product. One of the things that always amazed me about NVIDIA was that so many of the senior architects, who have no financial need to keep working (true for more than a decade), are still working there because they need the company to do what they love.
I think there is a comment somewhere here where I comment on NVIDIA, but I think NVIDIA is the best hardware company for making good software. We had a very niche software issue for which NVIDIA maintained open-source repos. I don't think NVIDIA's main advantage is its hardware, though; I think it's the software and the flexibility it brings to its hardware.
Suppose that Transformers die tomorrow, and Mamba becomes all the rage. The released Mamba code already has CUDA kernels for inference and training. Any of the CSPs or other NVIDIA GPU users can switch their entire software stack to train and inference Mamba models. Meanwhile, we'll be completely dead in the water with similar companies that made the same bet, like Etched.
You said (implied?) that your reason for starting a company was that you were waiting for somebody (MS) to build your favorite tech, and you realized it was monetizable. Finding a gap is a great start. But, if money is your goal, it is far easier to make money working at a company than starting one. Existing companies are great places to learn about technology, business, and the issues that should really drive your desire to start something yourself.
I don't think I ever implied we started this for money. We started working on the technology because it was exciting and enabled us to run LLMs locally. We wouldn't have started this company if someone else came along and did it, but we waited a month or two and didn't see anyone making progress. It just so happens that hardware is capital intensive, so making hardware means you need access to a lot of capital through grants (which Dartmouth didn't have for chip hardware) or venture capital (which we're going for now). I'm not sure where you got the idea we're doing this solely for money when I explicitly said "We were essentially nerd-sniped into working on this problem"
Glad to hear money isn't your focus. Your comment "...then we realized it was also monetizable" was the reason for my interpretation. Its also a very common rational. I don't know what "nerd-sniped" means, so...
Good luck with the VCs. I hope you all stay friends through the challenging process.
When performing performance optimization on CPUs, I was impressed with Intel's suite of tools (like VTUNE). NVIDIA has some unbelievable tools, like Nsys and, of course, its container registry (NGC), which I think surpasses even Intel's software support.
Is GPU rendering used today for VFX? From a quick google it seems that yes GPU based rendering is definitely an option, even if there's various reasons to still prefer CPU. So in your case was it really what you were aiming to do was pointless or simply your particular solution failed to succeed?
You're right that as a small player it's very hard to gain traction, even if the tech is fantastic because it's risky to switch your tech stack over. Though if you do do a good job with the tech I'd say you have a decent chance of an acquisition from a bigger player who wants a ready-made (or 90% of the way there) solution they can make their own. Perhaps you can call this an aquihire but I think you're significantly underplaying the potential upside of this exit. Imagine this startup is seen as having a great ternary transformer solution and ternary transformers are the way to go you could get multiple large players eyeing up an acquisition to get ahead pushing the price up.
My feeling is custom ASICs for ternary transformers is a great area to look at. There is a genuine chance of providing a significant step up from GPUs in terms of power efficiency and potentially performance. Plenty of risk of course, ternary models might just not perform as well as the full fat equivalents and building custom silicon, especially as a start-up, comes with all kinds of issues.
Yes by small studios with the agility to change their workflow without too much friction, and whose projects are small enough to fit into the constraints of GPU renderers, but largely not by huge studios who already have in-house CPU farms and whose projects need hundreds of gigs of RAM to render anyway.
Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]
From a software development standpoint, usability looks great, requiring only one import,
import deepsilicon as ds
and then, later on, a single line of Python,
model = ds.convert(model)
which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!
The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.
I, for one, hope you guys are hugely successful.
---
[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.
CUDA and Nvidia are practically impenetrable on the server side. To be very concrete, we did training for our models on AWS with parallel cluster. We used P5 instances (8xH100) that were scheduled with SLURM. A problem we ran into however, was that our training jobs were containerized. Thankfully, pyxis and enroot exist to run containerized jobs on SLURM. And who else, but Nvidia, develop and maintain those plugins. For practically any weird niche use case, Nvidia seems to have some software solution - but only on x86.
Jetson is a whole other beast. There is no guarantee any pip package you install has an aarch64/arm64 wheel. For example, we could not use torch_tensorrt, to compile to TensorRT via Torch Inductor. Why? Because the Bazel build system was only configured to build for Jetpack 4.6 or Jetpack 5.1, and we were using Jetpack 6. While Nvidia provides docker images for x86 systems that come with torch_tensorrt installed, their L4T (Linux for Tegra) images do not. Instead we had to manually write out a new workspace file and compile for Jetpack6 to provide TensorRT compiling support.
tl;dr: Nvidia and CUDA have a great walled garden on x86, not so much on their edge computing devices
My understanding is that, so far, most deployments of AI on edge devices are on mass-market mobile and entertainment devices relying on software and hardware tightly controlled by a handful of mega-corporations, such as Apple (iOS), Google (Android), Samsung (phones, TVs, etc.), and Tesla (proprietary in-car chips for FSD), and so on. Aren't those mega-corporations, not Nvidia, the ones who have the actual walled gardens on AI edge computing?
You're absolutely right about mobile devices (Apple, Google, etc.). However, most companies, with the exception of Tesla, do use Nvidia for edge computing capabilities. We know for a fact that most of the automotive industry uses automotive rated Orins (the 32GB unified RAM SKU) [1] and Anduril also use Orins. Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.
[1] Particularly vehicles with advanced self driving capabilities. Qualcomm is another large vendor of hardware for vehicles (though they have even worse support)
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.
This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.
I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.
The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block
I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.
It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.
Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?
We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.
To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:
10 => 1,
01 => 0,
00 =>-1,
11 => ?
I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.
What do you think about the tension between inference accuracy and the types of edge applications used today?
For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)
But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?
Great question. So a little bit of background about quantization (apologies if you are already familiar).
There are two types of quantization (generally), post training quantization (PTQ) and quantization aware training (QAT).
PTQ almost always suffers from some kind of accuracy degradation. This is because usually the loss is measured with respect to the FP16/BF16 parameters, and so the weights and distribution are selected to minimize the loss with those weights. Once the quantization function is applied, the weights and distribution change in some way (even if it's by a tiny amount), resulting in your model no longer being at minima.
We do QAT to get around the problem of PTQ. We actually quantize the weights during the forward pass of training, and measure the loss with respect to the quantized weights. As a result, once we converge the model, we have converged the ternary weights as well, and the accuracy it achieved at the end of training is the accuracy of the quantized model. At ~3B parameters the accuracy on downstream task performance between FP16 and ternary weights is identical.
I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.
Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.
Edit: I don't think the company exists in its current form anymore
Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?
Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.
I should note that our linear layers are not the same as Microsoft's, in fact, we think Microsoft made a mistake in the code they uploaded. When I have time later today, I'll link to where I think they made a mistake.
I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.
1. They are everywhere and aren't going anywhere..
2. Network infrastructure to ingest
and analyze thousands of cameras producing video footage is very demanding..
3. Low power and low latency scream asic to me
Have you tried implementing your ternary transformers on AVX(-512)? I think it fits relatively well with the hardware philosophy, and being able to run inference without a GPU would be a big plus.
What kind of code did you try on the CPU for, say, ternary gemm? I imagine ternary values maps nicely to vectorized mask instructions, and much of tiling etc from usual gemm
The most popular interfaces (human, API and network) I can imagine are ChatGPT, OpenAI compatible HTTP API, Transformers HuggingFace API and models, Llama.cpp / Ollama / Llamafile, Pytorch.
USB C, USB A, RJ45, HDMI/video(?)
If you can run a frontier model or a comparable model with the ChatGPT clone like Open UI, with a USB or LAN interface, that can work on private data quickly, securely and competitively to a used 3090 it would be super badass. It should be easy to plug in and be used for running chat or API use or fine-tune or use with raw primitives via Pytorch or a very similar compatible API. I've thought about this a bit. There's more I could say but I've got to sleep soon... Good luck, it's an awesome opportunity.
Have you sat in on my conversations with my cofounder?
The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.
I've not but that sounds cool!
I would point out though, in terms of mind share, how memorable, and how relatable and useful the products are: it might help to have ways that directly show the application for the kinds of people buying GPUs for inference and training or using cloud for this that would love to not have to fight their ATX case in a hot sweaty corner while repeatedly dropping screwdrivers and calculating how much RAM they need to buy for the 405B while llama.cpp is recompiling again... I think people would throw money at that.
I'd be happy to listen in or have a chat some time!
Yeah I've been thinking about this problem for a while from the making gates level, I've been thinking that the problem essentially breaks down to a couple of pop counts and a subtract, it's eminently pipelineable
ternary transformers have existed for a long time before you guys TerDit, vision ones etc. Competing in the edge inference space is likely going to require a lot of capex and opex + breaking into markets like defense thatre hard asf without connections and a strong team. neither of you guys are chip architects either and taping out silicon requires a lot of foresight to changing market demands. good luck, hopefully it works out.
In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.
We're targeting the edge market first, such as NVIDIA's Jetson line, because it's far less supported/focussed on. In our experience, whenever we did training runs on H100 clusters with x86, any pip package would be easily installable, and a wide array of software just worked. This is not the case in Jetson, where we constantly have to rebuild packages from source, and in general, NVIDIA will only release a better board every five years. As for the second part of your question, we agree. Much of our work has been trying to make switching to our software layer straightforward (a single line of code). The ideal endgame is that, given an ONNX file, we can parse the generated node tree and determine if our hardware supports all the nodes. Of course, this is assuming we have a large enough share of the market using our software, so we know what operations we need to support on the hardware side of things.
I cannot see any way of building HW profitably for the Jetson market. You are really competing with Raspberry PI, not Jetson, IMO. I mean, I'm no expert, but I would suggest doing a deep dive on your business plan if you intend to target the small hardware world rather than spending any time designing HW or SW. Then reduce your estimate by at least half since doing anything in that embedded/edge world has many more technical issues.
In general, Jetson has quite a large market. Vehicle companies use automotive-rated Jetson Orins, and defense companies also use Jetson Orins to power ML applications on the edge (Anduril). Many of the companies we currently talk to are robotics companies that are forced to use Jetsons because they are both the least of the bad options and the only edge compute provider with enough juice to run larger transformer models.
And the auto and Defense markets are so easy to enter! /s
Both of these markets have long lead times, tight HW build times, and move incredibly slowly. They are not the kind of markets that like using stuff from new companies with no history. Again, I'm no expert, but I'd say you need to be concentrating on sales and market research now.
With respect it doesn't sound like you know much about any of these businesses. This startup is extremely early, the road to silicon is long, and there is a lot of external change and learning by doing that will happen between here and there. This is them getting started and based on my related work experience I think it's pretty interesting.
We are not under the illusion these markets are easy to enter. Still, we believe providing an effortless and compatible experience for edge ML computing is a strong competitive advantage. We have not met anyone who likes using Jetsons yet, unlike A100/H100s in the server market.
Edit: I should note that if it weren't for Dusty and his docker image generating GitHub repo for Jetson, we would have spent weeks trying to get our kernels and optimized models shipped to customers.
[dead]
What's your point? Is it that one shouldn't attempt to enter a market just because it's difficult? Or are you trying to educate the founders about something obvious that they likely have already spent 1000x more time thinking about than you?
This 1000%. Just because a business in a tangential area didn't work, doesn't mean innovation shouldn't happen
I think the only way this could work is if you had the backing of one of the major LLM providers who decided that your ideas are worth doing a PoC. That way you actually have a client on board before you spend all the money. I know you guys probably like the designing of the HW and SW, and maybe the implementation of both, but really, what you need now is to do sales.
There are multiple ways to run a business like this.
1. Go deep on the tech, there are funders who will want equity stakes in risky startups because they operate in adjacent markets. It's often cheaper to invest 1MM on a startup than internal R&D activities. If it has promising results, those same investors may ramp up their spend or pivot to an acquisition strategy.
2. Get early customers, if you have 1-10 large enterprises with a committed spend - then you are likely golden. However as nice as this option sounds, there are few avenues to get this type of commitment. If you are in the fortunate position of knowing the exec/founding/investor team of a large LLM provider - it's possible. But easier said than done.
3. Build it and they will come, business strategies take time to develop - maybe that time is poorly spent. Build the best version of your product and someone might take it up. There are a few investors who will take a flyer on this type of founder mentality. Benefit to the investor is that they can get a much larger equity stake/board position in exchange for the early creative freedom. If it works out, the investor can get a lot of alpha. A card which handled LLM inference at 1/100th the cost of an H100 could produce quite a bit of value for the right buyer.
The most realistic and likely scenario is:
4. Do the technical work to get it a little bit beyond just an idea and then get acqui-hired by a large company who has the resources to push this.
So if I was them I would be doing thought experiments on how this technology could benefit a whole range of businesses e.g. gaming consoles, televisions etc. Not many people would've guessed LG acquiring Palm for example.
Agreed. We don't plan on making hardware until there is enough demand from customers to make it economically viable.
I'm currently working on a portable computer vision project using Pi/Jetson with some Luxonis camera modules and I completely see where you're headed. In the long-game I think you could capture hw accelerated robotics CV.
Why not target the enthusiast first? The buzz created around something interesting an "amateur" cooked up may be what you need. The investment involved with creating dev hardware should be minimal, correct?
I may be wrong, but from few other enthusiast niches I conclude, enthusiasts number is very little to feed hardware development. - Need millions sells, but really most real project have made thousands sells.
And this is long known - even Raspberry born for other market, fortunately, was not just killed but conversed to target enthusiast and even now incomplete project.
Having been working in DL inference for now 7+ years (5 of which at startup) which makes me comparably ancient in the AI world at this point. The performance rat race/treadmill is never ending, and to your point a large (i.e 2x+) performance improvement is not enough of a "painkiller" for customers unless there is something that is impossible for them to achieve without your technology.
The second problem is distribution: it is already hard enough to obtain good enough distribution with software, let alone software + hardware combinations. Even large silicon companies have struggled to get their HW into products across the world. Part of this is due to the actual purchase dynamics and cycle of people who buy chips, many design products and commit to N year production cycles of products built on certain hardware SKUs, meaning you have to both land large deals, and have opportune timing to catch them when they are evening shopping for a new platform. Furthermore the people with existing distribution i.e the Apple, Google, Nvidia, Intel, AMD, Qualcomms of the world already have distribution and their own offerings in this space and will not partner/buy from you.
My framing (which has remained unchanged since 2018) is that for silicon platform to win you have to beat the incumbents (i.e Nvidia) on the 3Ps: Price (really TCO), Performance, and Programmability.
Most hardware accelerators may win on one, but even then it is often theoretical performance because it assumes their existing software can/will work on your chip, which it often doesn't (see AMD and friends).
There are many other threats that come in this form, for example if you have a fixed function accelerator and some part of the model code has to run on CPU the memory traffic/synchronization can completely negate any performance improvements you might offer.
Even many of the existing silicon startups have been struggling with this since them middle of the last decade, the only thing that saved them is the consolidation to Transformers but it is very easy for a new model architecture to come out and require everyone to rework what they have built. This need for flexibility is what has given rise to the design ethos around GPGPU as flexibility in a changing world is a requirement not just a nice to have.
Best of luck, but these things are worth thinking deeply about as when we started in this market we were already aware of many of these things but their importance and gravity in the AI market have only become more important, not less :)
We've spent a lot of time thinking about these things, in particular, the 3Ps.
Part of making the one line of code work is addressing programmability. If you're on Jetson, we should load the CUDA kernels for Jetson's. If you're using a CPU, we should load the CPU kernels. CPU with AVX512, load the appropriate kernels with AVX512 instruction, etc.
The end goal is that when we introduce our custom silicon, one line of code should make it far easier to bring customers over from Jetson/any other platform because we handle loading the correct backend for them.
We know this will be bordering impossible, but it's critical to ensure we take on that burden rather than shifting it to the ML engineer.
Why start a company to make this product? Why not go work at one of the existing chip manufacturers? You'd learn a ton, get to design and work on HW and/or SW, and not have to do the million other things required to start a company.
We were waiting for a Bitnet-based software and hardware stack, particularly from Microsoft, but it never did. We were essentially nerd-sniped into working on this problem, then we realized it was also monetizable.
On a side note, I deeply looked into every company in the space and was thoroughly unimpressed with how little they cared about the software stack to make their hardware seamlessly work. So, even if I did go to work at some other hardware company, I doubt a lot of customers would utilize the hardware.
I recommend getting a job at NVIDIA. They care deeply about SW. It is a great place to learn about HW and the supporting SW. There is much to learn. Maybe you will learn why you are unimpressed with their SW offerings. For me, the hard part was the long lead time (8+ years) from design to customers using the product. One of the things that always amazed me about NVIDIA was that so many of the senior architects, who have no financial need to keep working (true for more than a decade), are still working there because they need the company to do what they love.
I think there is a comment somewhere here where I comment on NVIDIA, but I think NVIDIA is the best hardware company for making good software. We had a very niche software issue for which NVIDIA maintained open-source repos. I don't think NVIDIA's main advantage is its hardware, though; I think it's the software and the flexibility it brings to its hardware.
Suppose that Transformers die tomorrow, and Mamba becomes all the rage. The released Mamba code already has CUDA kernels for inference and training. Any of the CSPs or other NVIDIA GPU users can switch their entire software stack to train and inference Mamba models. Meanwhile, we'll be completely dead in the water with similar companies that made the same bet, like Etched.
You said (implied?) that your reason for starting a company was that you were waiting for somebody (MS) to build your favorite tech, and you realized it was monetizable. Finding a gap is a great start. But, if money is your goal, it is far easier to make money working at a company than starting one. Existing companies are great places to learn about technology, business, and the issues that should really drive your desire to start something yourself.
I don't think I ever implied we started this for money. We started working on the technology because it was exciting and enabled us to run LLMs locally. We wouldn't have started this company if someone else came along and did it, but we waited a month or two and didn't see anyone making progress. It just so happens that hardware is capital intensive, so making hardware means you need access to a lot of capital through grants (which Dartmouth didn't have for chip hardware) or venture capital (which we're going for now). I'm not sure where you got the idea we're doing this solely for money when I explicitly said "We were essentially nerd-sniped into working on this problem"
Glad to hear money isn't your focus. Your comment "...then we realized it was also monetizable" was the reason for my interpretation. Its also a very common rational. I don't know what "nerd-sniped" means, so...
Good luck with the VCs. I hope you all stay friends through the challenging process.
> I think NVIDIA is the best hardware company for making good software
I must support Your words. Long time I thought that Intel is the best, but unfortunately I could not anymore.
Must admit, I still don't understand, how it happened, but now NVIDIA is best.
100%.
When performing performance optimization on CPUs, I was impressed with Intel's suite of tools (like VTUNE). NVIDIA has some unbelievable tools, like Nsys and, of course, its container registry (NGC), which I think surpasses even Intel's software support.
Is GPU rendering used today for VFX? From a quick google it seems that yes GPU based rendering is definitely an option, even if there's various reasons to still prefer CPU. So in your case was it really what you were aiming to do was pointless or simply your particular solution failed to succeed?
You're right that as a small player it's very hard to gain traction, even if the tech is fantastic because it's risky to switch your tech stack over. Though if you do do a good job with the tech I'd say you have a decent chance of an acquisition from a bigger player who wants a ready-made (or 90% of the way there) solution they can make their own. Perhaps you can call this an aquihire but I think you're significantly underplaying the potential upside of this exit. Imagine this startup is seen as having a great ternary transformer solution and ternary transformers are the way to go you could get multiple large players eyeing up an acquisition to get ahead pushing the price up.
My feeling is custom ASICs for ternary transformers is a great area to look at. There is a genuine chance of providing a significant step up from GPUs in terms of power efficiency and potentially performance. Plenty of risk of course, ternary models might just not perform as well as the full fat equivalents and building custom silicon, especially as a start-up, comes with all kinds of issues.
> Is GPU rendering used today for VFX?
Yes by small studios with the agility to change their workflow without too much friction, and whose projects are small enough to fit into the constraints of GPU renderers, but largely not by huge studios who already have in-house CPU farms and whose projects need hundreds of gigs of RAM to render anyway.
The Unreal Engine I hear is getting a lot of work these days.
Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]
From a software development standpoint, usability looks great, requiring only one import,
and then, later on, a single line of Python, which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.
I, for one, hope you guys are hugely successful.
---
[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.
Thank you!
CUDA and Nvidia are practically impenetrable on the server side. To be very concrete, we did training for our models on AWS with parallel cluster. We used P5 instances (8xH100) that were scheduled with SLURM. A problem we ran into however, was that our training jobs were containerized. Thankfully, pyxis and enroot exist to run containerized jobs on SLURM. And who else, but Nvidia, develop and maintain those plugins. For practically any weird niche use case, Nvidia seems to have some software solution - but only on x86.
Jetson is a whole other beast. There is no guarantee any pip package you install has an aarch64/arm64 wheel. For example, we could not use torch_tensorrt, to compile to TensorRT via Torch Inductor. Why? Because the Bazel build system was only configured to build for Jetpack 4.6 or Jetpack 5.1, and we were using Jetpack 6. While Nvidia provides docker images for x86 systems that come with torch_tensorrt installed, their L4T (Linux for Tegra) images do not. Instead we had to manually write out a new workspace file and compile for Jetpack6 to provide TensorRT compiling support.
tl;dr: Nvidia and CUDA have a great walled garden on x86, not so much on their edge computing devices
My understanding is that, so far, most deployments of AI on edge devices are on mass-market mobile and entertainment devices relying on software and hardware tightly controlled by a handful of mega-corporations, such as Apple (iOS), Google (Android), Samsung (phones, TVs, etc.), and Tesla (proprietary in-car chips for FSD), and so on. Aren't those mega-corporations, not Nvidia, the ones who have the actual walled gardens on AI edge computing?
Do you think otherwise?
You're absolutely right about mobile devices (Apple, Google, etc.). However, most companies, with the exception of Tesla, do use Nvidia for edge computing capabilities. We know for a fact that most of the automotive industry uses automotive rated Orins (the 32GB unified RAM SKU) [1] and Anduril also use Orins. Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.
[1] Particularly vehicles with advanced self driving capabilities. Qualcomm is another large vendor of hardware for vehicles (though they have even worse support)
> Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.
Huh. That's a really good sign. I'm rooting for you!
Video cropping issues should be fixed!
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.
This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.
I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.
The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block
I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.
It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.
Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?
We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.
To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:
10 => 1, 01 => 0, 00 =>-1, 11 => ?
I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.
What do you think about the tension between inference accuracy and the types of edge applications used today?
For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)
But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?
Great question. So a little bit of background about quantization (apologies if you are already familiar).
There are two types of quantization (generally), post training quantization (PTQ) and quantization aware training (QAT).
PTQ almost always suffers from some kind of accuracy degradation. This is because usually the loss is measured with respect to the FP16/BF16 parameters, and so the weights and distribution are selected to minimize the loss with those weights. Once the quantization function is applied, the weights and distribution change in some way (even if it's by a tiny amount), resulting in your model no longer being at minima.
We do QAT to get around the problem of PTQ. We actually quantize the weights during the forward pass of training, and measure the loss with respect to the quantized weights. As a result, once we converge the model, we have converged the ternary weights as well, and the accuracy it achieved at the end of training is the accuracy of the quantized model. At ~3B parameters the accuracy on downstream task performance between FP16 and ternary weights is identical.
I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.
Congrats on launching. This is inspiring. .
Combine it with TOC, and then you’d really be off to the races!
https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...
Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.
Edit: I don't think the company exists in its current form anymore
> This represents an almost 8x compression ratio for every weight matrix in the transformer model
Surely you’d need more ternary weights though to achieve same performance outcome?
A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like
Either way excited about more tenary progress.
We do quantization-aware training, so the model should minimize the loss w.r.t. the ternary weights, hence no degradation in performance.
Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?
Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.
We actually were thinking about this to flush the weights in at initialization
Cool.... if you want to make a general purpose compute engine out of it, you could go full BitGrid[1]. ;-)
[1] https://bitgrid.blogspot.com/2005/03/bitgrid-story.html
This seems super cool. I'll have my cofounder look into it :)
There’s more to it. https://x.com/NolanoOrg/status/1813969329308021167
I will be archiving the full report with more results soon.
I should note that our linear layers are not the same as Microsoft's, in fact, we think Microsoft made a mistake in the code they uploaded. When I have time later today, I'll link to where I think they made a mistake.
I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.
An area worth exploring are IP cameras imho
1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me
There was another founder that said this exact same thing. We'll definitely look into it especially as we train more ViTs.
Have you tried implementing your ternary transformers on AVX(-512)? I think it fits relatively well with the hardware philosophy, and being able to run inference without a GPU would be a big plus.
Our CPU implementation for X86/AMD64 utilizes AVX-512 or AVX-2 instructions where possible. We're experimenting with support for ARM with NEON.
What kind of code did you try on the CPU for, say, ternary gemm? I imagine ternary values maps nicely to vectorized mask instructions, and much of tiling etc from usual gemm
What is the upper bound on the level of improvement (high performance networking, memory and compute) you can achieve with ternary weights?
Is there a possibility where this can run on a specialized hardware which is neither a CPU nor GPU, e.g. NextSilicon Maverick chips?
Great project, looking forward to seeing more as this develops.
Also FYI, your mail server seems to be down.
Thank you, and good catch.
We recently acquired deepsilicon.com, and it looks like the forwarding hasn't been registered yet. abhi@deepsilicon.net should work.
Congrats, always cool to see YC founders working on silicon!
The most popular interfaces (human, API and network) I can imagine are ChatGPT, OpenAI compatible HTTP API, Transformers HuggingFace API and models, Llama.cpp / Ollama / Llamafile, Pytorch. USB C, USB A, RJ45, HDMI/video(?) If you can run a frontier model or a comparable model with the ChatGPT clone like Open UI, with a USB or LAN interface, that can work on private data quickly, securely and competitively to a used 3090 it would be super badass. It should be easy to plug in and be used for running chat or API use or fine-tune or use with raw primitives via Pytorch or a very similar compatible API. I've thought about this a bit. There's more I could say but I've got to sleep soon... Good luck, it's an awesome opportunity.
Have you sat in on my conversations with my cofounder?
The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.
I've not but that sounds cool! I would point out though, in terms of mind share, how memorable, and how relatable and useful the products are: it might help to have ways that directly show the application for the kinds of people buying GPUs for inference and training or using cloud for this that would love to not have to fight their ATX case in a hot sweaty corner while repeatedly dropping screwdrivers and calculating how much RAM they need to buy for the 405B while llama.cpp is recompiling again... I think people would throw money at that. I'd be happy to listen in or have a chat some time!
Can this run crysis?
Can this run Doom?
Can it generate Doom at runtime?
Yeah I've been thinking about this problem for a while from the making gates level, I've been thinking that the problem essentially breaks down to a couple of pop counts and a subtract, it's eminently pipelineable
ternary transformers have existed for a long time before you guys TerDit, vision ones etc. Competing in the edge inference space is likely going to require a lot of capex and opex + breaking into markets like defense thatre hard asf without connections and a strong team. neither of you guys are chip architects either and taping out silicon requires a lot of foresight to changing market demands. good luck, hopefully it works out.
Very interesting!
[dead]
you might want to redo the video as it's cropped too much, and maybe it's only me but it's _really_ annoying to watch like this.
Oops, good catch. Will re upload shortly.
Thanks! We've updated the youtube link at the top to the fixed version.
[flagged]