
Origin E6
Optimal AI Inference in a performance, power- and area-optimized package
The Origin™ E6 is a family of performance-optimized Neural Processing Unit (NPU) IP cores designed for image and point cloud-related Artificial Intelligence (AI) usage in smartphones, AR/VR headsets, and other devices. Products like these require performance-intensive AI inference but also mandate balancing performance with power and area requirements. The Origin E6 family optimizes power consumption through careful attention to utilization and reduced external memory requirements while lowering latency to an absolute minimum. The E6 offers performance from 16 to 32 TOPS.
Native Execution: a New NPU Paradigm
Typical AI accelerators—often repurposed CPUs (Central Processing Units) or GPUs (Graphic Processing Units)—rely on a complex software stack that converts a neural network into a long sequence of basic instructions. Execution of these instructions tends to be inefficient, with low processor utilization ranging from 20 to 40%. Taking a new approach, Expedera designed Origin specifically as an NPU that efficiently executes the neural network directly using metadata and achieves sustained utilization averaging 80%. The metadata indicates the function of each layer (such as convolution or pooling) and other important details, such as the size and shape of the convolution. No changes to your trained neural networks are required, and there is no perceivable reduction in model accuracy. This approach greatly simplifies the software, and Expedera provides a robust stack based on Apache TVM. Expedera’s native execution eases the adoption of new models and reduces time to market.
Support for a Wide Variety of Neural Networks
Artificial Intelligence is a fast-developing science, and new neural networks are released almost daily. Expedera’s Origin E6 NPU supports RNN, CNN, LSTM, and other neural networks. Native support is provided for Inception, MobileNet, YOLO v3, FSRCNN, EfficientNet, Unet and many other neural networks with input resolutions up to 8K, including support for transformers. In addition, Origin includes support for custom and proprietary networks and offers provisions for future-proofing your design.
Optimized for Your Specific Needs
While there are many general-purpose AI processors, a one-size-fits-all solution is rarely the most efficient. General-purpose AI processors are often larger than needed for a specific application and consume more power than necessary. The Origin E6 IP cores are optimized for a customer’s application-specific area and power requirements. Whether performance-optimized for a specific single or series of networks, sized to meet silicon area constraints or configured to meet system power requirements, the E6 will provide optimal PPA (power, performance, area). Expedera can typically achieve superior performance in about half the silicon area required by other NPUs. During the design process, Expedera works with clients to understand their specific application needs and constraints and provides cycle accurate PPA estimations before delivery of the IP.
The customer-reported data set below illustrates how Expedera’s natively efficient architecture, combined with optimization, leads to huge performance gains. Running a 4K video denoising algorithm, the customer wanted to increase throughput, since their former NPU could process only a few frames per second (FPS). Using Expedera’s Origin, FPS performance grew by 20X—while consuming less than half the power, a more than 40X increase in PPA.
Market-leading Power Efficiency
Understanding the comparative power efficiencies of NPUs can be complicated. Ours isn’t— Expedera’s Origin family averages a market-leading 18 TOPS/W, where we assume a TSMC 7nm process, running ResNet50 at an INT8 precision throughout with a 1GHz system clock. No sparsity, compression or pruning is applied, though all are supported and may further increase power efficiency. Origin has repeatedly been cited as the most power efficient NPU available.
Silicon-Proven and Deployed in Millions of Consumer Products
Choosing the right AI processor can ‘make or break’ a design. The Origin architecture is silicon-proven in leading-edge process nodes and successfully shipped in millions of consumer devices worldwide.
- 16 – 32 TOPS performance, performance efficiency up to 18 TOPS/Watt
- Scalable performance to 16K MACS
- Support up to 8K input resolutions
- Advanced activation memory management
- Low latency
- Compatible with various DNN models
- Hardware scheduler for NN
- Processes model as trained, no need for software optimizations
- Use familiar open-source platforms like TFlite
- Delivered as soft IP: portable to any process
Compute Capacity | 8K to 16K INT8 MACs |
Multi-tasking | Run up to 2 Simultaneous Jobs |
Power Efficiency | 18 Effective TOPS/W (INT8) |
NN Support | CNN, RNN, LSTM and other NN architectures |
Layer Support | Standard NN functions, including Conv, Deconv, FC, Activations, Reshape, Concat, Elementwise, Pooling, Softmax etc. Programmable general FP function, including Sigmoid, Tanh, Sine, Cosine, Exp, etc. |
Data types | INT4/INT8/INT10/INT12/INT16 Activations/Weights FP16/BFloat16 Activations/Weights |
Quantization | Channel-wise Quantization (TFLite Specification) Optional custom Quantization based on workload needs |
Latency | Deterministic Performance Guarantees |
Memory | Smart System Memory Allocation and Scheduling |
Frameworks | TensorFlow, TFlite, ONNX |
Workloads | Capable of running large DNN networks |
Advantages
Industry-leading performance and power efficiency.
Architected to serve wide range of compute requirements.
On-chip memory and system DRAM work together to improve bandwidth.
Drastically reduces memory requirements.
Deterministic, real-time performance.
Flexible for changing applications.
Simple software stack.
Achieve same accuracy your trained model.
Simplifies deployment to end customers.
Benefits
- Efficiency: industry-leading 18 TOPS/W enables greater processing efficiencies with lower power consumption
- Simplicity: eliminates complicated compilers, easing design complexity, reducing cost, and speeding time-to-market
- Configurability: independently configurable building blocks allow for design optimization– right sized deployments
- Predictability: deterministic, QoS
- Scalability: from 16 to 32 TOPS a single scalable architecture addresses a wide range of application performance requirements
- Deployability: best-in-market TOPS/mm2 assures ideal processing/chip size designs

Download our White Papers

Get in Touch With Us
STAY INFORMED
Subscribe
to our News
Sign up today and receive helpful
resources delivered directly
to your inbox.