
Origin E2
Optimal performance in a power-and area-optimized package
The Origin™ E2 is a family of power- and area-optimized Neural Processing Unit (NPU) IP cores designed for usage in smartphones, edge nodes, and other devices. The Origin E2 family saves system power through careful attention to processor utilization and memory requirements while optimizing performance and reducing latency. The E2 offers highly configurable performance from 1 to 20 TOPS.
Native Execution: a New NPU Paradigm
Typical AI accelerators—often repurposed CPUs (Central Processing Units) or GPUs (Graphic Processing Units)—rely on a complex software stack that converts a neural network into a long sequence of basic instructions. Execution of these instructions tends to be inefficient, with low processor utilization ranging from 20 to 40%. Taking a new approach, Expedera designed Origin specifically as an NPU that efficiently executes the neural network directly using metadata and achieves sustained utilization averaging 80%. The metadata indicates the function of each layer (such as convolution or pooling) and other important details, such as the size and shape of the convolution. No changes to your trained neural networks are required, and there is no perceivable reduction in model accuracy. This approach greatly simplifies the software, and Expedera provides a robust stack based on Apache TVM. Expedera’s native execution eases the adoption of new models and reduces time to market.
Market-leading Power Efficiency
Understanding the comparative power efficiencies of NPUs can be complicated. Ours isn’t—Expedera’s Origin family averages a market-leading 18 TOPS/W, where we assume a TSMC 7nm process, running ResNet50 at an INT8 precision throughout with a 1GHz system clock. No sparsity, compression or pruning is applied, though all are supported and may further increase power efficiency. Origin has repeatedly been cited as the most power efficient NPU available by third parties and customers alike.
In the customer-reported example below, Expedera’s power consumption is compared against the customer’s former NPU solution using an identical Neural Network (NN). Both NPUs are built in the same process node, so this head-to-head comparison provides a true measure of Expedera’s power savings—more than a 50% reduction with no sacrifice to accuracy, size, or performance.
Support for a Wide Variety of Neural Networks
Artificial Intelligence is a fast-developing science, and new neural networks are released almost daily. Expedera’s Origin E2 NPU supports RNN, CNN, LSTM, and other neural networks. Native support is provided for Inception, MobileNet, YOLO v3, FSRCNN, EfficientNet, Unet and many other neural networks with input resolutions up to 4K. In addition, Origin includes support for custom and proprietary networks and offers provisions for future-proofing your design.
Optimized for Your Specific Needs
While there are many general-purpose AI processors, a one-size-fits-all solution is rarely the most efficient. General-purpose AI processors are often larger than needed for a specific application and will consume more power than necessary. The Origin E2 IP cores are optimized for a customer’s application-specific area and power requirements. Whether performance-optimized for a specific single or series of networks, sized to meet silicon area constraints or configured to meet system power requirements, the E2 will provide optimal PPA (power, performance, area). Expedera can typically achieve superior performance in about half the silicon area required by other NPUs. During the design process, Expedera works with clients to understand their specific application needs and constraints and provides cycle accurate PPA estimations before delivery of the IP.
Silicon-Proven and Deployed in Millions of Consumer Products
Choosing the right AI processor can ‘make or break’ a design. The Origin architecture is silicon-proven in leading-edge process nodes and successfully shipped in millions of consumer devices worldwide.
- Performance efficient 18 TOPS/Watt
- Scalable performance from 0.5K-10K INT8 MACS
- Capable of processing real-time HD video and images on-chip
- Advanced activation memory management
- Low latency
- Tunable for specific workloads
- Hardware scheduler for NN
- Support for standard NN functions including Convolution, Deconvolution, FC, Activations, Reshape, Concat, Elementwise, Pooling, Softmax, Bilinear
- Processes model as trained, no need for software optimizations
- Use familiar open-source platforms like TFlite
- Delivered as soft IP: portable to any process
Compute Capacity | 0.5K to 10K INT8 MACs |
Power Efficiency | 18 Effective TOPS/W (INT8) |
Number of Jobs | Single |
NN Support | CNN, RNN, and other NN Architectures |
Layer Support | Standard NN functions, including Conv, Deconv, FC, Activations, Reshape, Concat, Elementwise, Pooling, Softmax, Bilinear etc. |
Data types | INT4/INT8/INT16 Activations/Weights |
Quantization | Channel-wise Quantization (TFLite Specification) Optional custom Quantization based on workload needs |
Latency | Optimized for smallest Latency with Deterministic Guarantees. |
Memory | Smart On-chip Dynamic Memory Allocation Algorithms |
Frameworks | TensorFlow, TFlite, ONNX |
Workloads | Capable of processing 4K video & 8K images on-chip |
Advantages
Industry-leading performance and power efficiency (up to 18 TOPS/W)
Architected to serve wide range of compute requirements.
Drastically reduces memory requirements, no off-chip DRAM required.
Run trained models unchanged without the need for hardware dependent optimizations.
Deterministic, real-time performance.
Improved performance for your workloads, while still running breadth of models.
Simple software stack.
Achieve same accuracy your trained model.
Simplifies deployment to end customers.
Benefits
- Efficiency: industry-leading 18 TOPS/W enables greater processing efficiencies with lower power consumption
- Simplicity: eliminates complicated compilers, easing design complexity, reducing cost, and speeding time-to-market
- Configurability: independently configurable building blocks allow for design optimization– right sized deployments
- Predictability: deterministic, QoS
- Scalability: from 1 to 20 TOPS a single scalable architecture addresses a wide range of application performance requirements
- Deployability: best-in-market TOPS/mm2 assures ideal processing/chip size designs

Download our White Papers

Get in Touch With Us
STAY INFORMED
Subscribe
to our News
Sign up today and receive helpful
resources delivered directly
to your inbox.