Taper: A Minimal Deep Learning Core
Taper: Building a Neural Network Library in Rust
Taper is a small neural network library written in Rust. Taper - GitHub Repo
It’s not production-ready, and it’s not trying to replace existing frameworks. The goal was to learn: how automatic differentiation works, how to write fast kernels, and how low-level choices affect performance on CPU.
Performance (M4 Mac, CPU only):
- MLP on MNIST: ~0.22s/epoch (PyTorch baseline: 1.5s/epoch)
- CNN on MNIST: ~12s/epoch for batch size 256 (PyTorch baseline: 120s/epoch)
Tape-Based Autograd
Taper uses a tape to record operations during the forward pass. Each op pushes a closure that knows how to compute gradients. Backprop just walks the tape in reverse:
Tape::push_binary_op(self, other, &output, move || {
if let Some(gout) = out.grad.read().unwrap().as_ref() {
accumulate_grad(&a, gout);
accumulate_grad(&b, gout);
}
});
This keeps autograd simple and fully dynamic.
Matrix Multiplication and BLAS
Most of training time is matrix multiplication.
Taper uses two paths:
- BLAS when available (Apple Accelerate, Intel MKL, OpenBLAS):
cblas_sgemm(
CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k,
1.0, a.as_ptr(), lda,
b.as_ptr(), ldb,
0.0, c.as_mut_ptr(), ldc
);
- Fallback Rust GEMM: cache-blocked SGEMM with SIMD micro-kernels.
This hybrid design keeps it portable and fast.
SIMD Kernels
Elementwise ops don’t benefit from BLAS, so Taper uses explicit SIMD:
#[target_feature(enable = "avx")]
unsafe fn add_f32_avx(a: &[f32], b: &[f32], out: &mut [f32]) {
let chunks = a.len() / 8;
for i in 0..chunks {
let va = _mm256_loadu_ps(a.as_ptr().add(i * 8));
let vb = _mm256_loadu_ps(b.as_ptr().add(i * 8));
let res = _mm256_add_ps(va, vb);
_mm256_storeu_ps(out.as_mut_ptr().add(i * 8), res);
}
}
Supported backends: AVX2 (x86), NEON (ARM), scalar fallback.
Convolution Kernels
Convolution uses different strategies depending on the case:
- 3×3 direct: unrolled loops (9 multiply–adds directly), no im2col.
- 1×1: just a matmul.
- General: im2col -> GEMM.
This avoids overhead for common cases.
Note - Im2col (Image to Column) is a data transformation technique used in convolutional neural networks (CNNs) to convert convolution operations into General Matrix to Matrix Multiplication (GEMM) (or GEMM) operations.
Operation Fusion
Common patterns are fused to reduce memory traffic:
pub fn conv2d_relu(&self, weight: &Tensor, bias: Option<&Tensor>) -> Tensor {
let conv_out = self.conv2d(weight, bias, (1,1), (0,0), (1,1));
conv_out.relu_inplace()
}
Parallelism and Memory Layout
- Batch and channel parallelism via Rayon:
output.par_chunks_mut(h_out * w_out) .enumerate() .for_each(|(idx, out_spatial)| { /* ... */ });
- Cache-blocked transpose and matmul to stay inside L1/L2 cache.
Minimal Scope
The library includes:
- Tensor ops with autograd
- Linear, Conv2D, Pooling, ReLU/Sigmoid
- SGD and Adam
- Simple training loops (MNIST examples)
- About 3k lines of code in total.
Lessons Learned
- BLAS level 3 supercharges GEMM.
- SIMD for elementwise ops.
- Memory bandwidth is the bottleneck; fusing ops and blocking is critical.
- Rust makes parallelism safer to write.
- A minimal design is easier to optimize.
Try It -
git clone https://github.com/vaibhawvipul/taper.git
cd taper
cargo run --release --features blas-accelerate --example train_mnist
cargo run --release --features blas-accelerate --example train_mnist_cnn
Taper is a compact experiment in deep learning internals. It shows how far careful engineering can take you, even on CPU, and makes the tradeoffs explicit in code that’s small enough to read in a weekend.