Phase 1 Live ARM + Intel Profiling Platform

Stop guessing
where you're slow.

Multi-architecture profiling platform for embedded systems.

Measure exactly where your code stands on the roofline model — compute-bound or memory-bound, and how far from peak hardware performance. Works across ARM Cortex-M/A and Intel x86-64.

Cortex-M/AIntel x86
Join Waitlist
< 1 sprint to first profiling result
Minimal instrumentation overhead
sabueso — roofline.plot
LIVE
GFLOP/s 100 10 1 0.1 0.1 1 10 100 1000 Arithmetic Intensity (FLOPs/byte) MEMORY BOUND COMPUTE BOUND Peak Performance L1 Bandwidth L2/L3 Bandwidth DRAM Bandwidth kernel_a DRAM Bound kernel_b L2 Bound kernel_c Compute Bound
main.c — sabueso_cortex_a
#include "sabueso_cortex_a.h"
// Init hardware performance counters
sabueso_backend_init(&ctx);
SabuesoOutputBuffer buf = 0;
// Profile the critical path
SABUESO_MEASURE(&buf, your_function(args));
// → Roofline: MEMORY BOUND @ 0.18x ceiling
// → Cache miss rate: 12.4% | BW util: 87%
SCROLL

Clear answers.
Without the noise.

Generic profiling tools often create more noise than signal. Sabueso is built to provide actionable performance insights.

Analysis Paralysis

Flamegraphs

Visualizes everything at once, making it nearly impossible to distinguish between "expectedly slow" and "accidentally slow" code.

Data Overload

Standard Tracing

Generates gigabytes of raw event data. You spend more time writing scripts to filter noise than actually optimizing.

Zero Context

Cycle Counters

Tells you that you are slow, but not why. Memory bottleneck? Compute stall? You're left guessing.

cycles: 1,847,392
cycles: 1,847,392
// ...but why?
Actionable Intelligence

Our Edge

Instead of overwhelming you with data, we give you a single roofline coordinate: Compute-Bound or Memory-Bound — and exactly how far from peak you are.

→ MEMORY BOUND @ 0.18x ceiling
→ Optimize: cache locality first
→ Expected gain: 4–8x

Performance profiling
for your ARM hardware.

Purpose-built for the three domains where ARM performance matters most — and where generic tools fail.

ARM Cortex-M & A

Firmware Performance

Profile crypto, sensor, and control-loop code on ARM Cortex-M and Cortex-A chips. Pinpoint cache bottlenecks in production firmware before shipping.

STM32H5/F4/H7
nRF52840
Raspberry Pi 4/5
Real-time Systems

Robotics

Optimize real-time control loops and sensor fusion pipelines. Reduce jitter and ensure deterministic performance for autonomous systems.

Control loop profiling
Jitter analysis
Sensor fusion pipelines
Industrial ICS

ICS Determinism

Validate SCADA commissioning and prove cycle-exact determinism on industrial controllers. Catch interrupt latency spikes before they reach the plant floor.

SCADA commissioning validation
Interrupt latency profiling
Cycle-exact determinism

Actionable insights
in minutes.

01

Install

Add the Sabueso library to your project dependencies. Single header file. No OS required. No runtime overhead.

#include "sabueso_cortex_a.h"
02

Annotate Your Code

Wrap performance-critical functions with SABUESO_MEASURE macros. Or use explicit start/stop markers for fine-grained regions.

03

Analyze Bottlenecks

Review cycle counts, cache miss rates, and your roofline plot position. Know instantly if you're memory- or compute-bound, and by how much.

04

Optimize & Iterate

A/B test your optimizations against the roofline baseline. Measure real impact. Iterate with confidence until you hit the hardware ceiling.

inference_pipeline.c
Profiling
#include "sabueso_cortex_a.h"
// ── Step 1: Initialize ─────────────────────
SabuesoCtx ctx;
sabueso_backend_init(&ctx);
SabuesoOutputBuffer buf = 0;
// ── Step 2: Annotate ───────────────────────
SABUESO_MEASURE(&buf, run_inference(model, input));
// Or explicit markers for custom regions:
SABUESO_START();
  preprocess_tensor(input);
  run_layer(conv1);
SABUESO_STOP(&buf);
// ── Output ─────────────────────────────────
Roofline Position: MEMORY BOUND
Ceiling utilization: 18.3%
Cache miss rate: 12.4%
BW utilization: 87.1%
Arith. intensity: 0.21 FLOPs/byte
→ Recommendation: optimize cache locality
→ Potential speedup: 4.2–8.7x
-->
SaaS

Fleet Telemetry

Connected devices upload real-time performance data. Teams see roofline trends over time and identify performance variations across hardware units in production.

Real-time upload Fleet dashboard Trend analysis
CI/CD

Regression Tracking

Track instruction-level performance regressions on the roofline model. Block PRs that increase resource usage above defined thresholds. Never ship a regression again.

GitHub Actions PR gates Roofline diff
TEMU

Thermal Profiling

Uses Thermal Energy Management Units to predict power and thermal characteristics. Throttle code before hardware hits thermal safety limits — proactively.

Power prediction Thermal modeling Auto-throttle
Intel x86
HPC
Intel Raptorlake
Available
Intel Alderlake
In Development
ARM Cortex-A
Application processors
Alif E Series
Cortex-A32
Available
Raspberry Pi 4/5
Cortex-A72/A76
Available
NVIDIA Jetson Orin
Cortex-A78AE
In Development
Ampere Altra
Neoverse N1
AWS Graviton 3
Neoverse V1
ARM Cortex-M
Microcontrollers
Alif E Series
Cortex-M55
Available
STM32H5
Cortex-M33
Available
STM32G4
Cortex-M4
Available
STM32H7
Cortex-M7
In Development

Don't see your platform?

We're actively expanding support. Let us know what you need.

Request a Platform

Where we're going.

The profiler core is live. Vertical integrations for Wearables, ROS 2, and Industrial ICS are in active development.

Phase 1 Available now

Profiler core

C library and Python SDK. Instrument code regions, collect hardware PMU counters, analyze with Roofline.

C Library

SABUESO_MEASURE() macros to collect performance metrics via hardware counters with minimal footprint.

Cortex-M (STM32, nRF)Cortex-A (RPi 4/5)Alif E SeriesIntel x86

Python SDK

Python bindings for the C library. Call sabueso.measure() directly from Python-based codebases — ROS2 nodes, control scripts, robotics pipelines.

ROS2 node integrationSame PMU counters as C libpip install
Phase 2 Vertical integrations

Segment-specific tooling

Vertical-specific SDK extensions, report templates, and integrations built on top of the profiler core.

Wearables

Binary-only profiling. One sprint. Customer acceptance proved.

JTAG/ETM binary profilingEnergy budget reportsIP confidentiality preserved

ROS 2

Unified latency visibility. ROS2-native. < 0.02% overhead.

Scheduler tracingDDS topic latency breakdownros2_tracing + LTTng-UST integration
Phase 3 Mission-critical future

Mission-critical validation

High-stakes verticals where determinism requirements and commissioning rigour demand a dedicated validation layer.

Industrial ICS

SCADA commissioning validation. Determinism proof. < 0.5% overhead.

Want to influence which segments land next?

Join the waitlist

Be first to know
when we ship.

Join the waitlist for early access to Sabueso. We'll reach out when your platform is ready, or when new features launch.

No spam. We'll only email you when it matters.

No spam, ever Unsubscribe anytime