Your mission
- Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
- Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
- Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
- Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
- Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns
Your profile
- 3+ years of experience in AI systems, ML infrastructure, or a closely related area
- Hands-on experience with at least one major LLM serving or training framework
- Strong proficiency in Python and C++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
- Solid understanding of distributed communication
- Familiarity with parallelism strategies and how they shape execution behavior across large clusters
- Open source contributions or published research in relevant areas will definitely be appreciated!
- Previous startup experience is a plus - we move fast and value people who are comfortable with that
Why us?
- Build something big: Help build and scale a fast-growing AI infrastructure startup
- Pay & perks: Competitive compensation with a performance-based incentive, subsidized Deutschlandticket, and access to a discount portal
- Work your way: Flexible hours with hybrid and remote-friendly options
- Fast lanes, no red tape: Flat hierarchies and rapid decision-making mean ideas ship quickly
- Global team: Work with a diverse, international team across Germany and the USA
- Modern headquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
- Top setup: Your choice of high-quality hardware and equipment
- Relocation support: We’ll help make your move to join us as smooth as possible
About us
turbalance is an innovative, emerging startup that transforms AI laws. We are a team of passionate problem-solvers who believe in what we’re building. We constantly push boundaries and embrace our inner nerds as we find new ways to tackle complex challenges. You will find a dynamic work environment here, with flat or even non-existent hierarchies and the chance to take on responsibility from day one.
Apply for this job
