How We Score
Our scoring system measures how engineers work with AI across 7 calibrated dimensions in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. The same behavior is scored differently depending on what the task demands.
Process Over Output
Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result. Base scores are deterministic from telemetry; optional LLM enhancement adds qualitative evidence (bounded to ±15 points per dimension).
Calibrated AI Judgment
(65% of score)Where the gap between strong and weak AI-augmented engineers is widest. Scores are calibrated per task — the same behavior is scored differently depending on what the task demands.
Calibrated Trust
25% weighthybridDoes the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.
Calibrated Trust
25% weighthybridDoes the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.
Signals Measured
- Whether verification intensity matches the risk level of the change
- Appropriate trust calibration — accepting low-risk suggestions efficiently while scrutinizing high-risk ones
- Critical evaluation of AI output relative to task complexity
- Evidence of independent judgment in accepting or modifying AI suggestions
Example
Context Engineering
20% weightdeterministicHow effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.
Context Engineering
20% weightdeterministicHow effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.
Signals Measured
- Quality of investigation before seeking AI assistance
- Relevance and precision of context provided to the AI
- Whether prompts include meaningful architectural and behavioral constraints
- Context quality evolution as understanding deepens
Example
Problem Decomposition
20% weightdeterministicDoes the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.
Problem Decomposition
20% weightdeterministicDoes the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.
Signals Measured
- Exploration depth appropriate to the situation
- Evidence of structured thinking before acting
- Prompt specificity and targeted problem framing
- Whether complex problems are broken into manageable, scoped steps
Example
Technical Execution
(25% of score)Core engineering skills that remain essential regardless of AI assistance.
Debugging & Recovery
12% weighthybridWhen things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?
Debugging & Recovery
12% weighthybridWhen things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?
Signals Measured
- Systematic approach to root-cause identification
- Use of appropriate debugging techniques vs. blind re-prompting
- Quality of recovery when an approach fails
- Speed of recognizing dead ends and pivoting
Example
Architectural Judgment
8% weightdeterministicDoes the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.
Architectural Judgment
8% weightdeterministicDoes the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.
Signals Measured
- Respect for existing codebase patterns and conventions
- Deliberate decisions about scope and placement of changes
- Resistance to unnecessary AI-introduced complexity
Example
Code Review Quality
5% weighthybridCan they critically evaluate code — whether AI-generated or human-written?
Code Review Quality
5% weighthybridCan they critically evaluate code — whether AI-generated or human-written?
Signals Measured
- Specificity and actionability of review feedback
- Ability to distinguish critical issues from minor style concerns
- Quality of suggested alternatives and explanations
Example
Efficiency
(10% of score)Speed without quality is negative value — this is intentionally low-weighted.
Workflow Efficiency
10% weightdeterministicMeasures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.
Workflow Efficiency
10% weightdeterministicMeasures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.
Signals Measured
- Effective use of available development tools
- Productive momentum without unnecessary context switches
- Appropriate tool selection for the task at hand
- Read-before-write patterns indicating thoughtful workflow
Example
Our scoring uses a combination of deterministic telemetry analysis and AI-assisted evaluation. Signal weighting includes per-session randomization to prevent pattern memorization.
Behavioral Patterns
Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.
Calibrated Expert
+5 to +8Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.
Methodical Verifier
+2 to +5Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.
Explore-Plan-Execute
+3 to +5High orientation time → specific prompts → targeted verification. Strong signal of structured problem-solving.
Recovery Pivot
+3 to +6Initial approach fails → recognizes dead end → pivots strategy → succeeds. Shows resilience, experience, and intellectual honesty.
Context Blind
-5 to -10Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions to specific problems.
Spray and Pray
-8 to -15Immediate vague prompts → accept first output → no verification. The weakest signal of engineering judgment.
Comprehension Checks
After submission, candidates answer 3-5 targeted questions about their work. This prevents the "AI-generated code you don't understand" problem — research shows developers using AI score 17% lower on comprehension tests than those coding manually.
Grade Scale
The overall score (0-100) maps to a letter grade and performance band.
| Grade | Score Range | Band | What It Means |
|---|---|---|---|
| S | 90-100 | Exceptional | Top-tier performance across all dimensions. Rare. |
| A | 80-89 | Strong | Consistently strong AI collaboration. Recommended hire. |
| B | 70-79 | Competent | Solid fundamentals with room to grow. |
| C | 60-69 | Developing | Some good patterns, but significant gaps. |
| D | 50-59 | Needs Work | Below expectations for the role. |
| F | <50 | Significant Gaps | Fundamental skills not demonstrated. |
Scoring Reliability
Zero Rater Variance
4 of 7 dimensions are fully deterministic — computed from telemetry signals with per-session weight randomization to prevent pattern memorization. The same session always produces the same base scores. No human rater, no subjectivity.
Bounded LLM Enhancement
For 3 hybrid dimensions, LLM analysis can adjust scores by at most ±15 points. Each adjustment includes confidence level (high/medium/low) and written justification.
Evidence Trail
Every score links to specific timestamped events in the session. You can review the raw evidence — prompts, edits, test runs — that produced each number.
Outcome Modifier
After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.
| Final Test Pass Rate | Score Adjustment |
|---|---|
| 100% | No penalty |
| 70-99% | -5 points |
| 30-69% | -10 points |
| 1-29% | -15 points |
| 0% | -20 points |
Why Process Over Output
Two engineers can produce identical, passing code through fundamentally different processes. One methodically reads the codebase, identifies the root cause, provides targeted context to AI, and verifies each change. The other pastes the error message into AI, accepts the first suggestion, hits a wall, re-prompts three times, and eventually lands on the same fix by trial and error.
The output is the same. The engineering behind it is not. And which process an engineer uses predicts how they will perform on the problems that actually matter in production — ambiguous bugs, unfamiliar codebases, cascading failures, and high-stakes refactors where brute-forcing with AI does not work.
Engineer A: Methodical
- 1.Reads the failing test and traces execution through 3 files
- 2.Identifies the race condition in the connection pool
- 3.Prompts AI with the specific function, mutex pattern, and constraints
- 4.Reviews the suggestion, catches a missing edge case, adjusts
- 5.Runs tests, verifies the fix, submits
Engineer B: Trial-and-Error
- 1.Pastes error message into AI chat immediately
- 2.Accepts first suggestion, tests fail differently
- 3.Re-prompts: “that didn’t work, try something else”
- 4.Third attempt produces a passing fix
- 5.Submits without reviewing what changed
How This Compares to Traditional Assessments
Most technical assessments were designed for a pre-AI world. They measure the wrong things or measure the right things in the wrong way.
| Method | What It Measures | Blind Spots |
|---|---|---|
Take-Home Projects | Final output quality | No visibility into process. AI can generate an entire project in minutes. You cannot distinguish thoughtful engineering from sophisticated prompting. Unpaid labor deters strong candidates. |
Whiteboard / LeetCode | Algorithm recall under pressure | Does not reflect actual work. No one implements a red-black tree in production. Measures memorization, not engineering judgment. Biased toward recent grads who just studied. |
Timed Coding Challenges | Speed of code production | Rewards fast typing over careful thinking. Penalizes engineers who verify. With AI generating code instantly, speed of code production is meaningless. |
DynaLab | Full engineering process: how engineers think, investigate, use AI, verify, and recover | Captures 30+ telemetry signals across 7 dimensions. Scores the process that produces the output, not just the output itself. Detects behavioral patterns that predict production reliability. |
Example Scorecard Walkthrough
Here is what each dimension captures in practice, using a production debugging task (fixing a connection pool exhaustion bug) as a concrete example.
Calibrated Trust
25%91/100The candidate accepted a simple import fix without extra verification (appropriate for low risk) but ran the full test suite after the AI suggested restructuring the pool’s mutex strategy (appropriate for high risk). Their verification intensity matched the risk level of each change.
Context Engineering
20%78/100Attached the pool configuration and the specific failing test to their prompt, not the entire file. Lost points for not including the existing timeout logic from a related module, which the task rubric flagged as critical context.
Problem Decomposition
20%85/100Spent 2.5 minutes reading the pool implementation and test output before their first AI prompt. First prompt was specific: “The pool leaks connections when acquire times out mid-handshake — the cleanup in line 142 only fires on success.” Strong signal of structured thinking.
Debugging & Recovery
12%72/100First fix attempt introduced a deadlock. The candidate recognized the problem from test output within 30 seconds and pivoted to a channel-based approach. Good recovery speed, but the initial approach showed limited consideration of concurrency constraints.
Architectural Judgment
8%88/100Fix touched only 2 files and followed the existing error-handling pattern. Rejected an AI suggestion to add a new retry middleware (scope creep) and kept changes within the pool module.
Code Review Quality
5%65/100Reviewed the AI’s output but missed a subtle resource leak in the error path. Comments were general (“looks good”) rather than specific. Demonstrates awareness of review but not depth.
Workflow Efficiency
10%80/100Used the terminal to reproduce the issue and run targeted tests. Read files before writing. Maintained good flow between editor, AI, and terminal without unnecessary context switching.
Overall Result
This candidate demonstrates strong AI collaboration fundamentals with excellent calibrated trust and decomposition. Recovery from the deadlock was fast and decisive. Main growth area is code review depth — they tend to skim rather than critically evaluate AI-generated code.
See It In Action
View a complete sample scorecard with all dimensions scored, behavioral patterns detected, and evidence linked.