How We Score

Our scoring system measures how engineers work with AI across 7 calibrated dimensions in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. The same behavior is scored differently depending on what the task demands.

Process Over Output

Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result. Base scores are deterministic from telemetry; optional LLM enhancement adds qualitative evidence (bounded to ±15 points per dimension).

TIER 1

Calibrated AI Judgment

(65% of score)

Where the gap between strong and weak AI-augmented engineers is widest. Scores are calibrated per task — the same behavior is scored differently depending on what the task demands.

Calibrated Trust

25% weighthybrid

Does the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.

Signals Measured

  • Whether verification intensity matches the risk level of the change
  • Appropriate trust calibration — accepting low-risk suggestions efficiently while scrutinizing high-risk ones
  • Critical evaluation of AI output relative to task complexity
  • Evidence of independent judgment in accepting or modifying AI suggestions

Example

An engineer who accepts a simple import fix without testing but runs full test suites after an AI-suggested architectural change scores 85+. Testing everything equally — or testing nothing — both score lower.

Context Engineering

20% weightdeterministic

How effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.

Signals Measured

  • Quality of investigation before seeking AI assistance
  • Relevance and precision of context provided to the AI
  • Whether prompts include meaningful architectural and behavioral constraints
  • Context quality evolution as understanding deepens

Example

Attaching the relevant config file and test patterns before asking for a fix scores higher than pasting entire files without filtering.

Problem Decomposition

20% weightdeterministic

Does the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.

Signals Measured

  • Exploration depth appropriate to the situation
  • Evidence of structured thinking before acting
  • Prompt specificity and targeted problem framing
  • Whether complex problems are broken into manageable, scoped steps

Example

A candidate who spends 3 minutes exploring a debugging task before prompting scores well. The same 3 minutes on a production triage (where speed matters) would score lower.
TIER 2

Technical Execution

(25% of score)

Core engineering skills that remain essential regardless of AI assistance.

Debugging & Recovery

12% weighthybrid

When things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?

Signals Measured

  • Systematic approach to root-cause identification
  • Use of appropriate debugging techniques vs. blind re-prompting
  • Quality of recovery when an approach fails
  • Speed of recognizing dead ends and pivoting

Example

Spotting that the AI's fix passes tests but introduces a subtle race condition, then systematically narrowing the root cause.

Architectural Judgment

8% weightdeterministic

Does the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.

Signals Measured

  • Respect for existing codebase patterns and conventions
  • Deliberate decisions about scope and placement of changes
  • Resistance to unnecessary AI-introduced complexity

Example

A focused fix in 2-3 files that follows existing patterns scores higher than letting AI scatter changes across 10 files.

Code Review Quality

5% weighthybrid

Can they critically evaluate code — whether AI-generated or human-written?

Signals Measured

  • Specificity and actionability of review feedback
  • Ability to distinguish critical issues from minor style concerns
  • Quality of suggested alternatives and explanations

Example

Identifying SQL injection AND suggesting parameterized queries scores higher than noting 'security issue'.
TIER 3

Efficiency

(10% of score)

Speed without quality is negative value — this is intentionally low-weighted.

Workflow Efficiency

10% weightdeterministic

Measures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.

Signals Measured

  • Effective use of available development tools
  • Productive momentum without unnecessary context switches
  • Appropriate tool selection for the task at hand
  • Read-before-write patterns indicating thoughtful workflow

Example

Using the AI to read and understand code before writing, combined with terminal for testing, while maintaining flow. Not penalized for taking time to be thorough.

Our scoring uses a combination of deterministic telemetry analysis and AI-assisted evaluation. Signal weighting includes per-session randomization to prevent pattern memorization.

Behavioral Patterns

Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.

Calibrated Expert

+5 to +8

Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.

Methodical Verifier

+2 to +5

Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.

Explore-Plan-Execute

+3 to +5

High orientation time → specific prompts → targeted verification. Strong signal of structured problem-solving.

Recovery Pivot

+3 to +6

Initial approach fails → recognizes dead end → pivots strategy → succeeds. Shows resilience, experience, and intellectual honesty.

Context Blind

-5 to -10

Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions to specific problems.

Spray and Pray

-8 to -15

Immediate vague prompts → accept first output → no verification. The weakest signal of engineering judgment.

Comprehension Checks

After submission, candidates answer 3-5 targeted questions about their work. This prevents the "AI-generated code you don't understand" problem — research shows developers using AI score 17% lower on comprehension tests than those coding manually.

MCRoot cause identification — can they explain what was actually wrong?
MCApproach justification — why did they choose this approach over alternatives?
MCCode comprehension — what does this specific code do and what trade-offs does it make?
OpenReflection — what would they do differently with more time?

Grade Scale

The overall score (0-100) maps to a letter grade and performance band.

GradeScore RangeBandWhat It Means
S90-100ExceptionalTop-tier performance across all dimensions. Rare.
A80-89StrongConsistently strong AI collaboration. Recommended hire.
B70-79CompetentSolid fundamentals with room to grow.
C60-69DevelopingSome good patterns, but significant gaps.
D50-59Needs WorkBelow expectations for the role.
F<50Significant GapsFundamental skills not demonstrated.

Scoring Reliability

Zero Rater Variance

4 of 7 dimensions are fully deterministic — computed from telemetry signals with per-session weight randomization to prevent pattern memorization. The same session always produces the same base scores. No human rater, no subjectivity.

Bounded LLM Enhancement

For 3 hybrid dimensions, LLM analysis can adjust scores by at most ±15 points. Each adjustment includes confidence level (high/medium/low) and written justification.

Evidence Trail

Every score links to specific timestamped events in the session. You can review the raw evidence — prompts, edits, test runs — that produced each number.

Outcome Modifier

After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.

Final Test Pass RateScore Adjustment
100%No penalty
70-99%-5 points
30-69%-10 points
1-29%-15 points
0%-20 points

Why Process Over Output

Two engineers can produce identical, passing code through fundamentally different processes. One methodically reads the codebase, identifies the root cause, provides targeted context to AI, and verifies each change. The other pastes the error message into AI, accepts the first suggestion, hits a wall, re-prompts three times, and eventually lands on the same fix by trial and error.

The output is the same. The engineering behind it is not. And which process an engineer uses predicts how they will perform on the problems that actually matter in production — ambiguous bugs, unfamiliar codebases, cascading failures, and high-stakes refactors where brute-forcing with AI does not work.

Engineer A: Methodical

  1. 1.Reads the failing test and traces execution through 3 files
  2. 2.Identifies the race condition in the connection pool
  3. 3.Prompts AI with the specific function, mutex pattern, and constraints
  4. 4.Reviews the suggestion, catches a missing edge case, adjusts
  5. 5.Runs tests, verifies the fix, submits
Score: 84 (Grade A) — Strong calibrated trust, good decomposition

Engineer B: Trial-and-Error

  1. 1.Pastes error message into AI chat immediately
  2. 2.Accepts first suggestion, tests fail differently
  3. 3.Re-prompts: “that didn’t work, try something else”
  4. 4.Third attempt produces a passing fix
  5. 5.Submits without reviewing what changed
Score: 47 (Grade F) — No decomposition, uncalibrated trust, no verification

How This Compares to Traditional Assessments

Most technical assessments were designed for a pre-AI world. They measure the wrong things or measure the right things in the wrong way.

MethodWhat It MeasuresBlind Spots
Take-Home Projects
Final output qualityNo visibility into process. AI can generate an entire project in minutes. You cannot distinguish thoughtful engineering from sophisticated prompting. Unpaid labor deters strong candidates.
Whiteboard / LeetCode
Algorithm recall under pressureDoes not reflect actual work. No one implements a red-black tree in production. Measures memorization, not engineering judgment. Biased toward recent grads who just studied.
Timed Coding Challenges
Speed of code productionRewards fast typing over careful thinking. Penalizes engineers who verify. With AI generating code instantly, speed of code production is meaningless.
DynaLab
Full engineering process: how engineers think, investigate, use AI, verify, and recoverCaptures 30+ telemetry signals across 7 dimensions. Scores the process that produces the output, not just the output itself. Detects behavioral patterns that predict production reliability.

Example Scorecard Walkthrough

Here is what each dimension captures in practice, using a production debugging task (fixing a connection pool exhaustion bug) as a concrete example.

Calibrated Trust

25%91/100

The candidate accepted a simple import fix without extra verification (appropriate for low risk) but ran the full test suite after the AI suggested restructuring the pool’s mutex strategy (appropriate for high risk). Their verification intensity matched the risk level of each change.

Context Engineering

20%78/100

Attached the pool configuration and the specific failing test to their prompt, not the entire file. Lost points for not including the existing timeout logic from a related module, which the task rubric flagged as critical context.

Problem Decomposition

20%85/100

Spent 2.5 minutes reading the pool implementation and test output before their first AI prompt. First prompt was specific: “The pool leaks connections when acquire times out mid-handshake — the cleanup in line 142 only fires on success.” Strong signal of structured thinking.

Debugging & Recovery

12%72/100

First fix attempt introduced a deadlock. The candidate recognized the problem from test output within 30 seconds and pivoted to a channel-based approach. Good recovery speed, but the initial approach showed limited consideration of concurrency constraints.

Architectural Judgment

8%88/100

Fix touched only 2 files and followed the existing error-handling pattern. Rejected an AI suggestion to add a new retry middleware (scope creep) and kept changes within the pool module.

Code Review Quality

5%65/100

Reviewed the AI’s output but missed a subtle resource leak in the error path. Comments were general (“looks good”) rather than specific. Demonstrates awareness of review but not depth.

Workflow Efficiency

10%80/100

Used the terminal to reproduce the issue and run targeted tests. Read files before writing. Maintained good flow between editor, AI, and terminal without unnecessary context switching.

Overall Result

Score: 81/100
Grade: A
Band: Strong
Pattern: Explore-Plan-Execute + Recovery Pivot

This candidate demonstrates strong AI collaboration fundamentals with excellent calibrated trust and decomposition. Recovery from the deadlock was fast and decisive. Main growth area is code review depth — they tend to skim rather than critically evaluate AI-generated code.

See It In Action

View a complete sample scorecard with all dimensions scored, behavioral patterns detected, and evidence linked.