Scoring System

DynaLab measures how engineers work with AI across 7 calibrated dimensions organized in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. Scores are calibrated per task: the same behavior is scored differently depending on what the task demands.

Process over output

Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result.

Scoring Dimensions

The 7 dimensions are organized into 3 tiers by importance. Tier 1 accounts for 65% of the total score.

Tier 1: Calibrated AI Judgment (65%)

Calibrated Trust

25%hybrid

Does the engineer's verification intensity match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.

Signals measured:

Verification intensity relative to task-specific risk level
Appropriate trust for low-risk AI suggestions (renames, formatting)
Skepticism and testing for high-risk AI output (logic, architecture)
Ratio of AI suggestions accepted vs. modified vs. rejected, calibrated to task complexity

Context Engineering

20%deterministic

How effectively does the engineer select and provide context to the AI? Quality over quantity — scored against task-specific rubric files and key context targets.

Signals measured:

Files explored before first AI prompt
Relevant code/docs referenced in prompts vs. task rubric expectations
Whether prompts include architectural constraints
Quality of context vs. quantity (precision over volume)

Problem Decomposition

20%deterministic

Does the engineer think before prompting? Exploration time is calibrated per task — a production triage expects faster orientation than a complex refactor.

Signals measured:

Time spent reading/exploring before first AI interaction, calibrated to task archetype
Whether they create a plan or outline before coding
Prompt specificity (vague vs. targeted)
Whether they break complex tasks into sequential, scoped prompts

Tier 2: Technical Execution (25%)

Debugging & Recovery

12%hybrid

When things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?

Signals measured:

Root-cause identification speed and accuracy
Whether they use systematic debugging vs. re-prompting blindly
Recovery patterns: backtrack effectively or spiral?
Dead-end detection speed and pivot quality

Architectural Judgment

8%deterministic

Does the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations.

Signals measured:

Respects existing patterns and conventions
Doesn't introduce new dependencies without justification
Makes deliberate decisions about where code lives
Edit concentration and scope appropriateness vs. rubric expectations

Code Review Quality

5%hybrid

Can they critically evaluate code — whether AI-generated or human-written?

Signals measured:

Quality and specificity of review comments
Whether they catch real issues vs. nitpicks
Whether they explain why something is a problem
Whether they suggest alternatives

Tier 3: Efficiency (10%)

Workflow Efficiency

10%deterministic

Measures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.

Signals measured:

AI tool feature usage (inline edit vs. chat vs. agent mode)
Terminal proficiency and command diversity
Appropriate tool selection for the task
Read-before-write pattern
Productive momentum without unnecessary context switches

Scoring Methods

Deterministic (4 dimensions) — Computed entirely from telemetry signals. The same session always produces the same base scores. No human rater, no subjectivity.
Hybrid (3 dimensions) — Deterministic base score enhanced by LLM analysis, which can adjust scores by at most ±15 points. Each adjustment includes confidence level and written justification.

Behavioral Patterns

Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.

Pattern	Modifier	Description
Calibrated Expert	+5 to +8	Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.
Methodical Verifier	+2 to +5	Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.
Explore-Plan-Execute	+3 to +5	High orientation time, specific prompts, targeted verification. Strong signal of structured problem-solving.
Recovery Pivot	+3 to +6	Initial approach fails, recognizes dead end, pivots strategy, succeeds. Shows resilience and intellectual honesty.
Context Blind	-5 to -10	Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions.
Spray and Pray	-8 to -15	Immediate vague prompts, accept first output, no verification. The weakest signal of engineering judgment.

Task Archetypes

Each task belongs to an archetype that calibrates scoring expectations. The same behavior produces different scores depending on the archetype.

Archetype	Exploration Time	Verification Intensity	Example Tasks
Debugging	Moderate (2-5 min)	High — root cause must be verified	Connection pool fix, flaky test investigation
Production Triage	Low (1-2 min)	Moderate — speed matters	N+1 query, retry storm, goroutine leak
Code Review	High (5-10 min)	N/A — review is the verification	Security review, architecture review
Refactoring	High (5-8 min)	Very high — regressions are costly	State management, design system migration
DevOps / Infra	Moderate (3-5 min)	High — infrastructure changes are risky	K8s misconfiguration, CI pipeline, incident response

Grade Scale

The overall score (0-100) maps to a letter grade and performance band.

Grade	Score Range	Band
S	90-100	Exceptional
A	80-89	Strong
B	70-79	Competent
C	60-69	Developing
D	50-59	Needs Work
F	<50	Significant Gaps

Comprehension Checks

After submission, you answer 3-5 targeted questions about your work. This prevents the "AI-generated code you don't understand" problem.

Root cause identification (MC) — Can you explain what was actually wrong?
Approach justification (MC) — Why did you choose this approach over alternatives?
Code comprehension (MC) — What does this specific code do and what trade-offs does it make?
Reflection (Open-ended) — What would you do differently with more time?

Outcome Modifier

After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.

Test Pass Rate	Adjustment
100%	No penalty
70-99%	-5 points
30-69%	-10 points
1-29%	-15 points
0%	-20 points

IDE & Assessments

Task Categories