Scoring System

DynaLab.ai measures how engineers work with AI across 7 calibrated dimensions organized in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. Scores are calibrated per task: the same behavior is scored differently depending on what the task demands.

Process over output

Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result.

Scoring Dimensions

The 7 dimensions are organized into 3 tiers by importance. Tier 1 accounts for 65% of the total score.

Tier 1: Calibrated AI Judgment (65%)

Calibrated Trust

25%hybrid

Does the engineer's verification intensity match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.

Signals measured:

  • Verification intensity relative to task-specific risk level
  • Appropriate trust for low-risk AI suggestions (renames, formatting)
  • Skepticism and testing for high-risk AI output (logic, architecture)
  • Ratio of AI suggestions accepted vs. modified vs. rejected, calibrated to task complexity

Context Engineering

20%deterministic

How effectively does the engineer select and provide context to the AI? Quality over quantity — scored against task-specific rubric files and key context targets.

Signals measured:

  • Files explored before first AI prompt
  • Relevant code/docs referenced in prompts vs. task rubric expectations
  • Whether prompts include architectural constraints
  • Quality of context vs. quantity (precision over volume)

Problem Decomposition

20%deterministic

Does the engineer think before prompting? Exploration time is calibrated per task — a production triage expects faster orientation than a complex refactor.

Signals measured:

  • Time spent reading/exploring before first AI interaction, calibrated to task archetype
  • Whether they create a plan or outline before coding
  • Prompt specificity (vague vs. targeted)
  • Whether they break complex tasks into sequential, scoped prompts

Tier 2: Technical Execution (25%)

Debugging & Recovery

12%hybrid

When things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?

Signals measured:

  • Root-cause identification speed and accuracy
  • Whether they use systematic debugging vs. re-prompting blindly
  • Recovery patterns: backtrack effectively or spiral?
  • Dead-end detection speed and pivot quality

Architectural Judgment

8%deterministic

Does the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations.

Signals measured:

  • Respects existing patterns and conventions
  • Doesn't introduce new dependencies without justification
  • Makes deliberate decisions about where code lives
  • Edit concentration and scope appropriateness vs. rubric expectations

Code Review Quality

5%hybrid

Can they critically evaluate code — whether AI-generated or human-written?

Signals measured:

  • Quality and specificity of review comments
  • Whether they catch real issues vs. nitpicks
  • Whether they explain why something is a problem
  • Whether they suggest alternatives

Tier 3: Efficiency (10%)

Workflow Efficiency

10%deterministic

Measures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.

Signals measured:

  • AI tool feature usage (inline edit vs. chat vs. agent mode)
  • Terminal proficiency and command diversity
  • Appropriate tool selection for the task
  • Read-before-write pattern
  • Productive momentum without unnecessary context switches

Scoring Methods

  • Deterministic (4 dimensions) — Computed entirely from telemetry signals. The same session always produces the same base scores. No human rater, no subjectivity.
  • Hybrid (3 dimensions) — Deterministic base score enhanced by LLM analysis, which can adjust scores by at most ±15 points. Each adjustment includes confidence level and written justification.

Behavioral Patterns

Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.

PatternModifierDescription
Calibrated Expert+5 to +8Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.
Methodical Verifier+2 to +5Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.
Explore-Plan-Execute+3 to +5High orientation time, specific prompts, targeted verification. Strong signal of structured problem-solving.
Recovery Pivot+3 to +6Initial approach fails, recognizes dead end, pivots strategy, succeeds. Shows resilience and intellectual honesty.
Context Blind-5 to -10Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions.
Spray and Pray-8 to -15Immediate vague prompts, accept first output, no verification. The weakest signal of engineering judgment.

Task Archetypes

Each task belongs to an archetype that calibrates scoring expectations. The same behavior produces different scores depending on the archetype.

ArchetypeExploration TimeVerification IntensityExample Tasks
DebuggingModerate (2-5 min)High — root cause must be verifiedConnection pool fix, flaky test investigation
Production TriageLow (1-2 min)Moderate — speed mattersN+1 query, retry storm, goroutine leak
Code ReviewHigh (5-10 min)N/A — review is the verificationSecurity review, architecture review
RefactoringHigh (5-8 min)Very high — regressions are costlyState management, design system migration
DevOps / InfraModerate (3-5 min)High — infrastructure changes are riskyK8s misconfiguration, CI pipeline, incident response

Grade Scale

The overall score (0-100) maps to a letter grade and performance band.

GradeScore RangeBand
S90-100Exceptional
A80-89Strong
B70-79Competent
C60-69Developing
D50-59Needs Work
F<50Significant Gaps

Comprehension Checks

After submission, you answer 3-5 targeted questions about your work. This prevents the "AI-generated code you don't understand" problem.

  • Root cause identification (MC) — Can you explain what was actually wrong?
  • Approach justification (MC) — Why did you choose this approach over alternatives?
  • Code comprehension (MC) — What does this specific code do and what trade-offs does it make?
  • Reflection (Open-ended) — What would you do differently with more time?

Outcome Modifier

After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.

Test Pass RateAdjustment
100%No penalty
70-99%-5 points
30-69%-10 points
1-29%-15 points
0%-20 points