Scoring System
DynaLab.ai measures how engineers work with AI across 7 calibrated dimensions organized in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. Scores are calibrated per task: the same behavior is scored differently depending on what the task demands.
Process over output
Scoring Dimensions
The 7 dimensions are organized into 3 tiers by importance. Tier 1 accounts for 65% of the total score.
Tier 1: Calibrated AI Judgment (65%)
Calibrated Trust
25%hybridDoes the engineer's verification intensity match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.
Signals measured:
- Verification intensity relative to task-specific risk level
- Appropriate trust for low-risk AI suggestions (renames, formatting)
- Skepticism and testing for high-risk AI output (logic, architecture)
- Ratio of AI suggestions accepted vs. modified vs. rejected, calibrated to task complexity
Context Engineering
20%deterministicHow effectively does the engineer select and provide context to the AI? Quality over quantity — scored against task-specific rubric files and key context targets.
Signals measured:
- Files explored before first AI prompt
- Relevant code/docs referenced in prompts vs. task rubric expectations
- Whether prompts include architectural constraints
- Quality of context vs. quantity (precision over volume)
Problem Decomposition
20%deterministicDoes the engineer think before prompting? Exploration time is calibrated per task — a production triage expects faster orientation than a complex refactor.
Signals measured:
- Time spent reading/exploring before first AI interaction, calibrated to task archetype
- Whether they create a plan or outline before coding
- Prompt specificity (vague vs. targeted)
- Whether they break complex tasks into sequential, scoped prompts
Tier 2: Technical Execution (25%)
Debugging & Recovery
12%hybridWhen things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?
Signals measured:
- Root-cause identification speed and accuracy
- Whether they use systematic debugging vs. re-prompting blindly
- Recovery patterns: backtrack effectively or spiral?
- Dead-end detection speed and pivot quality
Architectural Judgment
8%deterministicDoes the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations.
Signals measured:
- Respects existing patterns and conventions
- Doesn't introduce new dependencies without justification
- Makes deliberate decisions about where code lives
- Edit concentration and scope appropriateness vs. rubric expectations
Code Review Quality
5%hybridCan they critically evaluate code — whether AI-generated or human-written?
Signals measured:
- Quality and specificity of review comments
- Whether they catch real issues vs. nitpicks
- Whether they explain why something is a problem
- Whether they suggest alternatives
Tier 3: Efficiency (10%)
Workflow Efficiency
10%deterministicMeasures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.
Signals measured:
- AI tool feature usage (inline edit vs. chat vs. agent mode)
- Terminal proficiency and command diversity
- Appropriate tool selection for the task
- Read-before-write pattern
- Productive momentum without unnecessary context switches
Scoring Methods
- Deterministic (4 dimensions) — Computed entirely from telemetry signals. The same session always produces the same base scores. No human rater, no subjectivity.
- Hybrid (3 dimensions) — Deterministic base score enhanced by LLM analysis, which can adjust scores by at most ±15 points. Each adjustment includes confidence level and written justification.
Behavioral Patterns
Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.
| Pattern | Modifier | Description |
|---|---|---|
| Calibrated Expert | +5 to +8 | Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience. |
| Methodical Verifier | +2 to +5 | Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues. |
| Explore-Plan-Execute | +3 to +5 | High orientation time, specific prompts, targeted verification. Strong signal of structured problem-solving. |
| Recovery Pivot | +3 to +6 | Initial approach fails, recognizes dead end, pivots strategy, succeeds. Shows resilience and intellectual honesty. |
| Context Blind | -5 to -10 | Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions. |
| Spray and Pray | -8 to -15 | Immediate vague prompts, accept first output, no verification. The weakest signal of engineering judgment. |
Task Archetypes
Each task belongs to an archetype that calibrates scoring expectations. The same behavior produces different scores depending on the archetype.
| Archetype | Exploration Time | Verification Intensity | Example Tasks |
|---|---|---|---|
| Debugging | Moderate (2-5 min) | High — root cause must be verified | Connection pool fix, flaky test investigation |
| Production Triage | Low (1-2 min) | Moderate — speed matters | N+1 query, retry storm, goroutine leak |
| Code Review | High (5-10 min) | N/A — review is the verification | Security review, architecture review |
| Refactoring | High (5-8 min) | Very high — regressions are costly | State management, design system migration |
| DevOps / Infra | Moderate (3-5 min) | High — infrastructure changes are risky | K8s misconfiguration, CI pipeline, incident response |
Grade Scale
The overall score (0-100) maps to a letter grade and performance band.
| Grade | Score Range | Band |
|---|---|---|
| S | 90-100 | Exceptional |
| A | 80-89 | Strong |
| B | 70-79 | Competent |
| C | 60-69 | Developing |
| D | 50-59 | Needs Work |
| F | <50 | Significant Gaps |
Comprehension Checks
After submission, you answer 3-5 targeted questions about your work. This prevents the "AI-generated code you don't understand" problem.
- Root cause identification (MC) — Can you explain what was actually wrong?
- Approach justification (MC) — Why did you choose this approach over alternatives?
- Code comprehension (MC) — What does this specific code do and what trade-offs does it make?
- Reflection (Open-ended) — What would you do differently with more time?
Outcome Modifier
After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.
| Test Pass Rate | Adjustment |
|---|---|
| 100% | No penalty |
| 70-99% | -5 points |
| 30-69% | -10 points |
| 1-29% | -15 points |
| 0% | -20 points |