The Gap Between Pitch and Practice
Sales conversations describe AI engineering practice. Code reviews reveal it. The gap between what partners say they do and what their code shows is one of the most consistent patterns in 2024-2025 partner evaluations. CTOs who asked for engineering artifacts from candidate partners before contract found gaps in approximately half of cases, according to ISG's services market research (ISG, "2024 Cloud Services Market Trends").
If your partner evaluation has not included an engineering artifact review, you are evaluating slides rather than capability. Six markers visible in actual engineering artifacts separate partners who ship from partners who pitch.
The markers are observable in code reviews, in a sample repository the partner agrees to share under NDA, in a walkthrough of a past project's pull request history, or in the partner's reference engagement that you can audit. Each marker is binary in the operational sense: present or absent.
The Six Practitioner Markers
The six markers are what good AI engineering looks like in code rather than in conversation. They are not exhaustive but they are diagnostic.
Marker one is eval discipline. Pull requests that modify prompts, retrieval pipelines, or model selection include eval results. The eval runs in CI and blocks merges that regress. Without this marker, the partner is changing AI behavior without measurement. With it, the partner has internalized the discipline that makes AI iteration sustainable.
Marker two is version control for prompts and configurations. Prompts live in source code or a versioned prompt registry. Changes are reviewed. History is recoverable. Without this marker, the partner cannot answer the question "what was running on March 14" with confidence. With it, the partner can.
Marker three is cost instrumentation. The code tracks tokens, calls, and supporting infrastructure usage per request. Cost dashboards exist. Anomaly detection alerts on cost spikes. Without this marker, the partner operates AI workloads without unit economics. With it, the partner can answer cost questions with current numbers.
Marker four is failure-mode design. The code includes fallback paths, circuit breakers, timeout handling, and graceful degradation. Production code does not assume the model is available and correct on every call. Without this marker, the partner ships code that breaks at the first production incident. With it, the partner has thought about the second and third order effects.
Marker five is structured outputs and validation. Where outputs feed downstream systems, the code uses structured generation with schema validation. Outputs that fail validation are caught and handled. Without this marker, the partner is producing strings that downstream code parses by hope. With it, the partner has reduced an entire class of integration failures.
Marker six is operational documentation. The repository has runbooks, on-call playbooks, escalation paths. The documentation matches the code. Without this marker, the partner has built systems nobody else can operate. With it, the partner has built systems that survive personnel changes.
A partner with all six markers visible in their code is engineering AI work at production grade. A partner with four or five has gaps that emerge under operational pressure. A partner with three or fewer is shipping accidentally rather than deliberately.
What Each Marker Reveals
Each marker maps to a specific risk that absent the marker materializes.
Eval discipline maps to silent quality regression. Without continuous measurement, the partner's outputs degrade over time and nobody notices. With it, regressions are caught at merge time.
Version control for prompts maps to inability to reconstruct decisions. Without versioning, when something goes wrong six weeks later, the partner cannot identify what changed. With it, forensics are possible.
Cost instrumentation maps to budget surprises. Without per-request cost data, the cloud bill grows and nobody can explain why. With it, the team has visibility into the economics of every workload.
Failure-mode design maps to incident severity. Without designed fallbacks, every model provider outage becomes a customer-facing incident. With it, the system degrades gracefully and customers may not notice.
Structured outputs maps to integration brittleness. Without schema validation, downstream code breaks when the model produces unexpected output formats. With it, the integration is contractually sound.
Operational documentation maps to bus-factor risk. Without runbooks, the partner's specific people are the only ones who can operate the system. With it, the system can be transitioned to your team or to another operator.
A partner that has consistently invested in the six markers has built engineering muscle that transfers to your project. A partner that has invested in some and skipped others has muscle that may not transfer.
How to Inspect for the Markers
Inspecting for the markers requires direct artifact review rather than conversation. Three approaches work.
The first approach is a sample repository review. The partner shares a representative repository (under NDA) covering a recent client engagement. The repository contains the AI code, the eval setup, the CI configuration, the deployment scripts, and the operational documentation. Your engineering team reviews against the six markers.
The second approach is a pull request walkthrough. The partner walks through the pull request history of a past project. Each marker is checked against the actual evidence in PR descriptions, comments, CI results, and merged code. This is faster than full repository review and produces high-signal evidence.
The third approach is a reference engagement audit. With permission from a reference customer, the partner walks through the actual production code that customer is running. The marker check is done against live production rather than against curated examples.
Each approach has practical constraints around NDA, customer permission, and time. The constraints can be addressed for serious procurement processes. The constraints that cannot be addressed should be themselves diagnostic about the partner's confidence in their work.
The Pattern of Skill Variation
Across the six markers, different partners typically excel at different ones.
Partners with traditional software engineering depth often excel at markers four (failure-mode design) and five (structured outputs) because these map to general engineering competence. They sometimes lag on marker one (eval discipline) because eval is AI-specific and newer.
Partners with strong ML or data science background often excel at marker one and lag on operational markers (four, five, six). The pattern reflects different professional formation.
Partners with strong DevOps or platform engineering depth often excel at marker six (operational documentation) and lag on AI-specific markers (one, three).
The six markers across the team is what matters more than any individual person mastering all six. Good AI engineering partnerships have collective coverage even when individuals specialize.
What Does Not Substitute for the Markers
Three things sometimes substitute for the markers in partner pitches and should not be accepted as substitutes.
Certifications do not substitute. AWS, Google Cloud, Microsoft, and various AI vendor certifications attest to product familiarity, not to engineering discipline. Certified partners can still ship code without the markers.
Case studies do not substitute. Case studies describe outcomes that may have been achieved in ways the partner would not repeat in your engagement. Six markers visible in current code are more diagnostic than three case studies from past projects.
Methodology branding does not substitute. Branded methodologies that the partner has developed and promotes are aspirational descriptions. The code is what reveals whether the methodology is actually practiced.
The substitutes are common in pitch processes because they are easier to produce than evidence of the markers. Buyers who accept the substitutes get the experience the substitutes describe; buyers who insist on the markers get the engineering the markers reveal.
What Logiciel Does Here
Logiciel works with enterprises that have been disappointed by previous partner engagements and want to evaluate next partners against engineering markers rather than against marketing materials. The work is typically structured around a defined artifact review against the six markers.
The Evaluate AI Implementation Partner framework covers the eight-item structural filter that complements the engineering markers. The Twelve Questions framework covers the diagnostic questions that surface the operational reality behind the markers.
A 30-minute working session is enough to design an artifact review process appropriate to your specific procurement.
Frequently Asked Questions
Can I check the markers without engineering staff?
With difficulty. The markers are technical and require engineering judgment to evaluate properly. Procurement-led evaluation usually misses the markers entirely. Engineering involvement in evaluation is mandatory for serious partner selection.
Should I require all six markers?
At production-grade work, yes. At pilot-grade work, the eval discipline and version control markers are the most critical. The operational markers can be retrofitted; the eval and version control markers retrofit poorly.
What if the partner declines to share artifacts?
That is itself a signal. Partners confident in their engineering will share with appropriate NDA protection. Partners unwilling to share usually have something they would prefer not to show.
How does this work for partners with a track record but for outdated work?
Look at recent work, not foundational work. AI engineering practice has evolved enough in 2023-2025 that 2021 work does not reflect current capability. Two-year-old artifacts should be supplemented with recent ones.
What if my procurement timeline does not allow artifact review?
Then your procurement timeline is producing higher risk than necessary. The artifact review adds 2-4 weeks. The alternative is selecting based on pitch and discovering engineering reality after contract. The 2-4 weeks usually pay back. Sources: - ISG, "2024 Cloud Services Market Trends" - Anthropic, "Building effective agents," 2024