scoring-research.md
1 --- 2 title: Scoring System Research & Design 3 category: pipeline 4 last_verified: 2026-02-28 5 related_files: 6 - src/score.js 7 - src/stages/rescoring.js 8 - prompts/CONVERSION-SCORING-VISION.md 9 - prompts/CONVERSION-SCORING-NOVIS.md 10 - docs/03-pipeline/scoring-system.md 11 tags: [scoring, CRO, research, rubric, design-decisions] 12 status: active 13 --- 14 15 # Scoring System Research & Design 16 17 This document captures the original deep-research conversation that produced the 333 Method's website conversion scoring system. It covers the research foundations, rubric design rationale, factor weighting decisions, screenshot strategy, rescoring approach, contact extraction, and implementation considerations. 18 19 > **Source:** Open-WebUI chat with OpenRouter (January 2026), exported and consolidated. 20 21 --- 22 23 ## Table of Contents 24 25 1. [Research Question](#1-research-question) 26 2. [Foundations: Why These Factors](#2-foundations-why-these-factors) 27 3. [The Nine-Factor Rubric](#3-the-nine-factor-rubric) 28 4. [Factor Weights & Rationale](#4-factor-weights--rationale) 29 5. [Score Calculation & Grading Scale](#5-score-calculation--grading-scale) 30 6. [Screenshot Strategy](#6-screenshot-strategy) 31 7. [Two-Pass Architecture](#7-two-pass-architecture) 32 8. [Contact Extraction on Rescore](#8-contact-extraction-on-rescore) 33 9. [Popover Handling](#9-popover-handling) 34 10. [LLM Prompt Design](#10-llm-prompt-design) 35 11. [Validation & Calibration](#11-validation--calibration) 36 12. [Token Optimization](#12-token-optimization) 37 13. [Implementation Notes](#13-implementation-notes) 38 14. [Appendix: Original JSON Schemas](#14-appendix-original-json-schemas) 39 40 --- 41 42 ## 1. Research Question 43 44 The original question that kicked off this research: 45 46 > Propose a scoring system for website conversion (with standard school grading of A+ to F) based on factors such as clear offer, CTA, urgency, hook, strong headline, strong value proposition, clear reason to choose them (USP), no generic stock photos, trust elements (reviews, badges, guarantees), and anything else from best practices. 47 > 48 > This will be provided to an LLM to calculate the score, along with the HTML of the DOM after pageload and one or more screenshots. Please advise whether this scoring system will require a full-page screenshot, or will just an above-the-fold and maybe the first below-the-fold screenshots be sufficient to produce a reasonable score? This system will be scoring many hundreds of thousands of websites, so minimising LLM token usage is more important than score accuracy. 49 50 --- 51 52 ## 2. Foundations: Why These Factors 53 54 The scoring factors were selected by synthesizing established CRO (Conversion Rate Optimization) best practices, prioritization frameworks (RICE, PIE), and behavioral psychology research. The factors map to three broad categories of conversion influence: 55 56 ### Messaging Clarity & Value Communication 57 58 - **Headline Quality** — The primary hook; must communicate what, who, and why within 3-5 seconds 59 - **Value Proposition** — Extends the headline; shifts from features to benefits ("what's in it for me?") 60 - **Unique Selling Proposition** — Why choose _this_ option over alternatives 61 - **Clear Offer** — What exactly is the visitor being asked to do, and what do they get 62 63 ### User Confidence & Trust 64 65 - **Trust & Credibility Signals** — Testimonials, certifications, badges, partner logos, media mentions 66 - **Authentic Imagery** — Real product photos vs. generic stock; professional visual design 67 68 ### Action & Engagement 69 70 - **Call-to-Action** — Copy clarity, visual prominence, and placement 71 - **Urgency/Scarcity** — Legitimate time/supply pressure for immediate action 72 - **Hook & Engagement** — Hero element that captures attention in the first seconds 73 74 ### Additional Context 75 76 - **Industry Appropriateness** (3% weight) — Whether design serves its specific business model context (B2B SaaS vs. e-commerce vs. local services have different norms) 77 78 ### Key Research Findings 79 80 - Users spend 57-80% of viewing time on above-the-fold content (Nielsen Norman Group) 81 - Google found ads above-fold achieve 73% viewability vs. 44% below-fold — a 66% "fold cliff" 82 - 90% of users begin scrolling within 14 seconds, but _only if above-fold content signals value_ 83 - CTA copy changes alone can generate conversion improvements exceeding 200% 84 - GPT-4 Vision studies found cropped images actually _outperform_ full-page captures for identification tasks — background noise reduces accuracy 85 86 --- 87 88 ## 3. The Nine-Factor Rubric 89 90 Each factor is scored 0-10 with specific rubric definitions. Below is the condensed rubric; the full detailed version with examples exists in the LLM prompts. 91 92 ### Factor 1: Headline Quality & Clarity (15%) 93 94 | Score | Description | 95 | ----- | ----------------------------------------------------------------------------------------------------- | 96 | 9-10 | Immediately communicates value; benefit-oriented; specific; creates curiosity or emotional connection | 97 | 7-8 | Clearly communicates basic benefit; mostly specific; adequate direction | 98 | 5-6 | Communicates a benefit but somewhat generic; requires modest interpretation | 99 | 3-4 | Vague, generic, or fails to communicate core benefit | 100 | 1-2 | Confusing, contradictory, or essentially absent above-fold | 101 | 0 | No discernable headline or actively confusing | 102 103 ### Factor 2: Value Proposition Clarity (14%) 104 105 | Score | Description | 106 | ----- | --------------------------------------------------------------- | 107 | 9-10 | Specific, benefit-oriented, compelling; clearly differentiates | 108 | 7-8 | Clear and benefit-focused; adequately articulates core benefits | 109 | 5-6 | Present but generic or feature-heavy; requires interpretation | 110 | 3-4 | Vague or feature-focused; unclear differentiation | 111 | 1-2 | Barely present or confused with feature lists | 112 | 0 | No value proposition or contradictory messaging | 113 114 ### Factor 3: Unique Selling Proposition (13%) 115 116 | Score | Description | 117 | ----- | ----------------------------------------------------------------- | 118 | 9-10 | Clear, compelling differentiation; specific competitive advantage | 119 | 7-8 | Reasonably clear; specific advantage identified | 120 | 5-6 | Some differentiation implied but not explicit | 121 | 3-4 | Vague; relies on generic claims ("best in class") | 122 | 1-2 | Barely present; no clear reasons to choose | 123 | 0 | No differentiation; appears identical to generic competitors | 124 125 ### Factor 4: Call-to-Action Design & Placement (13%) 126 127 | Score | Description | 128 | ----- | ----------------------------------------------------------------------------------------------------------- | 129 | 9-10 | Visible above fold; specific action-oriented language; visually prominent; secondary CTAs at natural breaks | 130 | 7-8 | Visible above fold; action-oriented; reasonably prominent | 131 | 5-6 | Present; clear but generic ("Submit", "Learn More"); adequate placement | 132 | 3-4 | Present but not prominent; vague language; requires scrolling | 133 | 1-2 | Hard to find, confusing, or inadequately prominent | 134 | 0 | No CTA or buried below multiple scrolls | 135 136 ### Factor 5: Urgency & Scarcity (10%) 137 138 | Score | Description | 139 | ----- | --------------------------------------------------------------------- | 140 | 9-10 | Legitimate urgency with specifics (deadline, count); genuine pressure | 141 | 7-8 | Clear mechanism; specific rather than vague | 142 | 5-6 | Some urgency suggested but lacks specifics | 143 | 3-4 | Vague ("act soon", "don't miss out") without details | 144 | 1-2 | Minimal or ineffective urgency | 145 | 0 | No urgency or false urgency undermining credibility | 146 147 ### Factor 6: Hook & Initial Engagement (9%) 148 149 | Score | Description | 150 | ----- | -------------------------------------------------------------------------- | 151 | 9-10 | Visually compelling hero element; contextually relevant; strong engagement | 152 | 7-8 | Professional, relevant hero; adequate engagement | 153 | 5-6 | Present but generic; mild engagement | 154 | 3-4 | Dated, poorly executed, or tangentially relevant | 155 | 1-2 | Missing, poor quality, or detracting | 156 | 0 | No hook; purely text-based above-fold with no visual appeal | 157 158 ### Factor 7: Trust & Credibility Signals (11%) 159 160 | Score | Description | 161 | ----- | -------------------------------------------------------------------------------------------------- | 162 | 9-10 | Multiple relevant elements (named testimonials, certifications, badges, logos); prominently placed | 163 | 7-8 | Several elements; specific testimonials or credible certifications | 164 | 5-6 | Some elements (generic testimonials or basic badges); adequate | 165 | 3-4 | Minimal; generic or lacking credibility | 166 | 1-2 | Nearly absent | 167 | 0 | No trust signals at all | 168 169 ### Factor 8: Authentic Imagery & Visual Design (8%) 170 171 | Score | Description | 172 | ----- | ----------------------------------------------------------------------- | 173 | 9-10 | Authentic imagery (product photos, real customers); professional design | 174 | 7-8 | Mix of authentic and professional; solid design | 175 | 5-6 | Mostly professional with some stock; adequate; minor issues | 176 | 3-4 | Significant stock photos; dated design; unprofessional impression | 177 | 1-2 | Predominantly generic/low-quality; poor design | 178 | 0 | Broken images, extremely low-quality, or repelling | 179 180 ### Factor 9: Clear Offer & Specificity (4%) 181 182 | Score | Description | 183 | ----- | ---------------------------------------------------------- | 184 | 9-10 | Specific, unambiguous; visitor knows exactly what they get | 185 | 7-8 | Clear and specific; minor ambiguity | 186 | 5-6 | Generally clear but could be more specific | 187 | 3-4 | Somewhat vague; visitor must infer details | 188 | 1-2 | Unclear or hard to determine | 189 | 0 | No discernable offer | 190 191 ### Factor 10: Contextual Appropriateness (3%) 192 193 Evaluates whether design serves its industry/business model context. B2B SaaS, e-commerce, and local services have different CRO norms. 194 195 --- 196 197 ## 4. Factor Weights & Rationale 198 199 The weights were derived from empirical research on correlation with actual conversion outcomes: 200 201 | Factor | Weight | Rationale | 202 | ------------------- | -------- | --------------------------------------------------------------------------------------------- | 203 | Headline Quality | 15% | Primary determinant of whether users engage or bounce; captures 80% of initial attention | 204 | Value Proposition | 14% | Extends headline; answers "what's in it for me?"; directly drives consideration | 205 | USP/Differentiation | 13% | Critical for competitive markets; answers "why you over alternatives?" | 206 | CTA Design | 13% | The conversion mechanism itself; changes to CTA alone can drive 200%+ improvement | 207 | Trust Signals | 11% | Addresses fundamental "is this trustworthy?" concern; increasingly important post-privacy era | 208 | Urgency/Scarcity | 10% | Drives immediate action vs. postponement; effective when legitimate | 209 | Hook/Engagement | 9% | First-impression visual; supports but doesn't replace messaging | 210 | Imagery/Design | 8% | Credibility signal; generic stock undermines trust but doesn't make or break conversion | 211 | Offer Clarity | 4% | Important but usually redundant with headline + CTA when those are strong | 212 | Context | 3% | Catch-all for industry-specific norms | 213 | **Total** | **100%** | | 214 215 The top 4 factors (headline, value prop, USP, CTA) account for 55% of the score. This reflects the research consensus that messaging clarity and the conversion mechanism are the dominant drivers. 216 217 --- 218 219 ## 5. Score Calculation & Grading Scale 220 221 ### Formula 222 223 ``` 224 Overall Score = (Headline × 0.15) + (Value Prop × 0.14) + (USP × 0.13) + (CTA × 0.13) 225 + (Urgency × 0.10) + (Hook × 0.09) + (Trust × 0.11) + (Imagery × 0.08) 226 + (Offer × 0.04) + (Context × 0.03) 227 ``` 228 229 Each factor is 0-10, producing a weighted sum of 0-10, multiplied by 10 to get 0-100. 230 231 ### Grading Scale 232 233 The production system uses a standard academic grade scale with +/- modifiers: 234 235 | Grade | Score Range | Interpretation | 236 | ----- | ----------- | ----------------------------------------------------- | 237 | A+ | 97-100 | Exceptional — industry-leading conversion design | 238 | A | 93-96 | Excellent — strong conversion potential | 239 | A- | 90-92 | Very good — well-executed with small areas to improve | 240 | B+ | 87-89 | Good — solid foundation, a few clear opportunities | 241 | B | 83-86 | Above average — noticeable room for improvement | 242 | B- | 80-82 | Borderline — several conversion barriers present | 243 | C+ | 77-79 | Below average — meaningful improvements needed | 244 | C | 73-76 | Fair — significant issues across multiple factors | 245 | C- | 70-72 | Weak — substantial work required | 246 | D+ | 67-69 | Poor — major barriers across most factors | 247 | D | 63-66 | Very poor — fundamental issues need addressing | 248 | D- | 60-62 | Critical — actively losing most potential customers | 249 | F | 0-59 | Failing — urgent, comprehensive overhaul required | 250 251 > See `src/score.js:computeGrade()` for the production implementation. 252 253 --- 254 255 ## 6. Screenshot Strategy 256 257 ### Key Decision: Above-the-Fold Is Sufficient 258 259 The research concluded that **above-the-fold and first below-the-fold screenshots are substantially sufficient for reliable scoring**, reducing token consumption by 65-75% compared to full-page screenshots while maintaining evaluation accuracy above 90%. 260 261 ### Recommended Approach 262 263 1. **Primary:** Desktop above-the-fold (1920x1080) 264 2. **Secondary:** Mobile above-the-fold (375x667) — _later dropped in production for cost reasons_ 265 3. **Conditional:** Below-the-fold screenshot if initial score is low (rescoring pass) 266 267 ### Evidence 268 269 - Above-fold content captures 57-80% of user viewing time 270 - The 9 scoring factors cluster heavily above the fold on well-designed pages 271 - GPT-4 Vision cropping research showed focused images _improve_ accuracy by removing noise 272 - Token savings: ~1,000 tokens per above-fold image vs. ~2,000 for full-page 273 - At 500,000 websites: ~1 billion fewer tokens consumed 274 275 ### Production Implementation 276 277 In the production system (`src/capture.js`): 278 279 - Desktop screenshot captured at page load (cropped + uncropped variants) 280 - DOM-aware intelligent cropping preserves CTAs, trust signals, hero imagery 281 - Cropped version saves 20-35% additional LLM tokens 282 - Below-fold screenshot captured separately for rescoring pass 283 - Mobile screenshot was dropped for cost efficiency 284 285 --- 286 287 ## 7. Two-Pass Architecture 288 289 ### Design Decision: Conditional Resubmission 290 291 The research compared three approaches for handling below-the-fold content: 292 293 | Approach | Token Cost (100K sites, 30% low-scoring) | 294 | ------------------------------------ | ---------------------------------------- | 295 | **Conditional resubmission** | 174M tokens | 296 | Always include below-fold | 180M tokens | 297 | Include with "ignore if unnecessary" | 185M tokens | 298 299 **Conditional resubmission wins** because: 300 301 - Vision models charge for image tokens at input time regardless of whether the model "uses" the image 302 - Including an image and saying "only look at it if needed" does NOT save tokens 303 - The breakeven point is ~50% of sites scoring low; in practice only ~30% do 304 305 ### Pass 1: Scoring (Above-the-Fold) 306 307 - Input: Desktop screenshot (cropped) + HTML DOM 308 - Output: Factor scores (0-10 each), weighted total, grade, strengths, weaknesses, improvement opportunities 309 - Sites scoring below threshold proceed to Pass 2 310 311 ### Pass 2: Rescoring (Below-the-Fold) 312 313 - Input: Below-fold screenshot + HTML DOM + original score JSON 314 - Output: Adjusted factor scores (only where new content warrants change), recalculated total/grade, contact details 315 - Does NOT resend above-fold screenshots (LLM already has the context from Pass 1 JSON) 316 - Focused prompt references original scores and asks for adjustments, not full re-evaluation 317 318 ### Threshold 319 320 The original research suggested C+ (77) as the resubmission threshold. The production system uses a configurable `LOW_SCORE_CUTOFF` (currently 82, i.e., B- and below). 321 322 > **Business logic:** We're selling web design services. High scorers don't need help; low scorers are prospects. Rescoring gives low-scoring sites a second chance with more data before proposal generation. 323 324 --- 325 326 ## 8. Contact Extraction on Rescore 327 328 Contact extraction was added to the rescoring pass (not the initial scoring) to save tokens — you only extract contacts for sites you actually plan to contact (low scorers). 329 330 ### What Gets Extracted 331 332 From the HTML DOM (not guessed): 333 334 - **Contact form details:** action URL, method, field presence (first_name, last_name, full_name, email, phone, company_name, subject_line, message) with field types, name attributes, and labels 335 - **Email addresses:** All explicit `mailto:` links or plain-text emails 336 - **Phone numbers:** All explicit `tel:` links or recognizable patterns 337 - **Social profiles:** Links to major platforms with platform identification 338 - **Contact page URLs:** Explicit "/contact" or "/support" links 339 340 ### Design Rationale 341 342 - Extracting contacts in Pass 1 would waste tokens on high-scoring sites we won't contact 343 - The HTML DOM is already being sent in Pass 2 anyway (for score adjustment) 344 - Adding contact extraction to the same API call adds minimal token overhead 345 - All fields are optional — the LLM reports what it finds, uses `null`/empty for missing data 346 347 --- 348 349 ## 9. Popover Handling 350 351 **Decision: Close popovers before taking screenshots.** 352 353 Reasoning: 354 355 1. Popovers obscure the headline, hero image, CTA, and trust signals being evaluated 356 2. They represent a secondary conversion path (newsletter signup, discount) — not the primary page conversion 357 3. They create inconsistent evaluation conditions (some sites show immediately, others on delay/exit) 358 4. The entire scoring methodology depends on evaluating above-fold content, which is completely blocked by modal overlays 359 360 ### Implementation 361 362 The production system (`src/capture.js` / `src/utils/stealth-browser.js`): 363 364 - Waits 2-3 seconds after page load for delay-triggered popovers 365 - Attempts to close via common selectors (`[class*='close']`, `[aria-label='Close']`, etc.) 366 - Sends Escape key as fallback 367 - Takes screenshot immediately after closing to avoid new popovers 368 369 --- 370 371 ## 10. LLM Prompt Design 372 373 ### Prompt Structure (Both Passes) 374 375 1. **System context:** Expert CRO specialist role 376 2. **Input specification:** What data is provided (screenshots, HTML, prior scores for rescoring) 377 3. **Evaluation framework:** Factor definitions with rubric anchors 378 4. **Scoring methodology:** Weighted calculation formula 379 5. **Output format:** Strict JSON schema 380 6. **Best practices:** Analyze HTML first, cross-reference with screenshots, assess mobile/desktop separately, provide specific evidence 381 382 ### Key Design Principles 383 384 - **Rubric in system prompt:** Full rubric definitions appear once in the system prompt, not repeated per-website 385 - **Evidence-based scoring:** Each factor score requires 1-2 sentence reasoning with specific page evidence 386 - **Independent factor scoring:** Score each factor independently, then calculate weighted total 387 - **Dual analysis:** HTML content analysis + visual assessment cross-referenced 388 - **Confidence assessment:** Overall confidence (High/Medium/Low) with limitation notes 389 390 ### Production Evolution 391 392 The production prompts (`prompts/CONVERSION-SCORING-VISION.md` and `prompts/CONVERSION-SCORING-NOVIS.md`) have evolved from this original design: 393 394 - Simplified output JSON (removed verbose nested structures for token efficiency) 395 - Added `recommendation_sms` and `recommendation_email` fields for proposal generation 396 - Split into vision-enabled and HTML-only variants 397 - Grade calculation moved from LLM to code (`computeGrade()` in `src/score.js`) 398 - LLM now returns only `factor_scores`; total and grade computed programmatically for consistency 399 400 --- 401 402 ## 11. Validation & Calibration 403 404 The research recommended the following validation approach (partially implemented): 405 406 ### Inter-Rater Reliability 407 408 - Evaluate 50-100 websites with experienced CRO professionals 409 - Run same websites through LLM scoring 410 - Target Spearman correlation > 0.85 between LLM and expert scores 411 - Analyze divergence cases; iterate on rubric wording 412 413 ### Expected Accuracy 414 415 - **Letter grade agreement:** 75-85% exact match with experts; remaining are adjacent grades 416 - **Factor-level accuracy:** 80-90% match (individual factors more objective than aggregated grades) 417 - **High-confidence cases:** >90% accuracy for clearly strong (A-) or weak (D/F) sites 418 - **Mid-range (B-C+):** Lower agreement due to inherent subjectivity 419 420 ### Continuous Monitoring 421 422 - Distribution should approximate normal centered around C+/B- 423 - Factor correlations should match expectations (headline ↔ value prop should correlate > 0.6) 424 - 1% human spot-checks; recalibrate if divergences exceed 5-10% 425 426 ### Production Status 427 428 In practice, the system has been validated through: 429 430 - Manual review of thousands of scored sites during outreach QA 431 - Grade/score mismatch detection and correction 432 - Programmatic grade computation (eliminating LLM grading inconsistencies) 433 - Iterative prompt refinement based on observed scoring patterns 434 435 --- 436 437 ## 12. Token Optimization 438 439 ### Image Optimization (Implemented) 440 441 - JPEG conversion (quality 85): 40-50% file size reduction 442 - DOM-aware intelligent cropping: 20-35% token reduction 443 - Resolution targeting: minimum for text legibility 444 - Above-fold only: 65-75% reduction vs. full-page 445 446 ### Prompt Optimization (Implemented) 447 448 - Rubric in system prompt (not repeated per website) 449 - Abbreviated factor references in rescoring prompt 450 - Strict JSON output (no prose explanation outside the JSON) 451 - LLM returns factor scores only; computation done in code 452 453 ### HTML-Only Mode (Added Later) 454 455 When `ENABLE_VISION=false`, the system skips screenshots entirely: 456 457 - No Playwright screenshot capture 458 - Text-only analysis of HTML DOM 459 - Auto-promotes scored sites through rescoring (no below-fold vision needed) 460 - Cost: ~$0.0025/site vs. ~$0.030/site with vision (83% savings) 461 462 --- 463 464 ## 13. Implementation Notes 465 466 ### What Changed from Research to Production 467 468 | Research Proposal | Production Implementation | Reason | 469 | ----------------------------------------- | --------------------------------------------- | ------------------------------------------ | 470 | Academic grading scale (A+ to F with +/-) | Business scale (A+ to F, fewer subdivisions) | Simpler for prospect identification | 471 | Mobile + desktop screenshots | Desktop only | Cost reduction; mobile added minimal value | 472 | LLM computes grade | Code computes grade from factor scores | Eliminates grading inconsistencies | 473 | Batch processing of 50-100 sites | Individual API calls with concurrency control | Rate limits; error isolation | 474 | Few-shot examples in prompt | Prompt-only (no examples) | Token savings; rubric detail sufficient | 475 | C+ (77) rescore threshold | B- (82) configurable threshold | Business need: more prospects | 476 | 9 factors + context | 10 factors (context kept as factor 10) | Consistent weighting | 477 | Full JSON with nested evidence/reasoning | Simplified JSON with factor scores | Token efficiency | 478 479 ### Key Files 480 481 - `src/score.js` — Scoring logic, `computeScoreFromFactors()`, `computeGrade()` 482 - `src/stages/rescoring.js` — Below-fold rescoring pass 483 - `src/capture.js` — Screenshot capture 484 - `src/contacts/prioritize.js` — Contact extraction and prioritization 485 - `prompts/CONVERSION-SCORING-VISION.md` — Vision-enabled scoring prompt 486 - `prompts/CONVERSION-SCORING-NOVIS.md` — HTML-only scoring prompt 487 488 --- 489 490 ## 14. Appendix: Original JSON Schemas 491 492 ### Pass 1: Scoring Output 493 494 ```json 495 { 496 "website_url": "https://example.com", 497 "evaluation_date": "2026-01-14T12:00:00Z", 498 "device_analysis": { 499 "desktop_visible": true, 500 "mobile_visible": true, 501 "design_differences": "Mobile layout stacks hero and CTA; desktop shows them side by side." 502 }, 503 "factor_scores": { 504 "headline_quality": { 505 "score": 8, 506 "reasoning": "Headline clearly states what the product does and who it's for.", 507 "evidence": "Headline text: 'Automate Your Invoices in Minutes for Small Businesses'." 508 }, 509 "value_proposition": { 510 "score": 7, 511 "reasoning": "Benefits are clear but lack concrete quantified outcomes.", 512 "evidence": "Copy mentions 'save time and reduce errors' but no specific numbers." 513 }, 514 "unique_selling_proposition": { 515 "score": 6, 516 "reasoning": "USP is implied but not explicitly contrasted with competitors.", 517 "evidence": "Mentions 'built specifically for freelancers' but no comparison." 518 }, 519 "call_to_action": { 520 "score": 9, 521 "reasoning": "Primary CTA is above the fold, high contrast, and action-oriented.", 522 "evidence": "CTA button 'Start Free 14-Day Trial' in hero, visually prominent." 523 }, 524 "urgency_messaging": { 525 "score": 3, 526 "reasoning": "Weak urgency with vague wording and no specific deadline.", 527 "evidence": "Text: 'Join now and don't miss out' without concrete time limit." 528 }, 529 "hook_engagement": { 530 "score": 7, 531 "reasoning": "Hero image is relevant and supports the message.", 532 "evidence": "Image shows a freelancer working with invoices on a laptop." 533 }, 534 "trust_signals": { 535 "score": 5, 536 "reasoning": "Includes generic testimonials but no logos or certifications.", 537 "evidence": "Three short text testimonials with first names only." 538 }, 539 "imagery_design": { 540 "score": 8, 541 "reasoning": "Clean, modern design with custom-looking imagery.", 542 "evidence": "Custom illustrations and product screenshots; no stock photos." 543 }, 544 "offer_clarity": { 545 "score": 8, 546 "reasoning": "Offer is explicit: 14-day free trial, no credit card.", 547 "evidence": "Text near CTA: 'Try it free for 14 days. No credit card needed.'" 548 }, 549 "contextual_appropriateness": { 550 "score": 7, 551 "reasoning": "Design is appropriate for a B2B SaaS invoicing tool.", 552 "industry_context": "B2B SaaS / invoicing software" 553 } 554 }, 555 "overall_calculation": { 556 "weighted_total": 76.5, 557 "letter_grade": "C", 558 "grade_interpretation": "Acceptable fundamentals but room for improvement in USP, trust, and urgency." 559 }, 560 "key_strengths": [ 561 "Strong, visible primary CTA with clear action and low-friction offer.", 562 "Headline and offer are easy to understand for the target audience." 563 ], 564 "critical_weaknesses": [ 565 "Weak urgency messaging provides little reason to act now.", 566 "Trust elements are generic and do not provide strong social proof." 567 ], 568 "quick_improvement_opportunities": [ 569 "Add specific trust signals (logos, detailed testimonials) above the fold.", 570 "Introduce concrete urgency (time-bound offer or limited onboarding slots)." 571 ], 572 "confidence_assessment": { 573 "overall_confidence": "High", 574 "reasoning": "Above-fold content contains all major conversion elements.", 575 "limitation_notes": "Does not consider deeper content, checkout flow, or post-click funnel." 576 } 577 } 578 ``` 579 580 ### Pass 2: Rescoring + Contact Extraction Output 581 582 The rescoring pass adds a `contact_details` section to the evaluation JSON: 583 584 ```json 585 { 586 "contact_details": { 587 "primary_contact_form": { 588 "form_action_url": "https://example.com/contact-submit", 589 "form_method": "post", 590 "fields": { 591 "first_name": { 592 "present": true, 593 "field_type": "text", 594 "name_attribute": "first_name", 595 "label_or_placeholder": "First name" 596 }, 597 "last_name": { 598 "present": true, 599 "field_type": "text", 600 "name_attribute": "last_name", 601 "label_or_placeholder": "Last name" 602 }, 603 "full_name": { 604 "present": false, 605 "field_type": null, 606 "name_attribute": null, 607 "label_or_placeholder": null 608 }, 609 "email": { 610 "present": true, 611 "field_type": "email", 612 "name_attribute": "email", 613 "label_or_placeholder": "Your email" 614 }, 615 "phone": { 616 "present": true, 617 "field_type": "tel", 618 "name_attribute": "phone", 619 "label_or_placeholder": "Phone number" 620 }, 621 "company_name": { 622 "present": false, 623 "field_type": null, 624 "name_attribute": null, 625 "label_or_placeholder": null 626 }, 627 "subject_line": { 628 "present": false, 629 "field_type": null, 630 "name_attribute": null, 631 "label_or_placeholder": null 632 }, 633 "message": { 634 "present": true, 635 "field_type": "textarea", 636 "name_attribute": "message", 637 "label_or_placeholder": "Your message" 638 } 639 } 640 }, 641 "email_addresses": ["support@example.com", "sales@example.com"], 642 "phone_numbers": ["+1-555-123-4567"], 643 "social_profiles": [ 644 { 645 "platform": "facebook", 646 "url": "https://www.facebook.com/example" 647 }, 648 { 649 "platform": "linkedin", 650 "url": "https://www.linkedin.com/company/example" 651 } 652 ], 653 "contact_pages": ["https://example.com/contact", "https://example.com/support"] 654 } 655 } 656 ``` 657 658 --- 659 660 ## References 661 662 The original research cited ~60 sources. Key references that informed the design: 663 664 - Nielsen Norman Group — User attention and fold research (57-80% above-fold viewing time) 665 - Google ad viewability study — 73% above-fold vs. 44% below-fold viewability 666 - CRO best practices literature — Factor selection and weighting 667 - GPT-4 Vision cropping research — Focused images outperform full-page for accuracy 668 - PURE method — Inter-rater reliability calibration for expert-based evaluation (>0.8 reliability) 669 - LLMLingua (Microsoft) — Prompt optimization: 35% token reduction maintaining quality 670 - DeepSeek vision-text compression — 7-20x token reduction at >90% accuracy 671 - OpenAI vision model documentation — Token calculation and pricing for image inputs