scoring-research.html
1 <!doctype html> 2 <html lang="en"> 3 <head> 4 <meta charset="UTF-8" /> 5 <meta name="viewport" content="width=device-width, initial-scale=1.0" /> 6 <title>Scoring System Research & Design — 333 Method</title> 7 <style> 8 :root { 9 --bg: #fafafa; 10 --fg: #1a1a1a; 11 --accent: #2563eb; 12 --border: #e5e7eb; 13 --code-bg: #f3f4f6; 14 --table-stripe: #f9fafb; 15 --blockquote-border: #d1d5db; 16 --blockquote-bg: #f9fafb; 17 } 18 @media (prefers-color-scheme: dark) { 19 :root { 20 --bg: #111827; 21 --fg: #e5e7eb; 22 --accent: #60a5fa; 23 --border: #374151; 24 --code-bg: #1f2937; 25 --table-stripe: #1f2937; 26 --blockquote-border: #4b5563; 27 --blockquote-bg: #1f2937; 28 } 29 } 30 * { 31 box-sizing: border-box; 32 margin: 0; 33 padding: 0; 34 } 35 body { 36 font-family: 37 -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif; 38 line-height: 1.7; 39 color: var(--fg); 40 background: var(--bg); 41 max-width: 52rem; 42 margin: 0 auto; 43 padding: 2rem 1.5rem 4rem; 44 } 45 h1 { 46 font-size: 2rem; 47 margin: 2rem 0 1rem; 48 border-bottom: 2px solid var(--accent); 49 padding-bottom: 0.5rem; 50 } 51 h2 { 52 font-size: 1.5rem; 53 margin: 2.5rem 0 0.75rem; 54 border-bottom: 1px solid var(--border); 55 padding-bottom: 0.4rem; 56 } 57 h3 { 58 font-size: 1.2rem; 59 margin: 1.5rem 0 0.5rem; 60 } 61 p { 62 margin: 0.75rem 0; 63 } 64 a { 65 color: var(--accent); 66 text-decoration: none; 67 } 68 a:hover { 69 text-decoration: underline; 70 } 71 ul, 72 ol { 73 margin: 0.5rem 0 0.5rem 1.5rem; 74 } 75 li { 76 margin: 0.25rem 0; 77 } 78 code { 79 font-family: 'SF Mono', 'Fira Code', 'JetBrains Mono', Consolas, monospace; 80 font-size: 0.875em; 81 background: var(--code-bg); 82 padding: 0.15em 0.35em; 83 border-radius: 4px; 84 } 85 pre { 86 background: var(--code-bg); 87 border: 1px solid var(--border); 88 border-radius: 8px; 89 padding: 1rem; 90 overflow-x: auto; 91 margin: 1rem 0; 92 } 93 pre code { 94 background: none; 95 padding: 0; 96 font-size: 0.85em; 97 } 98 table { 99 width: 100%; 100 border-collapse: collapse; 101 margin: 1rem 0; 102 font-size: 0.95em; 103 } 104 th, 105 td { 106 border: 1px solid var(--border); 107 padding: 0.5rem 0.75rem; 108 text-align: left; 109 } 110 th { 111 background: var(--code-bg); 112 font-weight: 600; 113 } 114 tr:nth-child(even) { 115 background: var(--table-stripe); 116 } 117 blockquote { 118 border-left: 4px solid var(--blockquote-border); 119 background: var(--blockquote-bg); 120 padding: 0.75rem 1rem; 121 margin: 1rem 0; 122 border-radius: 0 8px 8px 0; 123 } 124 blockquote p { 125 margin: 0.25rem 0; 126 } 127 hr { 128 border: none; 129 border-top: 1px solid var(--border); 130 margin: 2rem 0; 131 } 132 @media print { 133 body { 134 max-width: 100%; 135 padding: 1cm; 136 } 137 pre { 138 white-space: pre-wrap; 139 } 140 } 141 </style> 142 </head> 143 <body> 144 <h1>Scoring System Research & Design</h1> 145 <p> 146 This document captures the original deep-research conversation that produced the 333 147 Method's website conversion scoring system. It covers the research foundations, rubric 148 design rationale, factor weighting decisions, screenshot strategy, rescoring approach, contact 149 extraction, and implementation considerations. 150 </p> 151 <blockquote> 152 <p> 153 <strong>Source:</strong> Open-WebUI chat with OpenRouter (January 2026), exported and 154 consolidated. 155 </p> 156 </blockquote> 157 <hr /> 158 <h2>Table of Contents</h2> 159 <ol> 160 <li><a href="#1-research-question">Research Question</a></li> 161 <li><a href="#2-foundations-why-these-factors">Foundations: Why These Factors</a></li> 162 <li><a href="#3-the-nine-factor-rubric">The Nine-Factor Rubric</a></li> 163 <li><a href="#4-factor-weights--rationale">Factor Weights & Rationale</a></li> 164 <li> 165 <a href="#5-score-calculation--grading-scale">Score Calculation & Grading Scale</a> 166 </li> 167 <li><a href="#6-screenshot-strategy">Screenshot Strategy</a></li> 168 <li><a href="#7-two-pass-architecture">Two-Pass Architecture</a></li> 169 <li><a href="#8-contact-extraction-on-rescore">Contact Extraction on Rescore</a></li> 170 <li><a href="#9-popover-handling">Popover Handling</a></li> 171 <li><a href="#10-llm-prompt-design">LLM Prompt Design</a></li> 172 <li><a href="#11-validation--calibration">Validation & Calibration</a></li> 173 <li><a href="#12-token-optimization">Token Optimization</a></li> 174 <li><a href="#13-implementation-notes">Implementation Notes</a></li> 175 <li><a href="#14-appendix-original-json-schemas">Appendix: Original JSON Schemas</a></li> 176 </ol> 177 <hr /> 178 <h2>1. Research Question</h2> 179 <p>The original question that kicked off this research:</p> 180 <blockquote> 181 <p> 182 Propose a scoring system for website conversion (with standard school grading of A+ to F) 183 based on factors such as clear offer, CTA, urgency, hook, strong headline, strong value 184 proposition, clear reason to choose them (USP), no generic stock photos, trust elements 185 (reviews, badges, guarantees), and anything else from best practices. 186 </p> 187 <p> 188 This will be provided to an LLM to calculate the score, along with the HTML of the DOM after 189 pageload and one or more screenshots. Please advise whether this scoring system will require 190 a full-page screenshot, or will just an above-the-fold and maybe the first below-the-fold 191 screenshots be sufficient to produce a reasonable score? This system will be scoring many 192 hundreds of thousands of websites, so minimising LLM token usage is more important than 193 score accuracy. 194 </p> 195 </blockquote> 196 <hr /> 197 <h2>2. Foundations: Why These Factors</h2> 198 <p> 199 The scoring factors were selected by synthesizing established CRO (Conversion Rate 200 Optimization) best practices, prioritization frameworks (RICE, PIE), and behavioral psychology 201 research. The factors map to three broad categories of conversion influence: 202 </p> 203 <h3>Messaging Clarity & Value Communication</h3> 204 <ul> 205 <li> 206 <strong>Headline Quality</strong> — The primary hook; must communicate what, who, and why 207 within 3-5 seconds 208 </li> 209 <li> 210 <strong>Value Proposition</strong> — Extends the headline; shifts from features to benefits 211 ("what's in it for me?") 212 </li> 213 <li> 214 <strong>Unique Selling Proposition</strong> — Why choose <em>this</em> option over 215 alternatives 216 </li> 217 <li> 218 <strong>Clear Offer</strong> — What exactly is the visitor being asked to do, and what do 219 they get 220 </li> 221 </ul> 222 <h3>User Confidence & Trust</h3> 223 <ul> 224 <li> 225 <strong>Trust & Credibility Signals</strong> — Testimonials, certifications, badges, 226 partner logos, media mentions 227 </li> 228 <li> 229 <strong>Authentic Imagery</strong> — Real product photos vs. generic stock; professional 230 visual design 231 </li> 232 </ul> 233 <h3>Action & Engagement</h3> 234 <ul> 235 <li><strong>Call-to-Action</strong> — Copy clarity, visual prominence, and placement</li> 236 <li> 237 <strong>Urgency/Scarcity</strong> — Legitimate time/supply pressure for immediate action 238 </li> 239 <li> 240 <strong>Hook & Engagement</strong> — Hero element that captures attention in the first 241 seconds 242 </li> 243 </ul> 244 <h3>Additional Context</h3> 245 <ul> 246 <li> 247 <strong>Industry Appropriateness</strong> (3% weight) — Whether design serves its specific 248 business model context (B2B SaaS vs. e-commerce vs. local services have different norms) 249 </li> 250 </ul> 251 <h3>Key Research Findings</h3> 252 <ul> 253 <li>Users spend 57-80% of viewing time on above-the-fold content (Nielsen Norman Group)</li> 254 <li> 255 Google found ads above-fold achieve 73% viewability vs. 44% below-fold — a 66% "fold 256 cliff" 257 </li> 258 <li> 259 90% of users begin scrolling within 14 seconds, but 260 <em>only if above-fold content signals value</em> 261 </li> 262 <li>CTA copy changes alone can generate conversion improvements exceeding 200%</li> 263 <li> 264 GPT-4 Vision studies found cropped images actually <em>outperform</em> full-page captures 265 for identification tasks — background noise reduces accuracy 266 </li> 267 </ul> 268 <hr /> 269 <h2>3. The Nine-Factor Rubric</h2> 270 <p> 271 Each factor is scored 0-10 with specific rubric definitions. Below is the condensed rubric; 272 the full detailed version with examples exists in the LLM prompts. 273 </p> 274 <h3>Factor 1: Headline Quality & Clarity (15%)</h3> 275 <table> 276 <thead> 277 <tr> 278 <th>Score</th> 279 <th>Description</th> 280 </tr> 281 </thead> 282 <tbody> 283 <tr> 284 <td>9-10</td> 285 <td> 286 Immediately communicates value; benefit-oriented; specific; creates curiosity or 287 emotional connection 288 </td> 289 </tr> 290 <tr> 291 <td>7-8</td> 292 <td>Clearly communicates basic benefit; mostly specific; adequate direction</td> 293 </tr> 294 <tr> 295 <td>5-6</td> 296 <td>Communicates a benefit but somewhat generic; requires modest interpretation</td> 297 </tr> 298 <tr> 299 <td>3-4</td> 300 <td>Vague, generic, or fails to communicate core benefit</td> 301 </tr> 302 <tr> 303 <td>1-2</td> 304 <td>Confusing, contradictory, or essentially absent above-fold</td> 305 </tr> 306 <tr> 307 <td>0</td> 308 <td>No discernable headline or actively confusing</td> 309 </tr> 310 </tbody> 311 </table> 312 <h3>Factor 2: Value Proposition Clarity (14%)</h3> 313 <table> 314 <thead> 315 <tr> 316 <th>Score</th> 317 <th>Description</th> 318 </tr> 319 </thead> 320 <tbody> 321 <tr> 322 <td>9-10</td> 323 <td>Specific, benefit-oriented, compelling; clearly differentiates</td> 324 </tr> 325 <tr> 326 <td>7-8</td> 327 <td>Clear and benefit-focused; adequately articulates core benefits</td> 328 </tr> 329 <tr> 330 <td>5-6</td> 331 <td>Present but generic or feature-heavy; requires interpretation</td> 332 </tr> 333 <tr> 334 <td>3-4</td> 335 <td>Vague or feature-focused; unclear differentiation</td> 336 </tr> 337 <tr> 338 <td>1-2</td> 339 <td>Barely present or confused with feature lists</td> 340 </tr> 341 <tr> 342 <td>0</td> 343 <td>No value proposition or contradictory messaging</td> 344 </tr> 345 </tbody> 346 </table> 347 <h3>Factor 3: Unique Selling Proposition (13%)</h3> 348 <table> 349 <thead> 350 <tr> 351 <th>Score</th> 352 <th>Description</th> 353 </tr> 354 </thead> 355 <tbody> 356 <tr> 357 <td>9-10</td> 358 <td>Clear, compelling differentiation; specific competitive advantage</td> 359 </tr> 360 <tr> 361 <td>7-8</td> 362 <td>Reasonably clear; specific advantage identified</td> 363 </tr> 364 <tr> 365 <td>5-6</td> 366 <td>Some differentiation implied but not explicit</td> 367 </tr> 368 <tr> 369 <td>3-4</td> 370 <td>Vague; relies on generic claims ("best in class")</td> 371 </tr> 372 <tr> 373 <td>1-2</td> 374 <td>Barely present; no clear reasons to choose</td> 375 </tr> 376 <tr> 377 <td>0</td> 378 <td>No differentiation; appears identical to generic competitors</td> 379 </tr> 380 </tbody> 381 </table> 382 <h3>Factor 4: Call-to-Action Design & Placement (13%)</h3> 383 <table> 384 <thead> 385 <tr> 386 <th>Score</th> 387 <th>Description</th> 388 </tr> 389 </thead> 390 <tbody> 391 <tr> 392 <td>9-10</td> 393 <td> 394 Visible above fold; specific action-oriented language; visually prominent; secondary 395 CTAs at natural breaks 396 </td> 397 </tr> 398 <tr> 399 <td>7-8</td> 400 <td>Visible above fold; action-oriented; reasonably prominent</td> 401 </tr> 402 <tr> 403 <td>5-6</td> 404 <td> 405 Present; clear but generic ("Submit", "Learn More"); adequate 406 placement 407 </td> 408 </tr> 409 <tr> 410 <td>3-4</td> 411 <td>Present but not prominent; vague language; requires scrolling</td> 412 </tr> 413 <tr> 414 <td>1-2</td> 415 <td>Hard to find, confusing, or inadequately prominent</td> 416 </tr> 417 <tr> 418 <td>0</td> 419 <td>No CTA or buried below multiple scrolls</td> 420 </tr> 421 </tbody> 422 </table> 423 <h3>Factor 5: Urgency & Scarcity (10%)</h3> 424 <table> 425 <thead> 426 <tr> 427 <th>Score</th> 428 <th>Description</th> 429 </tr> 430 </thead> 431 <tbody> 432 <tr> 433 <td>9-10</td> 434 <td>Legitimate urgency with specifics (deadline, count); genuine pressure</td> 435 </tr> 436 <tr> 437 <td>7-8</td> 438 <td>Clear mechanism; specific rather than vague</td> 439 </tr> 440 <tr> 441 <td>5-6</td> 442 <td>Some urgency suggested but lacks specifics</td> 443 </tr> 444 <tr> 445 <td>3-4</td> 446 <td>Vague ("act soon", "don't miss out") without details</td> 447 </tr> 448 <tr> 449 <td>1-2</td> 450 <td>Minimal or ineffective urgency</td> 451 </tr> 452 <tr> 453 <td>0</td> 454 <td>No urgency or false urgency undermining credibility</td> 455 </tr> 456 </tbody> 457 </table> 458 <h3>Factor 6: Hook & Initial Engagement (9%)</h3> 459 <table> 460 <thead> 461 <tr> 462 <th>Score</th> 463 <th>Description</th> 464 </tr> 465 </thead> 466 <tbody> 467 <tr> 468 <td>9-10</td> 469 <td>Visually compelling hero element; contextually relevant; strong engagement</td> 470 </tr> 471 <tr> 472 <td>7-8</td> 473 <td>Professional, relevant hero; adequate engagement</td> 474 </tr> 475 <tr> 476 <td>5-6</td> 477 <td>Present but generic; mild engagement</td> 478 </tr> 479 <tr> 480 <td>3-4</td> 481 <td>Dated, poorly executed, or tangentially relevant</td> 482 </tr> 483 <tr> 484 <td>1-2</td> 485 <td>Missing, poor quality, or detracting</td> 486 </tr> 487 <tr> 488 <td>0</td> 489 <td>No hook; purely text-based above-fold with no visual appeal</td> 490 </tr> 491 </tbody> 492 </table> 493 <h3>Factor 7: Trust & Credibility Signals (11%)</h3> 494 <table> 495 <thead> 496 <tr> 497 <th>Score</th> 498 <th>Description</th> 499 </tr> 500 </thead> 501 <tbody> 502 <tr> 503 <td>9-10</td> 504 <td> 505 Multiple relevant elements (named testimonials, certifications, badges, logos); 506 prominently placed 507 </td> 508 </tr> 509 <tr> 510 <td>7-8</td> 511 <td>Several elements; specific testimonials or credible certifications</td> 512 </tr> 513 <tr> 514 <td>5-6</td> 515 <td>Some elements (generic testimonials or basic badges); adequate</td> 516 </tr> 517 <tr> 518 <td>3-4</td> 519 <td>Minimal; generic or lacking credibility</td> 520 </tr> 521 <tr> 522 <td>1-2</td> 523 <td>Nearly absent</td> 524 </tr> 525 <tr> 526 <td>0</td> 527 <td>No trust signals at all</td> 528 </tr> 529 </tbody> 530 </table> 531 <h3>Factor 8: Authentic Imagery & Visual Design (8%)</h3> 532 <table> 533 <thead> 534 <tr> 535 <th>Score</th> 536 <th>Description</th> 537 </tr> 538 </thead> 539 <tbody> 540 <tr> 541 <td>9-10</td> 542 <td>Authentic imagery (product photos, real customers); professional design</td> 543 </tr> 544 <tr> 545 <td>7-8</td> 546 <td>Mix of authentic and professional; solid design</td> 547 </tr> 548 <tr> 549 <td>5-6</td> 550 <td>Mostly professional with some stock; adequate; minor issues</td> 551 </tr> 552 <tr> 553 <td>3-4</td> 554 <td>Significant stock photos; dated design; unprofessional impression</td> 555 </tr> 556 <tr> 557 <td>1-2</td> 558 <td>Predominantly generic/low-quality; poor design</td> 559 </tr> 560 <tr> 561 <td>0</td> 562 <td>Broken images, extremely low-quality, or repelling</td> 563 </tr> 564 </tbody> 565 </table> 566 <h3>Factor 9: Clear Offer & Specificity (4%)</h3> 567 <table> 568 <thead> 569 <tr> 570 <th>Score</th> 571 <th>Description</th> 572 </tr> 573 </thead> 574 <tbody> 575 <tr> 576 <td>9-10</td> 577 <td>Specific, unambiguous; visitor knows exactly what they get</td> 578 </tr> 579 <tr> 580 <td>7-8</td> 581 <td>Clear and specific; minor ambiguity</td> 582 </tr> 583 <tr> 584 <td>5-6</td> 585 <td>Generally clear but could be more specific</td> 586 </tr> 587 <tr> 588 <td>3-4</td> 589 <td>Somewhat vague; visitor must infer details</td> 590 </tr> 591 <tr> 592 <td>1-2</td> 593 <td>Unclear or hard to determine</td> 594 </tr> 595 <tr> 596 <td>0</td> 597 <td>No discernable offer</td> 598 </tr> 599 </tbody> 600 </table> 601 <h3>Factor 10: Contextual Appropriateness (3%)</h3> 602 <p> 603 Evaluates whether design serves its industry/business model context. B2B SaaS, e-commerce, and 604 local services have different CRO norms. 605 </p> 606 <hr /> 607 <h2>4. Factor Weights & Rationale</h2> 608 <p> 609 The weights were derived from empirical research on correlation with actual conversion 610 outcomes: 611 </p> 612 <table> 613 <thead> 614 <tr> 615 <th>Factor</th> 616 <th>Weight</th> 617 <th>Rationale</th> 618 </tr> 619 </thead> 620 <tbody> 621 <tr> 622 <td>Headline Quality</td> 623 <td>15%</td> 624 <td> 625 Primary determinant of whether users engage or bounce; captures 80% of initial attention 626 </td> 627 </tr> 628 <tr> 629 <td>Value Proposition</td> 630 <td>14%</td> 631 <td> 632 Extends headline; answers "what's in it for me?"; directly drives 633 consideration 634 </td> 635 </tr> 636 <tr> 637 <td>USP/Differentiation</td> 638 <td>13%</td> 639 <td>Critical for competitive markets; answers "why you over alternatives?"</td> 640 </tr> 641 <tr> 642 <td>CTA Design</td> 643 <td>13%</td> 644 <td>The conversion mechanism itself; changes to CTA alone can drive 200%+ improvement</td> 645 </tr> 646 <tr> 647 <td>Trust Signals</td> 648 <td>11%</td> 649 <td> 650 Addresses fundamental "is this trustworthy?" concern; increasingly important 651 post-privacy era 652 </td> 653 </tr> 654 <tr> 655 <td>Urgency/Scarcity</td> 656 <td>10%</td> 657 <td>Drives immediate action vs. postponement; effective when legitimate</td> 658 </tr> 659 <tr> 660 <td>Hook/Engagement</td> 661 <td>9%</td> 662 <td>First-impression visual; supports but doesn't replace messaging</td> 663 </tr> 664 <tr> 665 <td>Imagery/Design</td> 666 <td>8%</td> 667 <td> 668 Credibility signal; generic stock undermines trust but doesn't make or break 669 conversion 670 </td> 671 </tr> 672 <tr> 673 <td>Offer Clarity</td> 674 <td>4%</td> 675 <td>Important but usually redundant with headline + CTA when those are strong</td> 676 </tr> 677 <tr> 678 <td>Context</td> 679 <td>3%</td> 680 <td>Catch-all for industry-specific norms</td> 681 </tr> 682 <tr> 683 <td><strong>Total</strong></td> 684 <td><strong>100%</strong></td> 685 <td></td> 686 </tr> 687 </tbody> 688 </table> 689 <p> 690 The top 4 factors (headline, value prop, USP, CTA) account for 55% of the score. This reflects 691 the research consensus that messaging clarity and the conversion mechanism are the dominant 692 drivers. 693 </p> 694 <hr /> 695 <h2>5. Score Calculation & Grading Scale</h2> 696 <h3>Formula</h3> 697 <pre><code>Overall Score = (Headline × 0.15) + (Value Prop × 0.14) + (USP × 0.13) + (CTA × 0.13) 698 + (Urgency × 0.10) + (Hook × 0.09) + (Trust × 0.11) + (Imagery × 0.08) 699 + (Offer × 0.04) + (Context × 0.03) 700 </code></pre> 701 <p>Each factor is 0-10, producing a weighted sum of 0-10, multiplied by 10 to get 0-100.</p> 702 <h3>Grading Scale</h3> 703 <p> 704 The original research used a standard academic scale. The production system now uses a 705 business-oriented scale: 706 </p> 707 <table> 708 <thead> 709 <tr> 710 <th>Grade</th> 711 <th>Score Range</th> 712 <th>Interpretation</th> 713 </tr> 714 </thead> 715 <tbody> 716 <tr> 717 <td>A+</td> 718 <td>95-100</td> 719 <td>Exceptional conversion design</td> 720 </tr> 721 <tr> 722 <td>A</td> 723 <td>90-94</td> 724 <td>Excellent; well-executed fundamentals</td> 725 </tr> 726 <tr> 727 <td>A-</td> 728 <td>85-89</td> 729 <td>Very good; minor weaknesses</td> 730 </tr> 731 <tr> 732 <td>B+</td> 733 <td>83-84</td> 734 <td>Good; some friction or messaging issues</td> 735 </tr> 736 <tr> 737 <td>B</td> 738 <td>82</td> 739 <td>Satisfactory; multiple improvement opportunities</td> 740 </tr> 741 <tr> 742 <td>B-</td> 743 <td>70-81</td> 744 <td>Below average but acceptable</td> 745 </tr> 746 <tr> 747 <td>C</td> 748 <td>50-69</td> 749 <td>Marginal; substantial improvements needed</td> 750 </tr> 751 <tr> 752 <td>D</td> 753 <td>30-49</td> 754 <td>Poor; critical issues present</td> 755 </tr> 756 <tr> 757 <td>E</td> 758 <td>0-29</td> 759 <td>Fundamentally broken</td> 760 </tr> 761 <tr> 762 <td>F</td> 763 <td>Negative</td> 764 <td>Should not occur in practice</td> 765 </tr> 766 </tbody> 767 </table> 768 <blockquote> 769 <p> 770 <strong>Note:</strong> The production grading scale (above) differs from the original 771 academic scale proposed in the research (which had C+/C-/D+/D- subdivisions). The business 772 scale was adopted because the scoring system is used to identify prospects who need help, 773 not to give academic grades. See <code>src/score.js:computeGrade()</code> for the production 774 implementation. 775 </p> 776 </blockquote> 777 <hr /> 778 <h2>6. Screenshot Strategy</h2> 779 <h3>Key Decision: Above-the-Fold Is Sufficient</h3> 780 <p> 781 The research concluded that 782 <strong 783 >above-the-fold and first below-the-fold screenshots are substantially sufficient for 784 reliable scoring</strong 785 >, reducing token consumption by 65-75% compared to full-page screenshots while maintaining 786 evaluation accuracy above 90%. 787 </p> 788 <h3>Recommended Approach</h3> 789 <ol> 790 <li><strong>Primary:</strong> Desktop above-the-fold (1920x1080)</li> 791 <li> 792 <strong>Secondary:</strong> Mobile above-the-fold (375x667) — 793 <em>later dropped in production for cost reasons</em> 794 </li> 795 <li> 796 <strong>Conditional:</strong> Below-the-fold screenshot if initial score is low (rescoring 797 pass) 798 </li> 799 </ol> 800 <h3>Evidence</h3> 801 <ul> 802 <li>Above-fold content captures 57-80% of user viewing time</li> 803 <li>The 9 scoring factors cluster heavily above the fold on well-designed pages</li> 804 <li> 805 GPT-4 Vision cropping research showed focused images <em>improve</em> accuracy by removing 806 noise 807 </li> 808 <li>Token savings: ~1,000 tokens per above-fold image vs. ~2,000 for full-page</li> 809 <li>At 500,000 websites: ~1 billion fewer tokens consumed</li> 810 </ul> 811 <h3>Production Implementation</h3> 812 <p>In the production system (<code>src/capture.js</code>):</p> 813 <ul> 814 <li>Desktop screenshot captured at page load (cropped + uncropped variants)</li> 815 <li>DOM-aware intelligent cropping preserves CTAs, trust signals, hero imagery</li> 816 <li>Cropped version saves 20-35% additional LLM tokens</li> 817 <li>Below-fold screenshot captured separately for rescoring pass</li> 818 <li>Mobile screenshot was dropped for cost efficiency</li> 819 </ul> 820 <hr /> 821 <h2>7. Two-Pass Architecture</h2> 822 <h3>Design Decision: Conditional Resubmission</h3> 823 <p>The research compared three approaches for handling below-the-fold content:</p> 824 <table> 825 <thead> 826 <tr> 827 <th>Approach</th> 828 <th>Token Cost (100K sites, 30% low-scoring)</th> 829 </tr> 830 </thead> 831 <tbody> 832 <tr> 833 <td><strong>Conditional resubmission</strong></td> 834 <td>174M tokens</td> 835 </tr> 836 <tr> 837 <td>Always include below-fold</td> 838 <td>180M tokens</td> 839 </tr> 840 <tr> 841 <td>Include with "ignore if unnecessary"</td> 842 <td>185M tokens</td> 843 </tr> 844 </tbody> 845 </table> 846 <p><strong>Conditional resubmission wins</strong> because:</p> 847 <ul> 848 <li> 849 Vision models charge for image tokens at input time regardless of whether the model 850 "uses" the image 851 </li> 852 <li> 853 Including an image and saying "only look at it if needed" does NOT save tokens 854 </li> 855 <li>The breakeven point is ~50% of sites scoring low; in practice only ~30% do</li> 856 </ul> 857 <h3>Pass 1: Scoring (Above-the-Fold)</h3> 858 <ul> 859 <li>Input: Desktop screenshot (cropped) + HTML DOM</li> 860 <li> 861 Output: Factor scores (0-10 each), weighted total, grade, strengths, weaknesses, improvement 862 opportunities 863 </li> 864 <li>Sites scoring below threshold proceed to Pass 2</li> 865 </ul> 866 <h3>Pass 2: Rescoring (Below-the-Fold)</h3> 867 <ul> 868 <li>Input: Below-fold screenshot + HTML DOM + original score JSON</li> 869 <li> 870 Output: Adjusted factor scores (only where new content warrants change), recalculated 871 total/grade, contact details 872 </li> 873 <li>Does NOT resend above-fold screenshots (LLM already has the context from Pass 1 JSON)</li> 874 <li> 875 Focused prompt references original scores and asks for adjustments, not full re-evaluation 876 </li> 877 </ul> 878 <h3>Threshold</h3> 879 <p> 880 The original research suggested C+ (77) as the resubmission threshold. The production system 881 uses a configurable <code>LOW_SCORE_CUTOFF</code> (currently 82, i.e., B- and below). 882 </p> 883 <blockquote> 884 <p> 885 <strong>Business logic:</strong> We're selling web design services. High scorers 886 don't need help; low scorers are prospects. Rescoring gives low-scoring sites a second 887 chance with more data before proposal generation. 888 </p> 889 </blockquote> 890 <hr /> 891 <h2>8. Contact Extraction on Rescore</h2> 892 <p> 893 Contact extraction was added to the rescoring pass (not the initial scoring) to save tokens — 894 you only extract contacts for sites you actually plan to contact (low scorers). 895 </p> 896 <h3>What Gets Extracted</h3> 897 <p>From the HTML DOM (not guessed):</p> 898 <ul> 899 <li> 900 <strong>Contact form details:</strong> action URL, method, field presence (first_name, 901 last_name, full_name, email, phone, company_name, subject_line, message) with field types, 902 name attributes, and labels 903 </li> 904 <li> 905 <strong>Email addresses:</strong> All explicit <code>mailto:</code> links or plain-text 906 emails 907 </li> 908 <li> 909 <strong>Phone numbers:</strong> All explicit <code>tel:</code> links or recognizable 910 patterns 911 </li> 912 <li> 913 <strong>Social profiles:</strong> Links to major platforms with platform identification 914 </li> 915 <li> 916 <strong>Contact page URLs:</strong> Explicit "/contact" or "/support" 917 links 918 </li> 919 </ul> 920 <h3>Design Rationale</h3> 921 <ul> 922 <li> 923 Extracting contacts in Pass 1 would waste tokens on high-scoring sites we won't contact 924 </li> 925 <li>The HTML DOM is already being sent in Pass 2 anyway (for score adjustment)</li> 926 <li>Adding contact extraction to the same API call adds minimal token overhead</li> 927 <li> 928 All fields are optional — the LLM reports what it finds, uses <code>null</code>/empty for 929 missing data 930 </li> 931 </ul> 932 <hr /> 933 <h2>9. Popover Handling</h2> 934 <p><strong>Decision: Close popovers before taking screenshots.</strong></p> 935 <p>Reasoning:</p> 936 <ol> 937 <li>Popovers obscure the headline, hero image, CTA, and trust signals being evaluated</li> 938 <li> 939 They represent a secondary conversion path (newsletter signup, discount) — not the primary 940 page conversion 941 </li> 942 <li> 943 They create inconsistent evaluation conditions (some sites show immediately, others on 944 delay/exit) 945 </li> 946 <li> 947 The entire scoring methodology depends on evaluating above-fold content, which is completely 948 blocked by modal overlays 949 </li> 950 </ol> 951 <h3>Implementation</h3> 952 <p> 953 The production system (<code>src/capture.js</code> / 954 <code>src/utils/stealth-browser.js</code>): 955 </p> 956 <ul> 957 <li>Waits 2-3 seconds after page load for delay-triggered popovers</li> 958 <li> 959 Attempts to close via common selectors (<code>[class*='close']</code>, 960 <code>[aria-label='Close']</code>, etc.) 961 </li> 962 <li>Sends Escape key as fallback</li> 963 <li>Takes screenshot immediately after closing to avoid new popovers</li> 964 </ul> 965 <hr /> 966 <h2>10. LLM Prompt Design</h2> 967 <h3>Prompt Structure (Both Passes)</h3> 968 <ol> 969 <li><strong>System context:</strong> Expert CRO specialist role</li> 970 <li> 971 <strong>Input specification:</strong> What data is provided (screenshots, HTML, prior scores 972 for rescoring) 973 </li> 974 <li><strong>Evaluation framework:</strong> Factor definitions with rubric anchors</li> 975 <li><strong>Scoring methodology:</strong> Weighted calculation formula</li> 976 <li><strong>Output format:</strong> Strict JSON schema</li> 977 <li> 978 <strong>Best practices:</strong> Analyze HTML first, cross-reference with screenshots, 979 assess mobile/desktop separately, provide specific evidence 980 </li> 981 </ol> 982 <h3>Key Design Principles</h3> 983 <ul> 984 <li> 985 <strong>Rubric in system prompt:</strong> Full rubric definitions appear once in the system 986 prompt, not repeated per-website 987 </li> 988 <li> 989 <strong>Evidence-based scoring:</strong> Each factor score requires 1-2 sentence reasoning 990 with specific page evidence 991 </li> 992 <li> 993 <strong>Independent factor scoring:</strong> Score each factor independently, then calculate 994 weighted total 995 </li> 996 <li> 997 <strong>Dual analysis:</strong> HTML content analysis + visual assessment cross-referenced 998 </li> 999 <li> 1000 <strong>Confidence assessment:</strong> Overall confidence (High/Medium/Low) with limitation 1001 notes 1002 </li> 1003 </ul> 1004 <h3>Production Evolution</h3> 1005 <p> 1006 The production prompts (<code>prompts/CONVERSION-SCORING-VISION.md</code> and 1007 <code>prompts/CONVERSION-SCORING-NOVIS.md</code>) have evolved from this original design: 1008 </p> 1009 <ul> 1010 <li>Simplified output JSON (removed verbose nested structures for token efficiency)</li> 1011 <li> 1012 Added <code>recommendation_sms</code> and <code>recommendation_email</code> fields for 1013 proposal generation 1014 </li> 1015 <li>Split into vision-enabled and HTML-only variants</li> 1016 <li> 1017 Grade calculation moved from LLM to code (<code>computeGrade()</code> in 1018 <code>src/score.js</code>) 1019 </li> 1020 <li> 1021 LLM now returns only <code>factor_scores</code>; total and grade computed programmatically 1022 for consistency 1023 </li> 1024 </ul> 1025 <hr /> 1026 <h2>11. Validation & Calibration</h2> 1027 <p>The research recommended the following validation approach (partially implemented):</p> 1028 <h3>Inter-Rater Reliability</h3> 1029 <ul> 1030 <li>Evaluate 50-100 websites with experienced CRO professionals</li> 1031 <li>Run same websites through LLM scoring</li> 1032 <li>Target Spearman correlation > 0.85 between LLM and expert scores</li> 1033 <li>Analyze divergence cases; iterate on rubric wording</li> 1034 </ul> 1035 <h3>Expected Accuracy</h3> 1036 <ul> 1037 <li> 1038 <strong>Letter grade agreement:</strong> 75-85% exact match with experts; remaining are 1039 adjacent grades 1040 </li> 1041 <li> 1042 <strong>Factor-level accuracy:</strong> 80-90% match (individual factors more objective than 1043 aggregated grades) 1044 </li> 1045 <li> 1046 <strong>High-confidence cases:</strong> >90% accuracy for clearly strong (A-) or weak 1047 (D/F) sites 1048 </li> 1049 <li><strong>Mid-range (B-C+):</strong> Lower agreement due to inherent subjectivity</li> 1050 </ul> 1051 <h3>Continuous Monitoring</h3> 1052 <ul> 1053 <li>Distribution should approximate normal centered around C+/B-</li> 1054 <li> 1055 Factor correlations should match expectations (headline ↔ value prop should correlate > 1056 0.6) 1057 </li> 1058 <li>1% human spot-checks; recalibrate if divergences exceed 5-10%</li> 1059 </ul> 1060 <h3>Production Status</h3> 1061 <p>In practice, the system has been validated through:</p> 1062 <ul> 1063 <li>Manual review of thousands of scored sites during outreach QA</li> 1064 <li>Grade/score mismatch detection and correction</li> 1065 <li>Programmatic grade computation (eliminating LLM grading inconsistencies)</li> 1066 <li>Iterative prompt refinement based on observed scoring patterns</li> 1067 </ul> 1068 <hr /> 1069 <h2>12. Token Optimization</h2> 1070 <h3>Image Optimization (Implemented)</h3> 1071 <ul> 1072 <li>JPEG conversion (quality 85): 40-50% file size reduction</li> 1073 <li>DOM-aware intelligent cropping: 20-35% token reduction</li> 1074 <li>Resolution targeting: minimum for text legibility</li> 1075 <li>Above-fold only: 65-75% reduction vs. full-page</li> 1076 </ul> 1077 <h3>Prompt Optimization (Implemented)</h3> 1078 <ul> 1079 <li>Rubric in system prompt (not repeated per website)</li> 1080 <li>Abbreviated factor references in rescoring prompt</li> 1081 <li>Strict JSON output (no prose explanation outside the JSON)</li> 1082 <li>LLM returns factor scores only; computation done in code</li> 1083 </ul> 1084 <h3>HTML-Only Mode (Added Later)</h3> 1085 <p>When <code>ENABLE_VISION=false</code>, the system skips screenshots entirely:</p> 1086 <ul> 1087 <li>No Playwright screenshot capture</li> 1088 <li>Text-only analysis of HTML DOM</li> 1089 <li>Auto-promotes scored sites through rescoring (no below-fold vision needed)</li> 1090 <li>Cost: ~$0.0025/site vs. ~$0.030/site with vision (83% savings)</li> 1091 </ul> 1092 <hr /> 1093 <h2>13. Implementation Notes</h2> 1094 <h3>What Changed from Research to Production</h3> 1095 <table> 1096 <thead> 1097 <tr> 1098 <th>Research Proposal</th> 1099 <th>Production Implementation</th> 1100 <th>Reason</th> 1101 </tr> 1102 </thead> 1103 <tbody> 1104 <tr> 1105 <td>Academic grading scale (A+ to F with +/-)</td> 1106 <td>Business scale (A+ to F, fewer subdivisions)</td> 1107 <td>Simpler for prospect identification</td> 1108 </tr> 1109 <tr> 1110 <td>Mobile + desktop screenshots</td> 1111 <td>Desktop only</td> 1112 <td>Cost reduction; mobile added minimal value</td> 1113 </tr> 1114 <tr> 1115 <td>LLM computes grade</td> 1116 <td>Code computes grade from factor scores</td> 1117 <td>Eliminates grading inconsistencies</td> 1118 </tr> 1119 <tr> 1120 <td>Batch processing of 50-100 sites</td> 1121 <td>Individual API calls with concurrency control</td> 1122 <td>Rate limits; error isolation</td> 1123 </tr> 1124 <tr> 1125 <td>Few-shot examples in prompt</td> 1126 <td>Prompt-only (no examples)</td> 1127 <td>Token savings; rubric detail sufficient</td> 1128 </tr> 1129 <tr> 1130 <td>C+ (77) rescore threshold</td> 1131 <td>B- (82) configurable threshold</td> 1132 <td>Business need: more prospects</td> 1133 </tr> 1134 <tr> 1135 <td>9 factors + context</td> 1136 <td>10 factors (context kept as factor 10)</td> 1137 <td>Consistent weighting</td> 1138 </tr> 1139 <tr> 1140 <td>Full JSON with nested evidence/reasoning</td> 1141 <td>Simplified JSON with factor scores</td> 1142 <td>Token efficiency</td> 1143 </tr> 1144 </tbody> 1145 </table> 1146 <h3>Key Files</h3> 1147 <ul> 1148 <li> 1149 <code>src/score.js</code> — Scoring logic, <code>computeScoreFromFactors()</code>, 1150 <code>computeGrade()</code> 1151 </li> 1152 <li><code>src/stages/rescoring.js</code> — Below-fold rescoring pass</li> 1153 <li><code>src/capture.js</code> — Screenshot capture</li> 1154 <li><code>src/contacts/prioritize.js</code> — Contact extraction and prioritization</li> 1155 <li><code>prompts/CONVERSION-SCORING-VISION.md</code> — Vision-enabled scoring prompt</li> 1156 <li><code>prompts/CONVERSION-SCORING-NOVIS.md</code> — HTML-only scoring prompt</li> 1157 </ul> 1158 <hr /> 1159 <h2>14. Appendix: Original JSON Schemas</h2> 1160 <h3>Pass 1: Scoring Output</h3> 1161 <pre><code class="language-json">{ 1162 "website_url": "https://example.com", 1163 "evaluation_date": "2026-01-14T12:00:00Z", 1164 "device_analysis": { 1165 "desktop_visible": true, 1166 "mobile_visible": true, 1167 "design_differences": "Mobile layout stacks hero and CTA; desktop shows them side by side." 1168 }, 1169 "factor_scores": { 1170 "headline_quality": { 1171 "score": 8, 1172 "reasoning": "Headline clearly states what the product does and who it's for.", 1173 "evidence": "Headline text: 'Automate Your Invoices in Minutes for Small Businesses'." 1174 }, 1175 "value_proposition": { 1176 "score": 7, 1177 "reasoning": "Benefits are clear but lack concrete quantified outcomes.", 1178 "evidence": "Copy mentions 'save time and reduce errors' but no specific numbers." 1179 }, 1180 "unique_selling_proposition": { 1181 "score": 6, 1182 "reasoning": "USP is implied but not explicitly contrasted with competitors.", 1183 "evidence": "Mentions 'built specifically for freelancers' but no comparison." 1184 }, 1185 "call_to_action": { 1186 "score": 9, 1187 "reasoning": "Primary CTA is above the fold, high contrast, and action-oriented.", 1188 "evidence": "CTA button 'Start Free 14-Day Trial' in hero, visually prominent." 1189 }, 1190 "urgency_messaging": { 1191 "score": 3, 1192 "reasoning": "Weak urgency with vague wording and no specific deadline.", 1193 "evidence": "Text: 'Join now and don't miss out' without concrete time limit." 1194 }, 1195 "hook_engagement": { 1196 "score": 7, 1197 "reasoning": "Hero image is relevant and supports the message.", 1198 "evidence": "Image shows a freelancer working with invoices on a laptop." 1199 }, 1200 "trust_signals": { 1201 "score": 5, 1202 "reasoning": "Includes generic testimonials but no logos or certifications.", 1203 "evidence": "Three short text testimonials with first names only." 1204 }, 1205 "imagery_design": { 1206 "score": 8, 1207 "reasoning": "Clean, modern design with custom-looking imagery.", 1208 "evidence": "Custom illustrations and product screenshots; no stock photos." 1209 }, 1210 "offer_clarity": { 1211 "score": 8, 1212 "reasoning": "Offer is explicit: 14-day free trial, no credit card.", 1213 "evidence": "Text near CTA: 'Try it free for 14 days. No credit card needed.'" 1214 }, 1215 "contextual_appropriateness": { 1216 "score": 7, 1217 "reasoning": "Design is appropriate for a B2B SaaS invoicing tool.", 1218 "industry_context": "B2B SaaS / invoicing software" 1219 } 1220 }, 1221 "overall_calculation": { 1222 "weighted_total": 76.5, 1223 "letter_grade": "C", 1224 "grade_interpretation": "Acceptable fundamentals but room for improvement in USP, trust, and urgency." 1225 }, 1226 "key_strengths": [ 1227 "Strong, visible primary CTA with clear action and low-friction offer.", 1228 "Headline and offer are easy to understand for the target audience." 1229 ], 1230 "critical_weaknesses": [ 1231 "Weak urgency messaging provides little reason to act now.", 1232 "Trust elements are generic and do not provide strong social proof." 1233 ], 1234 "quick_improvement_opportunities": [ 1235 "Add specific trust signals (logos, detailed testimonials) above the fold.", 1236 "Introduce concrete urgency (time-bound offer or limited onboarding slots)." 1237 ], 1238 "confidence_assessment": { 1239 "overall_confidence": "High", 1240 "reasoning": "Above-fold content contains all major conversion elements.", 1241 "limitation_notes": "Does not consider deeper content, checkout flow, or post-click funnel." 1242 } 1243 } 1244 </code></pre> 1245 <h3>Pass 2: Rescoring + Contact Extraction Output</h3> 1246 <p>The rescoring pass adds a <code>contact_details</code> section to the evaluation JSON:</p> 1247 <pre><code class="language-json">{ 1248 "contact_details": { 1249 "primary_contact_form": { 1250 "form_action_url": "https://example.com/contact-submit", 1251 "form_method": "post", 1252 "fields": { 1253 "first_name": { 1254 "present": true, 1255 "field_type": "text", 1256 "name_attribute": "first_name", 1257 "label_or_placeholder": "First name" 1258 }, 1259 "last_name": { 1260 "present": true, 1261 "field_type": "text", 1262 "name_attribute": "last_name", 1263 "label_or_placeholder": "Last name" 1264 }, 1265 "full_name": { 1266 "present": false, 1267 "field_type": null, 1268 "name_attribute": null, 1269 "label_or_placeholder": null 1270 }, 1271 "email": { 1272 "present": true, 1273 "field_type": "email", 1274 "name_attribute": "email", 1275 "label_or_placeholder": "Your email" 1276 }, 1277 "phone": { 1278 "present": true, 1279 "field_type": "tel", 1280 "name_attribute": "phone", 1281 "label_or_placeholder": "Phone number" 1282 }, 1283 "company_name": { 1284 "present": false, 1285 "field_type": null, 1286 "name_attribute": null, 1287 "label_or_placeholder": null 1288 }, 1289 "subject_line": { 1290 "present": false, 1291 "field_type": null, 1292 "name_attribute": null, 1293 "label_or_placeholder": null 1294 }, 1295 "message": { 1296 "present": true, 1297 "field_type": "textarea", 1298 "name_attribute": "message", 1299 "label_or_placeholder": "Your message" 1300 } 1301 } 1302 }, 1303 "email_addresses": [ 1304 "support@example.com", 1305 "sales@example.com" 1306 ], 1307 "phone_numbers": [ 1308 "+1-555-123-4567" 1309 ], 1310 "social_profiles": [ 1311 { 1312 "platform": "facebook", 1313 "url": "https://www.facebook.com/example" 1314 }, 1315 { 1316 "platform": "linkedin", 1317 "url": "https://www.linkedin.com/company/example" 1318 } 1319 ], 1320 "contact_pages": [ 1321 "https://example.com/contact", 1322 "https://example.com/support" 1323 ] 1324 } 1325 } 1326 </code></pre> 1327 <hr /> 1328 <h2>References</h2> 1329 <p>The original research cited ~60 sources. Key references that informed the design:</p> 1330 <ul> 1331 <li> 1332 Nielsen Norman Group — User attention and fold research (57-80% above-fold viewing time) 1333 </li> 1334 <li>Google ad viewability study — 73% above-fold vs. 44% below-fold viewability</li> 1335 <li>CRO best practices literature — Factor selection and weighting</li> 1336 <li>GPT-4 Vision cropping research — Focused images outperform full-page for accuracy</li> 1337 <li> 1338 PURE method — Inter-rater reliability calibration for expert-based evaluation (>0.8 1339 reliability) 1340 </li> 1341 <li>LLMLingua (Microsoft) — Prompt optimization: 35% token reduction maintaining quality</li> 1342 <li>DeepSeek vision-text compression — 7-20x token reduction at >90% accuracy</li> 1343 <li>OpenAI vision model documentation — Token calculation and pricing for image inputs</li> 1344 </ul> 1345 1346 <footer 1347 style=" 1348 margin-top: 3rem; 1349 padding-top: 1rem; 1350 border-top: 1px solid var(--border); 1351 font-size: 0.85em; 1352 color: #6b7280; 1353 " 1354 > 1355 <p> 1356 Generated from original Open-WebUI research chat (January 2026). Consolidated February 2026. 1357 </p> 1358 <p>333 Method — SERP-to-Outreach Automation</p> 1359 </footer> 1360 </body> 1361 </html>