Cradicle Explorer

/ docs / research / web-scraping-playbook.md
web-scraping-playbook.md
   1  
   2  # **A Developer's Playbook for Resilient Web Scraping: Advanced Evasion and Automation in a GitHub Actions Environment**
   3  
   4  ## **Executive Summary**
   5  
   6  The extraction of data from dynamic, modern web platforms represents a significant engineering challenge, far removed from the realm of simple scripting. High-value targets, such as professional networking and job platforms like LinkedIn, are fortified with multi-layered, sophisticated anti-bot systems designed to detect and block automated access. This complexity is further amplified when the scraping operations must be conducted within the ephemeral, stateless, and inherently conspicuous environment of a Continuous Integration/Continuous Delivery (CI/CD) system like GitHub Actions. Standard scraping techniques are not merely insufficient; they are destined for immediate failure.
   7  
   8  This playbook presents a definitive, expert-level guide for developers and engineers tasked with building a resilient, long-term web scraping engine under these demanding conditions. It deconstructs the problem into its core components—advanced browser automation, robust evasion tactics, and intelligent CI/CD orchestration—and provides a comprehensive, actionable solution. The architectural blueprint detailed herein is founded on a multi-layered defense strategy designed to consistently evade detection and ensure reliable data extraction.
   9  
  10  The proposed solution architecture integrates a modern browser automation framework, Playwright, chosen for its superior handling of dynamic, JavaScript-heavy applications. This foundation is augmented with a suite of advanced evasion tools and techniques, including specialized stealth libraries that patch browser-level automation tells, sophisticated human behavior emulation for mouse and keyboard interactions, and a non-negotiable, strategically managed network of rotating residential proxies to mask the scraper's origin. A key architectural innovation presented is a two-workflow system within GitHub Actions to manage session state securely, decoupling the high-risk login process from routine scraping operations to enhance both stealth and stability.
  11  
  12  This document is structured to guide the reader through a logical progression, beginning with a deep analysis of the modern anti-scraping threat landscape—from browser and protocol-level fingerprinting to behavioral biometrics. It then provides a practical arsenal of evasion tools and techniques, complete with comparative analyses and code implementations. Finally, it culminates in a complete architectural blueprint, detailing the Python scraper's design and the full YAML configuration for a scalable, resilient, and automated GitHub Actions workflow. This playbook is intended not as a theoretical exercise, but as a production-ready guide for building a web scraping engine capable of operating successfully against the most challenging targets in the most constrained environments.
  13  
  14  ## **Section 1: The Modern Gauntlet: Understanding Advanced Anti-Scraping Defenses**
  15  
  16  To construct a resilient scraping engine, one must first comprehend the adversary: the sophisticated, multi-layered defense systems of modern web platforms. These systems have evolved far beyond simple IP blocking or User-Agent string checks. They now employ a holistic approach that analyzes a client's identity from the initial network handshake up to the nuanced patterns of their mouse movements. A failure to present a consistent, human-like profile across every layer of this inspection will result in immediate detection and blocking.
  17  
  18  ### **1.1. Beyond User-Agents: A Taxonomy of Browser Fingerprinting**
  19  
  20  Browser fingerprinting is a collection of techniques used to create a unique, stable identifier for a client by gathering information about its specific configuration.1 This process operates without storing data on the user's device, making it a stealthy and powerful alternative to traditional cookie-based tracking.1 The resulting fingerprint is a hash generated from a combination of passively and actively collected data points, which, when combined, can achieve a high degree of uniqueness.3
  21  
  22  #### **Passive Fingerprinting**
  23  
  24  The first layer of detection involves the passive analysis of information that a browser sends with every HTTP request. This includes HTTP headers, which provide details about the browser, operating system, and preferred language.1 While this data provides limited uniqueness on its own and can be easily spoofed, inconsistencies—such as a User-Agent string for Chrome on Windows being accompanied by network protocol characteristics of a Linux-based Python library—serve as an immediate red flag for detection systems.1
  25  
  26  #### **Active JavaScript-based Fingerprinting**
  27  
  28  The more formidable challenge lies in active fingerprinting, where the target website executes JavaScript on the client side to interrogate the browser and its environment in detail. This approach creates a high-entropy fingerprint that is far more difficult to forge.
  29  
  30  * **Browser & OS Properties:** The most fundamental active checks involve querying properties of the navigator object. The navigator.webdriver property, which is set to true by default in standard automation frameworks like Selenium and Playwright, is a primary indicator of automation \[600, 59. Advanced systems go further, enumerating installed browser plugins (  
  31    navigator.plugins), system fonts, and screen resolution (window.screen) to build a more detailed profile.1 While individual properties are not unique, the specific combination across a user base is highly distinct.  
  32  * **Canvas Fingerprinting:** This is a powerful technique where a script instructs the browser to render a hidden image or text onto an HTML5 \<canvas\> element \[5974, 3, 2. The exact pixel-by-pixel output of this rendering process is subtly influenced by a combination of the operating system, the graphics card (GPU), installed graphics drivers, and font rendering engines.3 The script then extracts the rendered image data as a Base64 encoded string and computes a hash of it.3 This hash serves as a highly stable and unique identifier because even imperceptible rendering variations between devices will produce a different hash.3 The widespread adoption of this technique, with usage nearly doubling on top websites over a seven-year period, highlights its effectiveness.1  
  33  * **WebGL Fingerprinting:** An even more potent and difficult-to-evade technique is WebGL fingerprinting. WebGL (Web Graphics Library) is a JavaScript API for rendering 2D and 3D graphics directly in the browser, providing low-level access to the GPU.6 A fingerprinting script can instruct the browser to render a complex 3D scene. During this process, it collects a wealth of hardware-specific information, including the GPU model and vendor, driver versions, shader precision, supported extensions, and the exact pixel data of the rendered output.6 Because these characteristics are tied directly to the physical hardware, they are exceptionally difficult to spoof convincingly.6 Any attempt to mask one parameter (e.g., the GPU vendor string) while the actual rendering output corresponds to a different GPU will create a detectable inconsistency.  
  34  * **Audio Fingerprinting:** This technique leverages the Web Audio API to generate a unique fingerprint from a device's audio stack. A script uses an OscillatorNode to generate a specific, often inaudible, sound wave. This wave is then processed, and the output is analyzed. The final waveform is subtly altered by the specific hardware, drivers, and browser implementation, creating a consistent and unique hash for the device. This method is particularly stealthy as it leaves no client-side state and does not require user interaction.8
  35  
  36  The evolution of these techniques illustrates a clear escalation in the cat-and-mouse game between scrapers and anti-bot systems. Early detection focused on simple flags like navigator.webdriver. Once automation tools began patching this property, defense systems moved to more complex, multi-vector analyses like Canvas and WebGL fingerprinting. This progression reveals a critical principle for modern evasion: it is no longer sufficient to mask a single property. A resilient scraping engine must present a complete and internally consistent fingerprint. An automation tool that reports a Chrome-on-Windows User-Agent but produces a WebGL fingerprint characteristic of a Linux server's virtualized GPU will be instantly flagged. Success requires a holistic approach that ensures every detectable attribute tells the same plausible story.
  37  
  38  ### **1.2. The Protocol Layer: TLS and HTTP/2 Fingerprinting**
  39  
  40  The most sophisticated anti-bot systems do not wait for JavaScript execution to detect automation. They can identify and block a scraper at the network protocol level, based on the signature of its initial connection request. This layer of detection is particularly effective because the characteristics of a TLS (Transport Layer Security) and HTTP/2 connection are determined by the underlying networking library (e.g., Python's http.client, Node.js's http2 module) and are often fundamentally different from those of a real browser.
  41  
  42  During the initial TLS handshake, the client sends a Client Hello message. The specific combination of parameters in this message—such as the TLS version, supported cipher suites, and the list and order of extensions—creates a unique signature. This signature, often hashed into a string known as a JA3 or JA4 fingerprint, can reliably identify the underlying client library used to make the request.9 For example, the JA3 fingerprint of a standard Python
  43  
  44  requests session is distinctly different from that of a Chrome browser running on Windows.
  45  
  46  Similarly, with the widespread adoption of the HTTP/2 protocol, a new fingerprinting vector has emerged. When an HTTP/2 connection is established, the client sends a series of initial frames, including SETTINGS, WINDOW\_UPDATE, and potentially PRIORITY frames. The specific values within these frames (e.g., SETTINGS\_MAX\_CONCURRENT\_STREAMS), their order of transmission, and the ordering of pseudo-headers (like :method, :path) in the subsequent HEADERS frame vary significantly between different clients.9 A real Chrome browser has a well-defined and consistent HTTP/2 fingerprint that is difficult for non-browser libraries to replicate perfectly.9
  47  
  48  This protocol-level analysis means that a scraper can be blocked before it even sends its first GET request for the page's HTML. The implication for a resilient scraping architecture is profound: the choice of automation tool is not just about its ability to control a browser's DOM. The tool's underlying network stack must be indistinguishable from that of a genuine, user-operated browser. This is a primary reason why standard HTTP libraries are inadequate for scraping protected targets and why browser automation frameworks like Playwright, which use the browser's own network stack, are essential. Even then, patched versions of these frameworks are often necessary to ensure that no automation-related artifacts leak at this low level.
  49  
  50  ### **1.3. The Human Element: Behavioral Biometrics and Anomaly Detection**
  51  
  52  Beyond fingerprinting the client's software and hardware, advanced anti-bot systems increasingly analyze the user's behavior itself. They collect data on how the user interacts with the page, building a biometric profile that can distinguish the fluid, slightly imperfect motions of a human from the rigid, mathematically perfect actions of a script.12
  53  
  54  This analysis focuses on several key areas:
  55  
  56  * **Mouse Movements:** Human mouse movements are never perfectly linear. They follow curved paths, exhibit variations in speed (accelerating and decelerating), and contain minute, subconscious "jitters".12 A script that moves the mouse in a straight line from point A to point B is an obvious sign of automation. Detection systems track the entire mouse trajectory, analyzing its curvature, velocity, and consistency to identify non-human patterns.13  
  57  * **Click and Typing Patterns:** Humans do not click instantly. There is a small but measurable delay between the mousedown and mouseup events. Similarly, typing cadence is not uniform; the time between keystrokes varies, and humans make and correct errors.14 Scripts that execute clicks with zero delay or type with perfect, metronomic regularity are easily flagged.  
  58  * **Scrolling Behavior:** Humans scroll with varying speeds, sometimes using the mouse wheel, sometimes clicking the scrollbar, and often pausing to read content. An automated script that scrolls in perfectly uniform chunks or jumps instantly to the bottom of a page exhibits a clear non-human pattern.
  59  
  60  These behavioral data points are fed into machine learning models trained on vast datasets of genuine user interactions.13 These models learn the statistical signatures of human behavior and can detect anomalies that signify automation. Consequently, a resilient scraper cannot just programmatically execute events like
  61  
  62  .click() and .type(). It must do so in a way that emulates the natural, noisy, and slightly inefficient patterns of a human user. This requires the implementation of algorithms that generate non-linear mouse paths, introduce randomized delays, and simulate a realistic typing rhythm.
  63  
  64  ### **1.4. The Network Barrier: IP Reputation, Rate Limiting, and CAPTCHAs**
  65  
  66  The final layer of defense operates at the network level, focusing on the origin of the traffic and its volume. This is a particularly acute challenge for a scraper operating within a GitHub Actions environment.
  67  
  68  * **IP Reputation:** The IP address from which a request originates is one of the most fundamental data points used for risk assessment. IP addresses associated with data centers, including cloud providers like Microsoft Azure where GitHub-hosted runners operate, are inherently treated with a high degree of suspicion. Anti-bot systems maintain extensive databases of IP addresses, and traffic from a known data center IP range is often subjected to immediate, heightened scrutiny, more frequent CAPTCHA challenges, or outright blocks. This makes the native IP address of a GitHub runner a significant liability. The only viable solution is to mask this origin by routing all traffic through a **residential proxy network**. These networks provide IP addresses assigned by Internet Service Providers (ISPs) to real homes, making the scraper's traffic appear indistinguishable from that of a legitimate user, 94, 95\].  
  69  * **Rate Limiting:** Websites monitor the number of requests originating from a single IP address over a given time period. Exceeding a certain threshold is a classic sign of scraping and will trigger a temporary or permanent block. A resilient scraper must therefore implement dynamic rate limiting, respecting the server's limits and incorporating exponential backoff strategies when throttled. The use of a large, rotating proxy pool is essential for distributing requests across many different IP addresses, thereby avoiding per-IP rate limits.17  
  70  * **CAPTCHAs:** CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) challenges are the final line of defense, presented when a user's fingerprint or behavior is deemed suspicious. While third-party services exist to solve these challenges, they add cost, latency, and complexity. The primary architectural goal of a stealthy scraper should be to **avoid triggering CAPTCHAs in the first place** by successfully navigating all the preceding layers of detection. Relying on CAPTCHA solving as a primary strategy is an admission of a failed evasion architecture.
  71  
  72  In the context of GitHub Actions, the IP reputation problem is non-negotiable. The stateless and data-center-based nature of the runners means that a robust proxy management layer is not an optional enhancement but a foundational architectural requirement. Without it, even the most perfectly fingerprinted and behaviorally human-like scraper will be blocked based on its origin alone.
  73  
  74  ## **Section 2: The Ghost in the Machine: An Arsenal of Evasion Tools**
  75  
  76  Having dissected the mechanisms of modern anti-bot systems, this section provides a practical guide to the tools and techniques required to build a scraper that can systematically neutralize these defenses. The strategy involves selecting a powerful browser automation framework, augmenting it with specialized stealth libraries, emulating human interaction patterns, and masking its network identity.
  77  
  78  ### **2.1. Choosing the Right Engine: Playwright for Dynamic Web Applications**
  79  
  80  For scraping modern, dynamic single-page applications (SPAs) like LinkedIn, the choice of browser automation framework is critical. While Selenium has been a long-standing tool, Playwright, a newer framework from Microsoft, offers significant architectural advantages that make it better suited for this task.18
  81  
  82  The primary technical justification for choosing Playwright lies in its architecture and API design. Selenium communicates with the browser driver via the JSON Wire Protocol over HTTP, which introduces latency with each command. In contrast, Playwright communicates over a persistent WebSocket connection, enabling faster and more efficient command execution.18 This speed is crucial for complex scraping tasks involving numerous interactions.
  83  
  84  Furthermore, Playwright's API is designed with the dynamic web in mind. Its "auto-waiting" mechanism is a key feature; before performing an action like a click, Playwright automatically waits for the element to be attached to the DOM, visible, stable, and able to receive events. This eliminates a major source of flakiness common in Selenium scripts, where developers must often insert manual or explicit waits, which can be unreliable and slow down execution.20 Playwright also provides native, powerful tools for network interception, allowing the scraper to monitor, modify, or block network requests, which is invaluable for advanced scraping and evasion tasks.
  85  
  86  The following Python script demonstrates the simplicity and power of Playwright for extracting data from a dynamically loaded page. It navigates to a LinkedIn job search page and extracts the titles and company names of the initial job listings.
  87  
  88  Python
  89  
  90  \# File: simple\_scraper.py  
  91  import asyncio  
  92  import re  
  93  from playwright.async\_api import async\_playwright
  94  
  95  async def scrape\_linkedin\_jobs(job\_title: str, location: str):  
  96      """  
  97      A simple Playwright script to scrape the first page of LinkedIn job listings.  
  98      """  
  99      async with async\_playwright() as p:  
 100          browser \= await p.chromium.launch(headless=True)  
 101          page \= await browser.new\_page()
 102  
 103          \# Construct the URL for the job search  
 104          base\_url \= "https://www.linkedin.com/jobs/search"  
 105          params \= {  
 106              "keywords": job\_title,  
 107              "location": location,  
 108              "position": 1,  
 109              "pageNum": 0  
 110          }  
 111          \# A simple way to build the query string  
 112          query\_string \= "&".join(\[f"{key}\={value}" for key, value in params.items()\])  
 113          url \= f"{base\_url}?{query\_string}"
 114  
 115          print(f"Navigating to: {url}")  
 116          await page.goto(url, wait\_until="domcontentloaded")
 117  
 118          \# Wait for the job listings container to be visible  
 119          await page.wait\_for\_selector('ul.jobs-search\_\_results-list', timeout=15000)
 120  
 121          \# Extract job listings  
 122          job\_listings \= await page.locator('ul.jobs-search\_\_results-list \> li').all()  
 123          print(f"Found {len(job\_listings)} job listings on the first page.")
 124  
 125          scraped\_data \=  
 126          for job\_listing in job\_listings:  
 127              try:  
 128                  title\_element \= await job\_listing.query\_selector('h3.base-search-card\_\_title')  
 129                  company\_element \= await job\_listing.query\_selector('h4.base-search-card\_\_subtitle')  
 130                  link\_element \= await job\_listing.query\_selector('a.base-card\_\_full-link')
 131  
 132                  title \= await title\_element.inner\_text() if title\_element else "N/A"  
 133                  company \= await company\_element.inner\_text() if company\_element else "N/A"  
 134                  link \= await link\_element.get\_attribute('href') if link\_element else "N/A"
 135  
 136                  \# Clean the text content  
 137                  title \= re.sub(r'\[\\s\\n\]+', ' ', title).strip()  
 138                  company \= re.sub(r'\[\\s\\n\]+', ' ', company).strip()
 139  
 140                  scraped\_data.append({  
 141                      "title": title,  
 142                      "company": company,  
 143                      "link": link  
 144                  })  
 145              except Exception as e:  
 146                  print(f"Error processing a job listing: {e}")
 147  
 148          await browser.close()  
 149          return scraped\_data
 150  
 151  if \_\_name\_\_ \== "\_\_main\_\_":  
 152      jobs \= asyncio.run(scrape\_linkedin\_jobs("Software Engineer", "United States"))  
 153      for job in jobs:  
 154          print(job)
 155  
 156  This script, while functional for public pages, would be quickly detected on authenticated routes or by more advanced anti-bot systems due to its default browser fingerprint. The next step is to augment this engine with stealth capabilities.
 157  
 158  ### **2.2. The Cloak of Invisibility: A Comparative Analysis of Stealth Frameworks**
 159  
 160  Standard Playwright, while powerful for automation, is not designed for stealth and is easily detectable \[33, 65, 65, S\_R587\]. To operate undetected, it is essential to use a specialized library that patches the browser automation framework to remove or obscure the telltale signs of automation. The open-source community has produced several such libraries, each with a different approach and level of maturity.
 161  
 162  * **playwright-extra with puppeteer-extra-plugin-stealth**: This combination brings the well-established evasion modules from the Puppeteer ecosystem to Playwright.21  
 163    playwright-extra acts as a wrapper around the standard Playwright library, enabling a plugin architecture. The puppeteer-extra-plugin-stealth is a collection of individual evasion scripts that target specific detection vectors, such as masking navigator.webdriver, spoofing WebGL vendor information, and normalizing browser properties to match a real user's browser.23 This modular approach is powerful but can introduce compatibility risks, as the plugins are primarily developed for Puppeteer.22  
 164  * **undetected-playwright**: This Python library is a direct port of the popular undetected-chromedriver project's concepts to the Playwright framework.33 It functions by patching the Playwright browser instance upon launch to remove common automation signatures. It is designed to be a simple, drop-in replacement that requires minimal configuration changes to an existing Playwright script.33  
 165  * **patchright-python**: This is a more recent and actively maintained drop-in replacement for Playwright that focuses on patching lower-level detection vectors that some other stealth libraries may miss.35 Specifically, it addresses leaks related to the Chrome DevTools Protocol (CDP) itself, such as the use of  
 166    Runtime.enable and Console.enable, which can be detected by sophisticated anti-bot systems.35 Its focus on these more fundamental leaks represents a more modern approach to evasion, reflecting the continuous evolution of bot detection techniques.
 167  
 168  The progression from early stealth plugins focused on high-level JavaScript properties to newer libraries targeting low-level CDP interactions demonstrates the ongoing arms race. An effective long-term strategy requires an understanding of these underlying mechanisms. A developer should not treat these libraries as a "magic bullet" but as tools to be selected based on the current state of detection technology. For the most challenging targets, a library like patchright-python that addresses the deepest layers of detection is likely the most resilient choice.
 169  
 170  The following table provides a comparative analysis to aid in selecting the appropriate framework.
 171  
 172  | Feature | playwright-extra \+ stealth | undetected-playwright | patchright-python |
 173  | :---- | :---- | :---- | :---- |
 174  | **Primary Language** | JavaScript/Node.js | Python | Python |
 175  | **Maintenance Status** | Actively maintained (core plugin) | Less frequent updates | Actively maintained |
 176  | **Evasion Method** | Runtime JavaScript patching | Runtime JavaScript patching | Low-level CDP patching & JS patching |
 177  | **Key Patches** | navigator.webdriver, WebGL vendor, plugins, codecs, permissions | navigator.webdriver, various browser properties | Runtime.enable leak, Console.enable leak, navigator.webdriver, command flags |
 178  | **Cross-Browser Support** | Primarily Chromium | Chromium | Chromium |
 179  | **Community Activity** | High (via Puppeteer ecosystem) | Moderate | Growing |
 180  | **Ease of Use** | Simple setup | Simple drop-in | Simple drop-in |
 181  
 182  ### **2.3. Simulating Humanity: Advanced Interaction Emulation**
 183  
 184  To defeat behavioral biometric analysis, a scraper must perform actions in a way that is statistically indistinguishable from a human. This involves moving beyond the default, instantaneous methods like .click() and .type() and implementing functions that introduce natural-looking variability and imperfection.
 185  
 186  #### **Mouse Traversal**
 187  
 188  A script's mouse movements are a primary target for behavioral analysis. A straight, constant-speed path from one point to another is a definitive signature of a bot. To counter this, mouse movements must be non-linear and exhibit variable speed.
 189  
 190  * **Bézier Curves:** This mathematical technique is ideal for generating smooth, curved paths that mimic the natural arc of a human's hand moving a mouse \[602, 74, 86, 87, 88, 87, 43, 91, 92, 23, 24, 25, 26, 27, 2. A quadratic or cubic Bézier curve can be defined with a start point, an end point, and one or two control points. By randomizing the position of the control points, an infinite variety of natural-looking curves can be generated for each movement.  
 191  * **Perlin Noise:** While Bézier curves create a smooth path, human movements are not perfectly smooth; they contain tiny, subconscious corrections and "jitters." Perlin noise, a type of gradient noise used in computer graphics to generate natural-looking textures, can be applied to the mouse path to simulate this organic imperfection.37 By adding small, Perlin noise-generated offsets to the coordinates along the Bézier curve, the final path becomes less mathematically perfect and more believably human. The Python library  
 192    OxyMouse provides a ready-to-use implementation of both Bézier and Perlin noise algorithms for mouse movement generation.43
 193  
 194  #### **Keyboard Dynamics**
 195  
 196  Similarly, text input must not be instantaneous. A human types at a variable speed, with pauses between words and even characters. This can be simulated with a custom typing function that iterates through a string and types it character by character, with a small, randomized delay between each keystroke.
 197  
 198  The following Python code provides a utility class, HumanEmulator, that encapsulates these techniques for use with Playwright:
 199  
 200  Python
 201  
 202  \# File: human\_emulator.py  
 203  import random  
 204  import time  
 205  import numpy as np  
 206  from scipy.interpolate import interp1d
 207  
 208  class HumanEmulator:  
 209      """  
 210      A class to provide human-like interaction emulation for Playwright.  
 211      """
 212  
 213      @staticmethod  
 214      async def human\_like\_typing(page, selector: str, text: str):  
 215          """  
 216          Types text into an element with human-like delays.  
 217          """  
 218          await page.click(selector)  
 219          for char in text:  
 220              delay \= random.uniform(0.05, 0.2)  \# 50ms to 200ms delay between chars  
 221              await page.keyboard.type(char, delay=delay \* 1000)
 222  
 223      @staticmethod  
 224      async def bezier\_mouse\_move(page, start\_x, start\_y, end\_x, end\_y, duration\_ms=1000):  
 225          """  
 226          Moves the mouse along a randomized Bézier curve.  
 227          """  
 228          \# Control point randomization  
 229          control\_1\_x \= start\_x \+ random.uniform(-50, 50) \+ (end\_x \- start\_x) \* 0.25  
 230          control\_1\_y \= start\_y \+ random.uniform(-50, 50) \+ (end\_y \- start\_y) \* 0.25  
 231          control\_2\_x \= start\_x \+ random.uniform(-50, 50) \+ (end\_x \- start\_x) \* 0.75  
 232          control\_2\_y \= start\_y \+ random.uniform(-50, 50) \+ (end\_y \- start\_y) \* 0.75
 233  
 234          points \=  
 235          num\_points \= int(duration\_ms / 20) \# A point every \~20ms
 236  
 237          for i in range(num\_points \+ 1):  
 238              t \= i / num\_points  
 239              \# Cubic Bézier curve formula  
 240              x \= (1 \- t)\*\*3 \* start\_x \+ 3 \* (1 \- t)\*\*2 \* t \* control\_1\_x \+ 3 \* (1 \- t) \* t\*\*2 \* control\_2\_x \+ t\*\*3 \* end\_x  
 241              y \= (1 \- t)\*\*3 \* start\_y \+ 3 \* (1 \- t)\*\*2 \* t \* control\_1\_y \+ 3 \* (1 \- t) \* t\*\*2 \* control\_2\_y \+ t\*\*3 \* end\_y  
 242              points.append((x, y))
 243  
 244          \# Introduce Perlin-like noise/jitter  
 245          \# A simple way is to add small random offsets  
 246          noisy\_points \=  
 247          for x, y in points:  
 248              jitter\_x \= random.uniform(-2, 2)  
 249              jitter\_y \= random.uniform(-2, 2)  
 250              noisy\_points.append((x \+ jitter\_x, y \+ jitter\_y))
 251  
 252          \# Move the mouse through the points  
 253          for x, y in noisy\_points:  
 254              await page.mouse.move(x, y)  
 255              await page.wait\_for\_timeout(random.uniform(15, 25)) \# Small delay between moves
 256  
 257          await page.mouse.move(end\_x, end\_y) \# Ensure it ends at the exact point
 258  
 259      @staticmethod  
 260      async def move\_and\_click(page, selector: str, duration\_ms=1000):  
 261          """  
 262          Moves to an element with a human-like path and then clicks it.  
 263          """  
 264          element \= page.locator(selector)  
 265          await element.wait\_for(state="visible")  
 266          box \= await element.bounding\_box()
 267  
 268          if not box:  
 269              raise Exception(f"Element with selector '{selector}' not found or not visible.")
 270  
 271          \# Get current mouse position (approximate)  
 272          \# In a real scenario, you might track this, but for simplicity, we start from a random point.  
 273          start\_pos \= await page.evaluate("() \=\> ({ x: Math.random() \* window.innerWidth, y: Math.random() \* window.innerHeight })")  
 274            
 275          \# Target a random point within the element's bounding box  
 276          target\_x \= box\['x'\] \+ random.uniform(0.2, 0.8) \* box\['width'\]  
 277          target\_y \= box\['y'\] \+ random.uniform(0.2, 0.8) \* box\['height'\]
 278  
 279          await HumanEmulator.bezier\_mouse\_move(page, start\_pos\['x'\], start\_pos\['y'\], target\_x, target\_y, duration\_ms)  
 280            
 281          \# Human-like click delay  
 282          await page.mouse.down()  
 283          await page.wait\_for\_timeout(random.uniform(50, 150))  
 284          await page.mouse.up()
 285  
 286  ### **2.4. The Network Mask: Strategic Proxy Management**
 287  
 288  As established in Section 1.4, the use of a high-quality proxy network is a mandatory component for any scraper running within the GitHub Actions environment. The goal is to mask the data center IP of the runner and present an IP address that is indistinguishable from a real residential user.
 289  
 290  The architecture requires a subscription to a reputable **rotating residential proxy provider**, 94, 95\]. These services provide access to a large pool of IP addresses from real consumer devices around the world. Key features to look for in a provider are a large pool size, extensive geographic targeting options, and support for "sticky sessions," which allow a scraper to maintain the same IP address for the duration of a multi-step task, such as a login and subsequent data extraction.
 291  
 292  Configuring Playwright to use a proxy is straightforward. The browser.launch() method accepts a proxy parameter. To manage credentials securely, the proxy URL, including username and password, should never be hardcoded. Instead, it should be passed to the GitHub Actions workflow as a secret and then exposed to the Python script as an environment variable.
 293  
 294  The following code demonstrates a secure method for configuring a proxy in a Playwright script, assuming the proxy URL is available in an environment variable named PROXY\_URL.
 295  
 296  Python
 297  
 298  \# File: browser\_setup.py  
 299  import os  
 300  from playwright.async\_api import Browser, Playwright  
 301  from dotenv import load\_dotenv
 302  
 303  \# Load environment variables from a.env file for local development  
 304  load\_dotenv()
 305  
 306  async def get\_configured\_browser(playwright: Playwright) \-\> Browser:  
 307      """  
 308      Launches a Chromium browser instance configured with a proxy  
 309      from environment variables.  
 310      """  
 311      proxy\_url \= os.environ.get("PROXY\_URL")  
 312      proxy\_config \= None
 313  
 314      if proxy\_url:  
 315          try:  
 316              \# Standard proxy format: http://username:password@host:port  
 317              parsed\_url \= new URL(proxy\_url)  
 318              proxy\_config \= {  
 319                  "server": f"{parsed\_url.protocol}//{parsed\_url.hostname}:{parsed\_url.port}",  
 320                  "username": parsed\_url.username,  
 321                  "password": parsed\_url.password,  
 322              }  
 323              print("Proxy configured successfully.")  
 324          except Exception as e:  
 325              print(f"Warning: Could not parse PROXY\_URL. Proceeding without proxy. Error: {e}")  
 326      else:  
 327          print("Warning: PROXY\_URL environment variable not set. Proceeding without proxy.")
 328  
 329      \# Launch the browser with the proxy configuration  
 330      browser \= await playwright.chromium.launch(  
 331          headless=True,  
 332          proxy=proxy\_config if proxy\_config else None  
 333      )  
 334      return browser
 335  
 336  \# This is a placeholder for the URL class which is not native to Python  
 337  \# In a real implementation, you would use a library like \`urllib.parse\`  
 338  from urllib.parse import urlparse
 339  
 340  class URL:  
 341      def \_\_init\_\_(self, url\_string):  
 342          parsed \= urlparse(url\_string)  
 343          self.protocol \= parsed.scheme  
 344          self.hostname \= parsed.hostname  
 345          self.port \= parsed.port  
 346          self.username \= parsed.username  
 347          self.password \= parsed.password
 348  
 349  This modular approach ensures that the core scraping logic remains decoupled from the network configuration, allowing for easy updates to proxy credentials without modifying the scraper code itself.
 350  
 351  ## **Section 3: The Automated Scraper: A Resilient Architectural Blueprint**
 352  
 353  With the necessary evasion tools identified, this section outlines the architecture of the Python application itself. The design prioritizes modularity, robustness, and a novel approach to session management tailored to the stateless nature of the GitHub Actions environment.
 354  
 355  ### **3.1. Core Scraper Logic (Python with Playwright)**
 356  
 357  The Python application should be structured into distinct modules to separate concerns, enhancing maintainability and testability. A recommended structure includes:
 358  
 359  * config.py: Stores constants and configuration values, such as target URLs, search parameters, and selectors.  
 360  * human\_emulator.py: Contains the HumanEmulator class developed in Section 2.3 for simulating user interactions.  
 361  * scraper.py: Houses the main scraping class or functions responsible for orchestrating the browser, navigating pages, and extracting data.  
 362  * main.py: The entry point for the script, which parses command-line arguments (e.g., from the GitHub Actions matrix) and initiates the scraping process.
 363  
 364  The scraper.py module will contain the core logic. This includes functions to handle the entire scraping lifecycle for a given set of search terms 44:
 365  
 366  1. **Initialization:** A function to launch the patched Playwright browser, configure it with proxy settings from environment variables, and, most importantly, load the persisted session state.  
 367  2. **Search Execution:** A function that takes search keywords and a location, navigates to the LinkedIn jobs search page, and uses the human emulation methods to input the search terms and submit the form.  
 368  3. **Pagination and Scrolling:** A robust loop to handle pagination. For sites like LinkedIn that use infinite scroll, this involves repeatedly scrolling to the bottom of the results list and waiting for new content to be dynamically loaded via AJAX requests. The scraper must monitor for a "no more results" indicator to terminate the loop gracefully.  
 369  4. **Data Extraction:** A function to iterate over the located job listing elements. For each listing, it extracts key data points such as job title, company name, location, and the URL to the full job description. This process should be wrapped in error handling to prevent a single malformed listing from halting the entire scrape.  
 370  5. **Data Storage:** After collecting the data, a function saves it to a structured format like JSON or CSV in a designated output directory.
 371  
 372  ### **3.2. State and Session Management**
 373  
 374  The single greatest architectural challenge when scraping an authenticated site within a stateless environment like GitHub Actions is managing the login session \[59, 59. Runners are ephemeral; any cookies or local storage generated during a run are destroyed when the job completes. Attempting to perform a full username/password login on every scheduled run is a highly aggressive and unnatural pattern that will inevitably trigger security alerts, CAPTCHAs, and account locks.
 375  
 376  The solution is an architecture that decouples the high-risk login action from the low-risk, routine scraping action. This is achieved by persisting the browser's authentication state between workflow runs using GitHub Secrets. Playwright facilitates this by allowing the entire state of a browser context—including cookies, localStorage, and sessionStorage—to be saved to and loaded from a file.
 377  
 378  The proposed architecture consists of two distinct GitHub Actions workflows:
 379  
 380  1. **login-and-save-state.yml (Manual Workflow):**  
 381     * **Trigger:** This workflow is triggered manually via workflow\_dispatch. It is run only when a new session needs to be established (e.g., initially, or if the previous session expires).  
 382     * **Process:**  
 383       * It launches a **headed** Playwright browser within the GitHub Actions runner (using xvfb to provide a virtual display, as runners are headless by default \[5927, S\_R567, S\_S68, S\_S69\]).  
 384       * It navigates to the LinkedIn login page.  
 385       * Crucially, it pauses and waits for the user to solve any CAPTCHA challenges and perform the multi-factor authentication (MFA) required during login. This manual intervention is necessary for the initial secure login.  
 386       * Once logged in, it saves the browser context's state to a session.json file using context.storage\_state(path="session.json").  
 387       * It then encrypts this session.json file using a strong encryption key (e.g., GPG), which is itself stored as a GitHub Secret (GPG\_PASSPHRASE).  
 388       * The encrypted session data is then stored as a new, separate GitHub Secret (SESSION\_STATE).  
 389  2. **scrape-jobs.yml (Scheduled Workflow):**  
 390     * **Trigger:** This workflow runs on a schedule (e.g., daily).  
 391     * **Process:**  
 392       * It retrieves the encrypted SESSION\_STATE and the GPG\_PASSPHRASE from GitHub Secrets.  
 393       * It decrypts the session data back into a session.json file.  
 394       * It launches a headless Playwright browser and creates a new context by loading the state from the decrypted file: browser.new\_context(storage\_state="session.json").  
 395       * This new context is now fully authenticated. The scraper can navigate directly to internal pages and perform its tasks without needing to interact with the login form.
 396  
 397  This two-workflow approach provides immense benefits. The high-risk, CAPTCHA-prone login process is performed infrequently and with human assistance, while the frequent, automated scraping jobs run in a stealthy, pre-authenticated state. This dramatically reduces the risk of detection and increases the overall resilience of the engine.
 398  
 399  ### **3.3. Robustness and Reliability**
 400  
 401  A production-grade scraper must be resilient to the inherent unpredictability of the web. Network connections can fail, websites can change their layout, and individual elements may not load as expected. The application logic must anticipate and handle these failures gracefully.
 402  
 403  * **Retry Mechanisms:** All network operations (e.g., page.goto()) and critical element interactions (e.g., page.locator().click()) should be wrapped in a retry loop. An effective strategy is to implement exponential backoff, where the delay between retries increases after each failure (e.g., wait 2s, then 4s, then 8s). This prevents overwhelming a temporarily struggling server.  
 404  * **Error Handling:** Every data extraction step should be enclosed in a try-except block. If a specific piece of data, like a job's salary, is not found on one listing, the scraper should log the error, record the field as null, and continue processing the rest of the listing and subsequent listings. A single missing element should never cause the entire scraping job to crash.  
 405  * **Timeouts:** Playwright's default timeouts should be adjusted based on the expected performance of the target site and the proxy network. Setting an aggressive global timeout can lead to premature failures, while an overly generous timeout can cause jobs to hang indefinitely. A reasonable job-level timeout should also be configured in the GitHub Actions workflow itself (e.g., timeout-minutes: 60\) to prevent runaway jobs from consuming excessive resources, 38, 39\].
 406  
 407  ### **3.4. Data Persistence Strategy: Git Commit vs. Artifacts**
 408  
 409  Once the data is scraped, it must be saved. Within GitHub Actions, there are two primary methods for persisting data generated during a workflow run, 7, 83, 84, S\_S14, 11, 10, 24, 25, S\_S19, 40, 17, 85, 86, 87, 88, 37, 38, 39, 14, S\_S30, S\_S31, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\].
 410  
 411  * **Method 1: Committing to Git:** In this approach, the workflow includes a final step that uses a pre-built action (e.g., stefanzweifel/git-auto-commit-action or actions/add-commit) to commit the newly generated data file directly back to the repository7, 428, 429, 140, 141, 14. This creates a version-controlled, historical dataset that is easily accessible and can trigger downstream processes. The main drawback is the potential for a "noisy" commit history, with frequent, automated commits.  
 412  * **Method 2: Using Workflow Artifacts:** This method uses the actions/upload-artifact action to store the data file as an artifact associated with the workflow run, S\_R508, S\_R509, 6, 7, 83, 84, S\_S14, 11, 10, 24, 25, S\_S19, 40, 17, 85, 86, 87, 88, 37, 38, 39, 14, S\_S30, S\_S31, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\]. This keeps the Git history clean and is ideal for temporary data like logs or reports. However, artifacts expire by default and are not suitable for persisting state  
 413    *between* different workflow runs, as a new run cannot easily access artifacts from a previous one, S\_R705\].
 414  
 415  For the purpose of creating a persistent dataset of job postings, the **Git commit method is recommended**. For temporary, diagnostic data like trace files and screenshots generated upon failure, **artifacts are the superior choice**.
 416  
 417  ## **Section 4: The Factory Floor: Orchestration with GitHub Actions**
 418  
 419  This section translates the architectural design into a concrete implementation, providing the complete YAML configuration for the GitHub Actions workflow. The workflow is designed for automation, scalability, efficiency, and security.
 420  
 421  ### **4.1. The Workflow File (.github/workflows/scraper.yml)**
 422  
 423  The heart of the automation is the workflow file. It defines the triggers, permissions, jobs, and steps that constitute the scraping pipeline.
 424  
 425  YAML
 426  
 427  \# File:.github/workflows/scraper.yml  
 428  name: LinkedIn Job Scraper
 429  
 430  on:  
 431    \# Schedule the workflow to run every day at midnight UTC  
 432    schedule:  
 433      \- cron: '0 0 \* \* \*'  
 434    \# Allow manual triggering from the GitHub Actions UI  
 435    workflow\_dispatch:  
 436      inputs:  
 437        job\_title:  
 438          description: 'Job Title to search for'  
 439          required: true  
 440          default: 'Software Engineer'  
 441        location:  
 442          description: 'Location to search in'  
 443          required: true  
 444          default: 'United States'
 445  
 446  \# Set default permissions for the GITHUB\_TOKEN for security  
 447  permissions:  
 448    contents: write  \# Required to commit data back to the repository  
 449    issues: write    \# Required to create issues on failure
 450  
 451  jobs:  
 452    scrape:  
 453      runs-on: ubuntu-latest  
 454      strategy:  
 455        fail-fast: false \# Allow other matrix jobs to continue if one fails  
 456        matrix:  
 457          \# Define a matrix to run scrapers for different roles/locations in parallel  
 458          \# For workflow\_dispatch, these will be overridden by inputs  
 459          job\_config:  
 460            \- { title: 'Software Engineer', location: 'United States' }  
 461            \- { title: 'Data Scientist', location: 'United States' }  
 462            \- { title: 'Product Manager', location: 'Canada' }
 463  
 464      \# Set a timeout for each job to prevent runaways  
 465      timeout-minutes: 60
 466  
 467      steps:  
 468        \- name: Checkout repository  
 469          uses: actions/checkout@v4
 470  
 471        \- name: Set up Python  
 472          uses: actions/setup-python@v5  
 473          with:  
 474            python-version: '3.10'
 475  
 476        \- name: Cache Python dependencies  
 477          uses: actions/cache@v4  
 478          with:  
 479            path: \~/.cache/pip  
 480            key: ${{ runner.os }}-pip-${{ hashFiles('\*\*/requirements.txt') }}  
 481            restore-keys: |  
 482              ${{ runner.os }}-pip-
 483  
 484        \- name: Cache Playwright browsers  
 485          uses: actions/cache@v4  
 486          with:  
 487            path: \~/.cache/ms-playwright  
 488            key: ${{ runner.os }}-playwright-${{ hashFiles('\*\*/requirements.txt') }}  
 489            restore-keys: |  
 490              ${{ runner.os }}-playwright-
 491  
 492        \- name: Install Python dependencies  
 493          run: |  
 494            python \-m pip install \--upgrade pip  
 495            pip install \-r requirements.txt
 496  
 497        \- name: Install Playwright browsers and dependencies  
 498          run: npx playwright install \--with-deps chromium
 499  
 500        \- name: Run Scraper  
 501          id: run\_scraper  
 502          env:  
 503            \# Securely pass secrets to the Python script  
 504            LINKEDIN\_EMAIL: ${{ secrets.LINKEDIN\_EMAIL }}  
 505            LINKEDIN\_PASSWORD: ${{ secrets.LINKEDIN\_PASSWORD }}  
 506            PROXY\_URL: ${{ secrets.PROXY\_URL }}  
 507            GPG\_PASSPHRASE: ${{ secrets.GPG\_PASSPHRASE }}  
 508            SESSION\_STATE\_ENCRYPTED: ${{ secrets.SESSION\_STATE }}  
 509          run: |  
 510            \# Use workflow\_dispatch inputs if available, otherwise use matrix values  
 511            JOB\_TITLE="${{ github.event.inputs.job\_title |
 512  
 513  | matrix.job\_config.title }}"  
 514            LOCATION="${{ github.event.inputs.location |
 515  
 516  | matrix.job\_config.location }}"  
 517  
 518            python main.py \--job-title "$JOB\_TITLE" \--location "$LOCATION"
 519  
 520        \- name: Commit scraped data  
 521          if: success()  
 522          uses: stefanzweifel/git-auto-commit-action@v5  
 523          with:  
 524            commit\_message: "chore: Update scraped job data"  
 525            file\_pattern: "data/\*.json"
 526  
 527        \- name: Upload Trace on Failure  
 528          if: failure()  
 529          uses: actions/upload-artifact@v4  
 530          with:  
 531            name: playwright-trace-${{ matrix.job\_config.title }}-${{ matrix.job\_config.location }}  
 532            path: trace.zip  
 533            retention-days: 7
 534  
 535        \- name: Create Issue on Failure  
 536          if: failure()  
 537          uses: JasonEtco/create-an-issue@v2  
 538          env:  
 539            GITHUB\_TOKEN: ${{ secrets.GITHUB\_TOKEN }}  
 540          with:  
 541            filename:.github/ISSUE\_TEMPLATE.md  
 542            assignees: ${{ github.actor }}  
 543            update\_existing: true  
 544            search\_existing: open
 545  
 546  This workflow incorporates several best practices. It is triggered both on a schedule (cron) and manually (workflow\_dispatch), providing flexibility for automated runs and on-demand execution \[59, 5951, 5954, 5955, 5956, 5957, 5958, 5959, 5996, S\_R300, S\_R301, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\]. It also explicitly sets the
 547  
 548  permissions for the GITHUB\_TOKEN to the minimum required, adhering to the principle of least privilege.48
 549  
 550  ### **4.2. Environment and Dependency Optimization**
 551  
 552  Efficiency is paramount in a CI/CD environment to minimize runtime and cost. The two most time-consuming steps in a scraping workflow are typically dependency installation and browser binary downloads. GitHub Actions provides a caching mechanism to persist these between runs.
 553  
 554  * **Caching Python Dependencies:** The actions/cache action is used to store the pip cache directory. The cache key is composed of the runner's operating system and a hash of the requirements.txt file. This ensures that the cache is invalidated and rebuilt only when the dependencies change, saving significant time on subsequent runs.53  
 555  * **Caching Playwright Browsers:** Similarly, the browser binaries downloaded by Playwright can be cached. The cache path is \~/.cache/ms-playwright. Caching these binaries, which can be several hundred megabytes, is a critical optimization that can reduce job setup time by minutes.54 It is important to note that while the official Playwright documentation discourages caching due to potential staleness issues, for a controlled scraping environment where the browser version is pinned, the performance benefits are substantial and generally outweigh the risks.
 556  
 557  The workflow also uses npx playwright install \--with-deps chromium to install not only the Chromium browser but also all its necessary operating system dependencies, ensuring the environment is correctly configured on the ubuntu-latest runner \[56, 56, 56, S\_R194, 5908, 5909, 5925, 5926, 5927, 5928, 5929, 330, 331, 332, 333, 334, 5967, 5997, S\_R323, 87, 398, 399, 400, 5.
 558  
 559  ### **4.3. Secure Operations: Managing Credentials and Secrets**
 560  
 561  Hardcoding sensitive information like login credentials or API keys into workflow files is a severe security vulnerability. GitHub Actions provides a secure storage mechanism called "Secrets" for this purpose. Secrets are encrypted environment variables that are only exposed to the specific workflow run \[70, 70, 70, 70, 70, 70, 70, 70, S\_R602, S\_R603, S\_R604, S\_R605, S\_R606, S\_R607, S\_R608, S\_R632, 80, 80, 80, 80, 80, 80, 80, S\_R658, 80, S\_R711\].
 562  
 563  For this playbook, the following secrets must be created in the repository settings (Settings \> Secrets and variables \> Actions):
 564  
 565  * LINKEDIN\_EMAIL: The email address for the LinkedIn account.  
 566  * LINKEDIN\_PASSWORD: The password for the LinkedIn account.  
 567  * PROXY\_URL: The full connection string for the residential proxy service.  
 568  * GPG\_PASSPHRASE: The passphrase used to encrypt and decrypt the session state file.  
 569  * SESSION\_STATE: The GPG-encrypted, Base64-encoded session state JSON.
 570  
 571  These secrets are then passed to the Run Scraper step via the env block, making them available as environment variables within the Python script \[5983, S\_R410, S\_R424, S\_R517\]. The Python script should be designed to read these values from
 572  
 573  os.environ.
 574  
 575  ### **4.4. Scaling and Parallelism: Leveraging the Matrix Strategy**
 576  
 577  To scrape a large volume of data efficiently, the scraping tasks must be parallelized. GitHub Actions' strategy: matrix feature is the ideal tool for this \[59, S\_R384, S\_R436, S\_R467, 73, S\_R656, S\_R667, S\_R668, S\_R669, 610, 611, 22, 2. A matrix allows a single job definition to be expanded into multiple parallel jobs, each with a different set of input variables.
 578  
 579  In the example scraper.yml workflow, the matrix is defined under jobs.scrape.strategy.matrix.job\_config. This creates three parallel jobs, each with a different combination of title and location. The Python script is then designed to read these variables from the matrix context (matrix.job\_config.title and matrix.job\_config.location) to perform its targeted search. This approach allows the scraping of multiple job categories and locations simultaneously, drastically reducing the total time required to complete the entire scraping run. The fail-fast: false setting ensures that the failure of one job in the matrix (e.g., a scrape for "Product Manager" fails) does not automatically cancel the other in-progress jobs (e.g., "Software Engineer" and "Data Scientist"), maximizing data collection even in the face of partial failures.
 580  
 581  ### **4.5. Monitoring and Debugging**
 582  
 583  When a scraper fails in an automated, headless environment, debugging can be challenging. A robust monitoring and debugging strategy is essential for maintaining the long-term health of the scraping engine.
 584  
 585  * **Trace on Failure:** The most powerful debugging tool Playwright offers is the Trace Viewer. By configuring trace: 'on-first-retry' in the Playwright config, a detailed trace file (trace.zip) is generated for any test that fails and is retried. This trace includes a screencast of the execution, a live DOM snapshot for each action, console logs, and network requests, providing a complete picture of what went wrong.56  
 586  * **Conditional Artifact Upload:** The workflow is configured to upload this trace.zip file as a workflow artifact, but only when the run\_scraper step fails. This is achieved using the if: failure() condition on the actions/upload-artifact step. This ensures that artifacts are only generated when they are needed for debugging, keeping successful runs clean.  
 587  * **Programmatic Issue Creation:** For critical, unrecoverable errors, the workflow can automatically create a GitHub Issue. The JasonEtco/create-an-issue action is used, again with an if: failure() condition. It can be configured with a template (.github/ISSUE\_TEMPLATE.md) to pre-populate the issue with details from the workflow run, such as the run ID, the failed job, and a link to the logs. This creates a formal, trackable record of the failure, ensuring that it is investigated and resolved0, 471, 472, 473, 474, S\_R124, S\_R125, S\_R126, S\_R127, S\_R128, S\_R129, S\_R130, 603, 604, 605, 606, 607, S\_R185, S\_R190, S\_R191, 5900, 5902, 67, 67, S\_R320, S\_R348, S\_R555, 75, S\_R623, 75, S\_R694, 2.
 588  
 589  ## **Section 5: The Complete Codebase and Final Recommendations**
 590  
 591  This final section provides the complete, production-ready code for the Python scraper and the GitHub Actions workflow, followed by essential ethical and legal considerations for conducting web scraping activities.
 592  
 593  ### **5.1. Fully Commented Python Scraper Code**
 594  
 595  The following is the fully integrated Python script (main.py), designed to be executed by the GitHub Actions workflow. It incorporates the architectural principles of modularity, secure credential handling, state management, robust error handling, and human behavior emulation.
 596  
 597  Python
 598  
 599  \# File: main.py
 600  
 601  import asyncio  
 602  import os  
 603  import json  
 604  import re  
 605  import random  
 606  from playwright.async\_api import async\_playwright, Page, BrowserContext, Playwright  
 607  from dotenv import load\_dotenv  
 608  import gnupg \# Requires python-gnupg library and GnuPG installed on the runner
 609  
 610  \# \--- Configuration \---  
 611  \# In a real project, this might be in a separate config.py  
 612  BASE\_URL \= "<https://www.linkedin.com/jobs/search>"  
 613  OUTPUT\_DIR \= "data"  
 614  SESSION\_FILE \= "session.json"  
 615  ENCRYPTED\_SESSION\_FILE \= "session.json.gpg"
 616  
 617  \# \--- Human Behavior Emulation \---  
 618  class HumanEmulator:  
 619      @staticmethod  
 620      async def human\_like\_typing(page: Page, selector: str, text: str):  
 621          await page.locator(selector).click()  
 622          for char in text:  
 623              delay \= random.uniform(0.08, 0.25) \# 80ms to 250ms delay  
 624              await page.keyboard.type(char, delay=delay \* 1000)  
 625              await asyncio.sleep(delay / 2)
 626  
 627      @staticmethod  
 628      async def bezier\_mouse\_move(page: Page, start\_x, start\_y, end\_x, end\_y, duration\_ms=800):  
 629          control\_1\_x \= start\_x \+ (end\_x \- start\_x) \* 0.25 \+ random.uniform(-75, 75)  
 630          control\_1\_y \= start\_y \+ (end\_y \- start\_y) \* 0.25 \+ random.uniform(-75, 75)  
 631          control\_2\_x \= start\_x \+ (end\_x \- start\_x) \* 0.75 \+ random.uniform(-75, 75)  
 632          control\_2\_y \= start\_y \+ (end\_y \- start\_y) \* 0.75 \+ random.uniform(-75, 75)  
 633            
 634          num\_points \= int(duration\_ms / 20)  
 635          points \=  
 636          for i in range(num\_points \+ 1):  
 637              t \= i / num\_points  
 638              x \= (1\-t)\*\*3\*start\_x \+ 3\*(1\-t)\*\*2\*t\*control\_1\_x \+ 3\*(1\-t)\*t\*\*2\*control\_2\_x \+ t\*\*3\*end\_x  
 639              y \= (1\-t)\*\*3\*start\_y \+ 3\*(1\-t)\*\*2\*t\*control\_1\_y \+ 3\*(1\-t)\*t\*\*2\*control\_2\_y \+ t\*\*3\*end\_y  
 640              jitter\_x \= random.uniform(-1.5, 1.5)  
 641              jitter\_y \= random.uniform(-1.5, 1.5)  
 642              points.append((x \+ jitter\_x, y \+ jitter\_y))
 643  
 644          for x, y in points:  
 645              await page.mouse.move(x, y)  
 646              await asyncio.sleep(random.uniform(0.015, 0.025))  
 647          await page.mouse.move(end\_x, end\_y)
 648  
 649      @staticmethod  
 650      async def move\_and\_click(page: Page, selector: str, duration\_ms=800):  
 651          element \= page.locator(selector)  
 652          await element.wait\_for(state="visible", timeout=10000)  
 653          box \= await element.bounding\_box()  
 654          if not box: raise Exception(f"Element '{selector}' not found.")  
 655            
 656          start\_pos \= await page.evaluate("() \=\> ({ x: Math.random() \* 500, y: Math.random() \* 500 })")  
 657          target\_x \= box\['x'\] \+ random.uniform(0.3, 0.7) \* box\['width'\]  
 658          target\_y \= box\['y'\] \+ random.uniform(0.3, 0.7) \* box\['height'\]  
 659            
 660          await HumanEmulator.bezier\_mouse\_move(page, start\_pos\['x'\], start\_pos\['y'\], target\_x, target\_y, duration\_ms)  
 661          await page.mouse.down()  
 662          await asyncio.sleep(random.uniform(0.06, 0.18))  
 663          await page.mouse.up()
 664  
 665  \# \--- Scraper Class \---  
 666  class LinkedInScraper:  
 667      def \_\_init\_\_(self, job\_title: str, location: str):  
 668          self.job\_title \= job\_title  
 669          self.location \= location  
 670          self.playwright: Playwright \= None  
 671          self.browser \= None  
 672          self.context: BrowserContext \= None  
 673          self.page: Page \= None  
 674  
 675      async def setup(self):  
 676          self.playwright \= await async\_playwright().start()  
 677          proxy\_url \= os.environ.get("PROXY\_URL")  
 678          proxy\_config \= None  
 679          if proxy\_url:  
 680              \# Assuming format http://user:pass@host:port  
 681              parts \= re.match(r"http://(.\*?):(.\*?)@(.\*?):(\\d+)", proxy\_url)  
 682              if parts:  
 683                  proxy\_config \= {  
 684                      "server": f"http://{parts.group(3)}:{parts.group(4)}",  
 685                      "username": parts.group(1),  
 686                      "password": parts.group(2)  
 687                  }  
 688            
 689          self.browser \= await self.playwright.chromium.launch(  
 690              headless=True,  
 691              proxy=proxy\_config,  
 692              args=\["--disable-blink-features=AutomationControlled"\]  
 693          )
 694  
 695          await self.\_load\_session\_state()
 696  
 697      async def \_load\_session\_state(self):  
 698          encrypted\_state\_b64 \= os.environ.get("SESSION\_STATE\_ENCRYPTED")  
 699          passphrase \= os.environ.get("GPG\_PASSPHRASE")
 700  
 701          if not encrypted\_state\_b64 or not passphrase:  
 702              print("Session state or passphrase not found in environment variables. Cannot proceed with authenticated scraping.")  
 703              \# Fallback to creating a new context, which will be unauthenticated  
 704              self.context \= await self.browser.new\_context()  
 705              self.page \= await self.context.new\_page()  
 706              return
 707  
 708          try:  
 709              gpg \= gnupg.GPG()  
 710              with open(ENCRYPTED\_SESSION\_FILE, "wb") as f:  
 711                  import base64  
 712                  f.write(base64.b64decode(encrypted\_state\_b64))
 713  
 714              with open(ENCRYPTED\_SESSION\_FILE, "rb") as f:  
 715                  decrypted\_data \= gpg.decrypt\_file(f, passphrase=passphrase)  
 716                  if not decrypted\_data.ok:  
 717                      raise Exception(f"GPG decryption failed: {decrypted\_data.stderr}")  
 718                    
 719                  with open(SESSION\_FILE, "w") as sf:  
 720                      sf.write(str(decrypted\_data))
 721  
 722              self.context \= await self.browser.new\_context(storage\_state=SESSION\_FILE)  
 723              self.page \= await self.context.new\_page()  
 724              print("Successfully loaded and decrypted session state.")
 725  
 726          except Exception as e:  
 727              print(f"Error loading session state: {e}. Proceeding with a new context.")  
 728              self.context \= await self.browser.new\_context()  
 729              self.page \= await self.context.new\_page()
 730  
 731      async def scrape(self):  
 732          url \= f"{BASE\_URL}?keywords={self.job\_title}\&location={self.location}"  
 733          await self.page.goto(url, wait\_until="networkidle", timeout=60000)  
 734            
 735          \# Simple check to see if we are on a login page (which means session failed)  
 736          if "login" in self.page.url.lower():  
 737              print("Redirected to login page. Session state may be invalid.")  
 738              \# In a real scenario, you might want to trigger an alert here.  
 739              return
 740  
 741          \# Handle infinite scroll  
 742          last\_height \= await self.page.evaluate("document.body.scrollHeight")  
 743          while True:  
 744              await self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)")  
 745              await asyncio.sleep(random.uniform(2, 4)) \# Wait for new content to load  
 746              new\_height \= await self.page.evaluate("document.body.scrollHeight")  
 747              if new\_height \== last\_height:  
 748                  break  
 749              last\_height \= new\_height
 750  
 751          \# Extract data  
 752          job\_elements \= await self.page.locator('ul.jobs-search\_\_results-list \> li').all()  
 753          results \=  
 754          for job in job\_elements:  
 755              try:  
 756                  title \= await job.locator('h3.base-search-card\_\_title').inner\_text()  
 757                  company \= await job.locator('h4.base-search-card\_\_subtitle').inner\_text()  
 758                  location \= await job.locator('span.job-search-card\_\_location').inner\_text()  
 759                  link \= await job.locator('a.base-card\_\_full-link').get\_attribute('href')  
 760                  results.append({  
 761                      "title": title.strip(),  
 762                      "company": company.strip(),  
 763                      "location": location.strip(),  
 764                      "link": link  
 765                  })  
 766              except Exception:  
 767                  continue \# Skip if an element is missing  
 768            
 769          return results
 770  
 771      def save\_data(self, data):  
 772          if not os.path.exists(OUTPUT\_DIR):  
 773              os.makedirs(OUTPUT\_DIR)  
 774            
 775          filename \= f"{self.job\_title.replace(' ', '\_')}\_{self.location.replace(' ', '\_')}.json"  
 776          filepath \= os.path.join(OUTPUT\_DIR, filename)  
 777            
 778          with open(filepath, 'w', encoding='utf-8') as f:  
 779              json.dump(data, f, indent=4, ensure\_ascii=False)  
 780          print(f"Data saved to {filepath}")
 781  
 782      async def teardown(self):  
 783          if self.browser:  
 784              await self.browser.close()  
 785          if self.playwright:  
 786              await self.playwright.stop()
 787  
 788  async def main():  
 789      import argparse  
 790      parser \= argparse.ArgumentParser()  
 791      parser.add\_argument("--job-title", required=True)  
 792      parser.add\_argument("--location", required=True)  
 793      args \= parser.parse\_args()
 794  
 795      scraper \= LinkedInScraper(job\_title=args.job\_title, location=args.location)  
 796      try:  
 797          await scraper.setup()  
 798          data \= await scraper.scrape()  
 799          if data:  
 800              scraper.save\_data(data)  
 801      finally:  
 802          await scraper.teardown()
 803  
 804  if \_\_name\_\_ \== "\_\_main\_\_":  
 805      load\_dotenv() \# For local testing  
 806      asyncio.run(main())
 807  
 808  ### **5.2. The Final workflow.yml**
 809  
 810  This is the complete, annotated GitHub Actions workflow file that orchestrates the entire process. It should be placed in the repository at .github/workflows/scraper.yml.
 811  
 812  YAML
 813  
 814  \# File:.github/workflows/scraper.yml  
 815  name: LinkedIn Job Scraper
 816  
 817  on:  
 818    schedule:  
 819      \- cron: '0 3 \* \* \*' \# Runs every day at 3 AM UTC  
 820    workflow\_dispatch:  
 821      inputs:  
 822        job\_title:  
 823          description: 'Job Title to search for'  
 824          required: true  
 825          default: 'Data Engineer'  
 826        location:  
 827          description: 'Location to search in'  
 828          required: true  
 829          default: 'Remote'
 830  
 831  permissions:  
 832    contents: write  
 833    issues: write
 834  
 835  jobs:  
 836    scrape-and-commit:  
 837      runs-on: ubuntu-latest  
 838      strategy:  
 839        fail-fast: false  
 840        matrix:  
 841          job\_config:  
 842            \- { title: 'Software Engineer', location: 'United States' }  
 843            \- { title: 'Data Scientist', location: 'United States' }  
 844            \- { title: 'Product Manager', location: 'Canada' }  
 845            \- { title: 'DevOps Engineer', location: 'United Kingdom' }
 846  
 847      timeout-minutes: 60
 848  
 849      steps:  
 850        \- name: Checkout repository  
 851          uses: actions/checkout@v4
 852  
 853        \- name: Set up Python 3.10  
 854          uses: actions/setup-python@v5  
 855          with:  
 856            python-version: '3.10'
 857  
 858        \- name: Cache pip dependencies  
 859          uses: actions/cache@v4  
 860          with:  
 861            path: \~/.cache/pip  
 862            key: ${{ runner.os }}-pip-${{ hashFiles('\*\*/requirements.txt') }}  
 863            restore-keys: |  
 864              ${{ runner.os }}-pip-
 865  
 866        \- name: Cache Playwright browsers  
 867          uses: actions/cache@v4  
 868          with:  
 869            path: \~/.cache/ms-playwright  
 870            key: ${{ runner.os }}-playwright-${{ hashFiles('\*\*/requirements.txt') }}  
 871            restore-keys: |  
 872              ${{ runner.os }}-playwright-
 873  
 874        \- name: Install system dependencies for GnuPG  
 875          run: sudo apt-get update && sudo apt-get install \-y gnupg
 876  
 877        \- name: Install Python dependencies  
 878          run: |  
 879            python \-m pip install \--upgrade pip  
 880            pip install \-r requirements.txt
 881  
 882        \- name: Install Playwright browsers and OS dependencies  
 883          run: npx playwright install \--with-deps chromium
 884  
 885        \- name: Run Python Scraper  
 886          id: scraper\_run  
 887          env:  
 888            LINKEDIN\_EMAIL: ${{ secrets.LINKEDIN\_EMAIL }}  
 889            LINKEDIN\_PASSWORD: ${{ secrets.LINKEDIN\_PASSWORD }}  
 890            PROXY\_URL: ${{ secrets.PROXY\_URL }}  
 891            GPG\_PASSPHRASE: ${{ secrets.GPG\_PASSPHRASE }}  
 892            SESSION\_STATE\_ENCRYPTED: ${{ secrets.SESSION\_STATE }}  
 893          run: |  
 894            JOB\_TITLE="${{ github.event.inputs.job\_title |
 895  
 896  | matrix.job\_config.title }}"  
 897            LOCATION="${{ github.event.inputs.location |
 898  
 899  | matrix.job\_config.location }}"  
 900            python main.py \--job-title "$JOB\_TITLE" \--location "$LOCATION"
 901  
 902        \- name: Commit and push if changed  
 903          if: success()  
 904          uses: stefanzweifel/git-auto-commit-action@v5  
 905          with:  
 906            commit\_message: "ci: Automated job data update for ${{ matrix.job\_config.title }}"  
 907            file\_pattern: "data/\*.json"  
 908            commit\_user\_name: "GitHub Actions Bot"  
 909            commit\_user\_email: "github-actions\[bot\]@users.noreply.github.com"  
 910            commit\_author: "GitHub Actions Bot \<github-actions\[bot\]@users.noreply.github.com\>"
 911  
 912        \- name: Upload Trace on Failure  
 913          if: failure()  
 914          uses: actions/upload-artifact@v4  
 915          with:  
 916            name: playwright-trace-${{ matrix.job\_config.title }}-${{ matrix.job\_config.location }}  
 917            path: trace.zip  
 918            retention-days: 5
 919  
 920        \- name: Create Issue on Failure  
 921          if: failure()  
 922          uses: JasonEtco/create-an-issue@v2  
 923          env:  
 924            GITHUB\_TOKEN: ${{ secrets.GITHUB\_TOKEN }}  
 925          with:  
 926            filename:.github/ISSUE\_TEMPLATE.md  
 927            assignees: ${{ github.actor }}  
 928            update\_existing: true  
 929            search\_existing: open  
 930            title: "Scraping job failed for: ${{ matrix.job\_config.title }} in ${{ matrix.job\_config.location }}"
 931  
 932  ### **5.3. Ethical and Legal Considerations**
 933  
 934  While this playbook provides the technical means to perform advanced web scraping, it is imperative that these tools are used responsibly. Developers and organizations must be cognizant of the ethical and legal landscape surrounding data extraction\].
 935  
 936  * **Terms of Service (ToS):** Most websites, including LinkedIn, have terms of service that explicitly prohibit or restrict automated data collection. While ToS are not always legally enforceable in the same way as laws, violating them can lead to account suspension and legal action from the platform owner. It is crucial to review and understand the ToS of any target website.  
 937  * **Rate Limiting and Server Load:** A core tenet of responsible scraping is to "be a good web citizen." This means implementing conservative rate limits and backoff strategies to avoid overwhelming the target server's resources. An overly aggressive scraper can degrade the service for legitimate users and is more likely to be detected and blocked.  
 938  * **Data Privacy (GDPR, CCPA):** Scraping and storing personal data is subject to strict data protection laws like the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA). These regulations impose stringent requirements on the collection, processing, and storage of personally identifiable information (PII). Any scraping project involving personal data must have a clear legal basis for processing and must implement robust data security and privacy measures.  
 939  * **Copyright:** The data on websites may be protected by copyright. Scraping and republishing copyrighted content without permission can constitute infringement.
 940  
 941  Ultimately, the responsibility lies with the developer to ensure their scraping activities are conducted in an ethical, legal, and respectful manner. This playbook provides the "how," but the "why" and "if" must be carefully considered for each specific use case.
 942  
 943  ## **Section 6: References**
 944  
 945  \[59\], \[47\], \[60\], \[33\], \[59, \[33\],, \[61\], \[62\], \[45\], \[63\], \[64\],,,,, \[47, \[47, \[47, \[47, \[47,,,,,,,, \[65\], \[56\],, \[66\], \[45\], \[56\], \[56\], 20, \[60, \[60, \[60, \[60, \[60, \[60, \[60, \[60, \[56\],,,,, \[65\], \[21\], \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[33, \[33, \[33, \[33, \[33, \[59, \[67\], \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[44\], \[59, \[68\], \[59, \[59, \[67\], \[59, \[59, \[59, \[59,,,,, 17,,, 53, \[18\],,,,,,,,,,,,,,, \[3\],, \[55\], \[49\], \[33,,,,,,, \[12\],,, \[46\], \[48\], \[48\], \[48\], \[48\], \[48\], \[48\], \[48\],, \[12\],,, \[50\], \[48\],,,,,,,,,,, 6,, \[54\],, \[49\], \[49\], \[49\], \[49\], \[49\], \[49\], \[49\],,,,,,,,, \[69\],,,,,,,,,, \[70\], \[70\], \[70\], \[70\], \[70\], \[70\], \[70\], \[71\], \[72\], \[72\], \[72\], \[72\], \[72\], \[72\], \[72\], \[50\], \[50\], \[50\], \[50\], \[50\], \[50\], \[50\], \[19\], \[73\], \[74\],,, \[75\],,, \[76\],,, \[70\], \[77\], \[77\], \[77\], \[77\], \[77\], \[77\], \[77\], \[71\],,,,,,,, \[78\], \[78\], \[78\], \[78\], \[78\], \[78\], \[78\],, \[79\], \[18\],, \[80\], \[80\], \[80\], \[80\], \[80\], \[80\], \[80\], \[18\], \[75\],,,,,,,,, \[80\],,,,, \[81\],,,,,,,,,,,,,,,,, \[13\], \[13\], \[57\], 1, \[4\], 13, \[82\], \[2\], 3, 8, \[6\], \[7\], \[83\], \[84\], 9, \[11\], \[10\], \[24\], \[25\], \[17\], \[85\], \[86\], \[87\], \[88\], \[37\], \[38\], \[39\], \[14\], \[89\], 22, \[90\], \[87\], \[43\], \[91\], \[92\], \[39\], \[40\], \[41\], \[42\], \[14\], \[15\], \[16\], \[93\], \[94\], \[95\], \[96\], \[5\], \[43\], \[17\], 33, 35, \[36\], \[34\], \[43\], \[58\]
 946  
 947  ### **Works cited**
 948  
 949  1. Fingerprinting and Tracing Shadows: The Development and Impact ..., accessed on July 30, 2025, [https://arxiv.org/pdf/2411.12045](https://arxiv.org/pdf/2411.12045)  
 950  2. Canvas fingerprinting: Explained and illustrated \- Stytch, accessed on July 30, 2025, [https://stytch.com/blog/canvas-fingerprinting/](https://stytch.com/blog/canvas-fingerprinting/)  
 951  3. Canvas Fingerprinting: What Is It and How to Bypass It \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/canvas-fingerprinting](https://www.zenrows.com/blog/canvas-fingerprinting)  
 952  4. The Development and Impact of Browser Fingerprinting on Digital Privacy \- arXiv, accessed on July 30, 2025, [https://arxiv.org/html/2411.12045v1](https://arxiv.org/html/2411.12045v1)  
 953  5. How to Use a Playwright Proxy in 2025 \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/playwright-proxy](https://www.zenrows.com/blog/playwright-proxy)  
 954  6. What is WebGL Fingerprinting? How It Works & Tips | Medium, accessed on July 30, 2025, [https://medium.com/@datajournal/webgl-fingerprinting-60893a9ca382](https://medium.com/@datajournal/webgl-fingerprinting-60893a9ca382)  
 955  7. Top 9 Browser Fingerprinting Techniques Explained \- Bureau, accessed on July 30, 2025, [https://bureau.id/blog/browser-fingerprinting-techniques](https://bureau.id/blog/browser-fingerprinting-techniques)  
 956  8. Browser fingerprinting: implementing fraud detection techniques in the era of AI \- Stytch, accessed on July 30, 2025, [https://stytch.com/blog/browser-fingerprinting/](https://stytch.com/blog/browser-fingerprinting/)  
 957  9. What Is HTTP/2 Fingerprinting and How to Bypass It? | Ultimate Guide, accessed on July 30, 2025, [https://www.scrapeless.com/en/blog/bypass-https2](https://www.scrapeless.com/en/blog/bypass-https2)  
 958  10. Applications of TLS Fingerprinting in Bot Mitigation \- CDNetworks, accessed on July 30, 2025, [https://www.cdnetworks.com/blog/cloud-security/tls-fingerprinting-bot-mitigation/](https://www.cdnetworks.com/blog/cloud-security/tls-fingerprinting-bot-mitigation/)  
 959  11. HTTP2 Fingerprinting Tools \- Scrapfly, accessed on July 30, 2025, [https://scrapfly.io/web-scraping-tools/http2-fingerprint](https://scrapfly.io/web-scraping-tools/http2-fingerprint)  
 960  12. Preventing Playwright Bot Detection with Random Mouse Movements | by Manan Patel, accessed on July 30, 2025, [https://medium.com/@domadiyamanan/preventing-playwright-bot-detection-with-random-mouse-movements-10ab7c710d2a](https://medium.com/@domadiyamanan/preventing-playwright-bot-detection-with-random-mouse-movements-10ab7c710d2a)  
 961  13. (PDF) Web Bot Detection Evasion Using Generative Adversarial ..., accessed on July 30, 2025, [https://www.researchgate.net/publication/354391714\_Web\_Bot\_Detection\_Evasion\_Using\_Generative\_Adversarial\_Networks](https://www.researchgate.net/publication/354391714_Web_Bot_Detection_Evasion_Using_Generative_Adversarial_Networks)  
 962  14. mehaase/js-typewriter: Simulate a person typing in a DOM node. \- GitHub, accessed on July 30, 2025, [https://github.com/mehaase/js-typewriter](https://github.com/mehaase/js-typewriter)  
 963  15. TypeIt | The most versatile JavaScript typewriter effect library on the planet., accessed on July 30, 2025, [https://www.typeitjs.com/](https://www.typeitjs.com/)  
 964  16. How to simulate typing in an input box with JavaScript \- Stack Overflow, accessed on July 30, 2025, [https://stackoverflow.com/questions/47617616/how-to-simulate-typing-in-an-input-box-with-javascript](https://stackoverflow.com/questions/47617616/how-to-simulate-typing-in-an-input-box-with-javascript)  
 965  17. How To Make Playwright Undetectable | ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-make-playwright-undetectable/](https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-make-playwright-undetectable/)  
 966  18. Playwright vs Selenium : Which to choose in 2025 | BrowserStack, accessed on July 30, 2025, [https://www.browserstack.com/guide/playwright-vs-selenium](https://www.browserstack.com/guide/playwright-vs-selenium)  
 967  19. Playwright vs Selenium: Key Differences | Sauce Labs, accessed on July 30, 2025, [https://saucelabs.com/resources/blog/playwright-vs-selenium-guide](https://saucelabs.com/resources/blog/playwright-vs-selenium-guide)  
 968  20. Playwright vs. Selenium for web scraping \- Apify Blog, accessed on July 30, 2025, [https://blog.apify.com/playwright-vs-selenium/](https://blog.apify.com/playwright-vs-selenium/)  
 969  21. playwright-extra \- npm, accessed on July 30, 2025, [https://www.npmjs.com/package/playwright-extra](https://www.npmjs.com/package/playwright-extra)  
 970  22. What is Playwright Extra \- A Web Scrapers Guide \- ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-extra/](https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-extra/)  
 971  23. puppeteer-extra-plugin-stealth/evasions \- GitHub, accessed on July 30, 2025, [https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/readme.md](https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/readme.md)  
 972  24. Invisible Automation: Using puppeteer-extra-plugin-stealth to Bypass Bot Protection, accessed on July 30, 2025, [https://latenode.com/blog/invisible-automation-using-puppeteer-extra-plugin-stealth-to-bypass-bot-protection](https://latenode.com/blog/invisible-automation-using-puppeteer-extra-plugin-stealth-to-bypass-bot-protection)  
 973  25. Puppeteer Stealth Tutorial: How To Use & Setup (+Alternatives) \- Scrapingdog, accessed on July 30, 2025, [https://www.scrapingdog.com/blog/puppeteer-stealth/](https://www.scrapingdog.com/blog/puppeteer-stealth/)  
 974  26. puppeteer-extra-plugin-stealth \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/puppeteer-extra-plugin-stealth](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth)  
 975  27. Puppeteer-Extra-Stealth Guide \- Bypass Anti-Bots With Ease | ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/puppeteer-web-scraping-playbook/nodejs-puppeteer-extra-stealth-plugin/](https://scrapeops.io/puppeteer-web-scraping-playbook/nodejs-puppeteer-extra-stealth-plugin/)  
 976  28. Implementing "Stealth" in Puppeteer Sharp \- LambdaTest Community, accessed on July 30, 2025, [https://community.lambdatest.com/t/implementing-stealth-in-puppeteer-sharp/29231](https://community.lambdatest.com/t/implementing-stealth-in-puppeteer-sharp/29231)  
 977  29. Puppeteer Stealth Tutorial; How to Set Up & Use (+ Working Alternatives) | ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/puppeteer-stealth-tutorial-with-examples/](https://www.scrapingbee.com/blog/puppeteer-stealth-tutorial-with-examples/)  
 978  30. How to Use Puppeteer Stealth: A Plugin for Scraping \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/puppeteer-stealth](https://www.zenrows.com/blog/puppeteer-stealth)  
 979  31. puppeteer-extra-plugin-stealth \- UNPKG, accessed on July 30, 2025, [https://app.unpkg.com/puppeteer-extra-plugin-stealth@2.4.1/files/readme.md](https://app.unpkg.com/puppeteer-extra-plugin-stealth@2.4.1/files/readme.md)  
 980  32. puppeteer-extra \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/puppeteer-extra](https://www.npmjs.com/package/puppeteer-extra)  
 981  33. How to Make Playwright Scraping Undetectable | ScrapingAnt, accessed on July 30, 2025, [https://scrapingant.com/blog/playwright-scraping-undetectable](https://scrapingant.com/blog/playwright-scraping-undetectable)  
 982  34. undetected-playwright \- PyPI, accessed on July 30, 2025, [https://pypi.org/project/undetected-playwright/0.2.0/](https://pypi.org/project/undetected-playwright/0.2.0/)  
 983  35. Kaliiiiiiiiii-Vinyzu/patchright-python: Undetected Python version of the Playwright testing and automation library. \- GitHub, accessed on July 30, 2025, [https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python)  
 984  36. Playwright Web Scraping Tutorial | Become 100% Undetectable\! \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=afobK3UbTeE](https://www.youtube.com/watch?v=afobK3UbTeE)  
 985  37. Playing with Perlin Noise: Generating Realistic Archipelagos | by Yvan Scher \- Medium, accessed on July 30, 2025, [https://medium.com/@yvanscher/playing-with-perlin-noise-generating-realistic-archipelagos-b59f004d8401](https://medium.com/@yvanscher/playing-with-perlin-noise-generating-realistic-archipelagos-b59f004d8401)  
 986  38. Perlin Noise: Implementation, Procedural Generation, and Simplex Noise \- Garage Farm, accessed on July 30, 2025, [https://garagefarm.net/blog/perlin-noise-implementation-procedural-generation-and-simplex-noise](https://garagefarm.net/blog/perlin-noise-implementation-procedural-generation-and-simplex-noise)  
 987  39. Perlin Noise: A Procedural Generation Algorithm \- Raouf's blog, accessed on July 30, 2025, [https://rtouti.github.io/graphics/perlin-noise-algorithm](https://rtouti.github.io/graphics/perlin-noise-algorithm)  
 988  40. ghost-cursor \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/ghost-cursor](https://www.npmjs.com/package/ghost-cursor)  
 989  41. Using Perlin Noise to follow my mouse \- Processing Forum, accessed on July 30, 2025, [https://forum.processing.org/two/discussion/20974/using-perlin-noise-to-follow-my-mouse.html](https://forum.processing.org/two/discussion/20974/using-perlin-noise-to-follow-my-mouse.html)  
 990  42. Making maps with noise functions \- Red Blob Games, accessed on July 30, 2025, [https://www.redblobgames.com/maps/terrain-from-noise/](https://www.redblobgames.com/maps/terrain-from-noise/)  
 991  43. oxylabs/OxyMouse: Mouse Movement Algorithms \- GitHub, accessed on July 30, 2025, [https://github.com/oxylabs/OxyMouse](https://github.com/oxylabs/OxyMouse)  
 992  44. Python Scrapy \- Build A LinkedIn Jobs Scraper \[2025\] \- ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/)  
 993  45. spinlud/py-linkedin-jobs-scraper \- GitHub, accessed on July 30, 2025, [https://github.com/spinlud/py-linkedin-jobs-scraper](https://github.com/spinlud/py-linkedin-jobs-scraper)  
 994  46. speedyapply/JobSpy: Jobs scraper library for LinkedIn ... \- GitHub, accessed on July 30, 2025, [https://github.com/speedyapply/JobSpy](https://github.com/speedyapply/JobSpy)  
 995  47. How to create a LinkedIn job scraper in Python with Crawlee, accessed on July 30, 2025, [https://crawlee.dev/blog/linkedin-job-scraper-python](https://crawlee.dev/blog/linkedin-job-scraper-python)  
 996  48. Managing GitHub Actions settings for a repository \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository)  
 997  49. Controlling permissions for GITHUB\_TOKEN \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/en/actions/how-tos/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github\_token](https://docs.github.com/en/actions/how-tos/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github_token)  
 998  50. GitHub Actions permissions \- Graphite, accessed on July 30, 2025, [https://graphite.dev/guides/github-actions-permissions](https://graphite.dev/guides/github-actions-permissions)  
 999  51. Undetected ChromeDriver in Python Selenium: How to Use for Web Scraping \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/undetected-chromedriver](https://www.zenrows.com/blog/undetected-chromedriver)  
1000  52. How to avoid Selenium detection or change approach \- Stack Overflow, accessed on July 30, 2025, [https://stackoverflow.com/questions/77907712/how-to-avoid-selenium-detection-or-change-approach](https://stackoverflow.com/questions/77907712/how-to-avoid-selenium-detection-or-change-approach)  
1001  53. How to Set Up Automated GitHub Workflows for Your Python and React Applications, accessed on July 30, 2025, [https://www.freecodecamp.org/news/how-to-set-up-automated-github-workflows-for-python-react-apps/](https://www.freecodecamp.org/news/how-to-set-up-automated-github-workflows-for-python-react-apps/)  
1002  54. til/github-actions/cache-playwright-dependencies-across-workflows.md at master, accessed on July 30, 2025, [https://github.com/jbranchaud/til/blob/master/github-actions/cache-playwright-dependencies-across-workflows.md](https://github.com/jbranchaud/til/blob/master/github-actions/cache-playwright-dependencies-across-workflows.md)  
1003  55. How to run Playwright on GitHub Actions \- foosel.net, accessed on July 30, 2025, [https://foosel.net/til/how-to-run-playwright-on-github-actions/](https://foosel.net/til/how-to-run-playwright-on-github-actions/)  
1004  56. Setting up CI \- Playwright, accessed on July 30, 2025, [https://playwright.dev/docs/ci-intro](https://playwright.dev/docs/ci-intro)  
1005  57. Trace viewer | Playwright, accessed on July 30, 2025, [https://playwright.dev/docs/trace-viewer](https://playwright.dev/docs/trace-viewer)  
1006  58. What are the steps to enable and view traces in Playwright tests run on GitHub Actions?, accessed on July 30, 2025, [https://ray.run/questions/what-are-the-steps-to-enable-and-view-traces-in-playwright-tests-run-on-github-actions](https://ray.run/questions/what-are-the-steps-to-enable-and-view-traces-in-playwright-tests-run-on-github-actions)  
1007  59. How to Use GitHub Actions to Automate Data Scraping | by Tom Willcocks \- Medium, accessed on July 30, 2025, [https://medium.com/data-analytics-at-nesta/how-to-use-github-actions-to-automate-data-scraping-299690cd8bdb](https://medium.com/data-analytics-at-nesta/how-to-use-github-actions-to-automate-data-scraping-299690cd8bdb)  
1008  60. Scrapy Playwright Tutorial: How to Scrape Dynamic Websites | ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/scrapy-playwright-tutorial/](https://www.scrapingbee.com/blog/scrapy-playwright-tutorial/)  
1009  61. How to Scrape LinkedIn in 2025 \- Scrapfly, accessed on July 30, 2025, [https://scrapfly.io/blog/posts/how-to-scrape-linkedin-person-profile-company-job-data](https://scrapfly.io/blog/posts/how-to-scrape-linkedin-person-profile-company-job-data)  
1010  62. Playwright for Python Web Scraping Tutorial with Examples \- ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/playwright-for-python-web-scraping/](https://www.scrapingbee.com/blog/playwright-for-python-web-scraping/)  
1011  63. Web Scraping with Playwright \- BrowserStack, accessed on July 30, 2025, [https://www.browserstack.com/guide/playwright-web-scraping](https://www.browserstack.com/guide/playwright-web-scraping)  
1012  64. Playwright Web Scraping Tutorial for 2025 \- Oxylabs, accessed on July 30, 2025, [https://oxylabs.io/blog/playwright-web-scraping](https://oxylabs.io/blog/playwright-web-scraping)  
1013  65. From Puppeteer stealth to Nodriver: How anti-detect frameworks evolved to evade bot detection \- The Castle blog, accessed on July 30, 2025, [https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/](https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/)  
1014  66. “Step-by-Step Guide”: Build Python Project Using GitHub Actions | by Yagmur Ozden, accessed on July 30, 2025, [https://medium.com/@yagmurozden/step-by-step-guide-build-python-project-using-github-actions-025e67c164e9](https://medium.com/@yagmurozden/step-by-step-guide-build-python-project-using-github-actions-025e67c164e9)  
1015  67. Make an issue on github using API V3 and Python, accessed on July 30, 2025, [https://gist.github.com/JeffPaine/3145490](https://gist.github.com/JeffPaine/3145490)  
1016  68. The Python Developer's Guide: Mastering GitHub Actions | by Mayuresh K, accessed on July 30, 2025, [https://python.plainenglish.io/the-python-developers-guide-mastering-automated-workflows-with-github-actions-505110d89185](https://python.plainenglish.io/the-python-developers-guide-mastering-automated-workflows-with-github-actions-505110d89185)  
1017  69. How to Upload Artifacts with GitHub Actions? \- Workflow Hub \- CICube, accessed on July 30, 2025, [https://cicube.io/workflow-hub/github-actions-upload-artifact/](https://cicube.io/workflow-hub/github-actions-upload-artifact/)  
1018  70. Using secrets in GitHub Actions, accessed on July 30, 2025, [https://docs.github.com/actions/security-guides/using-secrets-in-github-actions](https://docs.github.com/actions/security-guides/using-secrets-in-github-actions)  
1019  71. Events that trigger workflows \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/actions/learn-github-actions/events-that-trigger-workflows](https://docs.github.com/actions/learn-github-actions/events-that-trigger-workflows)  
1020  72. Add & Commit · Actions · GitHub Marketplace, accessed on July 30, 2025, [https://github.com/marketplace/actions/add-commit](https://github.com/marketplace/actions/add-commit)  
1021  73. Unlimited Free Web-Scraping with GitHub Actions \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=gEZhTfaIxHQ](https://www.youtube.com/watch?v=gEZhTfaIxHQ)  
1022  74. vincentbavitz/bezmouse: Simulate human mouse movements with xdotool \- GitHub, accessed on July 30, 2025, [https://github.com/vincentbavitz/bezmouse](https://github.com/vincentbavitz/bezmouse)  
1023  75. REST API endpoints for issues \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/rest/reference/issues](https://docs.github.com/rest/reference/issues)  
1024  76. Start Automating: Build Your First GitHub Action \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=N7zd6tkqq04](https://www.youtube.com/watch?v=N7zd6tkqq04)  
1025  77. Actions · GitHub Marketplace \- Upload a Build Artifact, accessed on July 30, 2025, [https://github.com/marketplace/actions/upload-a-build-artifact](https://github.com/marketplace/actions/upload-a-build-artifact)  
1026  78. actions/upload-artifact \- GitHub, accessed on July 30, 2025, [https://github.com/actions/upload-artifact](https://github.com/actions/upload-artifact)  
1027  79. Building and testing Python \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/actions/guides/building-and-testing-python](https://docs.github.com/actions/guides/building-and-testing-python)  
1028  80. A How-To Guide for using Environment Variables and GitHub Secrets in GitHub Actions for Secrets Management in Continuous Integration \- GitHub Gist, accessed on July 30, 2025, [https://gist.github.com/brianjbayer/53ef17e0a15f7d80468d3f3077992ef8](https://gist.github.com/brianjbayer/53ef17e0a15f7d80468d3f3077992ef8)  
1029  81. graphite.dev, accessed on July 30, 2025, [https://graphite.dev/guides/github-actions-matrix\#:\~:text=The%20matrix%20strategy%20is%20a,256%20jobs%20per%20workflow%20run.](https://graphite.dev/guides/github-actions-matrix#:~:text=The%20matrix%20strategy%20is%20a,256%20jobs%20per%20workflow%20run.)  
1030  82. arXiv:2412.02266v1 \[cs.LG\] 3 Dec 2024, accessed on July 30, 2025, [https://arxiv.org/pdf/2412.02266](https://arxiv.org/pdf/2412.02266)  
1031  83. <www.expressvpn.com>, accessed on July 30, 2025, [https://www.expressvpn.com/webrtc-leak-test](https://www.expressvpn.com/webrtc-leak-test)  
1032  84. How to Fix WebRTC Leaks in 2025 (All Browsers) \- CyberInsider, accessed on July 30, 2025, [https://cyberinsider.com/webrtc-leaks/](https://cyberinsider.com/webrtc-leaks/)  
1033  85. Scalable Web Scraping with Playwright and Browserless (2025 Guide), accessed on July 30, 2025, [https://www.browserless.io/blog/scraping-with-playwright-a-developer-s-guide-to-scalable-undetectable-data-extraction](https://www.browserless.io/blog/scraping-with-playwright-a-developer-s-guide-to-scalable-undetectable-data-extraction)  
1034  86. sarperavci/human\_mouse: Ultra-realistic human mouse movements using bezier curves and spline interpolation. Natural cursor automation. \- GitHub, accessed on July 30, 2025, [https://github.com/sarperavci/human\_mouse](https://github.com/sarperavci/human_mouse)  
1035  87. A beautiful application of Bézier Curves to simulate natural mouse movements \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/math/comments/1hyfq73/a\_beautiful\_application\_of\_b%C3%A9zier\_curves\_to/](https://www.reddit.com/r/math/comments/1hyfq73/a_beautiful_application_of_b%C3%A9zier_curves_to/)  
1036  88. Bezier curve \- The Modern JavaScript Tutorial, accessed on July 30, 2025, [https://javascript.info/bezier-curve](https://javascript.info/bezier-curve)  
1037  89. Is Playwright the best alternative to Selenium in 2025? \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/Playwright/comments/1jb29zu/is\_playwright\_the\_best\_alternative\_to\_selenium\_in/](https://www.reddit.com/r/Playwright/comments/1jb29zu/is_playwright_the_best_alternative_to_selenium_in/)  
1038  90. Best Web Scraping Detection Avoidance Libraries for Javascript | ScrapingAnt, accessed on July 30, 2025, [https://scrapingant.com/blog/javascript-detection-avoidance-libraries](https://scrapingant.com/blog/javascript-detection-avoidance-libraries)  
1039  91. ELI5:Why is it hard to simulate human mouse movement? : r/explainlikeimfive \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/explainlikeimfive/comments/cv68fz/eli5why\_is\_it\_hard\_to\_simulate\_human\_mouse/](https://www.reddit.com/r/explainlikeimfive/comments/cv68fz/eli5why_is_it_hard_to_simulate_human_mouse/)  
1040  92. Emulate Human Mouse Input with Bezier Curves and Gaussian Distributions \- CodeProject, accessed on July 30, 2025, [https://www.codeproject.com/Tips/759391/Emulate-Human-Mouse-Input-with-Bezier-Curves-and-G](https://www.codeproject.com/Tips/759391/Emulate-Human-Mouse-Input-with-Bezier-Curves-and-G)  
1041  93. The Best Residential Proxies of 2025: Tested & Ranked \- Proxyway, accessed on July 30, 2025, [https://proxyway.com/best/residential-proxies](https://proxyway.com/best/residential-proxies)  
1042  94. 10 Best Residential Proxies in 2025 (List of Residential IP Proxies From Best Provider) \- GeeksforGeeks, accessed on July 30, 2025, [https://www.geeksforgeeks.org/websites-apps/best-residential-proxy-providers/](https://www.geeksforgeeks.org/websites-apps/best-residential-proxy-providers/)  
1043  95. Top 10 USA Proxy Providers in 2025 for Scraping \- Medium, accessed on July 30, 2025, [https://medium.com/@datajournal/best-usa-proxies-9ca04be84754](https://medium.com/@datajournal/best-usa-proxies-9ca04be84754)  
1044  96. How to set proxy in Playwright \- Pixeljets, accessed on July 30, 2025, [https://pixeljets.com/blog/proxy-in-playwright/](https://pixeljets.com/blog/proxy-in-playwright/)