web-scraping-playbook.md
1 2 # **A Developer's Playbook for Resilient Web Scraping: Advanced Evasion and Automation in a GitHub Actions Environment** 3 4 ## **Executive Summary** 5 6 The extraction of data from dynamic, modern web platforms represents a significant engineering challenge, far removed from the realm of simple scripting. High-value targets, such as professional networking and job platforms like LinkedIn, are fortified with multi-layered, sophisticated anti-bot systems designed to detect and block automated access. This complexity is further amplified when the scraping operations must be conducted within the ephemeral, stateless, and inherently conspicuous environment of a Continuous Integration/Continuous Delivery (CI/CD) system like GitHub Actions. Standard scraping techniques are not merely insufficient; they are destined for immediate failure. 7 8 This playbook presents a definitive, expert-level guide for developers and engineers tasked with building a resilient, long-term web scraping engine under these demanding conditions. It deconstructs the problem into its core components—advanced browser automation, robust evasion tactics, and intelligent CI/CD orchestration—and provides a comprehensive, actionable solution. The architectural blueprint detailed herein is founded on a multi-layered defense strategy designed to consistently evade detection and ensure reliable data extraction. 9 10 The proposed solution architecture integrates a modern browser automation framework, Playwright, chosen for its superior handling of dynamic, JavaScript-heavy applications. This foundation is augmented with a suite of advanced evasion tools and techniques, including specialized stealth libraries that patch browser-level automation tells, sophisticated human behavior emulation for mouse and keyboard interactions, and a non-negotiable, strategically managed network of rotating residential proxies to mask the scraper's origin. A key architectural innovation presented is a two-workflow system within GitHub Actions to manage session state securely, decoupling the high-risk login process from routine scraping operations to enhance both stealth and stability. 11 12 This document is structured to guide the reader through a logical progression, beginning with a deep analysis of the modern anti-scraping threat landscape—from browser and protocol-level fingerprinting to behavioral biometrics. It then provides a practical arsenal of evasion tools and techniques, complete with comparative analyses and code implementations. Finally, it culminates in a complete architectural blueprint, detailing the Python scraper's design and the full YAML configuration for a scalable, resilient, and automated GitHub Actions workflow. This playbook is intended not as a theoretical exercise, but as a production-ready guide for building a web scraping engine capable of operating successfully against the most challenging targets in the most constrained environments. 13 14 ## **Section 1: The Modern Gauntlet: Understanding Advanced Anti-Scraping Defenses** 15 16 To construct a resilient scraping engine, one must first comprehend the adversary: the sophisticated, multi-layered defense systems of modern web platforms. These systems have evolved far beyond simple IP blocking or User-Agent string checks. They now employ a holistic approach that analyzes a client's identity from the initial network handshake up to the nuanced patterns of their mouse movements. A failure to present a consistent, human-like profile across every layer of this inspection will result in immediate detection and blocking. 17 18 ### **1.1. Beyond User-Agents: A Taxonomy of Browser Fingerprinting** 19 20 Browser fingerprinting is a collection of techniques used to create a unique, stable identifier for a client by gathering information about its specific configuration.1 This process operates without storing data on the user's device, making it a stealthy and powerful alternative to traditional cookie-based tracking.1 The resulting fingerprint is a hash generated from a combination of passively and actively collected data points, which, when combined, can achieve a high degree of uniqueness.3 21 22 #### **Passive Fingerprinting** 23 24 The first layer of detection involves the passive analysis of information that a browser sends with every HTTP request. This includes HTTP headers, which provide details about the browser, operating system, and preferred language.1 While this data provides limited uniqueness on its own and can be easily spoofed, inconsistencies—such as a User-Agent string for Chrome on Windows being accompanied by network protocol characteristics of a Linux-based Python library—serve as an immediate red flag for detection systems.1 25 26 #### **Active JavaScript-based Fingerprinting** 27 28 The more formidable challenge lies in active fingerprinting, where the target website executes JavaScript on the client side to interrogate the browser and its environment in detail. This approach creates a high-entropy fingerprint that is far more difficult to forge. 29 30 * **Browser & OS Properties:** The most fundamental active checks involve querying properties of the navigator object. The navigator.webdriver property, which is set to true by default in standard automation frameworks like Selenium and Playwright, is a primary indicator of automation \[600, 59. Advanced systems go further, enumerating installed browser plugins ( 31 navigator.plugins), system fonts, and screen resolution (window.screen) to build a more detailed profile.1 While individual properties are not unique, the specific combination across a user base is highly distinct. 32 * **Canvas Fingerprinting:** This is a powerful technique where a script instructs the browser to render a hidden image or text onto an HTML5 \<canvas\> element \[5974, 3, 2. The exact pixel-by-pixel output of this rendering process is subtly influenced by a combination of the operating system, the graphics card (GPU), installed graphics drivers, and font rendering engines.3 The script then extracts the rendered image data as a Base64 encoded string and computes a hash of it.3 This hash serves as a highly stable and unique identifier because even imperceptible rendering variations between devices will produce a different hash.3 The widespread adoption of this technique, with usage nearly doubling on top websites over a seven-year period, highlights its effectiveness.1 33 * **WebGL Fingerprinting:** An even more potent and difficult-to-evade technique is WebGL fingerprinting. WebGL (Web Graphics Library) is a JavaScript API for rendering 2D and 3D graphics directly in the browser, providing low-level access to the GPU.6 A fingerprinting script can instruct the browser to render a complex 3D scene. During this process, it collects a wealth of hardware-specific information, including the GPU model and vendor, driver versions, shader precision, supported extensions, and the exact pixel data of the rendered output.6 Because these characteristics are tied directly to the physical hardware, they are exceptionally difficult to spoof convincingly.6 Any attempt to mask one parameter (e.g., the GPU vendor string) while the actual rendering output corresponds to a different GPU will create a detectable inconsistency. 34 * **Audio Fingerprinting:** This technique leverages the Web Audio API to generate a unique fingerprint from a device's audio stack. A script uses an OscillatorNode to generate a specific, often inaudible, sound wave. This wave is then processed, and the output is analyzed. The final waveform is subtly altered by the specific hardware, drivers, and browser implementation, creating a consistent and unique hash for the device. This method is particularly stealthy as it leaves no client-side state and does not require user interaction.8 35 36 The evolution of these techniques illustrates a clear escalation in the cat-and-mouse game between scrapers and anti-bot systems. Early detection focused on simple flags like navigator.webdriver. Once automation tools began patching this property, defense systems moved to more complex, multi-vector analyses like Canvas and WebGL fingerprinting. This progression reveals a critical principle for modern evasion: it is no longer sufficient to mask a single property. A resilient scraping engine must present a complete and internally consistent fingerprint. An automation tool that reports a Chrome-on-Windows User-Agent but produces a WebGL fingerprint characteristic of a Linux server's virtualized GPU will be instantly flagged. Success requires a holistic approach that ensures every detectable attribute tells the same plausible story. 37 38 ### **1.2. The Protocol Layer: TLS and HTTP/2 Fingerprinting** 39 40 The most sophisticated anti-bot systems do not wait for JavaScript execution to detect automation. They can identify and block a scraper at the network protocol level, based on the signature of its initial connection request. This layer of detection is particularly effective because the characteristics of a TLS (Transport Layer Security) and HTTP/2 connection are determined by the underlying networking library (e.g., Python's http.client, Node.js's http2 module) and are often fundamentally different from those of a real browser. 41 42 During the initial TLS handshake, the client sends a Client Hello message. The specific combination of parameters in this message—such as the TLS version, supported cipher suites, and the list and order of extensions—creates a unique signature. This signature, often hashed into a string known as a JA3 or JA4 fingerprint, can reliably identify the underlying client library used to make the request.9 For example, the JA3 fingerprint of a standard Python 43 44 requests session is distinctly different from that of a Chrome browser running on Windows. 45 46 Similarly, with the widespread adoption of the HTTP/2 protocol, a new fingerprinting vector has emerged. When an HTTP/2 connection is established, the client sends a series of initial frames, including SETTINGS, WINDOW\_UPDATE, and potentially PRIORITY frames. The specific values within these frames (e.g., SETTINGS\_MAX\_CONCURRENT\_STREAMS), their order of transmission, and the ordering of pseudo-headers (like :method, :path) in the subsequent HEADERS frame vary significantly between different clients.9 A real Chrome browser has a well-defined and consistent HTTP/2 fingerprint that is difficult for non-browser libraries to replicate perfectly.9 47 48 This protocol-level analysis means that a scraper can be blocked before it even sends its first GET request for the page's HTML. The implication for a resilient scraping architecture is profound: the choice of automation tool is not just about its ability to control a browser's DOM. The tool's underlying network stack must be indistinguishable from that of a genuine, user-operated browser. This is a primary reason why standard HTTP libraries are inadequate for scraping protected targets and why browser automation frameworks like Playwright, which use the browser's own network stack, are essential. Even then, patched versions of these frameworks are often necessary to ensure that no automation-related artifacts leak at this low level. 49 50 ### **1.3. The Human Element: Behavioral Biometrics and Anomaly Detection** 51 52 Beyond fingerprinting the client's software and hardware, advanced anti-bot systems increasingly analyze the user's behavior itself. They collect data on how the user interacts with the page, building a biometric profile that can distinguish the fluid, slightly imperfect motions of a human from the rigid, mathematically perfect actions of a script.12 53 54 This analysis focuses on several key areas: 55 56 * **Mouse Movements:** Human mouse movements are never perfectly linear. They follow curved paths, exhibit variations in speed (accelerating and decelerating), and contain minute, subconscious "jitters".12 A script that moves the mouse in a straight line from point A to point B is an obvious sign of automation. Detection systems track the entire mouse trajectory, analyzing its curvature, velocity, and consistency to identify non-human patterns.13 57 * **Click and Typing Patterns:** Humans do not click instantly. There is a small but measurable delay between the mousedown and mouseup events. Similarly, typing cadence is not uniform; the time between keystrokes varies, and humans make and correct errors.14 Scripts that execute clicks with zero delay or type with perfect, metronomic regularity are easily flagged. 58 * **Scrolling Behavior:** Humans scroll with varying speeds, sometimes using the mouse wheel, sometimes clicking the scrollbar, and often pausing to read content. An automated script that scrolls in perfectly uniform chunks or jumps instantly to the bottom of a page exhibits a clear non-human pattern. 59 60 These behavioral data points are fed into machine learning models trained on vast datasets of genuine user interactions.13 These models learn the statistical signatures of human behavior and can detect anomalies that signify automation. Consequently, a resilient scraper cannot just programmatically execute events like 61 62 .click() and .type(). It must do so in a way that emulates the natural, noisy, and slightly inefficient patterns of a human user. This requires the implementation of algorithms that generate non-linear mouse paths, introduce randomized delays, and simulate a realistic typing rhythm. 63 64 ### **1.4. The Network Barrier: IP Reputation, Rate Limiting, and CAPTCHAs** 65 66 The final layer of defense operates at the network level, focusing on the origin of the traffic and its volume. This is a particularly acute challenge for a scraper operating within a GitHub Actions environment. 67 68 * **IP Reputation:** The IP address from which a request originates is one of the most fundamental data points used for risk assessment. IP addresses associated with data centers, including cloud providers like Microsoft Azure where GitHub-hosted runners operate, are inherently treated with a high degree of suspicion. Anti-bot systems maintain extensive databases of IP addresses, and traffic from a known data center IP range is often subjected to immediate, heightened scrutiny, more frequent CAPTCHA challenges, or outright blocks. This makes the native IP address of a GitHub runner a significant liability. The only viable solution is to mask this origin by routing all traffic through a **residential proxy network**. These networks provide IP addresses assigned by Internet Service Providers (ISPs) to real homes, making the scraper's traffic appear indistinguishable from that of a legitimate user, 94, 95\]. 69 * **Rate Limiting:** Websites monitor the number of requests originating from a single IP address over a given time period. Exceeding a certain threshold is a classic sign of scraping and will trigger a temporary or permanent block. A resilient scraper must therefore implement dynamic rate limiting, respecting the server's limits and incorporating exponential backoff strategies when throttled. The use of a large, rotating proxy pool is essential for distributing requests across many different IP addresses, thereby avoiding per-IP rate limits.17 70 * **CAPTCHAs:** CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) challenges are the final line of defense, presented when a user's fingerprint or behavior is deemed suspicious. While third-party services exist to solve these challenges, they add cost, latency, and complexity. The primary architectural goal of a stealthy scraper should be to **avoid triggering CAPTCHAs in the first place** by successfully navigating all the preceding layers of detection. Relying on CAPTCHA solving as a primary strategy is an admission of a failed evasion architecture. 71 72 In the context of GitHub Actions, the IP reputation problem is non-negotiable. The stateless and data-center-based nature of the runners means that a robust proxy management layer is not an optional enhancement but a foundational architectural requirement. Without it, even the most perfectly fingerprinted and behaviorally human-like scraper will be blocked based on its origin alone. 73 74 ## **Section 2: The Ghost in the Machine: An Arsenal of Evasion Tools** 75 76 Having dissected the mechanisms of modern anti-bot systems, this section provides a practical guide to the tools and techniques required to build a scraper that can systematically neutralize these defenses. The strategy involves selecting a powerful browser automation framework, augmenting it with specialized stealth libraries, emulating human interaction patterns, and masking its network identity. 77 78 ### **2.1. Choosing the Right Engine: Playwright for Dynamic Web Applications** 79 80 For scraping modern, dynamic single-page applications (SPAs) like LinkedIn, the choice of browser automation framework is critical. While Selenium has been a long-standing tool, Playwright, a newer framework from Microsoft, offers significant architectural advantages that make it better suited for this task.18 81 82 The primary technical justification for choosing Playwright lies in its architecture and API design. Selenium communicates with the browser driver via the JSON Wire Protocol over HTTP, which introduces latency with each command. In contrast, Playwright communicates over a persistent WebSocket connection, enabling faster and more efficient command execution.18 This speed is crucial for complex scraping tasks involving numerous interactions. 83 84 Furthermore, Playwright's API is designed with the dynamic web in mind. Its "auto-waiting" mechanism is a key feature; before performing an action like a click, Playwright automatically waits for the element to be attached to the DOM, visible, stable, and able to receive events. This eliminates a major source of flakiness common in Selenium scripts, where developers must often insert manual or explicit waits, which can be unreliable and slow down execution.20 Playwright also provides native, powerful tools for network interception, allowing the scraper to monitor, modify, or block network requests, which is invaluable for advanced scraping and evasion tasks. 85 86 The following Python script demonstrates the simplicity and power of Playwright for extracting data from a dynamically loaded page. It navigates to a LinkedIn job search page and extracts the titles and company names of the initial job listings. 87 88 Python 89 90 \# File: simple\_scraper.py 91 import asyncio 92 import re 93 from playwright.async\_api import async\_playwright 94 95 async def scrape\_linkedin\_jobs(job\_title: str, location: str): 96 """ 97 A simple Playwright script to scrape the first page of LinkedIn job listings. 98 """ 99 async with async\_playwright() as p: 100 browser \= await p.chromium.launch(headless=True) 101 page \= await browser.new\_page() 102 103 \# Construct the URL for the job search 104 base\_url \= "https://www.linkedin.com/jobs/search" 105 params \= { 106 "keywords": job\_title, 107 "location": location, 108 "position": 1, 109 "pageNum": 0 110 } 111 \# A simple way to build the query string 112 query\_string \= "&".join(\[f"{key}\={value}" for key, value in params.items()\]) 113 url \= f"{base\_url}?{query\_string}" 114 115 print(f"Navigating to: {url}") 116 await page.goto(url, wait\_until="domcontentloaded") 117 118 \# Wait for the job listings container to be visible 119 await page.wait\_for\_selector('ul.jobs-search\_\_results-list', timeout=15000) 120 121 \# Extract job listings 122 job\_listings \= await page.locator('ul.jobs-search\_\_results-list \> li').all() 123 print(f"Found {len(job\_listings)} job listings on the first page.") 124 125 scraped\_data \= 126 for job\_listing in job\_listings: 127 try: 128 title\_element \= await job\_listing.query\_selector('h3.base-search-card\_\_title') 129 company\_element \= await job\_listing.query\_selector('h4.base-search-card\_\_subtitle') 130 link\_element \= await job\_listing.query\_selector('a.base-card\_\_full-link') 131 132 title \= await title\_element.inner\_text() if title\_element else "N/A" 133 company \= await company\_element.inner\_text() if company\_element else "N/A" 134 link \= await link\_element.get\_attribute('href') if link\_element else "N/A" 135 136 \# Clean the text content 137 title \= re.sub(r'\[\\s\\n\]+', ' ', title).strip() 138 company \= re.sub(r'\[\\s\\n\]+', ' ', company).strip() 139 140 scraped\_data.append({ 141 "title": title, 142 "company": company, 143 "link": link 144 }) 145 except Exception as e: 146 print(f"Error processing a job listing: {e}") 147 148 await browser.close() 149 return scraped\_data 150 151 if \_\_name\_\_ \== "\_\_main\_\_": 152 jobs \= asyncio.run(scrape\_linkedin\_jobs("Software Engineer", "United States")) 153 for job in jobs: 154 print(job) 155 156 This script, while functional for public pages, would be quickly detected on authenticated routes or by more advanced anti-bot systems due to its default browser fingerprint. The next step is to augment this engine with stealth capabilities. 157 158 ### **2.2. The Cloak of Invisibility: A Comparative Analysis of Stealth Frameworks** 159 160 Standard Playwright, while powerful for automation, is not designed for stealth and is easily detectable \[33, 65, 65, S\_R587\]. To operate undetected, it is essential to use a specialized library that patches the browser automation framework to remove or obscure the telltale signs of automation. The open-source community has produced several such libraries, each with a different approach and level of maturity. 161 162 * **playwright-extra with puppeteer-extra-plugin-stealth**: This combination brings the well-established evasion modules from the Puppeteer ecosystem to Playwright.21 163 playwright-extra acts as a wrapper around the standard Playwright library, enabling a plugin architecture. The puppeteer-extra-plugin-stealth is a collection of individual evasion scripts that target specific detection vectors, such as masking navigator.webdriver, spoofing WebGL vendor information, and normalizing browser properties to match a real user's browser.23 This modular approach is powerful but can introduce compatibility risks, as the plugins are primarily developed for Puppeteer.22 164 * **undetected-playwright**: This Python library is a direct port of the popular undetected-chromedriver project's concepts to the Playwright framework.33 It functions by patching the Playwright browser instance upon launch to remove common automation signatures. It is designed to be a simple, drop-in replacement that requires minimal configuration changes to an existing Playwright script.33 165 * **patchright-python**: This is a more recent and actively maintained drop-in replacement for Playwright that focuses on patching lower-level detection vectors that some other stealth libraries may miss.35 Specifically, it addresses leaks related to the Chrome DevTools Protocol (CDP) itself, such as the use of 166 Runtime.enable and Console.enable, which can be detected by sophisticated anti-bot systems.35 Its focus on these more fundamental leaks represents a more modern approach to evasion, reflecting the continuous evolution of bot detection techniques. 167 168 The progression from early stealth plugins focused on high-level JavaScript properties to newer libraries targeting low-level CDP interactions demonstrates the ongoing arms race. An effective long-term strategy requires an understanding of these underlying mechanisms. A developer should not treat these libraries as a "magic bullet" but as tools to be selected based on the current state of detection technology. For the most challenging targets, a library like patchright-python that addresses the deepest layers of detection is likely the most resilient choice. 169 170 The following table provides a comparative analysis to aid in selecting the appropriate framework. 171 172 | Feature | playwright-extra \+ stealth | undetected-playwright | patchright-python | 173 | :---- | :---- | :---- | :---- | 174 | **Primary Language** | JavaScript/Node.js | Python | Python | 175 | **Maintenance Status** | Actively maintained (core plugin) | Less frequent updates | Actively maintained | 176 | **Evasion Method** | Runtime JavaScript patching | Runtime JavaScript patching | Low-level CDP patching & JS patching | 177 | **Key Patches** | navigator.webdriver, WebGL vendor, plugins, codecs, permissions | navigator.webdriver, various browser properties | Runtime.enable leak, Console.enable leak, navigator.webdriver, command flags | 178 | **Cross-Browser Support** | Primarily Chromium | Chromium | Chromium | 179 | **Community Activity** | High (via Puppeteer ecosystem) | Moderate | Growing | 180 | **Ease of Use** | Simple setup | Simple drop-in | Simple drop-in | 181 182 ### **2.3. Simulating Humanity: Advanced Interaction Emulation** 183 184 To defeat behavioral biometric analysis, a scraper must perform actions in a way that is statistically indistinguishable from a human. This involves moving beyond the default, instantaneous methods like .click() and .type() and implementing functions that introduce natural-looking variability and imperfection. 185 186 #### **Mouse Traversal** 187 188 A script's mouse movements are a primary target for behavioral analysis. A straight, constant-speed path from one point to another is a definitive signature of a bot. To counter this, mouse movements must be non-linear and exhibit variable speed. 189 190 * **Bézier Curves:** This mathematical technique is ideal for generating smooth, curved paths that mimic the natural arc of a human's hand moving a mouse \[602, 74, 86, 87, 88, 87, 43, 91, 92, 23, 24, 25, 26, 27, 2. A quadratic or cubic Bézier curve can be defined with a start point, an end point, and one or two control points. By randomizing the position of the control points, an infinite variety of natural-looking curves can be generated for each movement. 191 * **Perlin Noise:** While Bézier curves create a smooth path, human movements are not perfectly smooth; they contain tiny, subconscious corrections and "jitters." Perlin noise, a type of gradient noise used in computer graphics to generate natural-looking textures, can be applied to the mouse path to simulate this organic imperfection.37 By adding small, Perlin noise-generated offsets to the coordinates along the Bézier curve, the final path becomes less mathematically perfect and more believably human. The Python library 192 OxyMouse provides a ready-to-use implementation of both Bézier and Perlin noise algorithms for mouse movement generation.43 193 194 #### **Keyboard Dynamics** 195 196 Similarly, text input must not be instantaneous. A human types at a variable speed, with pauses between words and even characters. This can be simulated with a custom typing function that iterates through a string and types it character by character, with a small, randomized delay between each keystroke. 197 198 The following Python code provides a utility class, HumanEmulator, that encapsulates these techniques for use with Playwright: 199 200 Python 201 202 \# File: human\_emulator.py 203 import random 204 import time 205 import numpy as np 206 from scipy.interpolate import interp1d 207 208 class HumanEmulator: 209 """ 210 A class to provide human-like interaction emulation for Playwright. 211 """ 212 213 @staticmethod 214 async def human\_like\_typing(page, selector: str, text: str): 215 """ 216 Types text into an element with human-like delays. 217 """ 218 await page.click(selector) 219 for char in text: 220 delay \= random.uniform(0.05, 0.2) \# 50ms to 200ms delay between chars 221 await page.keyboard.type(char, delay=delay \* 1000) 222 223 @staticmethod 224 async def bezier\_mouse\_move(page, start\_x, start\_y, end\_x, end\_y, duration\_ms=1000): 225 """ 226 Moves the mouse along a randomized Bézier curve. 227 """ 228 \# Control point randomization 229 control\_1\_x \= start\_x \+ random.uniform(-50, 50) \+ (end\_x \- start\_x) \* 0.25 230 control\_1\_y \= start\_y \+ random.uniform(-50, 50) \+ (end\_y \- start\_y) \* 0.25 231 control\_2\_x \= start\_x \+ random.uniform(-50, 50) \+ (end\_x \- start\_x) \* 0.75 232 control\_2\_y \= start\_y \+ random.uniform(-50, 50) \+ (end\_y \- start\_y) \* 0.75 233 234 points \= 235 num\_points \= int(duration\_ms / 20) \# A point every \~20ms 236 237 for i in range(num\_points \+ 1): 238 t \= i / num\_points 239 \# Cubic Bézier curve formula 240 x \= (1 \- t)\*\*3 \* start\_x \+ 3 \* (1 \- t)\*\*2 \* t \* control\_1\_x \+ 3 \* (1 \- t) \* t\*\*2 \* control\_2\_x \+ t\*\*3 \* end\_x 241 y \= (1 \- t)\*\*3 \* start\_y \+ 3 \* (1 \- t)\*\*2 \* t \* control\_1\_y \+ 3 \* (1 \- t) \* t\*\*2 \* control\_2\_y \+ t\*\*3 \* end\_y 242 points.append((x, y)) 243 244 \# Introduce Perlin-like noise/jitter 245 \# A simple way is to add small random offsets 246 noisy\_points \= 247 for x, y in points: 248 jitter\_x \= random.uniform(-2, 2) 249 jitter\_y \= random.uniform(-2, 2) 250 noisy\_points.append((x \+ jitter\_x, y \+ jitter\_y)) 251 252 \# Move the mouse through the points 253 for x, y in noisy\_points: 254 await page.mouse.move(x, y) 255 await page.wait\_for\_timeout(random.uniform(15, 25)) \# Small delay between moves 256 257 await page.mouse.move(end\_x, end\_y) \# Ensure it ends at the exact point 258 259 @staticmethod 260 async def move\_and\_click(page, selector: str, duration\_ms=1000): 261 """ 262 Moves to an element with a human-like path and then clicks it. 263 """ 264 element \= page.locator(selector) 265 await element.wait\_for(state="visible") 266 box \= await element.bounding\_box() 267 268 if not box: 269 raise Exception(f"Element with selector '{selector}' not found or not visible.") 270 271 \# Get current mouse position (approximate) 272 \# In a real scenario, you might track this, but for simplicity, we start from a random point. 273 start\_pos \= await page.evaluate("() \=\> ({ x: Math.random() \* window.innerWidth, y: Math.random() \* window.innerHeight })") 274 275 \# Target a random point within the element's bounding box 276 target\_x \= box\['x'\] \+ random.uniform(0.2, 0.8) \* box\['width'\] 277 target\_y \= box\['y'\] \+ random.uniform(0.2, 0.8) \* box\['height'\] 278 279 await HumanEmulator.bezier\_mouse\_move(page, start\_pos\['x'\], start\_pos\['y'\], target\_x, target\_y, duration\_ms) 280 281 \# Human-like click delay 282 await page.mouse.down() 283 await page.wait\_for\_timeout(random.uniform(50, 150)) 284 await page.mouse.up() 285 286 ### **2.4. The Network Mask: Strategic Proxy Management** 287 288 As established in Section 1.4, the use of a high-quality proxy network is a mandatory component for any scraper running within the GitHub Actions environment. The goal is to mask the data center IP of the runner and present an IP address that is indistinguishable from a real residential user. 289 290 The architecture requires a subscription to a reputable **rotating residential proxy provider**, 94, 95\]. These services provide access to a large pool of IP addresses from real consumer devices around the world. Key features to look for in a provider are a large pool size, extensive geographic targeting options, and support for "sticky sessions," which allow a scraper to maintain the same IP address for the duration of a multi-step task, such as a login and subsequent data extraction. 291 292 Configuring Playwright to use a proxy is straightforward. The browser.launch() method accepts a proxy parameter. To manage credentials securely, the proxy URL, including username and password, should never be hardcoded. Instead, it should be passed to the GitHub Actions workflow as a secret and then exposed to the Python script as an environment variable. 293 294 The following code demonstrates a secure method for configuring a proxy in a Playwright script, assuming the proxy URL is available in an environment variable named PROXY\_URL. 295 296 Python 297 298 \# File: browser\_setup.py 299 import os 300 from playwright.async\_api import Browser, Playwright 301 from dotenv import load\_dotenv 302 303 \# Load environment variables from a.env file for local development 304 load\_dotenv() 305 306 async def get\_configured\_browser(playwright: Playwright) \-\> Browser: 307 """ 308 Launches a Chromium browser instance configured with a proxy 309 from environment variables. 310 """ 311 proxy\_url \= os.environ.get("PROXY\_URL") 312 proxy\_config \= None 313 314 if proxy\_url: 315 try: 316 \# Standard proxy format: http://username:password@host:port 317 parsed\_url \= new URL(proxy\_url) 318 proxy\_config \= { 319 "server": f"{parsed\_url.protocol}//{parsed\_url.hostname}:{parsed\_url.port}", 320 "username": parsed\_url.username, 321 "password": parsed\_url.password, 322 } 323 print("Proxy configured successfully.") 324 except Exception as e: 325 print(f"Warning: Could not parse PROXY\_URL. Proceeding without proxy. Error: {e}") 326 else: 327 print("Warning: PROXY\_URL environment variable not set. Proceeding without proxy.") 328 329 \# Launch the browser with the proxy configuration 330 browser \= await playwright.chromium.launch( 331 headless=True, 332 proxy=proxy\_config if proxy\_config else None 333 ) 334 return browser 335 336 \# This is a placeholder for the URL class which is not native to Python 337 \# In a real implementation, you would use a library like \`urllib.parse\` 338 from urllib.parse import urlparse 339 340 class URL: 341 def \_\_init\_\_(self, url\_string): 342 parsed \= urlparse(url\_string) 343 self.protocol \= parsed.scheme 344 self.hostname \= parsed.hostname 345 self.port \= parsed.port 346 self.username \= parsed.username 347 self.password \= parsed.password 348 349 This modular approach ensures that the core scraping logic remains decoupled from the network configuration, allowing for easy updates to proxy credentials without modifying the scraper code itself. 350 351 ## **Section 3: The Automated Scraper: A Resilient Architectural Blueprint** 352 353 With the necessary evasion tools identified, this section outlines the architecture of the Python application itself. The design prioritizes modularity, robustness, and a novel approach to session management tailored to the stateless nature of the GitHub Actions environment. 354 355 ### **3.1. Core Scraper Logic (Python with Playwright)** 356 357 The Python application should be structured into distinct modules to separate concerns, enhancing maintainability and testability. A recommended structure includes: 358 359 * config.py: Stores constants and configuration values, such as target URLs, search parameters, and selectors. 360 * human\_emulator.py: Contains the HumanEmulator class developed in Section 2.3 for simulating user interactions. 361 * scraper.py: Houses the main scraping class or functions responsible for orchestrating the browser, navigating pages, and extracting data. 362 * main.py: The entry point for the script, which parses command-line arguments (e.g., from the GitHub Actions matrix) and initiates the scraping process. 363 364 The scraper.py module will contain the core logic. This includes functions to handle the entire scraping lifecycle for a given set of search terms 44: 365 366 1. **Initialization:** A function to launch the patched Playwright browser, configure it with proxy settings from environment variables, and, most importantly, load the persisted session state. 367 2. **Search Execution:** A function that takes search keywords and a location, navigates to the LinkedIn jobs search page, and uses the human emulation methods to input the search terms and submit the form. 368 3. **Pagination and Scrolling:** A robust loop to handle pagination. For sites like LinkedIn that use infinite scroll, this involves repeatedly scrolling to the bottom of the results list and waiting for new content to be dynamically loaded via AJAX requests. The scraper must monitor for a "no more results" indicator to terminate the loop gracefully. 369 4. **Data Extraction:** A function to iterate over the located job listing elements. For each listing, it extracts key data points such as job title, company name, location, and the URL to the full job description. This process should be wrapped in error handling to prevent a single malformed listing from halting the entire scrape. 370 5. **Data Storage:** After collecting the data, a function saves it to a structured format like JSON or CSV in a designated output directory. 371 372 ### **3.2. State and Session Management** 373 374 The single greatest architectural challenge when scraping an authenticated site within a stateless environment like GitHub Actions is managing the login session \[59, 59. Runners are ephemeral; any cookies or local storage generated during a run are destroyed when the job completes. Attempting to perform a full username/password login on every scheduled run is a highly aggressive and unnatural pattern that will inevitably trigger security alerts, CAPTCHAs, and account locks. 375 376 The solution is an architecture that decouples the high-risk login action from the low-risk, routine scraping action. This is achieved by persisting the browser's authentication state between workflow runs using GitHub Secrets. Playwright facilitates this by allowing the entire state of a browser context—including cookies, localStorage, and sessionStorage—to be saved to and loaded from a file. 377 378 The proposed architecture consists of two distinct GitHub Actions workflows: 379 380 1. **login-and-save-state.yml (Manual Workflow):** 381 * **Trigger:** This workflow is triggered manually via workflow\_dispatch. It is run only when a new session needs to be established (e.g., initially, or if the previous session expires). 382 * **Process:** 383 * It launches a **headed** Playwright browser within the GitHub Actions runner (using xvfb to provide a virtual display, as runners are headless by default \[5927, S\_R567, S\_S68, S\_S69\]). 384 * It navigates to the LinkedIn login page. 385 * Crucially, it pauses and waits for the user to solve any CAPTCHA challenges and perform the multi-factor authentication (MFA) required during login. This manual intervention is necessary for the initial secure login. 386 * Once logged in, it saves the browser context's state to a session.json file using context.storage\_state(path="session.json"). 387 * It then encrypts this session.json file using a strong encryption key (e.g., GPG), which is itself stored as a GitHub Secret (GPG\_PASSPHRASE). 388 * The encrypted session data is then stored as a new, separate GitHub Secret (SESSION\_STATE). 389 2. **scrape-jobs.yml (Scheduled Workflow):** 390 * **Trigger:** This workflow runs on a schedule (e.g., daily). 391 * **Process:** 392 * It retrieves the encrypted SESSION\_STATE and the GPG\_PASSPHRASE from GitHub Secrets. 393 * It decrypts the session data back into a session.json file. 394 * It launches a headless Playwright browser and creates a new context by loading the state from the decrypted file: browser.new\_context(storage\_state="session.json"). 395 * This new context is now fully authenticated. The scraper can navigate directly to internal pages and perform its tasks without needing to interact with the login form. 396 397 This two-workflow approach provides immense benefits. The high-risk, CAPTCHA-prone login process is performed infrequently and with human assistance, while the frequent, automated scraping jobs run in a stealthy, pre-authenticated state. This dramatically reduces the risk of detection and increases the overall resilience of the engine. 398 399 ### **3.3. Robustness and Reliability** 400 401 A production-grade scraper must be resilient to the inherent unpredictability of the web. Network connections can fail, websites can change their layout, and individual elements may not load as expected. The application logic must anticipate and handle these failures gracefully. 402 403 * **Retry Mechanisms:** All network operations (e.g., page.goto()) and critical element interactions (e.g., page.locator().click()) should be wrapped in a retry loop. An effective strategy is to implement exponential backoff, where the delay between retries increases after each failure (e.g., wait 2s, then 4s, then 8s). This prevents overwhelming a temporarily struggling server. 404 * **Error Handling:** Every data extraction step should be enclosed in a try-except block. If a specific piece of data, like a job's salary, is not found on one listing, the scraper should log the error, record the field as null, and continue processing the rest of the listing and subsequent listings. A single missing element should never cause the entire scraping job to crash. 405 * **Timeouts:** Playwright's default timeouts should be adjusted based on the expected performance of the target site and the proxy network. Setting an aggressive global timeout can lead to premature failures, while an overly generous timeout can cause jobs to hang indefinitely. A reasonable job-level timeout should also be configured in the GitHub Actions workflow itself (e.g., timeout-minutes: 60\) to prevent runaway jobs from consuming excessive resources, 38, 39\]. 406 407 ### **3.4. Data Persistence Strategy: Git Commit vs. Artifacts** 408 409 Once the data is scraped, it must be saved. Within GitHub Actions, there are two primary methods for persisting data generated during a workflow run, 7, 83, 84, S\_S14, 11, 10, 24, 25, S\_S19, 40, 17, 85, 86, 87, 88, 37, 38, 39, 14, S\_S30, S\_S31, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\]. 410 411 * **Method 1: Committing to Git:** In this approach, the workflow includes a final step that uses a pre-built action (e.g., stefanzweifel/git-auto-commit-action or actions/add-commit) to commit the newly generated data file directly back to the repository7, 428, 429, 140, 141, 14. This creates a version-controlled, historical dataset that is easily accessible and can trigger downstream processes. The main drawback is the potential for a "noisy" commit history, with frequent, automated commits. 412 * **Method 2: Using Workflow Artifacts:** This method uses the actions/upload-artifact action to store the data file as an artifact associated with the workflow run, S\_R508, S\_R509, 6, 7, 83, 84, S\_S14, 11, 10, 24, 25, S\_S19, 40, 17, 85, 86, 87, 88, 37, 38, 39, 14, S\_S30, S\_S31, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\]. This keeps the Git history clean and is ideal for temporary data like logs or reports. However, artifacts expire by default and are not suitable for persisting state 413 *between* different workflow runs, as a new run cannot easily access artifacts from a previous one, S\_R705\]. 414 415 For the purpose of creating a persistent dataset of job postings, the **Git commit method is recommended**. For temporary, diagnostic data like trace files and screenshots generated upon failure, **artifacts are the superior choice**. 416 417 ## **Section 4: The Factory Floor: Orchestration with GitHub Actions** 418 419 This section translates the architectural design into a concrete implementation, providing the complete YAML configuration for the GitHub Actions workflow. The workflow is designed for automation, scalability, efficiency, and security. 420 421 ### **4.1. The Workflow File (.github/workflows/scraper.yml)** 422 423 The heart of the automation is the workflow file. It defines the triggers, permissions, jobs, and steps that constitute the scraping pipeline. 424 425 YAML 426 427 \# File:.github/workflows/scraper.yml 428 name: LinkedIn Job Scraper 429 430 on: 431 \# Schedule the workflow to run every day at midnight UTC 432 schedule: 433 \- cron: '0 0 \* \* \*' 434 \# Allow manual triggering from the GitHub Actions UI 435 workflow\_dispatch: 436 inputs: 437 job\_title: 438 description: 'Job Title to search for' 439 required: true 440 default: 'Software Engineer' 441 location: 442 description: 'Location to search in' 443 required: true 444 default: 'United States' 445 446 \# Set default permissions for the GITHUB\_TOKEN for security 447 permissions: 448 contents: write \# Required to commit data back to the repository 449 issues: write \# Required to create issues on failure 450 451 jobs: 452 scrape: 453 runs-on: ubuntu-latest 454 strategy: 455 fail-fast: false \# Allow other matrix jobs to continue if one fails 456 matrix: 457 \# Define a matrix to run scrapers for different roles/locations in parallel 458 \# For workflow\_dispatch, these will be overridden by inputs 459 job\_config: 460 \- { title: 'Software Engineer', location: 'United States' } 461 \- { title: 'Data Scientist', location: 'United States' } 462 \- { title: 'Product Manager', location: 'Canada' } 463 464 \# Set a timeout for each job to prevent runaways 465 timeout-minutes: 60 466 467 steps: 468 \- name: Checkout repository 469 uses: actions/checkout@v4 470 471 \- name: Set up Python 472 uses: actions/setup-python@v5 473 with: 474 python-version: '3.10' 475 476 \- name: Cache Python dependencies 477 uses: actions/cache@v4 478 with: 479 path: \~/.cache/pip 480 key: ${{ runner.os }}-pip-${{ hashFiles('\*\*/requirements.txt') }} 481 restore-keys: | 482 ${{ runner.os }}-pip- 483 484 \- name: Cache Playwright browsers 485 uses: actions/cache@v4 486 with: 487 path: \~/.cache/ms-playwright 488 key: ${{ runner.os }}-playwright-${{ hashFiles('\*\*/requirements.txt') }} 489 restore-keys: | 490 ${{ runner.os }}-playwright- 491 492 \- name: Install Python dependencies 493 run: | 494 python \-m pip install \--upgrade pip 495 pip install \-r requirements.txt 496 497 \- name: Install Playwright browsers and dependencies 498 run: npx playwright install \--with-deps chromium 499 500 \- name: Run Scraper 501 id: run\_scraper 502 env: 503 \# Securely pass secrets to the Python script 504 LINKEDIN\_EMAIL: ${{ secrets.LINKEDIN\_EMAIL }} 505 LINKEDIN\_PASSWORD: ${{ secrets.LINKEDIN\_PASSWORD }} 506 PROXY\_URL: ${{ secrets.PROXY\_URL }} 507 GPG\_PASSPHRASE: ${{ secrets.GPG\_PASSPHRASE }} 508 SESSION\_STATE\_ENCRYPTED: ${{ secrets.SESSION\_STATE }} 509 run: | 510 \# Use workflow\_dispatch inputs if available, otherwise use matrix values 511 JOB\_TITLE="${{ github.event.inputs.job\_title | 512 513 | matrix.job\_config.title }}" 514 LOCATION="${{ github.event.inputs.location | 515 516 | matrix.job\_config.location }}" 517 518 python main.py \--job-title "$JOB\_TITLE" \--location "$LOCATION" 519 520 \- name: Commit scraped data 521 if: success() 522 uses: stefanzweifel/git-auto-commit-action@v5 523 with: 524 commit\_message: "chore: Update scraped job data" 525 file\_pattern: "data/\*.json" 526 527 \- name: Upload Trace on Failure 528 if: failure() 529 uses: actions/upload-artifact@v4 530 with: 531 name: playwright-trace-${{ matrix.job\_config.title }}-${{ matrix.job\_config.location }} 532 path: trace.zip 533 retention-days: 7 534 535 \- name: Create Issue on Failure 536 if: failure() 537 uses: JasonEtco/create-an-issue@v2 538 env: 539 GITHUB\_TOKEN: ${{ secrets.GITHUB\_TOKEN }} 540 with: 541 filename:.github/ISSUE\_TEMPLATE.md 542 assignees: ${{ github.actor }} 543 update\_existing: true 544 search\_existing: open 545 546 This workflow incorporates several best practices. It is triggered both on a schedule (cron) and manually (workflow\_dispatch), providing flexibility for automated runs and on-demand execution \[59, 5951, 5954, 5955, 5956, 5957, 5958, 5959, 5996, S\_R300, S\_R301, S\_S32, S\_S33, S\_S34, S\_S35, S\_S36, S\_S37, S\_S38, S\_S39, 820, 821, 89, 823, 90, 87, 43, 91, 92, 39, 40, 41, 42, 14, 15, 16, 93, 94, 95, 96, 5, S\_S61, S\_S62, S\_S63, S\_S64, S\_S65, S\_S66, S\_S67, S\_S68, S\_S69, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, S\_S80, S\_S81, 43, S\_S83, S\_S84, S\_S85, S\_S86, S\_S87, S\_S88, S\_S89, S\_S90, S\_S91, S\_S92, S\_S93, S\_S94, S\_S95, S\_S96, S\_S97, S\_S98, S\_S99, 60, 61, 17, 63, 64, 65, 36, 34, 68, 69, 43, 71, 72, 73, 74, 75, 76, 77, 78, 79, 830, 831, 832, 58\]. It also explicitly sets the 547 548 permissions for the GITHUB\_TOKEN to the minimum required, adhering to the principle of least privilege.48 549 550 ### **4.2. Environment and Dependency Optimization** 551 552 Efficiency is paramount in a CI/CD environment to minimize runtime and cost. The two most time-consuming steps in a scraping workflow are typically dependency installation and browser binary downloads. GitHub Actions provides a caching mechanism to persist these between runs. 553 554 * **Caching Python Dependencies:** The actions/cache action is used to store the pip cache directory. The cache key is composed of the runner's operating system and a hash of the requirements.txt file. This ensures that the cache is invalidated and rebuilt only when the dependencies change, saving significant time on subsequent runs.53 555 * **Caching Playwright Browsers:** Similarly, the browser binaries downloaded by Playwright can be cached. The cache path is \~/.cache/ms-playwright. Caching these binaries, which can be several hundred megabytes, is a critical optimization that can reduce job setup time by minutes.54 It is important to note that while the official Playwright documentation discourages caching due to potential staleness issues, for a controlled scraping environment where the browser version is pinned, the performance benefits are substantial and generally outweigh the risks. 556 557 The workflow also uses npx playwright install \--with-deps chromium to install not only the Chromium browser but also all its necessary operating system dependencies, ensuring the environment is correctly configured on the ubuntu-latest runner \[56, 56, 56, S\_R194, 5908, 5909, 5925, 5926, 5927, 5928, 5929, 330, 331, 332, 333, 334, 5967, 5997, S\_R323, 87, 398, 399, 400, 5. 558 559 ### **4.3. Secure Operations: Managing Credentials and Secrets** 560 561 Hardcoding sensitive information like login credentials or API keys into workflow files is a severe security vulnerability. GitHub Actions provides a secure storage mechanism called "Secrets" for this purpose. Secrets are encrypted environment variables that are only exposed to the specific workflow run \[70, 70, 70, 70, 70, 70, 70, 70, S\_R602, S\_R603, S\_R604, S\_R605, S\_R606, S\_R607, S\_R608, S\_R632, 80, 80, 80, 80, 80, 80, 80, S\_R658, 80, S\_R711\]. 562 563 For this playbook, the following secrets must be created in the repository settings (Settings \> Secrets and variables \> Actions): 564 565 * LINKEDIN\_EMAIL: The email address for the LinkedIn account. 566 * LINKEDIN\_PASSWORD: The password for the LinkedIn account. 567 * PROXY\_URL: The full connection string for the residential proxy service. 568 * GPG\_PASSPHRASE: The passphrase used to encrypt and decrypt the session state file. 569 * SESSION\_STATE: The GPG-encrypted, Base64-encoded session state JSON. 570 571 These secrets are then passed to the Run Scraper step via the env block, making them available as environment variables within the Python script \[5983, S\_R410, S\_R424, S\_R517\]. The Python script should be designed to read these values from 572 573 os.environ. 574 575 ### **4.4. Scaling and Parallelism: Leveraging the Matrix Strategy** 576 577 To scrape a large volume of data efficiently, the scraping tasks must be parallelized. GitHub Actions' strategy: matrix feature is the ideal tool for this \[59, S\_R384, S\_R436, S\_R467, 73, S\_R656, S\_R667, S\_R668, S\_R669, 610, 611, 22, 2. A matrix allows a single job definition to be expanded into multiple parallel jobs, each with a different set of input variables. 578 579 In the example scraper.yml workflow, the matrix is defined under jobs.scrape.strategy.matrix.job\_config. This creates three parallel jobs, each with a different combination of title and location. The Python script is then designed to read these variables from the matrix context (matrix.job\_config.title and matrix.job\_config.location) to perform its targeted search. This approach allows the scraping of multiple job categories and locations simultaneously, drastically reducing the total time required to complete the entire scraping run. The fail-fast: false setting ensures that the failure of one job in the matrix (e.g., a scrape for "Product Manager" fails) does not automatically cancel the other in-progress jobs (e.g., "Software Engineer" and "Data Scientist"), maximizing data collection even in the face of partial failures. 580 581 ### **4.5. Monitoring and Debugging** 582 583 When a scraper fails in an automated, headless environment, debugging can be challenging. A robust monitoring and debugging strategy is essential for maintaining the long-term health of the scraping engine. 584 585 * **Trace on Failure:** The most powerful debugging tool Playwright offers is the Trace Viewer. By configuring trace: 'on-first-retry' in the Playwright config, a detailed trace file (trace.zip) is generated for any test that fails and is retried. This trace includes a screencast of the execution, a live DOM snapshot for each action, console logs, and network requests, providing a complete picture of what went wrong.56 586 * **Conditional Artifact Upload:** The workflow is configured to upload this trace.zip file as a workflow artifact, but only when the run\_scraper step fails. This is achieved using the if: failure() condition on the actions/upload-artifact step. This ensures that artifacts are only generated when they are needed for debugging, keeping successful runs clean. 587 * **Programmatic Issue Creation:** For critical, unrecoverable errors, the workflow can automatically create a GitHub Issue. The JasonEtco/create-an-issue action is used, again with an if: failure() condition. It can be configured with a template (.github/ISSUE\_TEMPLATE.md) to pre-populate the issue with details from the workflow run, such as the run ID, the failed job, and a link to the logs. This creates a formal, trackable record of the failure, ensuring that it is investigated and resolved0, 471, 472, 473, 474, S\_R124, S\_R125, S\_R126, S\_R127, S\_R128, S\_R129, S\_R130, 603, 604, 605, 606, 607, S\_R185, S\_R190, S\_R191, 5900, 5902, 67, 67, S\_R320, S\_R348, S\_R555, 75, S\_R623, 75, S\_R694, 2. 588 589 ## **Section 5: The Complete Codebase and Final Recommendations** 590 591 This final section provides the complete, production-ready code for the Python scraper and the GitHub Actions workflow, followed by essential ethical and legal considerations for conducting web scraping activities. 592 593 ### **5.1. Fully Commented Python Scraper Code** 594 595 The following is the fully integrated Python script (main.py), designed to be executed by the GitHub Actions workflow. It incorporates the architectural principles of modularity, secure credential handling, state management, robust error handling, and human behavior emulation. 596 597 Python 598 599 \# File: main.py 600 601 import asyncio 602 import os 603 import json 604 import re 605 import random 606 from playwright.async\_api import async\_playwright, Page, BrowserContext, Playwright 607 from dotenv import load\_dotenv 608 import gnupg \# Requires python-gnupg library and GnuPG installed on the runner 609 610 \# \--- Configuration \--- 611 \# In a real project, this might be in a separate config.py 612 BASE\_URL \= "<https://www.linkedin.com/jobs/search>" 613 OUTPUT\_DIR \= "data" 614 SESSION\_FILE \= "session.json" 615 ENCRYPTED\_SESSION\_FILE \= "session.json.gpg" 616 617 \# \--- Human Behavior Emulation \--- 618 class HumanEmulator: 619 @staticmethod 620 async def human\_like\_typing(page: Page, selector: str, text: str): 621 await page.locator(selector).click() 622 for char in text: 623 delay \= random.uniform(0.08, 0.25) \# 80ms to 250ms delay 624 await page.keyboard.type(char, delay=delay \* 1000) 625 await asyncio.sleep(delay / 2) 626 627 @staticmethod 628 async def bezier\_mouse\_move(page: Page, start\_x, start\_y, end\_x, end\_y, duration\_ms=800): 629 control\_1\_x \= start\_x \+ (end\_x \- start\_x) \* 0.25 \+ random.uniform(-75, 75) 630 control\_1\_y \= start\_y \+ (end\_y \- start\_y) \* 0.25 \+ random.uniform(-75, 75) 631 control\_2\_x \= start\_x \+ (end\_x \- start\_x) \* 0.75 \+ random.uniform(-75, 75) 632 control\_2\_y \= start\_y \+ (end\_y \- start\_y) \* 0.75 \+ random.uniform(-75, 75) 633 634 num\_points \= int(duration\_ms / 20) 635 points \= 636 for i in range(num\_points \+ 1): 637 t \= i / num\_points 638 x \= (1\-t)\*\*3\*start\_x \+ 3\*(1\-t)\*\*2\*t\*control\_1\_x \+ 3\*(1\-t)\*t\*\*2\*control\_2\_x \+ t\*\*3\*end\_x 639 y \= (1\-t)\*\*3\*start\_y \+ 3\*(1\-t)\*\*2\*t\*control\_1\_y \+ 3\*(1\-t)\*t\*\*2\*control\_2\_y \+ t\*\*3\*end\_y 640 jitter\_x \= random.uniform(-1.5, 1.5) 641 jitter\_y \= random.uniform(-1.5, 1.5) 642 points.append((x \+ jitter\_x, y \+ jitter\_y)) 643 644 for x, y in points: 645 await page.mouse.move(x, y) 646 await asyncio.sleep(random.uniform(0.015, 0.025)) 647 await page.mouse.move(end\_x, end\_y) 648 649 @staticmethod 650 async def move\_and\_click(page: Page, selector: str, duration\_ms=800): 651 element \= page.locator(selector) 652 await element.wait\_for(state="visible", timeout=10000) 653 box \= await element.bounding\_box() 654 if not box: raise Exception(f"Element '{selector}' not found.") 655 656 start\_pos \= await page.evaluate("() \=\> ({ x: Math.random() \* 500, y: Math.random() \* 500 })") 657 target\_x \= box\['x'\] \+ random.uniform(0.3, 0.7) \* box\['width'\] 658 target\_y \= box\['y'\] \+ random.uniform(0.3, 0.7) \* box\['height'\] 659 660 await HumanEmulator.bezier\_mouse\_move(page, start\_pos\['x'\], start\_pos\['y'\], target\_x, target\_y, duration\_ms) 661 await page.mouse.down() 662 await asyncio.sleep(random.uniform(0.06, 0.18)) 663 await page.mouse.up() 664 665 \# \--- Scraper Class \--- 666 class LinkedInScraper: 667 def \_\_init\_\_(self, job\_title: str, location: str): 668 self.job\_title \= job\_title 669 self.location \= location 670 self.playwright: Playwright \= None 671 self.browser \= None 672 self.context: BrowserContext \= None 673 self.page: Page \= None 674 675 async def setup(self): 676 self.playwright \= await async\_playwright().start() 677 proxy\_url \= os.environ.get("PROXY\_URL") 678 proxy\_config \= None 679 if proxy\_url: 680 \# Assuming format http://user:pass@host:port 681 parts \= re.match(r"http://(.\*?):(.\*?)@(.\*?):(\\d+)", proxy\_url) 682 if parts: 683 proxy\_config \= { 684 "server": f"http://{parts.group(3)}:{parts.group(4)}", 685 "username": parts.group(1), 686 "password": parts.group(2) 687 } 688 689 self.browser \= await self.playwright.chromium.launch( 690 headless=True, 691 proxy=proxy\_config, 692 args=\["--disable-blink-features=AutomationControlled"\] 693 ) 694 695 await self.\_load\_session\_state() 696 697 async def \_load\_session\_state(self): 698 encrypted\_state\_b64 \= os.environ.get("SESSION\_STATE\_ENCRYPTED") 699 passphrase \= os.environ.get("GPG\_PASSPHRASE") 700 701 if not encrypted\_state\_b64 or not passphrase: 702 print("Session state or passphrase not found in environment variables. Cannot proceed with authenticated scraping.") 703 \# Fallback to creating a new context, which will be unauthenticated 704 self.context \= await self.browser.new\_context() 705 self.page \= await self.context.new\_page() 706 return 707 708 try: 709 gpg \= gnupg.GPG() 710 with open(ENCRYPTED\_SESSION\_FILE, "wb") as f: 711 import base64 712 f.write(base64.b64decode(encrypted\_state\_b64)) 713 714 with open(ENCRYPTED\_SESSION\_FILE, "rb") as f: 715 decrypted\_data \= gpg.decrypt\_file(f, passphrase=passphrase) 716 if not decrypted\_data.ok: 717 raise Exception(f"GPG decryption failed: {decrypted\_data.stderr}") 718 719 with open(SESSION\_FILE, "w") as sf: 720 sf.write(str(decrypted\_data)) 721 722 self.context \= await self.browser.new\_context(storage\_state=SESSION\_FILE) 723 self.page \= await self.context.new\_page() 724 print("Successfully loaded and decrypted session state.") 725 726 except Exception as e: 727 print(f"Error loading session state: {e}. Proceeding with a new context.") 728 self.context \= await self.browser.new\_context() 729 self.page \= await self.context.new\_page() 730 731 async def scrape(self): 732 url \= f"{BASE\_URL}?keywords={self.job\_title}\&location={self.location}" 733 await self.page.goto(url, wait\_until="networkidle", timeout=60000) 734 735 \# Simple check to see if we are on a login page (which means session failed) 736 if "login" in self.page.url.lower(): 737 print("Redirected to login page. Session state may be invalid.") 738 \# In a real scenario, you might want to trigger an alert here. 739 return 740 741 \# Handle infinite scroll 742 last\_height \= await self.page.evaluate("document.body.scrollHeight") 743 while True: 744 await self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)") 745 await asyncio.sleep(random.uniform(2, 4)) \# Wait for new content to load 746 new\_height \= await self.page.evaluate("document.body.scrollHeight") 747 if new\_height \== last\_height: 748 break 749 last\_height \= new\_height 750 751 \# Extract data 752 job\_elements \= await self.page.locator('ul.jobs-search\_\_results-list \> li').all() 753 results \= 754 for job in job\_elements: 755 try: 756 title \= await job.locator('h3.base-search-card\_\_title').inner\_text() 757 company \= await job.locator('h4.base-search-card\_\_subtitle').inner\_text() 758 location \= await job.locator('span.job-search-card\_\_location').inner\_text() 759 link \= await job.locator('a.base-card\_\_full-link').get\_attribute('href') 760 results.append({ 761 "title": title.strip(), 762 "company": company.strip(), 763 "location": location.strip(), 764 "link": link 765 }) 766 except Exception: 767 continue \# Skip if an element is missing 768 769 return results 770 771 def save\_data(self, data): 772 if not os.path.exists(OUTPUT\_DIR): 773 os.makedirs(OUTPUT\_DIR) 774 775 filename \= f"{self.job\_title.replace(' ', '\_')}\_{self.location.replace(' ', '\_')}.json" 776 filepath \= os.path.join(OUTPUT\_DIR, filename) 777 778 with open(filepath, 'w', encoding='utf-8') as f: 779 json.dump(data, f, indent=4, ensure\_ascii=False) 780 print(f"Data saved to {filepath}") 781 782 async def teardown(self): 783 if self.browser: 784 await self.browser.close() 785 if self.playwright: 786 await self.playwright.stop() 787 788 async def main(): 789 import argparse 790 parser \= argparse.ArgumentParser() 791 parser.add\_argument("--job-title", required=True) 792 parser.add\_argument("--location", required=True) 793 args \= parser.parse\_args() 794 795 scraper \= LinkedInScraper(job\_title=args.job\_title, location=args.location) 796 try: 797 await scraper.setup() 798 data \= await scraper.scrape() 799 if data: 800 scraper.save\_data(data) 801 finally: 802 await scraper.teardown() 803 804 if \_\_name\_\_ \== "\_\_main\_\_": 805 load\_dotenv() \# For local testing 806 asyncio.run(main()) 807 808 ### **5.2. The Final workflow.yml** 809 810 This is the complete, annotated GitHub Actions workflow file that orchestrates the entire process. It should be placed in the repository at .github/workflows/scraper.yml. 811 812 YAML 813 814 \# File:.github/workflows/scraper.yml 815 name: LinkedIn Job Scraper 816 817 on: 818 schedule: 819 \- cron: '0 3 \* \* \*' \# Runs every day at 3 AM UTC 820 workflow\_dispatch: 821 inputs: 822 job\_title: 823 description: 'Job Title to search for' 824 required: true 825 default: 'Data Engineer' 826 location: 827 description: 'Location to search in' 828 required: true 829 default: 'Remote' 830 831 permissions: 832 contents: write 833 issues: write 834 835 jobs: 836 scrape-and-commit: 837 runs-on: ubuntu-latest 838 strategy: 839 fail-fast: false 840 matrix: 841 job\_config: 842 \- { title: 'Software Engineer', location: 'United States' } 843 \- { title: 'Data Scientist', location: 'United States' } 844 \- { title: 'Product Manager', location: 'Canada' } 845 \- { title: 'DevOps Engineer', location: 'United Kingdom' } 846 847 timeout-minutes: 60 848 849 steps: 850 \- name: Checkout repository 851 uses: actions/checkout@v4 852 853 \- name: Set up Python 3.10 854 uses: actions/setup-python@v5 855 with: 856 python-version: '3.10' 857 858 \- name: Cache pip dependencies 859 uses: actions/cache@v4 860 with: 861 path: \~/.cache/pip 862 key: ${{ runner.os }}-pip-${{ hashFiles('\*\*/requirements.txt') }} 863 restore-keys: | 864 ${{ runner.os }}-pip- 865 866 \- name: Cache Playwright browsers 867 uses: actions/cache@v4 868 with: 869 path: \~/.cache/ms-playwright 870 key: ${{ runner.os }}-playwright-${{ hashFiles('\*\*/requirements.txt') }} 871 restore-keys: | 872 ${{ runner.os }}-playwright- 873 874 \- name: Install system dependencies for GnuPG 875 run: sudo apt-get update && sudo apt-get install \-y gnupg 876 877 \- name: Install Python dependencies 878 run: | 879 python \-m pip install \--upgrade pip 880 pip install \-r requirements.txt 881 882 \- name: Install Playwright browsers and OS dependencies 883 run: npx playwright install \--with-deps chromium 884 885 \- name: Run Python Scraper 886 id: scraper\_run 887 env: 888 LINKEDIN\_EMAIL: ${{ secrets.LINKEDIN\_EMAIL }} 889 LINKEDIN\_PASSWORD: ${{ secrets.LINKEDIN\_PASSWORD }} 890 PROXY\_URL: ${{ secrets.PROXY\_URL }} 891 GPG\_PASSPHRASE: ${{ secrets.GPG\_PASSPHRASE }} 892 SESSION\_STATE\_ENCRYPTED: ${{ secrets.SESSION\_STATE }} 893 run: | 894 JOB\_TITLE="${{ github.event.inputs.job\_title | 895 896 | matrix.job\_config.title }}" 897 LOCATION="${{ github.event.inputs.location | 898 899 | matrix.job\_config.location }}" 900 python main.py \--job-title "$JOB\_TITLE" \--location "$LOCATION" 901 902 \- name: Commit and push if changed 903 if: success() 904 uses: stefanzweifel/git-auto-commit-action@v5 905 with: 906 commit\_message: "ci: Automated job data update for ${{ matrix.job\_config.title }}" 907 file\_pattern: "data/\*.json" 908 commit\_user\_name: "GitHub Actions Bot" 909 commit\_user\_email: "github-actions\[bot\]@users.noreply.github.com" 910 commit\_author: "GitHub Actions Bot \<github-actions\[bot\]@users.noreply.github.com\>" 911 912 \- name: Upload Trace on Failure 913 if: failure() 914 uses: actions/upload-artifact@v4 915 with: 916 name: playwright-trace-${{ matrix.job\_config.title }}-${{ matrix.job\_config.location }} 917 path: trace.zip 918 retention-days: 5 919 920 \- name: Create Issue on Failure 921 if: failure() 922 uses: JasonEtco/create-an-issue@v2 923 env: 924 GITHUB\_TOKEN: ${{ secrets.GITHUB\_TOKEN }} 925 with: 926 filename:.github/ISSUE\_TEMPLATE.md 927 assignees: ${{ github.actor }} 928 update\_existing: true 929 search\_existing: open 930 title: "Scraping job failed for: ${{ matrix.job\_config.title }} in ${{ matrix.job\_config.location }}" 931 932 ### **5.3. Ethical and Legal Considerations** 933 934 While this playbook provides the technical means to perform advanced web scraping, it is imperative that these tools are used responsibly. Developers and organizations must be cognizant of the ethical and legal landscape surrounding data extraction\]. 935 936 * **Terms of Service (ToS):** Most websites, including LinkedIn, have terms of service that explicitly prohibit or restrict automated data collection. While ToS are not always legally enforceable in the same way as laws, violating them can lead to account suspension and legal action from the platform owner. It is crucial to review and understand the ToS of any target website. 937 * **Rate Limiting and Server Load:** A core tenet of responsible scraping is to "be a good web citizen." This means implementing conservative rate limits and backoff strategies to avoid overwhelming the target server's resources. An overly aggressive scraper can degrade the service for legitimate users and is more likely to be detected and blocked. 938 * **Data Privacy (GDPR, CCPA):** Scraping and storing personal data is subject to strict data protection laws like the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA). These regulations impose stringent requirements on the collection, processing, and storage of personally identifiable information (PII). Any scraping project involving personal data must have a clear legal basis for processing and must implement robust data security and privacy measures. 939 * **Copyright:** The data on websites may be protected by copyright. Scraping and republishing copyrighted content without permission can constitute infringement. 940 941 Ultimately, the responsibility lies with the developer to ensure their scraping activities are conducted in an ethical, legal, and respectful manner. This playbook provides the "how," but the "why" and "if" must be carefully considered for each specific use case. 942 943 ## **Section 6: References** 944 945 \[59\], \[47\], \[60\], \[33\], \[59, \[33\],, \[61\], \[62\], \[45\], \[63\], \[64\],,,,, \[47, \[47, \[47, \[47, \[47,,,,,,,, \[65\], \[56\],, \[66\], \[45\], \[56\], \[56\], 20, \[60, \[60, \[60, \[60, \[60, \[60, \[60, \[60, \[56\],,,,, \[65\], \[21\], \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[33, \[33, \[33, \[33, \[33, \[59, \[67\], \[59, \[59, \[59, \[59, \[59, \[59, \[59, \[44\], \[59, \[68\], \[59, \[59, \[67\], \[59, \[59, \[59, \[59,,,,, 17,,, 53, \[18\],,,,,,,,,,,,,,, \[3\],, \[55\], \[49\], \[33,,,,,,, \[12\],,, \[46\], \[48\], \[48\], \[48\], \[48\], \[48\], \[48\], \[48\],, \[12\],,, \[50\], \[48\],,,,,,,,,,, 6,, \[54\],, \[49\], \[49\], \[49\], \[49\], \[49\], \[49\], \[49\],,,,,,,,, \[69\],,,,,,,,,, \[70\], \[70\], \[70\], \[70\], \[70\], \[70\], \[70\], \[71\], \[72\], \[72\], \[72\], \[72\], \[72\], \[72\], \[72\], \[50\], \[50\], \[50\], \[50\], \[50\], \[50\], \[50\], \[19\], \[73\], \[74\],,, \[75\],,, \[76\],,, \[70\], \[77\], \[77\], \[77\], \[77\], \[77\], \[77\], \[77\], \[71\],,,,,,,, \[78\], \[78\], \[78\], \[78\], \[78\], \[78\], \[78\],, \[79\], \[18\],, \[80\], \[80\], \[80\], \[80\], \[80\], \[80\], \[80\], \[18\], \[75\],,,,,,,,, \[80\],,,,, \[81\],,,,,,,,,,,,,,,,, \[13\], \[13\], \[57\], 1, \[4\], 13, \[82\], \[2\], 3, 8, \[6\], \[7\], \[83\], \[84\], 9, \[11\], \[10\], \[24\], \[25\], \[17\], \[85\], \[86\], \[87\], \[88\], \[37\], \[38\], \[39\], \[14\], \[89\], 22, \[90\], \[87\], \[43\], \[91\], \[92\], \[39\], \[40\], \[41\], \[42\], \[14\], \[15\], \[16\], \[93\], \[94\], \[95\], \[96\], \[5\], \[43\], \[17\], 33, 35, \[36\], \[34\], \[43\], \[58\] 946 947 ### **Works cited** 948 949 1. Fingerprinting and Tracing Shadows: The Development and Impact ..., accessed on July 30, 2025, [https://arxiv.org/pdf/2411.12045](https://arxiv.org/pdf/2411.12045) 950 2. Canvas fingerprinting: Explained and illustrated \- Stytch, accessed on July 30, 2025, [https://stytch.com/blog/canvas-fingerprinting/](https://stytch.com/blog/canvas-fingerprinting/) 951 3. Canvas Fingerprinting: What Is It and How to Bypass It \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/canvas-fingerprinting](https://www.zenrows.com/blog/canvas-fingerprinting) 952 4. The Development and Impact of Browser Fingerprinting on Digital Privacy \- arXiv, accessed on July 30, 2025, [https://arxiv.org/html/2411.12045v1](https://arxiv.org/html/2411.12045v1) 953 5. How to Use a Playwright Proxy in 2025 \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/playwright-proxy](https://www.zenrows.com/blog/playwright-proxy) 954 6. What is WebGL Fingerprinting? How It Works & Tips | Medium, accessed on July 30, 2025, [https://medium.com/@datajournal/webgl-fingerprinting-60893a9ca382](https://medium.com/@datajournal/webgl-fingerprinting-60893a9ca382) 955 7. Top 9 Browser Fingerprinting Techniques Explained \- Bureau, accessed on July 30, 2025, [https://bureau.id/blog/browser-fingerprinting-techniques](https://bureau.id/blog/browser-fingerprinting-techniques) 956 8. Browser fingerprinting: implementing fraud detection techniques in the era of AI \- Stytch, accessed on July 30, 2025, [https://stytch.com/blog/browser-fingerprinting/](https://stytch.com/blog/browser-fingerprinting/) 957 9. What Is HTTP/2 Fingerprinting and How to Bypass It? | Ultimate Guide, accessed on July 30, 2025, [https://www.scrapeless.com/en/blog/bypass-https2](https://www.scrapeless.com/en/blog/bypass-https2) 958 10. Applications of TLS Fingerprinting in Bot Mitigation \- CDNetworks, accessed on July 30, 2025, [https://www.cdnetworks.com/blog/cloud-security/tls-fingerprinting-bot-mitigation/](https://www.cdnetworks.com/blog/cloud-security/tls-fingerprinting-bot-mitigation/) 959 11. HTTP2 Fingerprinting Tools \- Scrapfly, accessed on July 30, 2025, [https://scrapfly.io/web-scraping-tools/http2-fingerprint](https://scrapfly.io/web-scraping-tools/http2-fingerprint) 960 12. Preventing Playwright Bot Detection with Random Mouse Movements | by Manan Patel, accessed on July 30, 2025, [https://medium.com/@domadiyamanan/preventing-playwright-bot-detection-with-random-mouse-movements-10ab7c710d2a](https://medium.com/@domadiyamanan/preventing-playwright-bot-detection-with-random-mouse-movements-10ab7c710d2a) 961 13. (PDF) Web Bot Detection Evasion Using Generative Adversarial ..., accessed on July 30, 2025, [https://www.researchgate.net/publication/354391714\_Web\_Bot\_Detection\_Evasion\_Using\_Generative\_Adversarial\_Networks](https://www.researchgate.net/publication/354391714_Web_Bot_Detection_Evasion_Using_Generative_Adversarial_Networks) 962 14. mehaase/js-typewriter: Simulate a person typing in a DOM node. \- GitHub, accessed on July 30, 2025, [https://github.com/mehaase/js-typewriter](https://github.com/mehaase/js-typewriter) 963 15. TypeIt | The most versatile JavaScript typewriter effect library on the planet., accessed on July 30, 2025, [https://www.typeitjs.com/](https://www.typeitjs.com/) 964 16. How to simulate typing in an input box with JavaScript \- Stack Overflow, accessed on July 30, 2025, [https://stackoverflow.com/questions/47617616/how-to-simulate-typing-in-an-input-box-with-javascript](https://stackoverflow.com/questions/47617616/how-to-simulate-typing-in-an-input-box-with-javascript) 965 17. How To Make Playwright Undetectable | ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-make-playwright-undetectable/](https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-make-playwright-undetectable/) 966 18. Playwright vs Selenium : Which to choose in 2025 | BrowserStack, accessed on July 30, 2025, [https://www.browserstack.com/guide/playwright-vs-selenium](https://www.browserstack.com/guide/playwright-vs-selenium) 967 19. Playwright vs Selenium: Key Differences | Sauce Labs, accessed on July 30, 2025, [https://saucelabs.com/resources/blog/playwright-vs-selenium-guide](https://saucelabs.com/resources/blog/playwright-vs-selenium-guide) 968 20. Playwright vs. Selenium for web scraping \- Apify Blog, accessed on July 30, 2025, [https://blog.apify.com/playwright-vs-selenium/](https://blog.apify.com/playwright-vs-selenium/) 969 21. playwright-extra \- npm, accessed on July 30, 2025, [https://www.npmjs.com/package/playwright-extra](https://www.npmjs.com/package/playwright-extra) 970 22. What is Playwright Extra \- A Web Scrapers Guide \- ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-extra/](https://scrapeops.io/playwright-web-scraping-playbook/nodejs-playwright-extra/) 971 23. puppeteer-extra-plugin-stealth/evasions \- GitHub, accessed on July 30, 2025, [https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/readme.md](https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/readme.md) 972 24. Invisible Automation: Using puppeteer-extra-plugin-stealth to Bypass Bot Protection, accessed on July 30, 2025, [https://latenode.com/blog/invisible-automation-using-puppeteer-extra-plugin-stealth-to-bypass-bot-protection](https://latenode.com/blog/invisible-automation-using-puppeteer-extra-plugin-stealth-to-bypass-bot-protection) 973 25. Puppeteer Stealth Tutorial: How To Use & Setup (+Alternatives) \- Scrapingdog, accessed on July 30, 2025, [https://www.scrapingdog.com/blog/puppeteer-stealth/](https://www.scrapingdog.com/blog/puppeteer-stealth/) 974 26. puppeteer-extra-plugin-stealth \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/puppeteer-extra-plugin-stealth](https://www.npmjs.com/package/puppeteer-extra-plugin-stealth) 975 27. Puppeteer-Extra-Stealth Guide \- Bypass Anti-Bots With Ease | ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/puppeteer-web-scraping-playbook/nodejs-puppeteer-extra-stealth-plugin/](https://scrapeops.io/puppeteer-web-scraping-playbook/nodejs-puppeteer-extra-stealth-plugin/) 976 28. Implementing "Stealth" in Puppeteer Sharp \- LambdaTest Community, accessed on July 30, 2025, [https://community.lambdatest.com/t/implementing-stealth-in-puppeteer-sharp/29231](https://community.lambdatest.com/t/implementing-stealth-in-puppeteer-sharp/29231) 977 29. Puppeteer Stealth Tutorial; How to Set Up & Use (+ Working Alternatives) | ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/puppeteer-stealth-tutorial-with-examples/](https://www.scrapingbee.com/blog/puppeteer-stealth-tutorial-with-examples/) 978 30. How to Use Puppeteer Stealth: A Plugin for Scraping \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/puppeteer-stealth](https://www.zenrows.com/blog/puppeteer-stealth) 979 31. puppeteer-extra-plugin-stealth \- UNPKG, accessed on July 30, 2025, [https://app.unpkg.com/puppeteer-extra-plugin-stealth@2.4.1/files/readme.md](https://app.unpkg.com/puppeteer-extra-plugin-stealth@2.4.1/files/readme.md) 980 32. puppeteer-extra \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/puppeteer-extra](https://www.npmjs.com/package/puppeteer-extra) 981 33. How to Make Playwright Scraping Undetectable | ScrapingAnt, accessed on July 30, 2025, [https://scrapingant.com/blog/playwright-scraping-undetectable](https://scrapingant.com/blog/playwright-scraping-undetectable) 982 34. undetected-playwright \- PyPI, accessed on July 30, 2025, [https://pypi.org/project/undetected-playwright/0.2.0/](https://pypi.org/project/undetected-playwright/0.2.0/) 983 35. Kaliiiiiiiiii-Vinyzu/patchright-python: Undetected Python version of the Playwright testing and automation library. \- GitHub, accessed on July 30, 2025, [https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python) 984 36. Playwright Web Scraping Tutorial | Become 100% Undetectable\! \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=afobK3UbTeE](https://www.youtube.com/watch?v=afobK3UbTeE) 985 37. Playing with Perlin Noise: Generating Realistic Archipelagos | by Yvan Scher \- Medium, accessed on July 30, 2025, [https://medium.com/@yvanscher/playing-with-perlin-noise-generating-realistic-archipelagos-b59f004d8401](https://medium.com/@yvanscher/playing-with-perlin-noise-generating-realistic-archipelagos-b59f004d8401) 986 38. Perlin Noise: Implementation, Procedural Generation, and Simplex Noise \- Garage Farm, accessed on July 30, 2025, [https://garagefarm.net/blog/perlin-noise-implementation-procedural-generation-and-simplex-noise](https://garagefarm.net/blog/perlin-noise-implementation-procedural-generation-and-simplex-noise) 987 39. Perlin Noise: A Procedural Generation Algorithm \- Raouf's blog, accessed on July 30, 2025, [https://rtouti.github.io/graphics/perlin-noise-algorithm](https://rtouti.github.io/graphics/perlin-noise-algorithm) 988 40. ghost-cursor \- NPM, accessed on July 30, 2025, [https://www.npmjs.com/package/ghost-cursor](https://www.npmjs.com/package/ghost-cursor) 989 41. Using Perlin Noise to follow my mouse \- Processing Forum, accessed on July 30, 2025, [https://forum.processing.org/two/discussion/20974/using-perlin-noise-to-follow-my-mouse.html](https://forum.processing.org/two/discussion/20974/using-perlin-noise-to-follow-my-mouse.html) 990 42. Making maps with noise functions \- Red Blob Games, accessed on July 30, 2025, [https://www.redblobgames.com/maps/terrain-from-noise/](https://www.redblobgames.com/maps/terrain-from-noise/) 991 43. oxylabs/OxyMouse: Mouse Movement Algorithms \- GitHub, accessed on July 30, 2025, [https://github.com/oxylabs/OxyMouse](https://github.com/oxylabs/OxyMouse) 992 44. Python Scrapy \- Build A LinkedIn Jobs Scraper \[2025\] \- ScrapeOps, accessed on July 30, 2025, [https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/) 993 45. spinlud/py-linkedin-jobs-scraper \- GitHub, accessed on July 30, 2025, [https://github.com/spinlud/py-linkedin-jobs-scraper](https://github.com/spinlud/py-linkedin-jobs-scraper) 994 46. speedyapply/JobSpy: Jobs scraper library for LinkedIn ... \- GitHub, accessed on July 30, 2025, [https://github.com/speedyapply/JobSpy](https://github.com/speedyapply/JobSpy) 995 47. How to create a LinkedIn job scraper in Python with Crawlee, accessed on July 30, 2025, [https://crawlee.dev/blog/linkedin-job-scraper-python](https://crawlee.dev/blog/linkedin-job-scraper-python) 996 48. Managing GitHub Actions settings for a repository \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository) 997 49. Controlling permissions for GITHUB\_TOKEN \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/en/actions/how-tos/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github\_token](https://docs.github.com/en/actions/how-tos/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github_token) 998 50. GitHub Actions permissions \- Graphite, accessed on July 30, 2025, [https://graphite.dev/guides/github-actions-permissions](https://graphite.dev/guides/github-actions-permissions) 999 51. Undetected ChromeDriver in Python Selenium: How to Use for Web Scraping \- ZenRows, accessed on July 30, 2025, [https://www.zenrows.com/blog/undetected-chromedriver](https://www.zenrows.com/blog/undetected-chromedriver) 1000 52. How to avoid Selenium detection or change approach \- Stack Overflow, accessed on July 30, 2025, [https://stackoverflow.com/questions/77907712/how-to-avoid-selenium-detection-or-change-approach](https://stackoverflow.com/questions/77907712/how-to-avoid-selenium-detection-or-change-approach) 1001 53. How to Set Up Automated GitHub Workflows for Your Python and React Applications, accessed on July 30, 2025, [https://www.freecodecamp.org/news/how-to-set-up-automated-github-workflows-for-python-react-apps/](https://www.freecodecamp.org/news/how-to-set-up-automated-github-workflows-for-python-react-apps/) 1002 54. til/github-actions/cache-playwright-dependencies-across-workflows.md at master, accessed on July 30, 2025, [https://github.com/jbranchaud/til/blob/master/github-actions/cache-playwright-dependencies-across-workflows.md](https://github.com/jbranchaud/til/blob/master/github-actions/cache-playwright-dependencies-across-workflows.md) 1003 55. How to run Playwright on GitHub Actions \- foosel.net, accessed on July 30, 2025, [https://foosel.net/til/how-to-run-playwright-on-github-actions/](https://foosel.net/til/how-to-run-playwright-on-github-actions/) 1004 56. Setting up CI \- Playwright, accessed on July 30, 2025, [https://playwright.dev/docs/ci-intro](https://playwright.dev/docs/ci-intro) 1005 57. Trace viewer | Playwright, accessed on July 30, 2025, [https://playwright.dev/docs/trace-viewer](https://playwright.dev/docs/trace-viewer) 1006 58. What are the steps to enable and view traces in Playwright tests run on GitHub Actions?, accessed on July 30, 2025, [https://ray.run/questions/what-are-the-steps-to-enable-and-view-traces-in-playwright-tests-run-on-github-actions](https://ray.run/questions/what-are-the-steps-to-enable-and-view-traces-in-playwright-tests-run-on-github-actions) 1007 59. How to Use GitHub Actions to Automate Data Scraping | by Tom Willcocks \- Medium, accessed on July 30, 2025, [https://medium.com/data-analytics-at-nesta/how-to-use-github-actions-to-automate-data-scraping-299690cd8bdb](https://medium.com/data-analytics-at-nesta/how-to-use-github-actions-to-automate-data-scraping-299690cd8bdb) 1008 60. Scrapy Playwright Tutorial: How to Scrape Dynamic Websites | ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/scrapy-playwright-tutorial/](https://www.scrapingbee.com/blog/scrapy-playwright-tutorial/) 1009 61. How to Scrape LinkedIn in 2025 \- Scrapfly, accessed on July 30, 2025, [https://scrapfly.io/blog/posts/how-to-scrape-linkedin-person-profile-company-job-data](https://scrapfly.io/blog/posts/how-to-scrape-linkedin-person-profile-company-job-data) 1010 62. Playwright for Python Web Scraping Tutorial with Examples \- ScrapingBee, accessed on July 30, 2025, [https://www.scrapingbee.com/blog/playwright-for-python-web-scraping/](https://www.scrapingbee.com/blog/playwright-for-python-web-scraping/) 1011 63. Web Scraping with Playwright \- BrowserStack, accessed on July 30, 2025, [https://www.browserstack.com/guide/playwright-web-scraping](https://www.browserstack.com/guide/playwright-web-scraping) 1012 64. Playwright Web Scraping Tutorial for 2025 \- Oxylabs, accessed on July 30, 2025, [https://oxylabs.io/blog/playwright-web-scraping](https://oxylabs.io/blog/playwright-web-scraping) 1013 65. From Puppeteer stealth to Nodriver: How anti-detect frameworks evolved to evade bot detection \- The Castle blog, accessed on July 30, 2025, [https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/](https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/) 1014 66. “Step-by-Step Guide”: Build Python Project Using GitHub Actions | by Yagmur Ozden, accessed on July 30, 2025, [https://medium.com/@yagmurozden/step-by-step-guide-build-python-project-using-github-actions-025e67c164e9](https://medium.com/@yagmurozden/step-by-step-guide-build-python-project-using-github-actions-025e67c164e9) 1015 67. Make an issue on github using API V3 and Python, accessed on July 30, 2025, [https://gist.github.com/JeffPaine/3145490](https://gist.github.com/JeffPaine/3145490) 1016 68. The Python Developer's Guide: Mastering GitHub Actions | by Mayuresh K, accessed on July 30, 2025, [https://python.plainenglish.io/the-python-developers-guide-mastering-automated-workflows-with-github-actions-505110d89185](https://python.plainenglish.io/the-python-developers-guide-mastering-automated-workflows-with-github-actions-505110d89185) 1017 69. How to Upload Artifacts with GitHub Actions? \- Workflow Hub \- CICube, accessed on July 30, 2025, [https://cicube.io/workflow-hub/github-actions-upload-artifact/](https://cicube.io/workflow-hub/github-actions-upload-artifact/) 1018 70. Using secrets in GitHub Actions, accessed on July 30, 2025, [https://docs.github.com/actions/security-guides/using-secrets-in-github-actions](https://docs.github.com/actions/security-guides/using-secrets-in-github-actions) 1019 71. Events that trigger workflows \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/actions/learn-github-actions/events-that-trigger-workflows](https://docs.github.com/actions/learn-github-actions/events-that-trigger-workflows) 1020 72. Add & Commit · Actions · GitHub Marketplace, accessed on July 30, 2025, [https://github.com/marketplace/actions/add-commit](https://github.com/marketplace/actions/add-commit) 1021 73. Unlimited Free Web-Scraping with GitHub Actions \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=gEZhTfaIxHQ](https://www.youtube.com/watch?v=gEZhTfaIxHQ) 1022 74. vincentbavitz/bezmouse: Simulate human mouse movements with xdotool \- GitHub, accessed on July 30, 2025, [https://github.com/vincentbavitz/bezmouse](https://github.com/vincentbavitz/bezmouse) 1023 75. REST API endpoints for issues \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/rest/reference/issues](https://docs.github.com/rest/reference/issues) 1024 76. Start Automating: Build Your First GitHub Action \- YouTube, accessed on July 30, 2025, [https://www.youtube.com/watch?v=N7zd6tkqq04](https://www.youtube.com/watch?v=N7zd6tkqq04) 1025 77. Actions · GitHub Marketplace \- Upload a Build Artifact, accessed on July 30, 2025, [https://github.com/marketplace/actions/upload-a-build-artifact](https://github.com/marketplace/actions/upload-a-build-artifact) 1026 78. actions/upload-artifact \- GitHub, accessed on July 30, 2025, [https://github.com/actions/upload-artifact](https://github.com/actions/upload-artifact) 1027 79. Building and testing Python \- GitHub Docs, accessed on July 30, 2025, [https://docs.github.com/actions/guides/building-and-testing-python](https://docs.github.com/actions/guides/building-and-testing-python) 1028 80. A How-To Guide for using Environment Variables and GitHub Secrets in GitHub Actions for Secrets Management in Continuous Integration \- GitHub Gist, accessed on July 30, 2025, [https://gist.github.com/brianjbayer/53ef17e0a15f7d80468d3f3077992ef8](https://gist.github.com/brianjbayer/53ef17e0a15f7d80468d3f3077992ef8) 1029 81. graphite.dev, accessed on July 30, 2025, [https://graphite.dev/guides/github-actions-matrix\#:\~:text=The%20matrix%20strategy%20is%20a,256%20jobs%20per%20workflow%20run.](https://graphite.dev/guides/github-actions-matrix#:~:text=The%20matrix%20strategy%20is%20a,256%20jobs%20per%20workflow%20run.) 1030 82. arXiv:2412.02266v1 \[cs.LG\] 3 Dec 2024, accessed on July 30, 2025, [https://arxiv.org/pdf/2412.02266](https://arxiv.org/pdf/2412.02266) 1031 83. <www.expressvpn.com>, accessed on July 30, 2025, [https://www.expressvpn.com/webrtc-leak-test](https://www.expressvpn.com/webrtc-leak-test) 1032 84. How to Fix WebRTC Leaks in 2025 (All Browsers) \- CyberInsider, accessed on July 30, 2025, [https://cyberinsider.com/webrtc-leaks/](https://cyberinsider.com/webrtc-leaks/) 1033 85. Scalable Web Scraping with Playwright and Browserless (2025 Guide), accessed on July 30, 2025, [https://www.browserless.io/blog/scraping-with-playwright-a-developer-s-guide-to-scalable-undetectable-data-extraction](https://www.browserless.io/blog/scraping-with-playwright-a-developer-s-guide-to-scalable-undetectable-data-extraction) 1034 86. sarperavci/human\_mouse: Ultra-realistic human mouse movements using bezier curves and spline interpolation. Natural cursor automation. \- GitHub, accessed on July 30, 2025, [https://github.com/sarperavci/human\_mouse](https://github.com/sarperavci/human_mouse) 1035 87. A beautiful application of Bézier Curves to simulate natural mouse movements \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/math/comments/1hyfq73/a\_beautiful\_application\_of\_b%C3%A9zier\_curves\_to/](https://www.reddit.com/r/math/comments/1hyfq73/a_beautiful_application_of_b%C3%A9zier_curves_to/) 1036 88. Bezier curve \- The Modern JavaScript Tutorial, accessed on July 30, 2025, [https://javascript.info/bezier-curve](https://javascript.info/bezier-curve) 1037 89. Is Playwright the best alternative to Selenium in 2025? \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/Playwright/comments/1jb29zu/is\_playwright\_the\_best\_alternative\_to\_selenium\_in/](https://www.reddit.com/r/Playwright/comments/1jb29zu/is_playwright_the_best_alternative_to_selenium_in/) 1038 90. Best Web Scraping Detection Avoidance Libraries for Javascript | ScrapingAnt, accessed on July 30, 2025, [https://scrapingant.com/blog/javascript-detection-avoidance-libraries](https://scrapingant.com/blog/javascript-detection-avoidance-libraries) 1039 91. ELI5:Why is it hard to simulate human mouse movement? : r/explainlikeimfive \- Reddit, accessed on July 30, 2025, [https://www.reddit.com/r/explainlikeimfive/comments/cv68fz/eli5why\_is\_it\_hard\_to\_simulate\_human\_mouse/](https://www.reddit.com/r/explainlikeimfive/comments/cv68fz/eli5why_is_it_hard_to_simulate_human_mouse/) 1040 92. Emulate Human Mouse Input with Bezier Curves and Gaussian Distributions \- CodeProject, accessed on July 30, 2025, [https://www.codeproject.com/Tips/759391/Emulate-Human-Mouse-Input-with-Bezier-Curves-and-G](https://www.codeproject.com/Tips/759391/Emulate-Human-Mouse-Input-with-Bezier-Curves-and-G) 1041 93. The Best Residential Proxies of 2025: Tested & Ranked \- Proxyway, accessed on July 30, 2025, [https://proxyway.com/best/residential-proxies](https://proxyway.com/best/residential-proxies) 1042 94. 10 Best Residential Proxies in 2025 (List of Residential IP Proxies From Best Provider) \- GeeksforGeeks, accessed on July 30, 2025, [https://www.geeksforgeeks.org/websites-apps/best-residential-proxy-providers/](https://www.geeksforgeeks.org/websites-apps/best-residential-proxy-providers/) 1043 95. Top 10 USA Proxy Providers in 2025 for Scraping \- Medium, accessed on July 30, 2025, [https://medium.com/@datajournal/best-usa-proxies-9ca04be84754](https://medium.com/@datajournal/best-usa-proxies-9ca04be84754) 1044 96. How to set proxy in Playwright \- Pixeljets, accessed on July 30, 2025, [https://pixeljets.com/blog/proxy-in-playwright/](https://pixeljets.com/blog/proxy-in-playwright/)