In this article, you will learn how to build AI agents that can browse and interact with real websites using Playwright, browser-use, and LangGraph.
Topics we will cover include:
- Why Playwright is the right foundation for browser automation in 2026, and how it differs from Selenium.
- How to scrape dynamic, JavaScript-rendered pages and complete multi-step forms reliably.
- How to wire browser actions into LangGraph and browser-use agents, handle anti-bot detection, manage waiting and session persistence, and deploy the result in Docker.

Building Browser-Using AI Agents in Python
Introduction
Most AI agent tutorials start with an API. They show you how to call OpenWeather, hit the Stripe endpoint, pull data from GitHub. That is a fine starting point until you try to build something real and realize that the task you actually need done does not have an API.
Think about what humans do with browsers every day: filing government forms, reading competitor pricing, extracting research from sites that guard their data behind JavaScript rendering, logging into portals that have never heard of OAuth. There are roughly 1.1 billion websites on the internet. A vanishingly small fraction of them have public APIs. The rest only speak browser.
An agent that is limited to API calls handles maybe 5% of the tasks a human worker does daily. Give that agent a browser, and the coverage approaches everything. That is the gap this article closes.
The global AI agents market stands at \$10.91 billion in 2026 and is projected to reach \$50.31 billion by 2030, with browser-capable agents at the center of that growth. 27.7% of enterprises are already running agentic browsers in production, up from virtually none two years prior. The tooling has matured fast, and the patterns are settled enough to teach properly.
By the end of this article, you will have a working browser agent that navigates real websites, fills forms, extracts structured data, and connects to an LLM that decides what to do next, all in Python.
Why Playwright, Not Selenium
If you built browser automation five years ago, you built it with Selenium. Selenium is still widely deployed, still works, and is not going anywhere. But for any new project in 2026, Playwright is the default. The reasons are practical, not theoretical.
Selenium communicates with the browser by sending individual HTTP requests to a WebDriver. Every action, click, type, scroll, is a separate request. Playwright uses a persistent WebSocket connection for the entire session. Commands flow through that channel with no per-action round-trip cost. Independent benchmarks consistently show Playwright running 30-50% faster than Selenium at the test-suite level and averaging ~290ms per action versus Selenium’s ~536ms. For a browser agent that might execute hundreds of actions, that gap compounds.
Playwright also bundles its own browser binaries. When you install it, you get pre-configured versions of Chromium, Firefox, and WebKit that are guaranteed to work with your Playwright version. No driver version mismatches, no broken CI pipelines because someone updated Chrome. It has built-in auto-waiting before it clicks an element; it verifies the element is visible, enabled, and not animating. You do not have to write time.sleep(2) and hope for the best.
For AI agents specifically, Playwright fires real mouse and keyboard events that mirror how humans interact with browsers. Sites designed to detect automation look for synthetic DOM clicks. Playwright’s interaction model is harder to distinguish from genuine human input.

A side-by-side architecture comparison diagram (click to enlarge)
There is also the browser-use library, which sits one level higher. Browser-use is a Python library that gives an LLM a working browser. Under the hood, it uses Playwright to drive the browser, but the LLM reads the page state and decides what to click, type, and extract, no CSS selectors required. You give it a task in plain English, and it figures out the rest. We will cover both raw Playwright and browser-use in this article, because they serve different needs: Playwright when you want precise, predictable control; browser-use when you want the agent to handle navigation decisions autonomously.
Setting Up the Environment
You need Python 3.10 or higher, an OpenAI API key, and about five minutes.
Step 1: Create a virtual environment
|
1 2 3 4 5 6 7 |
python -m venv browser_agent_env # macOS / Linux source browser_agent_env/bin/activate # Windows browser_agent_env\Scripts\activate |
Step 2: Install dependencies
|
1 2 3 4 5 6 7 |
pip install playwright \ browser-use \ langchain \ langchain-openai \ langgraph \ langchain-community \ python-dotenv |
Step 3: Install the browser binaries
This is the step most people miss. Playwright needs to download Chromium, Firefox, and WebKit separately from the Python package. Run this once after installing:
|
1 |
playwright install chromium |
If you want all three browser engines: playwright install. Chromium alone is sufficient for most agent work and is smaller to download.
Step 4: Store your API key
Create a .env file in your project directory:
|
1 |
OPENAI_API_KEY=your_openai_api_key_here |
Add .env to your .gitignore immediately. Do not commit API keys.
Step 5: Verify everything works
Here is a first script that navigates to a URL, reads the heading, and saves a screenshot. Use example.com, a publicly available test domain maintained by IANA that will not block you.
How to run: Save as first_run.py and run python first_run.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# first_run.py # Navigate to a URL, take a screenshot, and extract the page title. # Prerequisites: pip install playwright && playwright install chromium # How to run: python first_run.py import asyncio from playwright.async_api import async_playwright async def main(): async with async_playwright() as p: # Launch Chromium in headless mode (no visible browser window). # Set headless=False if you want to watch it run during development. browser = await p.chromium.launch(headless=True) # A browser context is like a fresh browser profile. # It isolates cookies, storage, and cache from other contexts. context = await browser.new_context( viewport={"width": 1280, "height": 720}, user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/120.0.0.0 Safari/537.36" ) ) page = await context.new_page() # Navigate to the URL and wait until the network is idle. # "networkidle" means no open network connections for 500ms. # For faster pages, "domcontentloaded" is sufficient. await page.goto("https://example.com", wait_until="networkidle") # Extract the page title title = await page.title() print(f"Page title: {title}") # Extract the text content of the h1 heading h1 = await page.text_content("h1") print(f"H1 heading: {h1}") # Take a full-page screenshot and save it to disk await page.screenshot(path="screenshot.png", full_page=True) print("Screenshot saved to screenshot.png") await browser.close() asyncio.run(main()) |
What this does: async_playwright() is the entry point for the entire Playwright session. The browser_context is equivalent to opening a fresh incognito window; cookies, local storage, and cache are isolated from everything else. wait_until=”networkidle” tells Playwright to wait until the page has finished all its network activity before your code continues, which is the safest wait strategy for dynamic pages.
If this runs and saves a screenshot, your environment is working correctly.
Web Navigation and Scraping
The reason you need Playwright instead of requests + BeautifulSoup is JavaScript rendering. Modern websites deliver a skeleton of HTML and then build the actual content dynamically after the page loads: React, Vue, Angular, Next.js. A plain HTTP request fetches the skeleton. Playwright runs a real browser, so it sees exactly what a human sees after all JavaScript has executed.
The target below is books.toscrape.com, a legal scraping sandbox built for practice. It paginates results, uses dynamic class names for ratings, and closely mirrors the structure of real e-commerce product pages.
How to run: Save as scrape_books.py and run python scrape_books.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# scrape_books.py # Scrape book titles, prices, and ratings from books.toscrape.com # This is a legal scraping sandbox site built for practice. # Prerequisites: pip install playwright && playwright install chromium # How to run: python scrape_books.py import asyncio import json from playwright.async_api import async_playwright async def scrape_books(max_pages: int = 3) -> list[dict]: """ Scrape book listings from books.toscrape.com across multiple pages. Returns a list of dicts with title, price, rating, and page number. """ results = [] async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context(viewport={"width": 1280, "height": 720}) page = await context.new_page() for page_num in range(1, max_pages + 1): url = f"https://books.toscrape.com/catalogue/page-{page_num}.html" print(f"Scraping page {page_num}: {url}") await page.goto(url, wait_until="domcontentloaded") # Wait for the product cards to be visible before extracting. # This is critical on JavaScript-heavy pages where content loads after the HTML. # timeout=10000 means wait up to 10 seconds before raising an error. await page.wait_for_selector("article.product_pod", timeout=10000) # Get all book cards on the current page books = await page.query_selector_all("article.product_pod") for book in books: # Extract title from the <a> tag's title attribute title_el = await book.query_selector("h3 a") title = await title_el.get_attribute("title") if title_el else "N/A" # Extract price text price_el = await book.query_selector(".price_color") price = await price_el.inner_text() if price_el else "N/A" # Extract star rating from the CSS class name. # e.g. <p class="star-rating Three"> → "Three" rating_el = await book.query_selector("p.star-rating") rating_class = await rating_el.get_attribute("class") if rating_el else "" rating = rating_class.replace("star-rating", "").strip() results.append({ "title": title, "price": price, "rating": rating, "page": page_num }) print(f" Extracted {len(books)} books from page {page_num}") await browser.close() return results async def main(): books = await scrape_books(max_pages=2) print(f"\nTotal books scraped: {len(books)}") print(json.dumps(books[:3], indent=2)) asyncio.run(main()) |
What this does: wait_for_selector() is the key call here. Instead of sleeping for a fixed time and hoping the content has loaded, it watches the DOM and proceeds the moment the target element appears, or raises a TimeoutError if it does not appear within the timeout window. That is the right behavior: fail fast and explicitly rather than silently extracting from an empty page.
The rating extraction deserves attention. The star rating is encoded as a CSS class (star-rating Three), not a number. The code strips “star-rating” from the class string to get the text value. This is the kind of thing you only know by inspecting the actual HTML. When you hand this task to a raw LLM with no browser, it has no way to know what the class structure looks like. With Playwright, you can inspect it directly and extract it exactly.
Form Completion and Multi-Step Flows
Filling forms is where browser agents earn their keep and where most automation scripts fail. The reason is that web forms are not just inputs and buttons. They fire focus, input, change, and blur events in sequence. JavaScript validation listens for those events. If you inject a value into an input field by directly setting value in the DOM (as older automation tools often do), the validation listeners never fire and the form breaks.
Playwright’s fill() and click() methods fire real browser events in the right order, which is why they work on form validation that would block lower-level approaches.
The target below is the-internet.herokuapp.com/login, a public test site maintained specifically for automation practice. It accepts tomsmith / SuperSecretPassword! as valid credentials and returns clear success/failure messages.
How to run: Save as form_submit.py and run python form_submit.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# form_submit.py # Complete and submit a multi-field login form on a public demo site. # Target: https://the-internet.herokuapp.com/login (public test site) # Prerequisites: pip install playwright && playwright install chromium # How to run: python form_submit.py import asyncio from playwright.async_api import async_playwright async def login_and_verify(username: str, password: str) -> dict: """ Attempt to log in to a demo site and return whether it succeeded. Handles: input filling, button clicking, and result verification. """ async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context() page = await context.new_page() await page.goto("https://the-internet.herokuapp.com/login") # Wait for the form to be visible before interacting. # state="visible" is the default but makes the intent explicit. await page.wait_for_selector("#username", state="visible") # fill() clears the field first, then types the value. # It fires the focus, input, and change events in order. await page.fill("#username", username) await page.fill("#password", password) # click() fires real mouse events -- mousedown, mouseup, click. # This triggers JavaScript listeners that a plain DOM click misses. await page.click("button[type='submit']") # Wait for the page to settle after form submission await page.wait_for_load_state("networkidle") # Check which result element appeared success_el = await page.query_selector(".flash.success") error_el = await page.query_selector(".flash.error") if success_el: message = await success_el.inner_text() result = {"success": True, "message": message.strip()} elif error_el: message = await error_el.inner_text() result = {"success": False, "message": message.strip()} else: result = {"success": False, "message": "Unknown result"} await browser.close() return result async def main(): # Valid credentials for the demo site result = await login_and_verify("tomsmith", "SuperSecretPassword!") print(f"Valid login: {result}") # Invalid credentials to verify error handling result_fail = await login_and_verify("wronguser", "wrongpass") print(f"Invalid login: {result_fail}") asyncio.run(main()) |
What this does: The pattern here, fill() → click() → wait_for_load_state() → check for result element, is the template for almost any form interaction. The wait_for_load_state(“networkidle”) after the submit is important: without it, you query the DOM before the page has updated and get the pre-submission state, not the result.
For more complex forms with file uploads, dropdowns, and checkboxes:
|
1 2 3 4 5 6 7 8 9 10 11 |
# File upload await page.set_input_files("#file-upload", "/path/to/document.pdf") # Select dropdown by visible label text await page.select_option("#country-select", label="Nigeria") # Check a checkbox await page.check("#agree-terms") # Handle a modal dialog (confirm/alert) page.on("dialog", lambda dialog: asyncio.ensure_future(dialog.accept())) |
Tool Orchestration with LangChain and LangGraph
Raw Playwright scripts are powerful but fixed. They do exactly what you coded, no more. The moment a page changes its structure, or the task requires a decision the script did not anticipate, it breaks.
Connecting Playwright to an LLM changes this. Browser actions become tools the agent can call when it decides they are needed. The agent reads the task, reasons about what to do, calls a tool, reads the result, and decides what to do next. That loop handles variation that a fixed script cannot.
This is the bridge from “browser automation script” to “AI agent.”
How to run: Save as agent_tools.py, ensure OPENAI_API_KEY is in your .env, then run python agent_tools.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# agent_tools.py # LangGraph agent with three browser tools: navigate_and_extract, fill_and_submit_form, take_screenshot # Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv # playwright install chromium # How to run: python agent_tools.py import asyncio import os from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langchain.tools import tool from langchain_core.messages import HumanMessage from langgraph.prebuilt import create_react_agent from playwright.async_api import async_playwright load_dotenv() # ── SHARED BROWSER STATE ────────────────────────────────────────────────────── # We keep a single browser instance alive for the agent's lifetime. # Creating and destroying a browser on every tool call is slow and wasteful. _browser = None _page = None _playwright = None async def get_page(): """Return the shared page, launching the browser if needed.""" global _browser, _page, _playwright if _browser is None: _playwright = await async_playwright().start() _browser = await _playwright.chromium.launch(headless=True) context = await _browser.new_context(viewport={"width": 1280, "height": 720}) _page = await context.new_page() return _page async def close_browser(): """Clean up browser resources when the agent session ends.""" global _browser, _page, _playwright if _browser: await _browser.close() await _playwright.stop() _browser = None _page = None _playwright = None # ── BROWSER TOOLS ───────────────────────────────────────────────────────────── # Note: these are async tools (async def). LangChain's @tool decorator supports # async functions directly, and the agent must be invoked with ainvoke() so that # tool calls run on the same event loop instead of trying to start a second one. @tool async def navigate_and_extract(url: str) -> str: """ Navigate to a URL and return the visible text content of the page. Use this to visit websites and read their content. Input: a full URL string including https:// (e.g., 'https://example.com'). """ page = await get_page() await page.goto(url, wait_until="domcontentloaded", timeout=15000) await page.wait_for_load_state("networkidle") content = await page.inner_text("body") # Truncate to avoid flooding the LLM context window return content[:3000] if len(content) > 3000 else content @tool async def fill_and_submit_form(selector_value_pairs: str) -> str: """ Fill form fields and submit a form on the currently loaded page. Input: a comma-separated string of 'selector:value' pairs ending with 'submit:button_selector'. Example: '#email:user@example.com,#password:secret,submit:button[type=submit]' """ page = await get_page() try: pairs = selector_value_pairs.split(",") submit_selector = None for pair in pairs: key, val = pair.split(":", 1) key = key.strip() val = val.strip() if key == "submit": submit_selector = val else: await page.fill(key, val) if submit_selector: await page.click(submit_selector) await page.wait_for_load_state("networkidle") return f"Form submitted. Current URL: {page.url}" except Exception as e: return f"Form interaction failed: {str(e)}" @tool async def take_screenshot(filename: str) -> str: """ Take a screenshot of the current browser page and save it to a file. Use this to visually verify the current state of the page. Input: filename string (e.g., 'result.png'). """ page = await get_page() await page.screenshot(path=filename, full_page=False) return f"Screenshot saved to {filename}" # ── AGENT SETUP ─────────────────────────────────────────────────────────────── llm = ChatOpenAI( model="gpt-4o", temperature=0, api_key=os.getenv("OPENAI_API_KEY") ) tools = [navigate_and_extract, fill_and_submit_form, take_screenshot] # create_react_agent wires together the LLM, the tools, and the ReAct reasoning loop. # The agent decides which tool to call, calls it, reads the result, and continues. agent = create_react_agent(llm, tools) # ── DEMO ────────────────────────────────────────────────────────────────────── async def main(): result = await agent.ainvoke({ "messages": [HumanMessage( content=( "Go to https://example.com, read the page content, " "then take a screenshot called example.png" ) )] }) print(result["messages"][-1].content) await close_browser() asyncio.run(main()) |
What this does: The three @tool-decorated functions are registered with the agent. Each docstring is what the LLM reads to understand what the tool does and when to use it. Write them like job descriptions, not code comments. The shared _browser and _page globals mean the browser stays open across multiple tool calls, which is essential for tasks that span several pages in the same session. Because the tools are defined with async def, the agent is invoked with ainvoke() rather than invoke(), so the tool calls run on the same event loop that main() is already using.

A vertical flow diagram showing how a task request flows through the agent (click to enlarge)
Image by Editor
The key design decision in this snippet is the shared browser instance. If each tool call launched and closed its own browser, you would lose all session state between calls, such as cookies, navigation history, and any form state the agent had already built up. Keeping the browser alive for the full agent session preserves that context.
Using browser-use for High-Level Agent Tasks
Raw Playwright with @tool functions gives you precise control. The trade-off is that you are still writing selectors, still thinking about page structure, still handling every edge case manually. If the site changes its HTML, your selectors break.
browser-use takes a different approach. Instead of writing selectors, you give the agent a task in plain English. browser-use uses Playwright under the hood, but the LLM reads the current page state on each step and decides what to do next: which element to click, what to type, and when the task is complete. The page structure is not hardcoded into your code. The agent figures it out at runtime.
browser-use is a Python library that gives an LLM a working browser. The LLM reads each page and decides what to click, type, and extract. This makes it resilient to site changes that would break a selector-based script.
When to use browser-use over raw Playwright:
- If the task is exploratory and the page structure is unpredictable, use browser-use.
- If you are running a fixed, repeatable workflow where every selector is known and stable, raw Playwright is more reliable and cheaper per run.
- A browser-use agent makes multiple LLM calls per task step; a scripted Playwright run makes none.
How to run: Save as browser_use_agent.py, ensure OPENAI_API_KEY is in your .env, then run python browser_use_agent.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# browser_use_agent.py # A browser-use agent that accepts a natural language task and completes it # without any CSS selectors or hardcoded page structure. # Prerequisites: pip install browser-use playwright python-dotenv # playwright install chromium # How to run: python browser_use_agent.py import asyncio import os from dotenv import load_dotenv from langchain_openai import ChatOpenAI from browser_use import Agent load_dotenv() async def run_browser_task(task: str) -> str: """ Hand a natural language task to a browser-use agent. The agent handles navigation, clicks, and extraction without selectors. """ # temperature=0 keeps decisions deterministic and reduces hallucinated actions llm = ChatOpenAI( model="gpt-4o", temperature=0, api_key=os.getenv("OPENAI_API_KEY") ) # Agent wraps the browser, the LLM, and the task loop together. # max_actions_per_step limits how many actions the agent takes before # re-reading the page -- prevents runaway loops on complex pages. agent = Agent( task=task, llm=llm, max_actions_per_step=5 ) # run() executes the full task loop: # read page → decide action → take action → read updated page → repeat result = await agent.run() # final_result() returns the agent's extracted content or conclusion return result.final_result() or "Task completed with no extracted output." async def main(): task = ( "Go to https://books.toscrape.com and find the 3 most expensive books " "on the first page. Return their titles and prices." ) print(f"Task: {task}\n") output = await run_browser_task(task) print(f"Result:\n{output}") asyncio.run(main()) |
What this does: The entire task, navigating to the site, reading the page, identifying the three highest prices, and extracting them, is handled by the agent without a single CSS selector in your code. If books.toscrape.com redesigns its price display tomorrow, the script still works. With a selector-based scraper, it would break silently.
The max_actions_per_step=5 parameter is worth explaining. On each step, the agent reads the page and can decide to take up to five actions (click, type, scroll, navigate) before re-reading the page. Keeping this low forces the agent to check its work more frequently, which catches mistakes earlier.
Handling the Hard Parts
Three things break most browser agents in production. Each has a solution, but none of them is obvious until you have already been burned.
1. Anti-Bot Detection
Websites that do not want to be automated detect automation in several ways, such as checking the navigator.webdriver property (which Playwright sets to true by default), looking for headless browser fingerprints in the JavaScript environment, and analyzing interaction patterns that are too fast or too uniform to be human.
The most important mitigation is removing the webdriver flag. Beyond that, a realistic user agent string, a standard viewport size, and a realistic locale and timezone cover most detection methods short of sophisticated fingerprint analysis.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# hard_parts.py -- Part 1: Anti-bot stealth launch # Prerequisites: pip install playwright && playwright install chromium # How to run: python hard_parts.py import asyncio import json from pathlib import Path from playwright.async_api import async_playwright async def launch_stealth_browser(playwright): """ Launch a browser context that looks more like a real human session. Covers: realistic viewport, user-agent, locale, timezone, webdriver flag. Note: For serious anti-bot targets, consider a paid service like Browserbase. """ browser = await playwright.chromium.launch( headless=True, args=[ "--disable-blink-features=AutomationControlled", # Hides webdriver detection "--no-sandbox", "--disable-dev-shm-usage", ] ) context = await browser.new_context( viewport={"width": 1366, "height": 768}, # Common desktop resolution user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/124.0.0.0 Safari/537.36" ), locale="en-US", timezone_id="America/New_York", java_script_enabled=True, ) # Remove the 'webdriver' property that Playwright injects by default. # Bot detection systems check for this in the browser's JS environment. await context.add_init_script( "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})" ) return browser, context |
What this does: The add_init_script() call runs before any page JavaScript executes, which means the navigator.webdriver override is in place before the site’s detection code can check for it. The –disable-blink-features=AutomationControlled launch argument removes a separate automation flag at the browser engine level. Together, these two changes handle the most common detection methods.
For sites with aggressive fingerprinting and CAPTCHA systems, these mitigations will not be enough. Services like Browserbase, Spidra and Brightdata’s Scraping Browser handle CAPTCHA solving, residential IP rotation, and browser fingerprint management as managed infrastructure.
2. Smart Waiting
The second failure mode is timing. The reflex is to add time.sleep() calls and increase them when things break. This is wrong in both directions: too short on slow connections, too long on fast ones, and completely opaque when debugging.
Playwright has four proper wait strategies. Use the one that matches what you are actually waiting for:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Part 2: Smart waiting strategies (add to your scraper or agent tools) async def smart_wait_examples(page): """ Four ways to wait for the right page state, without arbitrary sleeps. """ # STRATEGY 1: Wait for a specific element to appear in the DOM # Use when you know exactly what element signals content has loaded await page.wait_for_selector(".product-list", state="visible", timeout=10000) # STRATEGY 2: Wait for a specific API response # Use when the content comes from an XHR/fetch call you can identify async with page.expect_response( lambda r: "/api/products" in r.url and r.status == 200 ) as response_info: await page.click("#load-more") response = await response_info.value print(f"API responded: {response.status}") # STRATEGY 3: Wait for the URL to change after form submission # Use when a successful submit redirects to a new page await page.wait_for_url("**/dashboard**", timeout=10000) # STRATEGY 4: Wait for a JavaScript variable to be set # Use when no visual element reliably signals the ready state await page.wait_for_function( "() => window.__dataLoaded === true", timeout=10000 ) |
What this does: Each strategy is tied to a specific observable event rather than an arbitrary time delay. wait_for_selector watches the DOM. expect_response hooks into the network layer. wait_for_url monitors navigation. wait_for_function evaluates JavaScript in the browser context. Use whichever one most directly signals “the thing I need is now ready.”
3. Session and Cookie Persistence
The third failure mode is losing session state. If your agent logs into a site during step one and then the browser context is destroyed, step two has no authentication. Recreating the login on every run is slow and can trigger rate limiting or lockout.
The solution is saving cookies to disk after login and loading them at the start of every subsequent run:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Part 3: Session persistence across runs COOKIES_FILE = Path("session_cookies.json") async def save_session(context) -> None: """Save browser cookies to disk after a successful login.""" cookies = await context.cookies() COOKIES_FILE.write_text(json.dumps(cookies, indent=2)) print(f"Session saved: {len(cookies)} cookies written.") async def load_session(context) -> bool: """Load saved cookies before navigating. Returns True if session was found.""" if not COOKIES_FILE.exists(): print("No saved session. Fresh login required.") return False cookies = json.loads(COOKIES_FILE.read_text()) await context.add_cookies(cookies) print(f"Session restored: {len(cookies)} cookies loaded.") return True |
What this does: context.cookies() returns all cookies for the current browser context, including session tokens and authentication cookies. Writing them to JSON and reloading them on the next run means the browser starts in an authenticated state. Note that sessions expire; add a check that falls back to a fresh login if the saved session returns a redirect to the login page.
Deploying Browser Agents
Getting a browser agent working locally is one thing. Running it reliably in a cloud environment is another.
The main difference between a Python script that works on your laptop and one that fails in CI is system dependencies. Playwright’s Chromium browser requires a set of shared libraries that are present on most developer machines but absent from minimal cloud images. The cleanest solution is Docker.
Dockerfile — build a container that ships everything Playwright needs:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Dockerfile for headless Playwright-based browser agent # Build: docker build -t browser-agent . # Run: docker run --rm -e OPENAI_API_KEY=your_key browser-agent FROM python:3.11-slim # Install system dependencies required by Chromium RUN apt-get update && apt-get install -y \ libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 \ libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \ libxrandr2 libgbm1 libasound2 libpangocairo-1.0-0 \ libpango-1.0-0 libcairo2 libx11-6 libxext6 libxfixes3 \ fonts-liberation wget ca-certificates \ && rm -rf /var/lib/apt/lists/* WORKDIR /app # Install Python dependencies first (cached layer -- only rebuilds on requirements change) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Install Playwright browser binaries into the image RUN playwright install chromium RUN playwright install-deps chromium # Copy application code last (changes here don't invalidate the pip/playwright layers) COPY . . CMD ["python", "agent_tools.py"] requirements.txt: playwright browser-use langchain langchain-openai langgraph python-dotenv |
For concurrent workloads running multiple browser sessions in parallel, use Playwright’s async API with asyncio.gather():
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Parallel scraping with semaphore rate limiting # Runs up to 3 browser sessions simultaneously import asyncio from playwright.async_api import async_playwright async def scrape_url(browser, url: str, semaphore: asyncio.Semaphore) -> dict: """Scrape a single URL, respecting the concurrency semaphore.""" async with semaphore: context = await browser.new_context() page = await context.new_page() await page.goto(url, wait_until="domcontentloaded") title = await page.title() await context.close() # Close context (not browser) to release resources return {"url": url, "title": title} async def scrape_parallel(urls: list[str], max_concurrent: int = 3) -> list[dict]: """Scrape a list of URLs in parallel, capped at max_concurrent sessions.""" semaphore = asyncio.Semaphore(max_concurrent) # Cap concurrent sessions async with async_playwright() as p: # One browser shared across all contexts -- much cheaper than one browser per URL browser = await p.chromium.launch(headless=True) tasks = [scrape_url(browser, url, semaphore) for url in urls] results = await asyncio.gather(*tasks) await browser.close() return list(results) |
What this does: The asyncio.Semaphore(max_concurrent) caps how many browser contexts run at the same time. Without it, launching 50 concurrent browser contexts will exhaust memory. One browser process is shared across all contexts; a context is cheap; a full browser instance is not.
On the managed infrastructure side, Amazon Nova Act launched in March 2025 as a dedicated SDK for building browser agents on AWS, integrating natively with Playwright for browser control. Playwright’s own MCP server gives AI assistants full browser control through the Model Context Protocol, using structured accessibility snapshots rather than screenshots, which means token costs stay low while the agent’s understanding of the page stays high.
Putting It All Together
Here is a complete end-to-end agent that takes a research question, navigates to a public data source, extracts structured results, and returns a clean summary. It uses the browser tools from Section 5 orchestrated by a LangGraph agent.
How to run: Save as reference_agent.py, ensure OPENAI_API_KEY is in your .env, and run python reference_agent.py
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# reference_agent.py # Full browser-using AI agent: navigates, extracts, summarizes. # Target: books.toscrape.com (public scraping sandbox) # Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv # playwright install chromium # How to run: python reference_agent.py import asyncio import os from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langchain.tools import tool from langchain_core.messages import HumanMessage, SystemMessage from langgraph.prebuilt import create_react_agent from playwright.async_api import async_playwright load_dotenv() # ── BROWSER STATE ───────────────────────────────────────────────────────────── _browser = None _context = None _page = None _playwright = None async def get_page(): global _browser, _context, _page, _playwright if _browser is None: _playwright = await async_playwright().start() _browser = await _playwright.chromium.launch(headless=True) _context = await _browser.new_context( viewport={"width": 1280, "height": 720}, user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/120.0.0.0 Safari/537.36" ) ) # Remove webdriver fingerprint await _context.add_init_script( "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})" ) _page = await _context.new_page() return _page async def teardown(): global _browser, _playwright if _browser: await _browser.close() await _playwright.stop() _browser = None _playwright = None # ── TOOLS ───────────────────────────────────────────────────────────────────── @tool async def navigate(url: str) -> str: """ Navigate the browser to a URL and return the page's text content. Use when you need to open a website or move to a new page. Input: full URL with https:// prefix. """ page = await get_page() await page.goto(url, wait_until="domcontentloaded", timeout=20000) await page.wait_for_load_state("networkidle") content = await page.inner_text("body") return content[:4000] @tool async def extract_structured(css_selector: str) -> str: """ Extract text from all elements matching a CSS selector on the current page. Use when you need to pull specific elements from the loaded page. Input: valid CSS selector string (e.g., 'h3 a', '.price_color', 'article.product_pod'). """ page = await get_page() try: await page.wait_for_selector(css_selector, timeout=5000) elements = await page.query_selector_all(css_selector) texts = [] for el in elements[:20]: # Cap at 20 elements to keep output manageable text = await el.inner_text() texts.append(text.strip()) return "\n".join(texts) if texts else "No elements found." except Exception as e: return f"Extraction failed: {str(e)}" @tool async def get_current_url() -> str: """Return the URL the browser is currently on. No input required.""" page = await get_page() return page.url # ── AGENT ───────────────────────────────────────────────────────────────────── llm = ChatOpenAI( model="gpt-4o", temperature=0, api_key=os.getenv("OPENAI_API_KEY") ) tools = [navigate, extract_structured, get_current_url] agent = create_react_agent(llm, tools) SYSTEM = ( "You are a browser-based research agent. You have access to a real browser. " "Use navigate() to open pages, extract_structured() to pull specific elements, " "and get_current_url() to check where you are. " "Always navigate first, then extract. Be concise in your final answer." ) async def run_agent(query: str) -> str: result = await agent.ainvoke({ "messages": [ SystemMessage(content=SYSTEM), HumanMessage(content=query) ] }) await teardown() return result["messages"][-1].content # ── DEMO ────────────────────────────────────────────────────────────────────── if __name__ == "__main__": query = ( "Go to https://books.toscrape.com and extract the titles and prices " "of the first 5 books listed. Return them as a structured list." ) print(f"Query: {query}\n") answer = asyncio.run(run_agent(query)) print(f"Answer:\n{answer}") |
What this does: This agent has three clean tools: navigate, extract_structured, and get_current_url, plus a system prompt that tells it exactly when to use each one. The agent calls navigate to load the page, extract_structured to pull the book titles and prices by CSS selector, and synthesizes a structured list in the final answer. The teardown() call after the agent finishes closes the browser cleanly so no zombie Chromium processes are left running.
Conclusion
The browser is not a specialized tool for automation engineers. It is the universal interface for the web, and the web is where most of the world’s actual work gets done. An AI agent that can use a browser does not need a partner team maintaining API integrations. It can reach anything a human can reach.
What makes this practical now, not just theoretically interesting, is the maturity of the tooling. Playwright handles the hard parts of browser interaction. browser-use removes the need to write selectors for exploratory tasks. LangGraph gives the LLM clean tool hooks and a reasoning loop that handles variable page structures. The patterns in this article are not demos. They are the same patterns 51% of enterprises now running AI agents in production are building on.
Start with the scraping example. Get it running against a site you actually need data from. Add the agent layer when you need decisions the script cannot anticipate. Add browser-use when the page structure is too dynamic for selectors. Deploy in Docker when you need it running somewhere other than your laptop.
The hard part is not the code. It is knowing which tool to reach for at each layer. Hopefully this article made that clearer.






No comments yet.