Building Browser-Using AI Agents in Python

By Shittu Olumide on June 22, 2026 in Artificial Intelligence 2

In this article, you will learn how to build AI agents that can browse and interact with real websites using Playwright, browser-use, and LangGraph.

Topics we will cover include:

Why Playwright is the right foundation for browser automation in 2026, and how it differs from Selenium.
How to scrape dynamic, JavaScript-rendered pages and complete multi-step forms reliably.
How to wire browser actions into LangGraph and browser-use agents, handle anti-bot detection, manage waiting and session persistence, and deploy the result in Docker.

Building Browser-Using AI Agents in Python

Introduction

Most AI agent tutorials start with an API. They show you how to call OpenWeather, hit the Stripe endpoint, pull data from GitHub. That is a fine starting point until you try to build something real and realize that the task you actually need done does not have an API.

Think about what humans do with browsers every day: filing government forms, reading competitor pricing, extracting research from sites that guard their data behind JavaScript rendering, logging into portals that have never heard of OAuth. There are roughly 1.1 billion websites on the internet. A vanishingly small fraction of them have public APIs. The rest only speak browser.

An agent that is limited to API calls handles maybe 5% of the tasks a human worker does daily. Give that agent a browser, and the coverage approaches everything. That is the gap this article closes.

The global AI agents market stands at \$10.91 billion in 2026 and is projected to reach \$50.31 billion by 2030, with browser-capable agents at the center of that growth. 27.7% of enterprises are already running agentic browsers in production, up from virtually none two years prior. The tooling has matured fast, and the patterns are settled enough to teach properly.

By the end of this article, you will have a working browser agent that navigates real websites, fills forms, extracts structured data, and connects to an LLM that decides what to do next, all in Python.

Why Playwright, Not Selenium

If you built browser automation five years ago, you built it with Selenium. Selenium is still widely deployed, still works, and is not going anywhere. But for any new project in 2026, Playwright is the default. The reasons are practical, not theoretical.

Selenium communicates with the browser by sending individual HTTP requests to a WebDriver. Every action, click, type, scroll, is a separate request. Playwright uses a persistent WebSocket connection for the entire session. Commands flow through that channel with no per-action round-trip cost. Independent benchmarks consistently show Playwright running 30-50% faster than Selenium at the test-suite level and averaging ~290ms per action versus Selenium’s ~536ms. For a browser agent that might execute hundreds of actions, that gap compounds.

Playwright also bundles its own browser binaries. When you install it, you get pre-configured versions of Chromium, Firefox, and WebKit that are guaranteed to work with your Playwright version. No driver version mismatches, no broken CI pipelines because someone updated Chrome. It has built-in auto-waiting before it clicks an element; it verifies the element is visible, enabled, and not animating. You do not have to write time.sleep(2) and hope for the best.

For AI agents specifically, Playwright fires real mouse and keyboard events that mirror how humans interact with browsers. Sites designed to detect automation look for synthetic DOM clicks. Playwright’s interaction model is harder to distinguish from genuine human input.

A side-by-side architecture comparison diagram (click to enlarge)

There is also the browser-use library, which sits one level higher. Browser-use is a Python library that gives an LLM a working browser. Under the hood, it uses Playwright to drive the browser, but the LLM reads the page state and decides what to click, type, and extract, no CSS selectors required. You give it a task in plain English, and it figures out the rest. We will cover both raw Playwright and browser-use in this article, because they serve different needs: Playwright when you want precise, predictable control; browser-use when you want the agent to handle navigation decisions autonomously.

Setting Up the Environment

You need Python 3.10 or higher, an OpenAI API key, and about five minutes.

Step 1: Create a virtual environment

python -m venv browser_agent_env

# macOS / Linux
source browser_agent_env/bin/activate

# Windows
browser_agent_env\Scripts\activate

python -m venv browser_agent_env

# macOS / Linux

source browser_agent_env/bin/activate

# Windows

browser_agent_env\Scripts\activate

Step 2: Install dependencies

pip install playwright \
            browser-use \
            langchain \
            langchain-openai \
            langgraph \
            langchain-community \
            python-dotenv

pip install playwright \

browser-use \

langchain \

langchain-openai \

langgraph \

langchain-community \

python-dotenv

Step 3: Install the browser binaries
This is the step most people miss. Playwright needs to download Chromium, Firefox, and WebKit separately from the Python package. Run this once after installing:

playwright install chromium

1	playwright install chromium

If you want all three browser engines: playwright install. Chromium alone is sufficient for most agent work and is smaller to download.

Step 4: Store your API key
Create a .env file in your project directory:

OPENAI_API_KEY=your_openai_api_key_here

1	OPENAI_API_KEY=your_openai_api_key_here

Add .env to your .gitignore immediately. Do not commit API keys.

Step 5: Verify everything works
Here is a first script that navigates to a URL, reads the heading, and saves a screenshot. Use example.com, a publicly available test domain maintained by IANA that will not block you.

How to run: Save as first_run.py and run python first_run.py

# first_run.py
# Navigate to a URL, take a screenshot, and extract the page title.
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python first_run.py

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        # Launch Chromium in headless mode (no visible browser window).
        # Set headless=False if you want to watch it run during development.
        browser = await p.chromium.launch(headless=True)

        # A browser context is like a fresh browser profile.
        # It isolates cookies, storage, and cache from other contexts.
        context = await browser.new_context(
            viewport={"width": 1280, "height": 720},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            )
        )

        page = await context.new_page()

        # Navigate to the URL and wait until the network is idle.
        # "networkidle" means no open network connections for 500ms.
        # For faster pages, "domcontentloaded" is sufficient.
        await page.goto("https://example.com", wait_until="networkidle")

        # Extract the page title
        title = await page.title()
        print(f"Page title: {title}")

        # Extract the text content of the h1 heading
        h1 = await page.text_content("h1")
        print(f"H1 heading: {h1}")

        # Take a full-page screenshot and save it to disk
        await page.screenshot(path="screenshot.png", full_page=True)
        print("Screenshot saved to screenshot.png")

        await browser.close()

asyncio.run(main())

# first_run.py

# Navigate to a URL, take a screenshot, and extract the page title.

# Prerequisites: pip install playwright && playwright install chromium

# How to run: python first_run.py

import asyncio

from playwright.async_api import async_playwright

async def main():

async with async_playwright() as p:

# Launch Chromium in headless mode (no visible browser window).

# Set headless=False if you want to watch it run during development.

browser = await p.chromium.launch(headless=True)

# A browser context is like a fresh browser profile.

# It isolates cookies, storage, and cache from other contexts.

context = await browser.new_context(

viewport={"width": 1280, "height": 720},

user_agent=(

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/120.0.0.0 Safari/537.36"

)

page = await context.new_page()

# Navigate to the URL and wait until the network is idle.

# "networkidle" means no open network connections for 500ms.

# For faster pages, "domcontentloaded" is sufficient.

await page.goto("https://example.com", wait_until="networkidle")

# Extract the page title

title = await page.title()

print(f"Page title: {title}")

# Extract the text content of the h1 heading

h1 = await page.text_content("h1")

print(f"H1 heading: {h1}")

# Take a full-page screenshot and save it to disk

await page.screenshot(path="screenshot.png", full_page=True)

print("Screenshot saved to screenshot.png")

await browser.close()

asyncio.run(main())

What this does: async_playwright() is the entry point for the entire Playwright session. The browser_context is equivalent to opening a fresh incognito window; cookies, local storage, and cache are isolated from everything else. wait_until=”networkidle” tells Playwright to wait until the page has finished all its network activity before your code continues, which is the safest wait strategy for dynamic pages.

If this runs and saves a screenshot, your environment is working correctly.

Web Navigation and Scraping

The reason you need Playwright instead of requests + BeautifulSoup is JavaScript rendering. Modern websites deliver a skeleton of HTML and then build the actual content dynamically after the page loads: React, Vue, Angular, Next.js. A plain HTTP request fetches the skeleton. Playwright runs a real browser, so it sees exactly what a human sees after all JavaScript has executed.

The target below is books.toscrape.com, a legal scraping sandbox built for practice. It paginates results, uses dynamic class names for ratings, and closely mirrors the structure of real e-commerce product pages.

How to run: Save as scrape_books.py and run python scrape_books.py

# scrape_books.py
# Scrape book titles, prices, and ratings from books.toscrape.com
# This is a legal scraping sandbox site built for practice.
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python scrape_books.py

import asyncio
import json
from playwright.async_api import async_playwright

async def scrape_books(max_pages: int = 3) -> list[dict]:
    """
    Scrape book listings from books.toscrape.com across multiple pages.
    Returns a list of dicts with title, price, rating, and page number.
    """
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(viewport={"width": 1280, "height": 720})
        page = await context.new_page()

        for page_num in range(1, max_pages + 1):
            url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"
            print(f"Scraping page {page_num}: {url}")

            await page.goto(url, wait_until="domcontentloaded")

            # Wait for the product cards to be visible before extracting.
            # This is critical on JavaScript-heavy pages where content loads after the HTML.
            # timeout=10000 means wait up to 10 seconds before raising an error.
            await page.wait_for_selector("article.product_pod", timeout=10000)

            # Get all book cards on the current page
            books = await page.query_selector_all("article.product_pod")

            for book in books:
                # Extract title from the <a> tag's title attribute
                title_el = await book.query_selector("h3 a")
                title = await title_el.get_attribute("title") if title_el else "N/A"

                # Extract price text
                price_el = await book.query_selector(".price_color")
                price = await price_el.inner_text() if price_el else "N/A"

                # Extract star rating from the CSS class name.
                # e.g. <p class="star-rating Three"> → "Three"
                rating_el = await book.query_selector("p.star-rating")
                rating_class = await rating_el.get_attribute("class") if rating_el else ""
                rating = rating_class.replace("star-rating", "").strip()

                results.append({
                    "title": title,
                    "price": price,
                    "rating": rating,
                    "page": page_num
                })

            print(f"  Extracted {len(books)} books from page {page_num}")

        await browser.close()

    return results


async def main():
    books = await scrape_books(max_pages=2)
    print(f"\nTotal books scraped: {len(books)}")
    print(json.dumps(books[:3], indent=2))


asyncio.run(main())

# scrape_books.py

# Scrape book titles, prices, and ratings from books.toscrape.com

# This is a legal scraping sandbox site built for practice.

# Prerequisites: pip install playwright && playwright install chromium

# How to run: python scrape_books.py

import asyncio

import json

from playwright.async_api import async_playwright

async def scrape_books(max_pages: int = 3) -> list[dict]:

"""

Scrape book listings from books.toscrape.com across multiple pages.

Returns a list of dicts with title, price, rating, and page number.

"""

results = []

async with async_playwright() as p:

browser = await p.chromium.launch(headless=True)

context = await browser.new_context(viewport={"width": 1280, "height": 720})

page = await context.new_page()

for page_num in range(1, max_pages + 1):

url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"

print(f"Scraping page {page_num}: {url}")

await page.goto(url, wait_until="domcontentloaded")

# Wait for the product cards to be visible before extracting.

# This is critical on JavaScript-heavy pages where content loads after the HTML.

# timeout=10000 means wait up to 10 seconds before raising an error.

await page.wait_for_selector("article.product_pod", timeout=10000)

# Get all book cards on the current page

books = await page.query_selector_all("article.product_pod")

for book in books:

# Extract title from the <a> tag's title attribute

title_el = await book.query_selector("h3 a")

title = await title_el.get_attribute("title") if title_el else "N/A"

# Extract price text

price_el = await book.query_selector(".price_color")

price = await price_el.inner_text() if price_el else "N/A"

# Extract star rating from the CSS class name.

# e.g. <p class="star-rating Three"> → "Three"

rating_el = await book.query_selector("p.star-rating")

rating_class = await rating_el.get_attribute("class") if rating_el else ""

rating = rating_class.replace("star-rating", "").strip()

results.append({

"title": title,

"price": price,

"rating": rating,

"page": page_num

})

print(f" Extracted {len(books)} books from page {page_num}")

await browser.close()

return results

async def main():

books = await scrape_books(max_pages=2)

print(f"\nTotal books scraped: {len(books)}")

print(json.dumps(books[:3], indent=2))

asyncio.run(main())

What this does: wait_for_selector() is the key call here. Instead of sleeping for a fixed time and hoping the content has loaded, it watches the DOM and proceeds the moment the target element appears, or raises a TimeoutError if it does not appear within the timeout window. That is the right behavior: fail fast and explicitly rather than silently extracting from an empty page.

The rating extraction deserves attention. The star rating is encoded as a CSS class (star-rating Three), not a number. The code strips “star-rating” from the class string to get the text value. This is the kind of thing you only know by inspecting the actual HTML. When you hand this task to a raw LLM with no browser, it has no way to know what the class structure looks like. With Playwright, you can inspect it directly and extract it exactly.

Form Completion and Multi-Step Flows

Filling forms is where browser agents earn their keep and where most automation scripts fail. The reason is that web forms are not just inputs and buttons. They fire focus, input, change, and blur events in sequence. JavaScript validation listens for those events. If you inject a value into an input field by directly setting value in the DOM (as older automation tools often do), the validation listeners never fire and the form breaks.

Playwright’s fill() and click() methods fire real browser events in the right order, which is why they work on form validation that would block lower-level approaches.

The target below is the-internet.herokuapp.com/login, a public test site maintained specifically for automation practice. It accepts tomsmith / SuperSecretPassword! as valid credentials and returns clear success/failure messages.

How to run: Save as form_submit.py and run python form_submit.py

# form_submit.py
# Complete and submit a multi-field login form on a public demo site.
# Target: https://the-internet.herokuapp.com/login (public test site)
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python form_submit.py

import asyncio
from playwright.async_api import async_playwright

async def login_and_verify(username: str, password: str) -> dict:
    """
    Attempt to log in to a demo site and return whether it succeeded.
    Handles: input filling, button clicking, and result verification.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        await page.goto("https://the-internet.herokuapp.com/login")

        # Wait for the form to be visible before interacting.
        # state="visible" is the default but makes the intent explicit.
        await page.wait_for_selector("#username", state="visible")

        # fill() clears the field first, then types the value.
        # It fires the focus, input, and change events in order.
        await page.fill("#username", username)
        await page.fill("#password", password)

        # click() fires real mouse events -- mousedown, mouseup, click.
        # This triggers JavaScript listeners that a plain DOM click misses.
        await page.click("button[type='submit']")

        # Wait for the page to settle after form submission
        await page.wait_for_load_state("networkidle")

        # Check which result element appeared
        success_el = await page.query_selector(".flash.success")
        error_el = await page.query_selector(".flash.error")

        if success_el:
            message = await success_el.inner_text()
            result = {"success": True, "message": message.strip()}
        elif error_el:
            message = await error_el.inner_text()
            result = {"success": False, "message": message.strip()}
        else:
            result = {"success": False, "message": "Unknown result"}

        await browser.close()
        return result


async def main():
    # Valid credentials for the demo site
    result = await login_and_verify("tomsmith", "SuperSecretPassword!")
    print(f"Valid login:   {result}")

    # Invalid credentials to verify error handling
    result_fail = await login_and_verify("wronguser", "wrongpass")
    print(f"Invalid login: {result_fail}")


asyncio.run(main())

# form_submit.py

# Complete and submit a multi-field login form on a public demo site.

# Target: https://the-internet.herokuapp.com/login (public test site)

# Prerequisites: pip install playwright && playwright install chromium

# How to run: python form_submit.py

import asyncio

from playwright.async_api import async_playwright

async def login_and_verify(username: str, password: str) -> dict:

"""

Attempt to log in to a demo site and return whether it succeeded.

Handles: input filling, button clicking, and result verification.

"""

async with async_playwright() as p:

browser = await p.chromium.launch(headless=True)

context = await browser.new_context()

page = await context.new_page()

await page.goto("https://the-internet.herokuapp.com/login")

# Wait for the form to be visible before interacting.

# state="visible" is the default but makes the intent explicit.

await page.wait_for_selector("#username", state="visible")

# fill() clears the field first, then types the value.

# It fires the focus, input, and change events in order.

await page.fill("#username", username)

await page.fill("#password", password)

# click() fires real mouse events -- mousedown, mouseup, click.

# This triggers JavaScript listeners that a plain DOM click misses.

await page.click("button[type='submit']")

# Wait for the page to settle after form submission

await page.wait_for_load_state("networkidle")

# Check which result element appeared

success_el = await page.query_selector(".flash.success")

error_el = await page.query_selector(".flash.error")

if success_el:

message = await success_el.inner_text()

result = {"success": True, "message": message.strip()}

elif error_el:

message = await error_el.inner_text()

result = {"success": False, "message": message.strip()}

else:

result = {"success": False, "message": "Unknown result"}

await browser.close()

return result

async def main():

# Valid credentials for the demo site

result = await login_and_verify("tomsmith", "SuperSecretPassword!")

print(f"Valid login: {result}")

# Invalid credentials to verify error handling

result_fail = await login_and_verify("wronguser", "wrongpass")

print(f"Invalid login: {result_fail}")

asyncio.run(main())

What this does: The pattern here, fill() → click() → wait_for_load_state() → check for result element, is the template for almost any form interaction. The wait_for_load_state(“networkidle”) after the submit is important: without it, you query the DOM before the page has updated and get the pre-submission state, not the result.

For more complex forms with file uploads, dropdowns, and checkboxes:

# File upload
await page.set_input_files("#file-upload", "/path/to/document.pdf")

# Select dropdown by visible label text
await page.select_option("#country-select", label="Nigeria")

# Check a checkbox
await page.check("#agree-terms")

# Handle a modal dialog (confirm/alert)
page.on("dialog", lambda dialog: asyncio.ensure_future(dialog.accept()))

# File upload

await page.set_input_files("#file-upload", "/path/to/document.pdf")

# Select dropdown by visible label text

await page.select_option("#country-select", label="Nigeria")

# Check a checkbox

await page.check("#agree-terms")

# Handle a modal dialog (confirm/alert)

page.on("dialog", lambda dialog: asyncio.ensure_future(dialog.accept()))

Tool Orchestration with LangChain and LangGraph

Raw Playwright scripts are powerful but fixed. They do exactly what you coded, no more. The moment a page changes its structure, or the task requires a decision the script did not anticipate, it breaks.

Connecting Playwright to an LLM changes this. Browser actions become tools the agent can call when it decides they are needed. The agent reads the task, reasons about what to do, calls a tool, reads the result, and decides what to do next. That loop handles variation that a fixed script cannot.

This is the bridge from “browser automation script” to “AI agent.”

How to run: Save as agent_tools.py, ensure OPENAI_API_KEY is in your .env, then run python agent_tools.py

# agent_tools.py
# LangGraph agent with three browser tools: navigate_and_extract, fill_and_submit_form, take_screenshot
# Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv
#                playwright install chromium
# How to run: python agent_tools.py

import asyncio
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langchain_core.messages import HumanMessage
from langgraph.prebuilt import create_react_agent
from playwright.async_api import async_playwright

load_dotenv()

# ── SHARED BROWSER STATE ──────────────────────────────────────────────────────
# We keep a single browser instance alive for the agent's lifetime.
# Creating and destroying a browser on every tool call is slow and wasteful.
_browser = None
_page = None
_playwright = None

async def get_page():
    """Return the shared page, launching the browser if needed."""
    global _browser, _page, _playwright
    if _browser is None:
        _playwright = await async_playwright().start()
        _browser = await _playwright.chromium.launch(headless=True)
        context = await _browser.new_context(viewport={"width": 1280, "height": 720})
        _page = await context.new_page()
    return _page


async def close_browser():
    """Clean up browser resources when the agent session ends."""
    global _browser, _page, _playwright
    if _browser:
        await _browser.close()
        await _playwright.stop()
        _browser = None
        _page = None
        _playwright = None


# ── BROWSER TOOLS ─────────────────────────────────────────────────────────────
# Note: these are async tools (async def). LangChain's @tool decorator supports
# async functions directly, and the agent must be invoked with ainvoke() so that
# tool calls run on the same event loop instead of trying to start a second one.

@tool
async def navigate_and_extract(url: str) -> str:
    """
    Navigate to a URL and return the visible text content of the page.
    Use this to visit websites and read their content.
    Input: a full URL string including https:// (e.g., 'https://example.com').
    """
    page = await get_page()
    await page.goto(url, wait_until="domcontentloaded", timeout=15000)
    await page.wait_for_load_state("networkidle")
    content = await page.inner_text("body")
    # Truncate to avoid flooding the LLM context window
    return content[:3000] if len(content) > 3000 else content


@tool
async def fill_and_submit_form(selector_value_pairs: str) -> str:
    """
    Fill form fields and submit a form on the currently loaded page.
    Input: a comma-separated string of 'selector:value' pairs ending with 'submit:button_selector'.
    Example: '#email:user@example.com,#password:secret,submit:button[type=submit]'
    """
    page = await get_page()
    try:
        pairs = selector_value_pairs.split(",")
        submit_selector = None

        for pair in pairs:
            key, val = pair.split(":", 1)
            key = key.strip()
            val = val.strip()
            if key == "submit":
                submit_selector = val
            else:
                await page.fill(key, val)

        if submit_selector:
            await page.click(submit_selector)
            await page.wait_for_load_state("networkidle")

        return f"Form submitted. Current URL: {page.url}"
    except Exception as e:
        return f"Form interaction failed: {str(e)}"


@tool
async def take_screenshot(filename: str) -> str:
    """
    Take a screenshot of the current browser page and save it to a file.
    Use this to visually verify the current state of the page.
    Input: filename string (e.g., 'result.png').
    """
    page = await get_page()
    await page.screenshot(path=filename, full_page=False)
    return f"Screenshot saved to {filename}"


# ── AGENT SETUP ───────────────────────────────────────────────────────────────

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

tools = [navigate_and_extract, fill_and_submit_form, take_screenshot]

# create_react_agent wires together the LLM, the tools, and the ReAct reasoning loop.
# The agent decides which tool to call, calls it, reads the result, and continues.
agent = create_react_agent(llm, tools)


# ── DEMO ──────────────────────────────────────────────────────────────────────

async def main():
    result = await agent.ainvoke({
        "messages": [HumanMessage(
            content=(
                "Go to https://example.com, read the page content, "
                "then take a screenshot called example.png"
            )
        )]
    })
    print(result["messages"][-1].content)
    await close_browser()


asyncio.run(main())

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

# agent_tools.py

# LangGraph agent with three browser tools: navigate_and_extract, fill_and_submit_form, take_screenshot

# Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv

# playwright install chromium

# How to run: python agent_tools.py

import asyncio

import os

from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from langchain.tools import tool

from langchain_core.messages import HumanMessage

from langgraph.prebuilt import create_react_agent

from playwright.async_api import async_playwright

load_dotenv()

# ── SHARED BROWSER STATE ──────────────────────────────────────────────────────

# We keep a single browser instance alive for the agent's lifetime.

# Creating and destroying a browser on every tool call is slow and wasteful.

_browser = None

_page = None

_playwright = None

async def get_page():

"""Return the shared page, launching the browser if needed."""

global _browser, _page, _playwright

if _browser is None:

_playwright = await async_playwright().start()

_browser = await _playwright.chromium.launch(headless=True)

context = await _browser.new_context(viewport={"width": 1280, "height": 720})

_page = await context.new_page()

return _page

async def close_browser():

"""Clean up browser resources when the agent session ends."""

global _browser, _page, _playwright

if _browser:

await _browser.close()

await _playwright.stop()

_browser = None

_page = None

_playwright = None

# ── BROWSER TOOLS ─────────────────────────────────────────────────────────────

# Note: these are async tools (async def). LangChain's @tool decorator supports

# async functions directly, and the agent must be invoked with ainvoke() so that

# tool calls run on the same event loop instead of trying to start a second one.

@tool

async def navigate_and_extract(url: str) -> str:

"""

Navigate to a URL and return the visible text content of the page.

Use this to visit websites and read their content.

Input: a full URL string including https:// (e.g., 'https://example.com').

"""

page = await get_page()

await page.goto(url, wait_until="domcontentloaded", timeout=15000)

await page.wait_for_load_state("networkidle")

content = await page.inner_text("body")

# Truncate to avoid flooding the LLM context window

return content[:3000] if len(content) > 3000 else content

@tool

async def fill_and_submit_form(selector_value_pairs: str) -> str:

"""

Fill form fields and submit a form on the currently loaded page.

Input: a comma-separated string of 'selector:value' pairs ending with 'submit:button_selector'.

Example: '#email:user@example.com,#password:secret,submit:button[type=submit]'

"""

page = await get_page()

try:

pairs = selector_value_pairs.split(",")

submit_selector = None

for pair in pairs:

key, val = pair.split(":", 1)

key = key.strip()

val = val.strip()

if key == "submit":

submit_selector = val

else:

await page.fill(key, val)

if submit_selector:

await page.click(submit_selector)

await page.wait_for_load_state("networkidle")

return f"Form submitted. Current URL: {page.url}"

except Exception as e:

return f"Form interaction failed: {str(e)}"

@tool

async def take_screenshot(filename: str) -> str:

"""

Take a screenshot of the current browser page and save it to a file.

Use this to visually verify the current state of the page.

Input: filename string (e.g., 'result.png').

"""

page = await get_page()

await page.screenshot(path=filename, full_page=False)

return f"Screenshot saved to {filename}"

# ── AGENT SETUP ───────────────────────────────────────────────────────────────

llm = ChatOpenAI(

model="gpt-4o",

temperature=0,

api_key=os.getenv("OPENAI_API_KEY")

)

tools = [navigate_and_extract, fill_and_submit_form, take_screenshot]

# create_react_agent wires together the LLM, the tools, and the ReAct reasoning loop.

# The agent decides which tool to call, calls it, reads the result, and continues.

agent = create_react_agent(llm, tools)

# ── DEMO ──────────────────────────────────────────────────────────────────────

async def main():

result = await agent.ainvoke({

"messages": [HumanMessage(

content=(

"Go to https://example.com, read the page content, "

"then take a screenshot called example.png"

)

)]

})

print(result["messages"][-1].content)

await close_browser()

asyncio.run(main())

What this does: The three @tool-decorated functions are registered with the agent. Each docstring is what the LLM reads to understand what the tool does and when to use it. Write them like job descriptions, not code comments. The shared _browser and _page globals mean the browser stays open across multiple tool calls, which is essential for tasks that span several pages in the same session. Because the tools are defined with async def, the agent is invoked with ainvoke() rather than invoke(), so the tool calls run on the same event loop that main() is already using.

A vertical flow diagram showing how a task request flows through the agent (click to enlarge)
Image by Editor

The key design decision in this snippet is the shared browser instance. If each tool call launched and closed its own browser, you would lose all session state between calls, such as cookies, navigation history, and any form state the agent had already built up. Keeping the browser alive for the full agent session preserves that context.

Using browser-use for High-Level Agent Tasks

Raw Playwright with @tool functions gives you precise control. The trade-off is that you are still writing selectors, still thinking about page structure, still handling every edge case manually. If the site changes its HTML, your selectors break.

browser-use takes a different approach. Instead of writing selectors, you give the agent a task in plain English. browser-use uses Playwright under the hood, but the LLM reads the current page state on each step and decides what to do next: which element to click, what to type, and when the task is complete. The page structure is not hardcoded into your code. The agent figures it out at runtime.

browser-use is a Python library that gives an LLM a working browser. The LLM reads each page and decides what to click, type, and extract. This makes it resilient to site changes that would break a selector-based script.

When to use browser-use over raw Playwright:

If the task is exploratory and the page structure is unpredictable, use browser-use.
If you are running a fixed, repeatable workflow where every selector is known and stable, raw Playwright is more reliable and cheaper per run.
A browser-use agent makes multiple LLM calls per task step; a scripted Playwright run makes none.

How to run: Save as browser_use_agent.py, ensure OPENAI_API_KEY is in your .env, then run python browser_use_agent.py

# browser_use_agent.py
# A browser-use agent that accepts a natural language task and completes it
# without any CSS selectors or hardcoded page structure.
# Prerequisites: pip install browser-use playwright python-dotenv
#                playwright install chromium
# How to run: python browser_use_agent.py

import asyncio
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from browser_use import Agent

load_dotenv()

async def run_browser_task(task: str) -> str:
    """
    Hand a natural language task to a browser-use agent.
    The agent handles navigation, clicks, and extraction without selectors.
    """
    # temperature=0 keeps decisions deterministic and reduces hallucinated actions
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0,
        api_key=os.getenv("OPENAI_API_KEY")
    )

    # Agent wraps the browser, the LLM, and the task loop together.
    # max_actions_per_step limits how many actions the agent takes before
    # re-reading the page -- prevents runaway loops on complex pages.
    agent = Agent(
        task=task,
        llm=llm,
        max_actions_per_step=5
    )

    # run() executes the full task loop:
    # read page → decide action → take action → read updated page → repeat
    result = await agent.run()

    # final_result() returns the agent's extracted content or conclusion
    return result.final_result() or "Task completed with no extracted output."


async def main():
    task = (
        "Go to https://books.toscrape.com and find the 3 most expensive books "
        "on the first page. Return their titles and prices."
    )
    print(f"Task: {task}\n")
    output = await run_browser_task(task)
    print(f"Result:\n{output}")


asyncio.run(main())

# browser_use_agent.py

# A browser-use agent that accepts a natural language task and completes it

# without any CSS selectors or hardcoded page structure.

# Prerequisites: pip install browser-use playwright python-dotenv

# playwright install chromium

# How to run: python browser_use_agent.py

import asyncio

import os

from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from browser_use import Agent

load_dotenv()

async def run_browser_task(task: str) -> str:

"""

Hand a natural language task to a browser-use agent.

The agent handles navigation, clicks, and extraction without selectors.

"""

# temperature=0 keeps decisions deterministic and reduces hallucinated actions

llm = ChatOpenAI(

model="gpt-4o",

temperature=0,

api_key=os.getenv("OPENAI_API_KEY")

)

# Agent wraps the browser, the LLM, and the task loop together.

# max_actions_per_step limits how many actions the agent takes before

# re-reading the page -- prevents runaway loops on complex pages.

agent = Agent(

task=task,

llm=llm,

max_actions_per_step=5

)

# run() executes the full task loop:

# read page → decide action → take action → read updated page → repeat

result = await agent.run()

# final_result() returns the agent's extracted content or conclusion

return result.final_result() or "Task completed with no extracted output."

async def main():

task = (

"Go to https://books.toscrape.com and find the 3 most expensive books "

"on the first page. Return their titles and prices."

)

print(f"Task: {task}\n")

output = await run_browser_task(task)

print(f"Result:\n{output}")

asyncio.run(main())

What this does: The entire task, navigating to the site, reading the page, identifying the three highest prices, and extracting them, is handled by the agent without a single CSS selector in your code. If books.toscrape.com redesigns its price display tomorrow, the script still works. With a selector-based scraper, it would break silently.

The max_actions_per_step=5 parameter is worth explaining. On each step, the agent reads the page and can decide to take up to five actions (click, type, scroll, navigate) before re-reading the page. Keeping this low forces the agent to check its work more frequently, which catches mistakes earlier.

Handling the Hard Parts

Three things break most browser agents in production. Each has a solution, but none of them is obvious until you have already been burned.

1. Anti-Bot Detection
Websites that do not want to be automated detect automation in several ways, such as checking the navigator.webdriver property (which Playwright sets to true by default), looking for headless browser fingerprints in the JavaScript environment, and analyzing interaction patterns that are too fast or too uniform to be human.

The most important mitigation is removing the webdriver flag. Beyond that, a realistic user agent string, a standard viewport size, and a realistic locale and timezone cover most detection methods short of sophisticated fingerprint analysis.

# hard_parts.py -- Part 1: Anti-bot stealth launch
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python hard_parts.py

import asyncio
import json
from pathlib import Path
from playwright.async_api import async_playwright

async def launch_stealth_browser(playwright):
    """
    Launch a browser context that looks more like a real human session.
    Covers: realistic viewport, user-agent, locale, timezone, webdriver flag.
    Note: For serious anti-bot targets, consider a paid service like Browserbase.
    """
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",  # Hides webdriver detection
            "--no-sandbox",
            "--disable-dev-shm-usage",
        ]
    )

    context = await browser.new_context(
        viewport={"width": 1366, "height": 768},   # Common desktop resolution
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
        java_script_enabled=True,
    )

    # Remove the 'webdriver' property that Playwright injects by default.
    # Bot detection systems check for this in the browser's JS environment.
    await context.add_init_script(
        "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
    )

    return browser, context

# hard_parts.py -- Part 1: Anti-bot stealth launch

# Prerequisites: pip install playwright && playwright install chromium

# How to run: python hard_parts.py

import asyncio

import json

from pathlib import Path

from playwright.async_api import async_playwright

async def launch_stealth_browser(playwright):

"""

Launch a browser context that looks more like a real human session.

Covers: realistic viewport, user-agent, locale, timezone, webdriver flag.

Note: For serious anti-bot targets, consider a paid service like Browserbase.

"""

browser = await playwright.chromium.launch(

headless=True,

args=[

"--disable-blink-features=AutomationControlled", # Hides webdriver detection

"--no-sandbox",

"--disable-dev-shm-usage",

]

)

context = await browser.new_context(

viewport={"width": 1366, "height": 768}, # Common desktop resolution

user_agent=(

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/124.0.0.0 Safari/537.36"

locale="en-US",

timezone_id="America/New_York",

java_script_enabled=True,

)

# Remove the 'webdriver' property that Playwright injects by default.

# Bot detection systems check for this in the browser's JS environment.

await context.add_init_script(

"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"

)

return browser, context

What this does: The add_init_script() call runs before any page JavaScript executes, which means the navigator.webdriver override is in place before the site’s detection code can check for it. The –disable-blink-features=AutomationControlled launch argument removes a separate automation flag at the browser engine level. Together, these two changes handle the most common detection methods.

For sites with aggressive fingerprinting and CAPTCHA systems, these mitigations will not be enough. Services like Browserbase, Spidra and Brightdata’s Scraping Browser handle CAPTCHA solving, residential IP rotation, and browser fingerprint management as managed infrastructure.

2. Smart Waiting

The second failure mode is timing. The reflex is to add time.sleep() calls and increase them when things break. This is wrong in both directions: too short on slow connections, too long on fast ones, and completely opaque when debugging.

Playwright has four proper wait strategies. Use the one that matches what you are actually waiting for:

# Part 2: Smart waiting strategies (add to your scraper or agent tools)

async def smart_wait_examples(page):
    """
    Four ways to wait for the right page state, without arbitrary sleeps.
    """
    # STRATEGY 1: Wait for a specific element to appear in the DOM
    # Use when you know exactly what element signals content has loaded
    await page.wait_for_selector(".product-list", state="visible", timeout=10000)

    # STRATEGY 2: Wait for a specific API response
    # Use when the content comes from an XHR/fetch call you can identify
    async with page.expect_response(
        lambda r: "/api/products" in r.url and r.status == 200
    ) as response_info:
        await page.click("#load-more")
    response = await response_info.value
    print(f"API responded: {response.status}")

    # STRATEGY 3: Wait for the URL to change after form submission
    # Use when a successful submit redirects to a new page
    await page.wait_for_url("**/dashboard**", timeout=10000)

    # STRATEGY 4: Wait for a JavaScript variable to be set
    # Use when no visual element reliably signals the ready state
    await page.wait_for_function(
        "() => window.__dataLoaded === true",
        timeout=10000
    )

# Part 2: Smart waiting strategies (add to your scraper or agent tools)

async def smart_wait_examples(page):

"""

Four ways to wait for the right page state, without arbitrary sleeps.

"""

# STRATEGY 1: Wait for a specific element to appear in the DOM

# Use when you know exactly what element signals content has loaded

await page.wait_for_selector(".product-list", state="visible", timeout=10000)

# STRATEGY 2: Wait for a specific API response

# Use when the content comes from an XHR/fetch call you can identify

async with page.expect_response(

lambda r: "/api/products" in r.url and r.status == 200

) as response_info:

await page.click("#load-more")

response = await response_info.value

print(f"API responded: {response.status}")

# STRATEGY 3: Wait for the URL to change after form submission

# Use when a successful submit redirects to a new page

await page.wait_for_url("**/dashboard**", timeout=10000)

# STRATEGY 4: Wait for a JavaScript variable to be set

# Use when no visual element reliably signals the ready state

await page.wait_for_function(

"() => window.__dataLoaded === true",

timeout=10000

)

What this does: Each strategy is tied to a specific observable event rather than an arbitrary time delay. wait_for_selector watches the DOM. expect_response hooks into the network layer. wait_for_url monitors navigation. wait_for_function evaluates JavaScript in the browser context. Use whichever one most directly signals “the thing I need is now ready.”

3. Session and Cookie Persistence
The third failure mode is losing session state. If your agent logs into a site during step one and then the browser context is destroyed, step two has no authentication. Recreating the login on every run is slow and can trigger rate limiting or lockout.

The solution is saving cookies to disk after login and loading them at the start of every subsequent run:

# Part 3: Session persistence across runs

COOKIES_FILE = Path("session_cookies.json")

async def save_session(context) -> None:
    """Save browser cookies to disk after a successful login."""
    cookies = await context.cookies()
    COOKIES_FILE.write_text(json.dumps(cookies, indent=2))
    print(f"Session saved: {len(cookies)} cookies written.")


async def load_session(context) -> bool:
    """Load saved cookies before navigating. Returns True if session was found."""
    if not COOKIES_FILE.exists():
        print("No saved session. Fresh login required.")
        return False
    cookies = json.loads(COOKIES_FILE.read_text())
    await context.add_cookies(cookies)
    print(f"Session restored: {len(cookies)} cookies loaded.")
    return True

# Part 3: Session persistence across runs

COOKIES_FILE = Path("session_cookies.json")

async def save_session(context) -> None:

"""Save browser cookies to disk after a successful login."""

cookies = await context.cookies()

COOKIES_FILE.write_text(json.dumps(cookies, indent=2))

print(f"Session saved: {len(cookies)} cookies written.")

async def load_session(context) -> bool:

"""Load saved cookies before navigating. Returns True if session was found."""

if not COOKIES_FILE.exists():

print("No saved session. Fresh login required.")

return False

cookies = json.loads(COOKIES_FILE.read_text())

await context.add_cookies(cookies)

print(f"Session restored: {len(cookies)} cookies loaded.")

return True

What this does: context.cookies() returns all cookies for the current browser context, including session tokens and authentication cookies. Writing them to JSON and reloading them on the next run means the browser starts in an authenticated state. Note that sessions expire; add a check that falls back to a fresh login if the saved session returns a redirect to the login page.

Deploying Browser Agents

Getting a browser agent working locally is one thing. Running it reliably in a cloud environment is another.

The main difference between a Python script that works on your laptop and one that fails in CI is system dependencies. Playwright’s Chromium browser requires a set of shared libraries that are present on most developer machines but absent from minimal cloud images. The cleanest solution is Docker.

Dockerfile — build a container that ships everything Playwright needs:

# Dockerfile for headless Playwright-based browser agent
# Build: docker build -t browser-agent .
# Run:   docker run --rm -e OPENAI_API_KEY=your_key browser-agent

FROM python:3.11-slim

# Install system dependencies required by Chromium
RUN apt-get update && apt-get install -y \
    libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 \
    libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \
    libxrandr2 libgbm1 libasound2 libpangocairo-1.0-0 \
    libpango-1.0-0 libcairo2 libx11-6 libxext6 libxfixes3 \
    fonts-liberation wget ca-certificates \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies first (cached layer -- only rebuilds on requirements change)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browser binaries into the image
RUN playwright install chromium
RUN playwright install-deps chromium

# Copy application code last (changes here don't invalidate the pip/playwright layers)
COPY . .

CMD ["python", "agent_tools.py"]

requirements.txt:
playwright
browser-use
langchain
langchain-openai
langgraph
python-dotenv

# Dockerfile for headless Playwright-based browser agent

# Build: docker build -t browser-agent .

# Run: docker run --rm -e OPENAI_API_KEY=your_key browser-agent

FROM python:3.11-slim

# Install system dependencies required by Chromium

RUN apt-get update && apt-get install -y \

libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 \

libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \

libxrandr2 libgbm1 libasound2 libpangocairo-1.0-0 \

libpango-1.0-0 libcairo2 libx11-6 libxext6 libxfixes3 \

fonts-liberation wget ca-certificates \

&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies first (cached layer -- only rebuilds on requirements change)

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browser binaries into the image

RUN playwright install chromium

RUN playwright install-deps chromium

# Copy application code last (changes here don't invalidate the pip/playwright layers)

COPY . .

CMD ["python", "agent_tools.py"]

requirements.txt:

playwright

browser-use

langchain

langchain-openai

langgraph

python-dotenv

For concurrent workloads running multiple browser sessions in parallel, use Playwright’s async API with asyncio.gather():

# Parallel scraping with semaphore rate limiting
# Runs up to 3 browser sessions simultaneously

import asyncio
from playwright.async_api import async_playwright

async def scrape_url(browser, url: str, semaphore: asyncio.Semaphore) -> dict:
    """Scrape a single URL, respecting the concurrency semaphore."""
    async with semaphore:
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(url, wait_until="domcontentloaded")
        title = await page.title()
        await context.close()   # Close context (not browser) to release resources
        return {"url": url, "title": title}


async def scrape_parallel(urls: list[str], max_concurrent: int = 3) -> list[dict]:
    """Scrape a list of URLs in parallel, capped at max_concurrent sessions."""
    semaphore = asyncio.Semaphore(max_concurrent)  # Cap concurrent sessions

    async with async_playwright() as p:
        # One browser shared across all contexts -- much cheaper than one browser per URL
        browser = await p.chromium.launch(headless=True)
        tasks = [scrape_url(browser, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks)
        await browser.close()

    return list(results)

# Parallel scraping with semaphore rate limiting

# Runs up to 3 browser sessions simultaneously

import asyncio

from playwright.async_api import async_playwright

async def scrape_url(browser, url: str, semaphore: asyncio.Semaphore) -> dict:

"""Scrape a single URL, respecting the concurrency semaphore."""

async with semaphore:

context = await browser.new_context()

page = await context.new_page()

await page.goto(url, wait_until="domcontentloaded")

title = await page.title()

await context.close() # Close context (not browser) to release resources

return {"url": url, "title": title}

async def scrape_parallel(urls: list[str], max_concurrent: int = 3) -> list[dict]:

"""Scrape a list of URLs in parallel, capped at max_concurrent sessions."""

semaphore = asyncio.Semaphore(max_concurrent) # Cap concurrent sessions

async with async_playwright() as p:

# One browser shared across all contexts -- much cheaper than one browser per URL

browser = await p.chromium.launch(headless=True)

tasks = [scrape_url(browser, url, semaphore) for url in urls]

results = await asyncio.gather(*tasks)

await browser.close()

return list(results)

What this does: The asyncio.Semaphore(max_concurrent) caps how many browser contexts run at the same time. Without it, launching 50 concurrent browser contexts will exhaust memory. One browser process is shared across all contexts; a context is cheap; a full browser instance is not.

On the managed infrastructure side, Amazon Nova Act launched in March 2025 as a dedicated SDK for building browser agents on AWS, integrating natively with Playwright for browser control. Playwright’s own MCP server gives AI assistants full browser control through the Model Context Protocol, using structured accessibility snapshots rather than screenshots, which means token costs stay low while the agent’s understanding of the page stays high.

Putting It All Together

Here is a complete end-to-end agent that takes a research question, navigates to a public data source, extracts structured results, and returns a clean summary. It uses the browser tools from Section 5 orchestrated by a LangGraph agent.

How to run: Save as reference_agent.py, ensure OPENAI_API_KEY is in your .env, and run python reference_agent.py

# reference_agent.py
# Full browser-using AI agent: navigates, extracts, summarizes.
# Target: books.toscrape.com (public scraping sandbox)
# Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv
#                playwright install chromium
# How to run: python reference_agent.py

import asyncio
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.prebuilt import create_react_agent
from playwright.async_api import async_playwright

load_dotenv()

# ── BROWSER STATE ─────────────────────────────────────────────────────────────
_browser = None
_context = None
_page = None
_playwright = None

async def get_page():
    global _browser, _context, _page, _playwright
    if _browser is None:
        _playwright = await async_playwright().start()
        _browser = await _playwright.chromium.launch(headless=True)
        _context = await _browser.new_context(
            viewport={"width": 1280, "height": 720},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            )
        )
        # Remove webdriver fingerprint
        await _context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )
        _page = await _context.new_page()
    return _page


async def teardown():
    global _browser, _playwright
    if _browser:
        await _browser.close()
        await _playwright.stop()
        _browser = None
        _playwright = None


# ── TOOLS ─────────────────────────────────────────────────────────────────────

@tool
async def navigate(url: str) -> str:
    """
    Navigate the browser to a URL and return the page's text content.
    Use when you need to open a website or move to a new page.
    Input: full URL with https:// prefix.
    """
    page = await get_page()
    await page.goto(url, wait_until="domcontentloaded", timeout=20000)
    await page.wait_for_load_state("networkidle")
    content = await page.inner_text("body")
    return content[:4000]


@tool
async def extract_structured(css_selector: str) -> str:
    """
    Extract text from all elements matching a CSS selector on the current page.
    Use when you need to pull specific elements from the loaded page.
    Input: valid CSS selector string (e.g., 'h3 a', '.price_color', 'article.product_pod').
    """
    page = await get_page()
    try:
        await page.wait_for_selector(css_selector, timeout=5000)
        elements = await page.query_selector_all(css_selector)
        texts = []
        for el in elements[:20]:  # Cap at 20 elements to keep output manageable
            text = await el.inner_text()
            texts.append(text.strip())
        return "\n".join(texts) if texts else "No elements found."
    except Exception as e:
        return f"Extraction failed: {str(e)}"


@tool
async def get_current_url() -> str:
    """Return the URL the browser is currently on. No input required."""
    page = await get_page()
    return page.url


# ── AGENT ─────────────────────────────────────────────────────────────────────

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

tools = [navigate, extract_structured, get_current_url]
agent = create_react_agent(llm, tools)

SYSTEM = (
    "You are a browser-based research agent. You have access to a real browser. "
    "Use navigate() to open pages, extract_structured() to pull specific elements, "
    "and get_current_url() to check where you are. "
    "Always navigate first, then extract. Be concise in your final answer."
)


async def run_agent(query: str) -> str:
    result = await agent.ainvoke({
        "messages": [
            SystemMessage(content=SYSTEM),
            HumanMessage(content=query)
        ]
    })
    await teardown()
    return result["messages"][-1].content


# ── DEMO ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    query = (
        "Go to https://books.toscrape.com and extract the titles and prices "
        "of the first 5 books listed. Return them as a structured list."
    )
    print(f"Query: {query}\n")
    answer = asyncio.run(run_agent(query))
    print(f"Answer:\n{answer}")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

# reference_agent.py

# Full browser-using AI agent: navigates, extracts, summarizes.

# Target: books.toscrape.com (public scraping sandbox)

# Prerequisites: pip install playwright langchain langchain-openai langgraph python-dotenv

# playwright install chromium

# How to run: python reference_agent.py

import asyncio

import os

from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from langchain.tools import tool

from langchain_core.messages import HumanMessage, SystemMessage

from langgraph.prebuilt import create_react_agent

from playwright.async_api import async_playwright

load_dotenv()

# ── BROWSER STATE ─────────────────────────────────────────────────────────────

_browser = None

_context = None

_page = None

_playwright = None

async def get_page():

global _browser, _context, _page, _playwright

if _browser is None:

_playwright = await async_playwright().start()

_browser = await _playwright.chromium.launch(headless=True)

_context = await _browser.new_context(

viewport={"width": 1280, "height": 720},

user_agent=(

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/120.0.0.0 Safari/537.36"

)

# Remove webdriver fingerprint

await _context.add_init_script(

"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"

)

_page = await _context.new_page()

return _page

async def teardown():

global _browser, _playwright

if _browser:

await _browser.close()

await _playwright.stop()

_browser = None

_playwright = None

# ── TOOLS ─────────────────────────────────────────────────────────────────────

@tool

async def navigate(url: str) -> str:

"""

Navigate the browser to a URL and return the page's text content.

Use when you need to open a website or move to a new page.

Input: full URL with https:// prefix.

"""

page = await get_page()

await page.goto(url, wait_until="domcontentloaded", timeout=20000)

await page.wait_for_load_state("networkidle")

content = await page.inner_text("body")

return content[:4000]

@tool

async def extract_structured(css_selector: str) -> str:

"""

Extract text from all elements matching a CSS selector on the current page.

Use when you need to pull specific elements from the loaded page.

Input: valid CSS selector string (e.g., 'h3 a', '.price_color', 'article.product_pod').

"""

page = await get_page()

try:

await page.wait_for_selector(css_selector, timeout=5000)

elements = await page.query_selector_all(css_selector)

texts = []

for el in elements[:20]: # Cap at 20 elements to keep output manageable

text = await el.inner_text()

texts.append(text.strip())

return "\n".join(texts) if texts else "No elements found."

except Exception as e:

return f"Extraction failed: {str(e)}"

@tool

async def get_current_url() -> str:

"""Return the URL the browser is currently on. No input required."""

page = await get_page()

return page.url

# ── AGENT ─────────────────────────────────────────────────────────────────────

llm = ChatOpenAI(

model="gpt-4o",

temperature=0,

api_key=os.getenv("OPENAI_API_KEY")

)

tools = [navigate, extract_structured, get_current_url]

agent = create_react_agent(llm, tools)

SYSTEM = (

"You are a browser-based research agent. You have access to a real browser. "

"Use navigate() to open pages, extract_structured() to pull specific elements, "

"and get_current_url() to check where you are. "

"Always navigate first, then extract. Be concise in your final answer."

)

async def run_agent(query: str) -> str:

result = await agent.ainvoke({

"messages": [

SystemMessage(content=SYSTEM),

HumanMessage(content=query)

]

})

await teardown()

return result["messages"][-1].content

if __name__ == "__main__":

query = (

"Go to https://books.toscrape.com and extract the titles and prices "

"of the first 5 books listed. Return them as a structured list."

)

print(f"Query: {query}\n")

answer = asyncio.run(run_agent(query))

print(f"Answer:\n{answer}")

What this does: This agent has three clean tools: navigate, extract_structured, and get_current_url, plus a system prompt that tells it exactly when to use each one. The agent calls navigate to load the page, extract_structured to pull the book titles and prices by CSS selector, and synthesizes a structured list in the final answer. The teardown() call after the agent finishes closes the browser cleanly so no zombie Chromium processes are left running.

Conclusion

The browser is not a specialized tool for automation engineers. It is the universal interface for the web, and the web is where most of the world’s actual work gets done. An AI agent that can use a browser does not need a partner team maintaining API integrations. It can reach anything a human can reach.

What makes this practical now, not just theoretically interesting, is the maturity of the tooling. Playwright handles the hard parts of browser interaction. browser-use removes the need to write selectors for exploratory tasks. LangGraph gives the LLM clean tool hooks and a reasoning loop that handles variable page structures. The patterns in this article are not demos. They are the same patterns 51% of enterprises now running AI agents in production are building on.

Start with the scraping example. Get it running against a site you actually need data from. Add the agent layer when you need decisions the script cannot anticipate. Add browser-use when the page structure is too dynamic for selectors. Deploy in Docker when you need it running somewhere other than your laptop.

The hard part is not the code. It is knowing which tool to reach for at each layer. Hopefully this article made that clearer.

Navigation

Building Browser-Using AI Agents in Python

Introduction

Why Playwright, Not Selenium

Setting Up the Environment

Web Navigation and Scraping

Form Completion and Multi-Step Flows

Tool Orchestration with LangChain and LangGraph

Using browser-use for High-Level Agent Tasks

Handling the Hard Parts

Deploying Browser Agents

Putting It All Together

Conclusion

More On This Topic

2 Responses to Building Browser-Using AI Agents in Python

Leave a Reply Click here to cancel reply.