Build AI Agents that Scrape the Web and Generate Dashboards with Crawl4AI

AI Agents are emerging as one of the most practical applications of large language models (LLMs). Systems that once relied solely on strict logic now exhibit a touch of intuition and reasoning capabilities that, until recently, were only associated with human intervention. Instead of simply answering a question or adhering to predefined logic gates, AI agents can reason, plan, and act more flexibly, using any tool at their disposal. That shift doesn’t just change how we write code. It also reshapes how we think about scraping and transforming data and turning it into products like dashboards.

LLMs have fueled a wave of bold claims, such as: “The internet is dead.” “Dashboards are dead.” My personal standpoint is more measured, but this project shows a glimpse of why people are quick to make those claims.

What We’ll Build: a Self-Sustaining Sentiment Dashboard

We’ll design a system of AI Agents that scrapes, processes, and interprets data, and presents it in a dashboard that updates and improves itself. The pipeline pulls headlines from CNN and Fox, runs sentiment analysis to measure emotional tone, and visualizes the results.

Sankey diagrams to show how sentiment flows through categories.
Bar charts to compare sentiment between CNN and Fox.
Word clouds to highlight the most common words across headlines.

The system is powered by Crawl4AI (scraping), LangGraph + LangChain (agents and orchestration), and Streamlit (interactive dashboards).

Why This Matters

The cost of any data product is not its initial development, but the technical debt: the continuous maintenance required to keep things functional and reliable. It is not uncommon for teams to spend 25% of their time on fixing rather than developing¹.

If you’ve ever done web scraping, you know the pain. Scrapers rely on specific web elements within a site’s layout, which can change at any time. On day one, your scraper runs flawlessly. On day seven, CNN’s website changes the name of a single

, and your entire pipeline collapses.

Dashboards aren’t much better. As a data analyst by profession, I know that building and maintaining visuals and layouts can consume hours, even with tools like Power BI or Tableau. Alternatively, using Python to render dashboards adds dependency management to the mix.

From an organizational perspective, the real cost is clear: either you hire more developers, or your existing ones spend a huge share of their time just keeping systems alive.

The Solution

Now, imagine a system that runs itself: an AI agent that decides what to scrape based on instructions and adapts as the website’s structure changes. Then generates and renders dashboards automatically. For companies, the appeal is obvious: fewer maintenance cycles, lower operational costs, and the ability to ship insights faster without constantly firefighting.

Key Concepts: Agents

What is an AI Agent? An AI agent is a system that can perceive, decide, and act towards a goal with some degree of autonomy. While the term actually describes a wide array of systems; Robotics, bots that trade stock, and even a smart thermostat that adjusts temperature automatically. The common thread is that they all sense, reason, and act.

Today, the term most often refers to LLM-powered agents: systems with large language models as their reasoning mechanism. There are different ways to structure these agents, but one of the most used is ReAct (Reason + Act). In this Framework, the LLM reasons about what to do, acts by calling a tool or API, observes the result, and repeats the cycle until the goal is reached².

This is the pattern illustrated below:

System Instruction (prompt): defines the task or goal for the agent.
Reasoning (LLM Core): the model interprets instructions, makes decisions, and plans next steps.
Tools (Functions): external actions and tools that the agent can consult to act with the environment.
Observation Loop: the agent interprets feedback from tools to guide further reasoning.

Why Multi-Agent Environments? Asking one LLM-powered agent to “do it all” is a recipe for failure. The broader its role, the larger its instructions and the more room for ambiguity. That’s when hallucinations creep in and pipelines start breaking. The idea of a multi-agent environment is that instead of an “overworked” agent, you deploy a team where each member is specialized in a specific task. These often outperform monolithic, all-purpose agents³. Moreover, these systems are⁴:

More accurate: each agent specializes in one domain.
Easier to debug: you know where errors come from.
More scalable: agents can be replaced, removed, or added, without directly affecting the entire pipeline.

The Project’s Multi-Agent Setup In this project, the system is split into three ReAct agents. Each has its own system instruction and tools.

Data Agent: Scrapes and organizes data from CNN and Fox.
Visualization Agent: Creates charts and generates the initial Streamlit dashboard.
Refinement Agent: Tests, repairs, improves, and iterates on the dashboard until it works.

To all work together to produce the final app.

Key Concepts: Libraries

Crawl4AI: Scraping for the AI Era

Traditional scrapers are sensitive to every tiny site update. Craw4AI, on the other hand, is built from the ground up with agents and LLMs in mind. Here are some of its advantages⁵.

Machine-friendly by default: It outputs clean, structured data, ready for RAG pipelines and AI agents.
Token Efficiency: It doesn’t output messy selectors or HTML clutter, meaning leaner LLM calls that are faster and cheaper.
Resilient: Changes in site layouts won’t send you into maintenance hell.

That’s why it’s trending on GitHub. It’s the next-gen bridge between the messy web and intelligent systems.

LangGraph: The Agent’s Skeleton

Actually building an agent from scratch involves many steps and considerations. Luckily, LangGraph provides a framework and functionalities out of the box. Specifically, it comes with a built-in ReAct agent⁶, the reasoning loop that powers decision-making in this project. With it, we can:

Set system instructions to guide behavior.
Easily manage and orchestrate different tools.
Rely on built-in looping and retries so an agent doesn’t just crash at the first failure.

In short, LangGraph is what turns a clever LLM into a reliable worker.

LangChain: The Ecosystem Glue

In this project, Lanchain is used to:

Initialize LLMs from different providers (OpenAI, Anthropic).
Wrap Python functions as tools with the @tool decorator.
Provide messaging abstractions (SystemMessage, HumanMessage) that agents use to communicate.

LangGraph builds on LangChain. Together, they provide the core and the surrounding infrastructure. Streamlit: Bringing It to Life Streamlit is one of the most popular Python libraries for building dashboards:

Low-code and fast: spin up a polished UI
Widely adopted → one of the most popular Python dashboard libraries (and likely well-represented in LLM training data).

In this project, the entire dashboard is generated by AI, and Streamlit makes that practical.

The Technical

The following section walks through the code. From the project structure to setup and agent orchestration.

Project Structure and Prerequisites

If you’re curious to try it out, you can clone the project from GitHub using the following link:

GitHub — raphaelschols/sentiment-agents-dashboard Contribute to raphaelschols/sentiment-agents-dashboard development by creating an account on GitHub. github.com

The code block below shows the project structure
news-agents-dashboard/
│── main.py
│── requirements.txt
│
├── _agents/               # Agents that run different pipeline steps
│   ├── __init__.py
│   ├── data_agent.py
│   ├── viz_agent.py
│   ├── refine_agent.py
│
├── _llm/                  # Model initialization and config
│   ├── __init__.py
│   ├── openai_llm.py
│   ├── claud_llm.py
│
├── _system_instructions/  # System prompts for each agent
│   ├── __init__.py
│   ├── data_prompt.py
│   ├── viz_prompt.py
│   ├── refine_prompt.py
│
├── _tools/                # Tools grouped per agent
│   ├── __init__.py
│   ├── data_tools.py
│   ├── viz_tools.py
│   ├── refine_tools.py
│   ├── tools.py
│
├── output/                # Generated outputs (charts, JSON, wordclouds)
│
└── venv/                  # Virtual environment

Prerequisites: LLM Key, Requirements, and Crawl4AI setup. To run the dashboard assistant, you’ll need a few things set up:

🔑1. Get an LLM Developer Key

To run this project, you’ll need API keys for your language models. I used GPT-4o-mini for most of the work, but switched to Claude 3.5 for dashboard generation since Claude tends to be stronger at coding. You can grab your keys here:

OpenAI API → generate a new key.
Claude (Anthropic) API → create your developer key.

⚙️2. Requirements

All requirements for this project are listed in the requirements.txt and contain the following libraries.

langchain-openai
langgraph
langchain
crawl4ai
streamlit
vaderSentiment
plotly
wordcloud
langchain-anthropic

Requirements can then be installed by using the following command.

pip install -r requirements.txt

🕷️ 3. Crawl4AI Setup Unlike most Python libraries, Crawl4AI needs a quick setup step after installation. This ensures everything works smoothly under the hood.

Run the setup command (required):

crawl4ai-setup This prepares the runtime environment and installs the necessary backend for crawling.

(Optional) Run a health check:

crawl4ai-doctor Use this if you want to confirm your installation is healthy before running crawls.

The Project Code

The project uses roughly the same structure for every agent. Instead of pasting all the code below, let’s zoom in on the key parts.

LLM Setup → initialized in _llm/
System Prompts → step-by-step instructions in _system_instructions/
Tools → Python functions in _tools/, wrapped with @tool agents can call them autonomously
Agents Setup → _agents/ modules that combine LLM + tools + prompt + memory
Pipeline Orchestration → main.py stitches everything together and schedules run

Let’s walk through these building blocks.

1. LLM Setup

Each agent needs a language model backend. To keep things modular, I put all model initialization inside the _llm/ folder.

OpenAI models (GPT-4o-mini in this case) handle the Data Agent and Visualization Agent.
Claude 3.5 handles the Refiner Agent, since it’s often better at reasoning about code.

For example _llm/openai_llm.py contains:

import os
import getpass
from langchain.chat_models import init_chat_model

def init_openai_llm(model: str = "gpt-4o-mini", output_token_limit: int = 5000):
    """
    Initialize an OpenAI GPT model.

    Args:
        model (str): OpenAI model name (e.g., "gpt-4o-mini").

    Returns:
        llm: Initialized OpenAI language model.
    """
    if not os.environ.get("OPENAI_API_KEY"):
        os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API key: ")

    llm = init_chat_model(model, model_provider="openai", max_tokens=output_token_limit)
    return llm

2. Systems Prompts

Each agent gets its own prompt file inside _system_instructions/. The prompt defines the role of the agent and how to use specific tools. 🔔I have shortened the prompts for the sake of the article; the full version in the repo contains more guardrails.

Data Agent — _system_instructions/data_prompt.py

import textwrap

data_prompt = textwrap.dedent("""
You are the Data Agent.

Process each outlet one at a time:
  - ("cnn", "https://www.cnn.com")
  - ("fox", "https://www.foxnews.com")

Steps per outlet:
1. Call scrape_links(homepage_url) → select Politics, World, Business, Sports, Entertainment → build full URLs.
2. Call extract_headlines(pages) → return distinct headlines.
3. Call sentiment_analysis(headlines, outlet):
   - Max 100 items per call. If more, split into ≤100 batches and merge results.
   - Passing `outlet` ensures results are saved.

Rules:
- Only one scrape_links() per homepage.
- Use homepage + section URLs only.
- Complete all steps for one outlet before moving to the next.
- Preserve exact headlines; use batching to avoid truncation.
""")

Visualization Agent — _system_instructions/viz_prompt.py

import textwrap

viz_prompt = textwrap.dedent("""
You are the Visualization Agent.

Steps:
1) Create per-outlet visuals:
   - create_sankey_chart(outlet="cnn")
   - create_wordcloud(outlet="cnn")
   - create_sankey_chart(outlet="fox")
   - create_wordcloud(outlet="fox")

2) Create the comparison bar chart:
   - create_comparison_chart(cnn_outlet="cnn", fox_outlet="fox")

3) Generate a complete Streamlit dashboard (app.py).

4) Save outputs:
   - cnn_sankey.json
   - cnn_wordcloud.png
   - fox_sankey.json
   - fox_wordcloud.png
   - comparison.json
   - app.py
""")
Refine Agent — _system_instructions/refine_prompt.py
import textwrap

refine_prompt = textwrap.dedent("""
You are the Refiner Agent. Your job is to improve the Streamlit app.py and ensure it runs without errors.

Workflow:
1. Load the existing app.py.
2. Improve code (layout, design, error handling).
3. Save and re-test until the app passes without errors.

Key Improvements:
- Use modern Streamlit APIs (st.tabs, st.columns).
- Better styling and theming.
- Add sidebar with project description.
- Add comments and improve readability.

Critical Rules:
- Always save with save_dashboard_app().
- Always test with test_streamlit_code().
- Fix and re-test until no errors remain.
""")

3. Tools

As you can see in the system instructions. Each agent references specific tools. These are within the _tools/ directory and grouped by agent type. Tools are simply Python functions that an agent can use. Langchain has several ways to ensure a tool is callable inside an agent’s reasoning loop. The most simplistic approach is using Lanchain’s @tool decorator. Aside from the system instruction, the agent can then read the docstring to understand how and when to use the tool.

💡 A docstring is mandatory when using the @tool decorator. Not adding it will result in an error. Below, I show the full code for the scrape_linkstool. The other tools follow the same pattern (extract_headlines and sentiment_analysis) but are shortened here for brevity. Data Tools — _tools/data_tools.py

# --- Data Collection ---
@tool
def scrape_links(url: str) -> list[str]:
    """Scrape homepage links only. Reject repeat calls.
    
    Args:
        url (str): The homepage URL to scrape.

    Returns:
        list[str]: List of scraped links.

    """
    async def _scrape():
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url=url)
        return result.links

    return asyncio.run(_scrape())

@tool
def extract_headlines(urls: List[str], chunk_size: int = 5000) -> List[dict]:
    """
    Scrape URLs and return distinct headlines with inferred categories.

    Args:
        urls (List[str]): List of article URLs to scrape.
        chunk_size (int): Chunk size for scraping. Defaults to 5000.

    Returns:    
        List[dict]: List of dicts with headlines and categories Each {"headline": str, "category": str}.
    """
     ....

@tool
def sentiment_analysis(
    headlines: list[dict],
    outlet: str = None,
    path: str = "output/data"
) -> dict:
    """
    Run sentiment analysis on headlines with Vader.
    Always returns results in memory.
    If `outlet` is provided, also saves results to JSON and includes the filepath.
    """
     ....

# Collect all data tools in a list so the agent can call them
DATA_TOOLS = [
    scrape_links,
    extract_headlines,
    sentiment_analysis,
]

The other agents follow the same pattern:

Viz Tools — _tools/viz_tools.py  The visualization agent has its own toolkit for turning the sentiment data into visuals and packaging them into a dashboard:

load_sentiment_data → loads saved sentiment JSON from disk
create_sankey_chart → builds a Sankey diagram of sentiment flow per outlet
create_wordcloud → generates a word cloud PNG from headlines
create_comparison_chart → builds a bar chart comparing CNN vs Fox sentiment
save_dashboard_app → writes out the Streamlit app.py file

VISUALIZATION_TOOLS = [
    load_sentiment_data,
    create_sankey_chart,
    create_wordcloud,
    create_comparison_chart,
    save_dashboard_app,
]

Refiner Tools — _tools/refine_tools.py  The refiner agent uses tools to improve and validate the Streamlit app:

load_dashboard_app → reads the current app.py file from disk
save_dashboard_app → saves the updated code back to disk
test_streamlit_code → runs a test to confirm the app executes without errors

REFINER_TOOLS = [
    load_dashboard_app,
    save_dashboard_app,
    test_streamlit_code,
]

4. Agents Setup Each agent lives in its own module inside _agents/. At a high level, every agent is built from the same building blocks:

LLM: OpenAI (for Data & Viz) or Claude (for Refinement).
Tools: imported from _tools/ and passed into the agent so it can call them autonomously.
System Prompt: defines its role and step-by-step workflow, stored in _system_instructions/.
Memory: a MemorySaver backend to persist state across turns in the reasoning loop.

We use LangGraph’s create_react_agent, which wires everything together using the ReAct paradigm (reasoning + acting). This allows the LLM to alternate between “thinking” (reasoning steps) and “doing” (tool calls).

Here’s the Data Agent:

# _agents/data_agent.py
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import SystemMessage
from _llm.openai_llm import init_openai_llm
from _tools.data_tools import DATA_TOOLS
from _system_instructions.data_prompt import data_prompt

def run_data_agent():
    """Run the Data Agent and return results."""
    model_openai = init_openai_llm()
    memory = MemorySaver()
    data_agent = create_react_agent(
        model=model_openai,
        tools=DATA_TOOLS,
        checkpointer=memory,
        version="v2",
    )
    config_data = {
        "configurable": {"thread_id": "agent_thread_1"},
        "recursion_limit": 60
    }
    return data_agent.invoke(
        {"messages": [SystemMessage(content=data_prompt)]},
        config=config_data,
    )

The Viz Agent and Refiner Agent follow the same structure, with just three differences:

Viz Agent: uses VISUALIZATION_TOOLS and viz_prompt.
Refiner Agent: uses Claude (init_claude_llm), REFINER_TOOLS, and refine_prompt.

5. Pipeline orchestration

The main.py file ties everything up into a full workflow. It runs each agent in sequence — Data → Visualization → Refinement — and schedules the job to execute automatically once per day at 09:00.

Data Agent: scrapes news outlets and runs sentiment analysis.
Visualization Agent: generates Sankey diagrams, word clouds, and comparison charts.
Refiner Agent: ensures the Streamlit dashboard code runs cleanly.

Here’s the orchestration code:

import schedule
import time
from _agents.data_agent import run_data_agent
from _agents.viz_agent import run_viz_agent
from _agents.refine_agent import run_refiner_agent


def run_pipeline():
    """Run the full pipeline: Data → Visualization → Refinement."""
    print("Step 1: Running Data Agent...")
    data_res = run_data_agent()

    print("Step 2: Running Visualization Agent...")
    viz_res = run_viz_agent()

    print("Step 3: Running Refinement Agent...")
    refine_res = run_refiner_agent()

    return {
        "data": data_res,
        "visualization": viz_res,
        "refinement": refine_res,
    }


def job():
    print("Running daily pipeline...")
    result = run_pipeline()
    print("Pipeline finished!")
    # optionally log or save `result` here


if __name__ == "__main__":
    print("Scheduler started. Pipeline will run every day at 09:00.")
    schedule.every().day.at("09:00").do(job)

    while True:
        schedule.run_pending()
        time.sleep(60)  # check every minute

Thoughts, Limitations, and Improvements

Working with autonomous agents is fun and exciting, but it comes with trade-offs. When I first started this project, I assumed it would be simpler and require less code. Instead, I quickly realized that more structure and guardrails were necessary.

1. Autonomy vs. control

While agents reduce boilerplate and handle orchestration. They introduce unpredictability. Sometimes it follows instructions perfectly, at other times it goes off track. Avoiding ambiguity, modularisation is then often key.

2. Testing and debugging can be tricky

It’s often difficult to understand why an agent breaks. Debugging requires inspecting both the reasoning steps and the tool calls.

3. Output randomness

The system sometimes generates a slightly different dashboard on each run, which is fun for a hobby project but not what is needed in many production environments. Rather than letting the system regenerate a completely new dashboard each time. A better approach would be to persist, save, and use past outputs, and use the refinement agent to fix bugs or gradually enrich features. The current pipeline hasn’t been deployed yet.

4. Deployment gap

The current project runs locally. Running it in the cloud (e.g., on Google Cloud Run) would make it more reliable and accessible.

Potential solution

Create a container Dockerfile to containerize the app.
Use tools like Streamlit Community Cloud or Google Cloud

Final Thoughts

If you enjoyed this article, a few claps 👏 (or even 50!) would mean a lot and help more people discover it. You can also buy me a coffee if you’d like to support my work, and feel free to follow me for more content! Sources:

Ramač, R., Mandić, V., Taušan, N., Rios, N., Freire, S., Pérez, B., Castellanos, C., Correal, D., Pacheco, A., López, G., Izurieta, C., Seaman, C., & Spínola, R. (2022). Prevalence, common causes and effects of technical debt: Results from a family of surveys with the IT industry. Journal of Systems and Software, 184, 111114. https://doi.org/10.1016/j.jss.2021.111114
Yao, S., Yang, J., Cui, D., Narasimhan, K., & Liang, P. (2022). ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629
Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., & Wiest, O. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv. arXiv
Li, X., Wang, … (2024). A Survey on LLM-based Multi-Agent Systems: Workflow, Applications, Challenges. Springer. link.springer.com
Crawl4AI Documentation. (n.d.). Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper. Retrieved from https://docs.crawl4ai.com/ docs.crawl4ai.com
LangChain-AI. (n.d.). LangGraph agents reference. Retrieved from https://langchain-ai.github.io/langgraph/reference/agents/

Read the full article here: https://medium.com/data-science-collective/build-ai-agents-that-scrape-the-web-and-generate-dashboards-with-crawl4ai-1f9e5229e428