Build AI Agents that Scrape the Web and Generate Dashboards with Crawl4AI
AI Agents are emerging as one of the most practical applications of large language models (LLMs). Systems that once relied solely on strict logic now exhibit a touch of intuition and reasoning capabilities that, until recently, were only associated with human intervention. Instead of simply answering a question or adhering to predefined logic gates, AI agents can reason, plan, and act more flexibly, using any tool at their disposal. That shift doesn’t just change how we write code. It also reshapes how we think about scraping and transforming data and turning it into products like dashboards.
LLMs have fueled a wave of bold claims, such as: “The internet is dead.” “Dashboards are dead.” My personal standpoint is more measured, but this project shows a glimpse of why people are quick to make those claims.
What We’ll Build: a Self-Sustaining Sentiment Dashboard
We’ll design a system of AI Agents that scrapes, processes, and interprets data, and presents it in a dashboard that updates and improves itself. The pipeline pulls headlines from CNN and Fox, runs sentiment analysis to measure emotional tone, and visualizes the results.
- Sankey diagrams to show how sentiment flows through categories.
- Bar charts to compare sentiment between CNN and Fox.
- Word clouds to highlight the most common words across headlines.
The system is powered by Crawl4AI (scraping), LangGraph + LangChain (agents and orchestration), and Streamlit (interactive dashboards).
Why This Matters
The cost of any data product is not its initial development, but the technical debt: the continuous maintenance required to keep things functional and reliable. It is not uncommon for teams to spend 25% of their time on fixing rather than developing¹.
If you’ve ever done web scraping, you know the pain. Scrapers rely on specific web elements within a site’s layout, which can change at any time. On day one, your scraper runs flawlessly. On day seven, CNN’s website changes the name of a single
Dashboards aren’t much better. As a data analyst by profession, I know that building and maintaining visuals and layouts can consume hours, even with tools like Power BI or Tableau. Alternatively, using Python to render dashboards adds dependency management to the mix.
From an organizational perspective, the real cost is clear: either you hire more developers, or your existing ones spend a huge share of their time just keeping systems alive.
The Solution
Now, imagine a system that runs itself: an AI agent that decides what to scrape based on instructions and adapts as the website’s structure changes. Then generates and renders dashboards automatically. For companies, the appeal is obvious: fewer maintenance cycles, lower operational costs, and the ability to ship insights faster without constantly firefighting.
Key Concepts: Agents
What is an AI Agent? An AI agent is a system that can perceive, decide, and act towards a goal with some degree of autonomy. While the term actually describes a wide array of systems; Robotics, bots that trade stock, and even a smart thermostat that adjusts temperature automatically. The common thread is that they all sense, reason, and act.
Today, the term most often refers to LLM-powered agents: systems with large language models as their reasoning mechanism. There are different ways to structure these agents, but one of the most used is ReAct (Reason + Act). In this Framework, the LLM reasons about what to do, acts by calling a tool or API, observes the result, and repeats the cycle until the goal is reached².
This is the pattern illustrated below:
- System Instruction (prompt): defines the task or goal for the agent.
- Reasoning (LLM Core): the model interprets instructions, makes decisions, and plans next steps.
- Tools (Functions): external actions and tools that the agent can consult to act with the environment.
- Observation Loop: the agent interprets feedback from tools to guide further reasoning.
Why Multi-Agent Environments? Asking one LLM-powered agent to “do it all” is a recipe for failure. The broader its role, the larger its instructions and the more room for ambiguity. That’s when hallucinations creep in and pipelines start breaking. The idea of a multi-agent environment is that instead of an “overworked” agent, you deploy a team where each member is specialized in a specific task. These often outperform monolithic, all-purpose agents³. Moreover, these systems are⁴:
- More accurate: each agent specializes in one domain.
- Easier to debug: you know where errors come from.
- More scalable: agents can be replaced, removed, or added, without directly affecting the entire pipeline.
The Project’s Multi-Agent Setup In this project, the system is split into three ReAct agents. Each has its own system instruction and tools.
- Data Agent: Scrapes and organizes data from CNN and Fox.
- Visualization Agent: Creates charts and generates the initial Streamlit dashboard.
- Refinement Agent: Tests, repairs, improves, and iterates on the dashboard until it works.
To all work together to produce the final app.
Key Concepts: Libraries
Crawl4AI: Scraping for the AI Era
Traditional scrapers are sensitive to every tiny site update. Craw4AI, on the other hand, is built from the ground up with agents and LLMs in mind. Here are some of its advantages⁵.
- Machine-friendly by default: It outputs clean, structured data, ready for RAG pipelines and AI agents.
- Token Efficiency: It doesn’t output messy selectors or HTML clutter, meaning leaner LLM calls that are faster and cheaper.
- Resilient: Changes in site layouts won’t send you into maintenance hell.
That’s why it’s trending on GitHub. It’s the next-gen bridge between the messy web and intelligent systems.
LangGraph: The Agent’s Skeleton
Actually building an agent from scratch involves many steps and considerations. Luckily, LangGraph provides a framework and functionalities out of the box. Specifically, it comes with a built-in ReAct agent⁶, the reasoning loop that powers decision-making in this project. With it, we can:
- Set system instructions to guide behavior.
- Easily manage and orchestrate different tools.
- Rely on built-in looping and retries so an agent doesn’t just crash at the first failure.
In short, LangGraph is what turns a clever LLM into a reliable worker.
LangChain: The Ecosystem Glue
In this project, Lanchain is used to:
- Initialize LLMs from different providers (OpenAI, Anthropic).
- Wrap Python functions as tools with the @tool decorator.
- Provide messaging abstractions (SystemMessage, HumanMessage) that agents use to communicate.
LangGraph builds on LangChain. Together, they provide the core and the surrounding infrastructure. Streamlit: Bringing It to Life Streamlit is one of the most popular Python libraries for building dashboards:
- Low-code and fast: spin up a polished UI
- Widely adopted → one of the most popular Python dashboard libraries (and likely well-represented in LLM training data).
In this project, the entire dashboard is generated by AI, and Streamlit makes that practical.
The Technical
The following section walks through the code. From the project structure to setup and agent orchestration.
Project Structure and Prerequisites
If you’re curious to try it out, you can clone the project from GitHub using the following link:
GitHub — raphaelschols/sentiment-agents-dashboard Contribute to raphaelschols/sentiment-agents-dashboard development by creating an account on GitHub. github.com
The code block below shows the project structure news-agents-dashboard/ │── main.py │── requirements.txt │ ├── _agents/ # Agents that run different pipeline steps │ ├── __init__.py │ ├── data_agent.py │ ├── viz_agent.py │ ├── refine_agent.py │ ├── _llm/ # Model initialization and config │ ├── __init__.py │ ├── openai_llm.py │ ├── claud_llm.py │ ├── _system_instructions/ # System prompts for each agent │ ├── __init__.py │ ├── data_prompt.py │ ├── viz_prompt.py │ ├── refine_prompt.py │ ├── _tools/ # Tools grouped per agent │ ├── __init__.py │ ├── data_tools.py │ ├── viz_tools.py │ ├── refine_tools.py │ ├── tools.py │ ├── output/ # Generated outputs (charts, JSON, wordclouds) │ └── venv/ # Virtual environment
Prerequisites: LLM Key, Requirements, and Crawl4AI setup. To run the dashboard assistant, you’ll need a few things set up:
🔑1. Get an LLM Developer Key
To run this project, you’ll need API keys for your language models. I used GPT-4o-mini for most of the work, but switched to Claude 3.5 for dashboard generation since Claude tends to be stronger at coding. You can grab your keys here:
- OpenAI API → generate a new key.
- Claude (Anthropic) API → create your developer key.
⚙️2. Requirements
All requirements for this project are listed in the requirements.txt and contain the following libraries.
langchain-openai langgraph langchain crawl4ai streamlit vaderSentiment plotly wordcloud langchain-anthropic
Requirements can then be installed by using the following command.
pip install -r requirements.txt
🕷️ 3. Crawl4AI Setup Unlike most Python libraries, Crawl4AI needs a quick setup step after installation. This ensures everything works smoothly under the hood.
- Run the setup command (required):
crawl4ai-setup This prepares the runtime environment and installs the necessary backend for crawling.
- (Optional) Run a health check:
crawl4ai-doctor Use this if you want to confirm your installation is healthy before running crawls.
The Project Code
The project uses roughly the same structure for every agent. Instead of pasting all the code below, let’s zoom in on the key parts.
- LLM Setup → initialized in _llm/
- System Prompts → step-by-step instructions in _system_instructions/
- Tools → Python functions in _tools/, wrapped with @tool agents can call them autonomously
- Agents Setup → _agents/ modules that combine LLM + tools + prompt + memory
- Pipeline Orchestration → main.py stitches everything together and schedules run
Let’s walk through these building blocks.
1. LLM Setup
Each agent needs a language model backend. To keep things modular, I put all model initialization inside the _llm/ folder.
- OpenAI models (GPT-4o-mini in this case) handle the Data Agent and Visualization Agent.
- Claude 3.5 handles the Refiner Agent, since it’s often better at reasoning about code.
For example _llm/openai_llm.py contains:
import os
import getpass
from langchain.chat_models import init_chat_model
def init_openai_llm(model: str = "gpt-4o-mini", output_token_limit: int = 5000):
"""
Initialize an OpenAI GPT model.
Args:
model (str): OpenAI model name (e.g., "gpt-4o-mini").
Returns:
llm: Initialized OpenAI language model.
"""
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API key: ")
llm = init_chat_model(model, model_provider="openai", max_tokens=output_token_limit)
return llm
2. Systems Prompts
Each agent gets its own prompt file inside _system_instructions/. The prompt defines the role of the agent and how to use specific tools. 🔔I have shortened the prompts for the sake of the article; the full version in the repo contains more guardrails.
Data Agent — _system_instructions/data_prompt.py
import textwrap
data_prompt = textwrap.dedent("""
You are the Data Agent.
Process each outlet one at a time:
- ("cnn", "https://www.cnn.com")
- ("fox", "https://www.foxnews.com")
Steps per outlet:
1. Call scrape_links(homepage_url) → select Politics, World, Business, Sports, Entertainment → build full URLs.
2. Call extract_headlines(pages) → return distinct headlines.
3. Call sentiment_analysis(headlines, outlet):
- Max 100 items per call. If more, split into ≤100 batches and merge results.
- Passing `outlet` ensures results are saved.
Rules:
- Only one scrape_links() per homepage.
- Use homepage + section URLs only.
- Complete all steps for one outlet before moving to the next.
- Preserve exact headlines; use batching to avoid truncation.
""")
Visualization Agent — _system_instructions/viz_prompt.py
import textwrap
viz_prompt = textwrap.dedent("""
You are the Visualization Agent.
Steps:
1) Create per-outlet visuals:
- create_sankey_chart(outlet="cnn")
- create_wordcloud(outlet="cnn")
- create_sankey_chart(outlet="fox")
- create_wordcloud(outlet="fox")
2) Create the comparison bar chart:
- create_comparison_chart(cnn_outlet="cnn", fox_outlet="fox")
3) Generate a complete Streamlit dashboard (app.py).
4) Save outputs:
- cnn_sankey.json
- cnn_wordcloud.png
- fox_sankey.json
- fox_wordcloud.png
- comparison.json
- app.py
""")
Refine Agent — _system_instructions/refine_prompt.py
import textwrap
refine_prompt = textwrap.dedent("""
You are the Refiner Agent. Your job is to improve the Streamlit app.py and ensure it runs without errors.
Workflow:
1. Load the existing app.py.
2. Improve code (layout, design, error handling).
3. Save and re-test until the app passes without errors.
Key Improvements:
- Use modern Streamlit APIs (st.tabs, st.columns).
- Better styling and theming.
- Add sidebar with project description.
- Add comments and improve readability.
Critical Rules:
- Always save with save_dashboard_app().
- Always test with test_streamlit_code().
- Fix and re-test until no errors remain.
""")
3. Tools
As you can see in the system instructions. Each agent references specific tools. These are within the _tools/ directory and grouped by agent type. Tools are simply Python functions that an agent can use. Langchain has several ways to ensure a tool is callable inside an agent’s reasoning loop. The most simplistic approach is using Lanchain’s @tool decorator. Aside from the system instruction, the agent can then read the docstring to understand how and when to use the tool.
💡 A docstring is mandatory when using the @tool decorator. Not adding it will result in an error. Below, I show the full code for the scrape_linkstool. The other tools follow the same pattern (extract_headlines and sentiment_analysis) but are shortened here for brevity. Data Tools — _tools/data_tools.py
# --- Data Collection ---
@tool
def scrape_links(url: str) -> list[str]:
"""Scrape homepage links only. Reject repeat calls.
Args:
url (str): The homepage URL to scrape.
Returns:
list[str]: List of scraped links.
"""
async def _scrape():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
return result.links
return asyncio.run(_scrape())
@tool
def extract_headlines(urls: List[str], chunk_size: int = 5000) -> List[dict]:
"""
Scrape URLs and return distinct headlines with inferred categories.
Args:
urls (List[str]): List of article URLs to scrape.
chunk_size (int): Chunk size for scraping. Defaults to 5000.
Returns:
List[dict]: List of dicts with headlines and categories Each {"headline": str, "category": str}.
"""
....
@tool
def sentiment_analysis(
headlines: list[dict],
outlet: str = None,
path: str = "output/data"
) -> dict:
"""
Run sentiment analysis on headlines with Vader.
Always returns results in memory.
If `outlet` is provided, also saves results to JSON and includes the filepath.
"""
....
# Collect all data tools in a list so the agent can call them
DATA_TOOLS = [
scrape_links,
extract_headlines,
sentiment_analysis,
]
The other agents follow the same pattern:
Viz Tools — _tools/viz_tools.py The visualization agent has its own toolkit for turning the sentiment data into visuals and packaging them into a dashboard:
- load_sentiment_data → loads saved sentiment JSON from disk
- create_sankey_chart → builds a Sankey diagram of sentiment flow per outlet
- create_wordcloud → generates a word cloud PNG from headlines
- create_comparison_chart → builds a bar chart comparing CNN vs Fox sentiment
- save_dashboard_app → writes out the Streamlit app.py file
VISUALIZATION_TOOLS = [
load_sentiment_data,
create_sankey_chart,
create_wordcloud,
create_comparison_chart,
save_dashboard_app,
]
Refiner Tools — _tools/refine_tools.py The refiner agent uses tools to improve and validate the Streamlit app:
- load_dashboard_app → reads the current app.py file from disk
- save_dashboard_app → saves the updated code back to disk
- test_streamlit_code → runs a test to confirm the app executes without errors
REFINER_TOOLS = [
load_dashboard_app,
save_dashboard_app,
test_streamlit_code,
]
4. Agents Setup Each agent lives in its own module inside _agents/. At a high level, every agent is built from the same building blocks:
- LLM: OpenAI (for Data & Viz) or Claude (for Refinement).
- Tools: imported from _tools/ and passed into the agent so it can call them autonomously.
- System Prompt: defines its role and step-by-step workflow, stored in _system_instructions/.
- Memory: a MemorySaver backend to persist state across turns in the reasoning loop.
We use LangGraph’s create_react_agent, which wires everything together using the ReAct paradigm (reasoning + acting). This allows the LLM to alternate between “thinking” (reasoning steps) and “doing” (tool calls).
Here’s the Data Agent:
# _agents/data_agent.py
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import SystemMessage
from _llm.openai_llm import init_openai_llm
from _tools.data_tools import DATA_TOOLS
from _system_instructions.data_prompt import data_prompt
def run_data_agent():
"""Run the Data Agent and return results."""
model_openai = init_openai_llm()
memory = MemorySaver()
data_agent = create_react_agent(
model=model_openai,
tools=DATA_TOOLS,
checkpointer=memory,
version="v2",
)
config_data = {
"configurable": {"thread_id": "agent_thread_1"},
"recursion_limit": 60
}
return data_agent.invoke(
{"messages": [SystemMessage(content=data_prompt)]},
config=config_data,
)
The Viz Agent and Refiner Agent follow the same structure, with just three differences:
- Viz Agent: uses VISUALIZATION_TOOLS and viz_prompt.
- Refiner Agent: uses Claude (init_claude_llm), REFINER_TOOLS, and refine_prompt.
5. Pipeline orchestration
The main.py file ties everything up into a full workflow. It runs each agent in sequence — Data → Visualization → Refinement — and schedules the job to execute automatically once per day at 09:00.
- Data Agent: scrapes news outlets and runs sentiment analysis.
- Visualization Agent: generates Sankey diagrams, word clouds, and comparison charts.
- Refiner Agent: ensures the Streamlit dashboard code runs cleanly.
Here’s the orchestration code:
import schedule
import time
from _agents.data_agent import run_data_agent
from _agents.viz_agent import run_viz_agent
from _agents.refine_agent import run_refiner_agent
def run_pipeline():
"""Run the full pipeline: Data → Visualization → Refinement."""
print("Step 1: Running Data Agent...")
data_res = run_data_agent()
print("Step 2: Running Visualization Agent...")
viz_res = run_viz_agent()
print("Step 3: Running Refinement Agent...")
refine_res = run_refiner_agent()
return {
"data": data_res,
"visualization": viz_res,
"refinement": refine_res,
}
def job():
print("Running daily pipeline...")
result = run_pipeline()
print("Pipeline finished!")
# optionally log or save `result` here
if __name__ == "__main__":
print("Scheduler started. Pipeline will run every day at 09:00.")
schedule.every().day.at("09:00").do(job)
while True:
schedule.run_pending()
time.sleep(60) # check every minute
Thoughts, Limitations, and Improvements
Working with autonomous agents is fun and exciting, but it comes with trade-offs. When I first started this project, I assumed it would be simpler and require less code. Instead, I quickly realized that more structure and guardrails were necessary.
1. Autonomy vs. control
While agents reduce boilerplate and handle orchestration. They introduce unpredictability. Sometimes it follows instructions perfectly, at other times it goes off track. Avoiding ambiguity, modularisation is then often key.
2. Testing and debugging can be tricky
It’s often difficult to understand why an agent breaks. Debugging requires inspecting both the reasoning steps and the tool calls.
3. Output randomness
The system sometimes generates a slightly different dashboard on each run, which is fun for a hobby project but not what is needed in many production environments. Rather than letting the system regenerate a completely new dashboard each time. A better approach would be to persist, save, and use past outputs, and use the refinement agent to fix bugs or gradually enrich features. The current pipeline hasn’t been deployed yet.
4. Deployment gap
The current project runs locally. Running it in the cloud (e.g., on Google Cloud Run) would make it more reliable and accessible.
Potential solution
- Create a container Dockerfile to containerize the app.
- Use tools like Streamlit Community Cloud or Google Cloud
Final Thoughts
If you enjoyed this article, a few claps 👏 (or even 50!) would mean a lot and help more people discover it. You can also buy me a coffee if you’d like to support my work, and feel free to follow me for more content! Sources:
- Ramač, R., Mandić, V., Taušan, N., Rios, N., Freire, S., Pérez, B., Castellanos, C., Correal, D., Pacheco, A., López, G., Izurieta, C., Seaman, C., & Spínola, R. (2022). Prevalence, common causes and effects of technical debt: Results from a family of surveys with the IT industry. Journal of Systems and Software, 184, 111114. https://doi.org/10.1016/j.jss.2021.111114
- Yao, S., Yang, J., Cui, D., Narasimhan, K., & Liang, P. (2022). ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629
- Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., & Wiest, O. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv. arXiv
- Li, X., Wang, … (2024). A Survey on LLM-based Multi-Agent Systems: Workflow, Applications, Challenges. Springer. link.springer.com
- Crawl4AI Documentation. (n.d.). Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper. Retrieved from https://docs.crawl4ai.com/ docs.crawl4ai.com
- LangChain-AI. (n.d.). LangGraph agents reference. Retrieved from https://langchain-ai.github.io/langgraph/reference/agents/
Read the full article here: https://medium.com/data-science-collective/build-ai-agents-that-scrape-the-web-and-generate-dashboards-with-crawl4ai-1f9e5229e428