How to Scrape Web Data for AI Agents

Learn how to scrape web content and use it dynamically inside your AI workflows

Last updated 1 day ago

How to Scrape Web Data for AI Agents

Learn how to scrape web content and use it dynamically inside your AI workflows

This guide walks through the process of scraping webpage content in MindStudio and using that content in a custom AI agent. The example agent extracts article content from a URL and turns it into a LinkedIn post.

Use Case Overview

We’ll build a URL to LinkedIn Post agent that:

Collects a URL from the user.
Scrapes the content from that page.
Uses AI to generate a LinkedIn post based on the page content.

Step 1: Create a User Input for the URL

Add a User Input block to your workflow.
Choose the Short Text input type.
Name the variable: url
Set the label: Enter the URL you'd like to write a LinkedIn post about
Add placeholder text: e.g., https://www.theverge.com/...
Enable URL validation to ensure the input is a proper URL.
(Optional) Set a test value for debugging, like a real article URL.

Step 2: Scrape the Webpage

Add a Scrape URL block.
In the URL field, use the variable: {{ url }}
Set the output variable name: scraped_content
Choose Output Format: Text only
Enable Auto-enhance to improve scraping reliability.
Keep the Default scraper selected (Firecrawl is also available if needed).
Leave Screenshot disabled unless required.

The block will now extract and store webpage content into the scraped_content variable.

Step 3: Generate AI Output

Add a Generate Text block.

Write your prompt, including the scraped content:

cssCopyEditWrite an attention-grabbing LinkedIn post based on the following article:
<content>{{ scraped_content }}</content>

Choose an appropriate model (e.g., Claude 3.5 Haiku).

Step 4: Test the Agent

Click Preview and open the draft agent.
Try inputting an invalid value (like not a URL) to confirm validation works.
Enter a valid URL or use the test value.
The AI will:
- Scrape the page.
- Analyze the content.
- Generate a LinkedIn post for you to copy or repurpose.

Recap and Best Practices

Use the Scrape URL block to pull live content from any webpage.
Always validate user input when collecting URLs.
Store scraped data in a clearly named variable for easy reuse.
Keep the output format as “Text only” for general analysis or “JSON” for structured use cases.
Auto-enhance improves scraping accuracy on dynamic or complex websites.

You can further extend this workflow by adding post-processing steps or integration blocks to share or save the generated content.