MindStudio Docs
  • Get Started
    • Overview
    • MindStudio Chrome Extension
    • Quickstart Guide
    • What is an AI Agent?
    • AI Agent Use Cases
  • Free vs. Paid AI Agents
  • Building AI Agents
    • Editor Overview
    • Workflow Generator
    • Writing Prompts
      • Templating
    • AI Models
    • Variables
      • Working with JSON
      • Using Handlebars Templating
    • Dynamic Variables
    • Data Sources
    • Automations
      • Start Block
      • Generate Text Block
      • Generate Image Block
      • Generate Chart Block
      • Generate Asset Block
      • Display Content Block
      • Text to Speech Block
      • Analyze Image Block
      • User Input Block
      • User Context Block
      • Query Data Block
      • Run Function Block
      • Scrape URL Block
      • Extract Text from File Block
      • Post to Slack Block
      • Menu Block
      • Logic Block
      • Checkpoint Block
      • Jump Block
      • Run Workflow Block
      • Terminator Block
    • Integrations
      • Search Bluesky Posts
      • Scrape Facebook Page
      • Scrape Meta Threads Profile
      • Scrape Instagram Comments
      • Scrape Instagram Mentions
      • Scrape Instagram Posts
      • Scrape Instagram Profile
      • Scrape Instagram Reels
      • Create LinkedIn Post
      • Create X Post
      • Search X Posts
      • Search Google
      • Search Google Images
      • Search Google Trends
      • Search Google News
      • Create Google Doc
      • Fetch Google Doc
      • Update Google Doc
      • Create Google Sheet
      • Fetch Google Sheet
      • Update Google Sheet
      • Enrich Company via Domain
      • Find Contact Email for Website
      • Find Email
      • Verify Email
      • Enrich Person via Email
      • Fetch YouTube Captions
      • Fetch YouTube Channel
      • Fetch YouTube Comments
      • Fetch YouTube Video
      • Search YouTube
      • Search YouTube Trends
      • Create Notion Page
      • Update Notion Page
      • Apify
      • Run Scenario
      • Post to Slack
      • HTTP Request
      • Run Node
      • Create Contact
      • Add Note
      • Send Email
      • Send SMS
    • Publishing & Versioning
  • Embedding AI Agents
  • Using Webhooks
  • Workspace Management
    • Workspace Overview
    • Workspace Settings
    • Usage Explorer
    • Billing Settings
    • Account Settings
    • Team Settings & Access Controls
  • Test & Evaluate
    • Testing Suite Overview
    • Evaluations
    • Profiler
    • Debugger
  • Integration Guides
    • Zapier + MindStudio
    • Make.com + MindStudio
    • n8n + MindStudio
  • Developers
    • API Reference
    • NPM Package
    • Custom Workflow Functions
  • Additional Resources
    • Glossary
    • Allowing Access to Mindstudio From Your Network
  • Solutions
    • MindStudio Solutions Partners
    • MindStudio For Developers
    • MindStudio for Enterprises
Powered by GitBook
On this page
  • Configurations
  • URL
  • Output Variable
  • Provider
  • Types of Providers for the Scrape URL Block
  • Default
  • Firecrawl
  • Best Practices
Export as PDF
  1. Building AI Agents
  2. Automations

Scrape URL Block

Extract text content from a webpage

The Scrape URL Block allows you to extract text content from a webpage and use it within your workflow. This block is ideal for gathering data dynamically from online sources to provide context for your AI Agents.

Configurations

URL

Provide the webpage URL you want to scrape. You can input a static URL directly or use a {{variable}} for dynamic URLs.

Example:

  • Static: https://example.com/article

  • Dynamic: {{inputURL}}

Output Variable

Creates a new variable and saves the extracted text to it. Enter a variable_name to store the response for later use in the workflow.

Provider

Select the scraping provider to process the webpage. Different providers will have different configuration settings and outputs. Choose the one that works best for your needs.


Types of Providers for the Scrape URL Block

The Scrape URL Block supports multiple providers to extract webpage content, each offering different levels of customization and functionality.

Default

The default provider extracts basic text content from the provided URL without additional configuration options. It is suitable for quick, straightforward scraping tasks that do not require advanced customization.

Firecrawl

Only Main Content

When enabled, this setting returns only the main content of the page, such as the body text, while excluding headers, navigation bars, and footers. Disabling it includes the entire page content, including headers and sidebars.

Include Screenshot

When enabled, captures and returns a screenshot of the top of the page you are scraping. When disabled, it does not include a screenshot.

Wait For

Allows you to specify a delay (in milliseconds) before scraping begins to let the page fully load. By default, no wait time is applied. For example, setting it to 500 waits half a second before scraping.

Absolute Paths

Converts all relative paths in the scraped content to absolute URLs when enabled. This ensures that links and resources in the scraped content are fully qualified. When disabled, relative paths remain as they are.

Headers

Lets you include custom HTTP headers with your scraping request. This is useful for adding cookies, specifying a User-Agent, or passing authentication tokens.

Remove Tags

Allows you to define HTML tags to exclude from the scraped content. For instance, adding <footer> removes footer elements from the output.

Use Extractor

When enabled, this option uses an LLM (Large Language Model) to extract structured data from the page. When disabled, the block returns raw textual content without further processing.


Best Practices

  • Validate URLs: Ensure the URL points to a publicly accessible page with the desired content.

  • Monitor Structure Changes: Webpages may change structure over time, which could affect scraping accuracy.

  • Use Variables: Leverage dynamic variables to adapt the block to multiple use cases without manually changing the URL.

  • Set Clear Outputs: Choose meaningful variable names to make workflow debugging and customization easier.

Last updated 1 month ago

Unlike traditional web scrapers, is equipped to handle dynamic content rendered with JavaScript. It offers advanced configuration options for greater control over how webpages are scraped.

Firecrawl