Tool
v0.1.2
Jina Reader and YouTube Loader Scraper
Tool ID
jina_reader_and_youtube_loader_scraper
Creator
@iamg30
Downloads
234+
Scrapes web pages using a locally hosted Reader instance with optional rJina API key usage and extracts YouTube transcripts with YoutubeLoader, featuring asynchronous processing and content utilities
README

Jina Reader and YouTube Loader Scraper

This tool is designed to be directly integrated into Open WebUI. It scrapes web pages using a locally hosted Reader instance and extracts YouTube transcripts. It leverages the r.Jina API for web scraping and YoutubeLoader.from_youtube_url from Langchain for YouTube transcripts. The tool is designed for asynchronous processing and includes features to enhance scraping reliability and efficiency.

Key Features:

  • Web Scraping with Jina Reader: Utilizes a self-hosted Jina Reader instance for robust web page scraping, potentially leveraging Jina API keys for higher rate limits.
  • YouTube Transcript Extraction: Extracts transcripts from YouTube videos using YoutubeLoader from Langchain.
  • Asynchronous Processing: Designed for efficient and non-blocking scraping operations.
  • Rate Limiting: Implements domain-based rate limiting to avoid overloading websites and prevent IP blocking.
  • Caching: Caches scraped web page content and YouTube transcripts to reduce redundant requests and improve performance.
  • User-Agent Rotation: Rotates through a list of user agents to mimic normal user traffic and reduce the chance of being blocked.
  • Content Cleaning: Option to remove URLs and image links from scraped content to reduce token count and focus on text.
  • Status Updates: Provides real-time status updates via event emitters within Open WebUI, allowing for monitoring of scraping progress and errors (can be toggled on/off).
  • Configuration Valves: Offers configurable settings through Valves and UserValves classes to customize scraping behavior, caching, rate limits, and more.
  • Multi-URL Handling: Supports processing multiple URLs at once, accepting comma-separated or newline-separated lists.
  • Duplicate URL Handling: Automatically detects and skips duplicate URLs within a batch.

Prerequisites

  • Open WebUI Installation: You must have a working installation of Open WebUI.
  • Locally Hosted Jina Reader Instance: You need to have a running instance of Reader accessible from where you run this script. By default, it is configured to connect to http://host.docker.internal:3000. Otherwise, you can use a Jina Reader API key.

Usage within Open WebUI

Once imported into Open WebUI, you can use the tool within your workflows. The tool provides the following functionalities:

  • Web Scraping a Single URL: Input a URL and the tool will scrape the content of the web page.
  • Extract Information (Title, Links, Images): Input a URL and the tool will scrape the page and extract the title, links, and images.
  • Get YouTube Transcript: Input a YouTube video URL and the tool will retrieve the transcript.
  • Process Multiple URLs: Input a comma-separated or newline-separated list of URLs for batch processing.

Configuration within Open WebUI:

You can configure the tool's behavior directly within the Open WebUI interface. Look for settings related to the imported tool to adjust parameters defined in the Valves and UserValves classes, such as:

  • Jina Instance URL

  • API Keys

  • Caching behavior

  • Rate Limits

  • Content Cleaning options

  • Logging and Status Updates

API Keys

You can enhance the scraping rate limits by using Jina API keys. Configure API keys within the tool settings in Open WebUI. You can typically set:

Global Jina API Key: A general API key for the tool.

User-Specific Jina API Key: Allows individual users to provide their own API keys, potentially overriding the global key.

License

This tool is licensed under the MIT License. See the LICENSE file for more details.