Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Fetch Safety

Resource Limits

LimitDefaultConfigurable
Maximum response body256 MiBFetchConfig.max_body_size
Maximum redirect chain10FetchConfig.max_redirects
Connect timeout30 secondsFetchConfig.connect_timeout
Total request timeout120 secondsFetchConfig.total_timeout

Decompression Bomb Protection

Responses with Content-Encoding: gzip (or brotli, deflate) are decompressed with size validation. The decompressed size is checked against Content-Length * reasonable_ratio to prevent zip bomb attacks.

Unsafe URL Schemes

Link extraction blocks unsafe URL schemes. These are logged but never followed:

  • javascript: — code execution
  • data: — embedded content (can be arbitrarily large)
  • blob: — browser-internal references

HTML Sanitization

Before link extraction, <script> and <style> tag content is stripped entirely. This prevents extracting junk URLs from JavaScript source code (e.g., minified variable names that look like relative paths).

#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url> {
    let cleaned = strip_tag_content(html, &["script", "style"]);
    // ... scan for href, src attributes
}
}

robots.txt Enforcement

Palimpsest respects robots.txt per RFC 9309:

  • Fetches and caches robots.txt per origin before crawling
  • Respects Disallow directives for the configured user agent
  • Honors Crawl-delay when specified
  • Blocked URLs are counted in metrics (palimpsest_robots_blocked)