Fetch Safety
Resource Limits
| Limit | Default | Configurable |
|---|---|---|
| Maximum response body | 256 MiB | FetchConfig.max_body_size |
| Maximum redirect chain | 10 | FetchConfig.max_redirects |
| Connect timeout | 30 seconds | FetchConfig.connect_timeout |
| Total request timeout | 120 seconds | FetchConfig.total_timeout |
Decompression Bomb Protection
Responses with Content-Encoding: gzip (or brotli, deflate) are decompressed with size validation. The decompressed size is checked against Content-Length * reasonable_ratio to prevent zip bomb attacks.
Unsafe URL Schemes
Link extraction blocks unsafe URL schemes. These are logged but never followed:
javascript:— code executiondata:— embedded content (can be arbitrarily large)blob:— browser-internal references
HTML Sanitization
Before link extraction, <script> and <style> tag content is stripped entirely. This prevents extracting junk URLs from JavaScript source code (e.g., minified variable names that look like relative paths).
#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url> {
let cleaned = strip_tag_content(html, &["script", "style"]);
// ... scan for href, src attributes
}
}
robots.txt Enforcement
Palimpsest respects robots.txt per RFC 9309:
- Fetches and caches
robots.txtper origin before crawling - Respects
Disallowdirectives for the configured user agent - Honors
Crawl-delaywhen specified - Blocked URLs are counted in metrics (
palimpsest_robots_blocked)