Best approach to crawling a domain and identifying PDFs with publish dates without sitemap.xml

·Feb 12, 2026 04:29 PM·

Hey everyone - looking for the best approach to crawling a domain and identifying PDFs and their publish date. I've achieved some success with Python but wondering if there's a better approach? Needs to crawl through the entire domain even if sitemap.xml is not available

1 comment

· Sorted by Oldest

Rana H.
·
·
You can try crawler.dev for the crawling process , extract links and render JS

Rana H.
·
·
You can try crawler.dev for the crawling process , extract links and render JS