Hey guys, I am looking for websites containing a noindex tag. I.e., websites that are not possible to find on search engines. I am having little luck so far using scrapers / AI tools since they are not accurate. Does anyone here had any success with similar tasks?
what do you mean?
Many websites have what is know as robots tag in their HTML code. The purpose is to give information to crawlers for search engines (Google, Bing etc.) how to handle their website. One robot tag is defined as "noindex", which essentially means that the website should not be found in search results on Search engines.
indeed - but the question is , do you have the list of urls already?
Sure I could provide a list of URL:s, but the question is more general and not regarding a specific list, it could be whatever websites. I am just asking to see if anyone has found any luck in prompts or other ways to in a scalable way find websites with the noindex tag.
so you want to find website with no index tag or you want to find what pages of a website have no index tag. Finding website with no index tag (entire website) is quite impossible unless you scrape all website in the world 🙂
To clarify, given a set of URL:s (in most scenarios this would be the starting page). Is there a smooth way to find if the HTML code for the given URL has a noindex tag. Meaning, if I could find a reliable way to find if the nooindex is present for a given set of URL:s I would be more than happy. Hence, it does not matter if other pages on the website contains the noindex tag or not 🙂
yes given a set of pages - you can run a simple URL status check via screaming frog
Hey, thanks for reaching out! To find websites containing a noindex tag, you can use Google search operators in a Clay table. Here’s how you can set it up: 1. Create a new table in Clay. 2. Use Google search with the following operator: f`iletype:txt inurl:"/robots.txt" -inurl:www` This will help you find websites with a robots.txt file, where you can look for noindex tags. Since Google search results are limited to 300 entries, if you need more: • Copy the search URL (of google when you do the search) and add pagination by modifying the URL with page numbers. (Ask chatgpt if you need help here. • Create a column in your Clay table with the pre-numbered part of the Google search URL, then add another column for page numbers. Use a formula column to match both Next, use the Scrape Website function: • Use a formula to extract the URLs from the scraped results. You can learn more about formulas here: AI Formulas Guide. • Create a prompt to extract the URLs, separating them with commas. • Use Regex to extract values with a pattern like https://[^\s,"]+. You can check out this documentation on value extraction. Finally, use Write to Table to save the extracted URLs as individual rows in a new table. For more on this, check out this video: Write to Table Video. Let me know if you need further assistance! 😊
Bo (. thanks for the update. Is there any other way to do this for a given set companies (with URL:s for the domain). Your approach is quite a lot of manual labor and not specifically for a given set of companies. Meaning, trying your approach I receive websites for huge companies such as H&M and Samsung. These would never be a potential prospect, as such the approach is not viable. My main goal is to find prospects in Clay, then for their domain (e.g., testsite.com) retrieve an answer if they have a noindex tag in their robot file (if they do have a robot file).
Gotcha! In this case, here’s a more streamlined approach to achieving your goal of checking if a set of companies (with domain URLs) have a noindex tag in their robots.txt file: 1. Build your company list: Start by using the Clay Find Companies tutorial here to compile a list of companies with their domain URLs. 2. Create a formula column: Add a new formula column that appends robots.txt to the end of each URL. You can create a formula like this: Add /robots.txt at the end of this domain /domain You can link the column that contains the domains with the forward slash. This will generate the full path to the robots.txt file for each domain. 3. Check if the URL is valid: Use Clay’s Check if URL is valid to ensure that the robots.txt file exists for each domain. 4. Scrape the robots.txt file: Once you’ve verified valid URLs, use the Clay Scrapers tool here to scrape the contents of each robots.txt file. 5. Use a conditional formula: To avoid scraping invalid URLs, apply a conditional formula to run the scraper only on valid URLs. Check out the guide here to set this up. 6. Check for noindex: Create a new AI formula to check if the scraped content contains the noindex tag. The formula can look for the keyword “noindex” in the scraped text and return “yes” or “no” based on its presence. This should help automate the process and target the correct domains more efficiently. Let me know if you need further clarification or additional steps! 😊
We haven't heard back from you in a bit, so we're going to go ahead and close things out here - feel free to let us know if you still need something!