Hey everyone, is there a possibility to scrape PDFs inside Clay? Or do you know a tool that can do PDF scraping that I can connect with Clay?
Hey Daniel, thanks for reaching out! There may be a few approaches for this like Scraping a list of websites or PDF URLs if available. From those scraped results chatGPT could potentially read or infer information from the text and then that response can be exported into your email sequence. Here's a video that might help, let us know!
Of course, if anyone else has used a different approach, let's see if a few people can recommend a few other tools or methods for this.
Thanks Arturo for your response. But how can Chat GPT read PDFs and extract data? I don't see how this should work?
If Clay finds a URL with a pdf - let's say: www.test.com/example.pdf. And I want to find the keyword "example" in the PDF. In this case any tool needs to do a pdf scraping and find the keyword in the PDF under above mentioned URL. How does this work - or is it even possible?
Sorry for the delay, Daniel! Few calls in between.Our Scrape Website or Zenrows integrations should be able to scrape the PDF URL, and the body of text it returns is what chatGPT could potentially read and extract data from.Alternatively, Claygent may be able to do both of these actions in one go.
Ok, can you show me how? We couldn't figure out how it works.
Sure, here's a video with a different example and how to use Scrape Website but the premise should be similar: https://www.clay.com/learn/enriching-a-companys-linkedin-profile-for-tons-of-data
Thanks Arturo. But the example shows how to scrape for information on websites, which is an easy task. The video doesn't explain how to scrape information in data that is hidden behind a file that is hosted on the website - means in my case: PDFs. So - again: How can I scrape PDFs with Clay or is this even possible? To stay with your careear example: On the career page is a pdf to download. The URL is called: company.com/jobs. On this page is a button. If you click on it, you can download a PDF with a job description. So I want to know if the word "remote" can be found within this pdf file. Do you know what I mean? 😉
Yep, I definitely understand the idea. Some sites or even a Google Search could return a PDF URL which could then be scraped. If it's behind a clickable wall, I think that's where it gets a bit more complicated but maybe Claygent would be able to do such tasks altogether
Ok - now we are slowly getting there. We have a PDF url, yes. So HOW can we now scrape the PDF behind this URL? I need to know HOW, Arturo. Thx!
Yup, once you set up the Scrape Website, you give it that URL so it does it. Currently testing a few examples to see if it works but not very promising. However, it looks like chatGPT is already capable of reading these URLs directly, since they're considered files and not exactly a website, it's not trying to "browse" it. Hope this helps!
Maybe it comes down to the prompt? I tried this one and it found it for me.
Here's the prompt i used: "navigate to this site /website and go to their menu section, determine if there's a URL that redirects to a PDF menu, and return that value. Otherwise, search google to try to find it."