PDF Scraping in Clay: How to Scrape PDFs and Connect with External Tools

Daniel V. · 2023-12-18T10:13:12.819Z

Hey everyone, is there a possibility to scrape PDFs inside Clay? Or do you know a tool that can do PDF scraping that I can connect with Clay?

PDF Scraping in Clay: How to Scrape PDFs and Connect with External Tools | Clay

Clay Team
APP
·
·
Hey Daniel, thanks for reaching out! There may be a few approaches for this like Scraping a list of websites or PDF URLs if available. From those scraped results chatGPT could potentially read or infer information from the text and then that response can be exported into your email sequence. Here 's a video that might help, let us know!
Clay Team
APP
·
·
Of course, if anyone else has used a different approach, let's see if a few people can recommend a few other tools or methods for this.
Daniel V.
·
·
Thanks Arturo for your response. But how can Chat GPT read PDFs and extract data? I don't see how this should work?
Daniel V.
·
·
If Clay finds a URL with a pdf - let's say: www.test.com/example.pdf. And I want to find the keyword "example" in the PDF. In this case any tool needs to do a pdf scraping and find the keyword in the PDF under above mentioned URL. How does this work - or is it even possible?
Daniel V.
·
·
Hey Arturo O.: Can you help me with this issue or do we have to search for an independent PDF scraping service that can solve this issue for us? Thank you!
Clay Team
APP
·
·
Sorry for the delay, Daniel! Few calls in between.Our Scrape Website or Zenrows integrations should be able to scrape the PDF URL, and the body of text it returns is what chatGPT could potentially read and extract data from.Alternatively, Claygent may be able to do both of these actions in one go.
Daniel V.
·
·
Ok, can you show me how? We couldn't figure out how it works.
Clay Team
APP
·
·
Sure, here's a video with a different example and how to use Scrape Website but the premise should be similar: https://www.clay.com/learn/enriching-a-companys-linkedin-profile-for-tons-of-data
Daniel V.
·
·
Thanks Arturo. But the example shows how to scrape for information on websites, which is an easy task. The video doesn't explain how to scrape information in data that is hidden behind a file that is hosted on the website - means in my case: PDFs. So - again: How can I scrape PDFs with Clay or is this even possible? To stay with your careear example: On the career page is a pdf to download. The URL is called: company.com/jobs. On this page is a button. If you click on it, you can download a PDF with a job description. So I want to know if the word "remote" can be found within this pdf file. Do you know what I mean? 😉
Clay Team
APP
·
·
Yep, I definitely understand the idea. Some sites or even a Google Search could return a PDF URL which could then be scraped. If it's behind a clickable wall, I think that's where it gets a bit more complicated but maybe Claygent would be able to do such tasks altogether
Daniel V.
·
·
Ok - now we are slowly getting there. We have a PDF url, yes. So HOW can we now scrape the PDF behind this URL? I need to know HOW, Arturo. Thx!
Clay Team
APP
·
·
Yup, once you set up the Scrape Website, you give it that URL so it does it. Currently testing a few examples to see if it works but not very promising. However, it looks like chatGPT is already capable of reading these URLs directly, since they're considered files and not exactly a website, it's not trying to "browse" it. Hope this helps!
Marvin K.
·
·
Arturo O. I tried to scrape restaurant menus that are in PDFs with ClayAgent but it didn't work. See error message:
During the visit to the website, the menu was located within a PDF document which could not be parsed for its contents. Access to the PDF menu can be found here: MENU2023
Clay Team
APP
·
·
Hey Marvin K.! I'll take a look
Clay Team
APP
·
·
Maybe it comes down to the prompt? I tried this one and it found it for me.
02d6f07e-0f62-4c01-b3f8-d23a343c4e80.png(48 kB)

17 comments