Clay Icon

PDF Scraping in Clay: How to Scrape PDFs and Connect with External Tools

·
·

Hey everyone, is there a possibility to scrape PDFs inside Clay? Or do you know a tool that can do PDF scraping that I can connect with Clay?

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Hey Daniel, thanks for reaching out! There may be a few approaches for this like Scraping a list of websites or PDF URLs if available. From those scraped results chatGPT could potentially read or infer information from the text and then that response can be exported into your email sequence. Here's a video that might help, let us know!

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Of course, if anyone else has used a different approach, let's see if a few people can recommend a few other tools or methods for this.

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    Thanks Arturo for your response. But how can Chat GPT read PDFs and extract data? I don't see how this should work?

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    If Clay finds a URL with a pdf - let's say: www.test.com/example.pdf. And I want to find the keyword "example" in the PDF. In this case any tool needs to do a pdf scraping and find the keyword in the PDF under above mentioned URL. How does this work - or is it even possible?

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    Hey Arturo O.: Can you help me with this issue or do we have to search for an independent PDF scraping service that can solve this issue for us? Thank you!

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Sorry for the delay, Daniel! Few calls in between.Our Scrape Website or Zenrows integrations should be able to scrape the PDF URL, and the body of text it returns is what chatGPT could potentially read and extract data from.Alternatively, Claygent may be able to do both of these actions in one go.

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    Ok, can you show me how? We couldn't figure out how it works.

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Sure, here's a video with a different example and how to use Scrape Website but the premise should be similar: https://www.clay.com/learn/enriching-a-companys-linkedin-profile-for-tons-of-data

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    Thanks Arturo. But the example shows how to scrape for information on websites, which is an easy task. The video doesn't explain how to scrape information in data that is hidden behind a file that is hosted on the website - means in my case: PDFs. So - again: How can I scrape PDFs with Clay or is this even possible? To stay with your careear example: On the career page is a pdf to download. The URL is called: company.com/jobs. On this page is a button. If you click on it, you can download a PDF with a job description. So I want to know if the word "remote" can be found within this pdf file. Do you know what I mean? 😉

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Yep, I definitely understand the idea. Some sites or even a Google Search could return a PDF URL which could then be scraped. If it's behind a clickable wall, I think that's where it gets a bit more complicated but maybe Claygent would be able to do such tasks altogether

  • Avatar of Daniel V.
    Daniel V.
    ·
    ·

    Ok - now we are slowly getting there. We have a PDF url, yes. So HOW can we now scrape the PDF behind this URL? I need to know HOW, Arturo. Thx!

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Yup, once you set up the Scrape Website, you give it that URL so it does it. Currently testing a few examples to see if it works but not very promising. However, it looks like chatGPT is already capable of reading these URLs directly, since they're considered files and not exactly a website, it's not trying to "browse" it. Hope this helps!

  • Avatar of Marvin K.
    Marvin K.
    ·
    ·

    Arturo O. I tried to scrape restaurant menus that are in PDFs with ClayAgent but it didn't work. See error message:

    During the visit to the website, the menu was located within a PDF document which could not be parsed for its contents. Access to the PDF menu can be found here: MENU2023

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Hey Marvin K.! I'll take a look

  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Maybe it comes down to the prompt? I tried this one and it found it for me.

    02d6f07e-0f62-4c01-b3f8-d23a343c4e80.png(48 kB)
  • Avatar of Clay Team
    Clay Team
    APP
    ·
    ·

    Here's the prompt i used: "navigate to this site /website and go to their menu section, determine if there's a URL that redirects to a PDF menu, and return that value. Otherwise, search google to try to find it."

  • Avatar of Marvin K.
    Marvin K.
    ·
    ·

    Will try it thanks! Arturo O.