Iโm trying to scrape https://projects.propublica.org/nonprofits/search for a specific list of target companies. How can I put a URL or company name in their search box, and then grab specific information that is returned in a search (usually a link to a PDF)?
๐ค You've caught us outside of our support hours (9am-9pm EST), but don't worry - we'll be back in touch within 24 hours (often sooner!). If you haven't already, please include the URL of your table in the thread below so that we can help you as quickly as possible!
Hey there! Happy to provide a few suggestions. I assume you have a list of company names and want to run this programmatically? It sounds like a job for RPA solutions like UiPath (I believe they have a generous free tier), Robomotion, Bardeen, etc. In particular, since some of the PDFs seem to hide behind a CAPTCHA. I used only Robomotion (low code) and Bardeen (no-code). It may also make sense to check on Upwork/Fiverr if someone can set this up for you since the learning curve is steep, at least for Robomotion. However, if you want to scrape all companies, links, and meta-information, you can try something like Scrapestorm and Octoparse. These are visual web scraping tools. A freelancer from Upwork, etc. may also make sense depending on how keen you are to learn those tools :) Hope this was useful!
https://app.clay.com/shared-table/share_Jy6cH8EMT8f6 Daniel K. Is the table good?
Hello Brooks! You can use Claygent for this. To simulate typing in tetx in the search box, you can add it to the url as shown here: https://downloads.intercomcdn.com/i/o/w28k1kwz/1223360845/1f8d592a6c942a92caf5f862181d/CleanShot+2024-10-21+at+10_43_48.gif?expires=1729523700&signature=1458febeb240fdf52e18fbcc08b1ee53baa369c31faf3eb26d8403d3f72cc412&req=dSIlFcp4nYlbXPMW1HO4zVqeed9ViUadRYVgVVBCYaaEQQ9C0CF1Qmajwq2d%0A%2BxKt%0A We can also observe that changing tabs to 'filings' changes the urk: https://downloads.intercomcdn.com/i/o/w28k1kwz/1223366795/2a55ec861cb2e1a623a330db79d7/CleanShot+2024-10-21+at+10_48_04.gif?expires=1729523700&signature=c1fd65439b04e133809d6cb5a2ea79fbb85a0fa972ef42915513f71f9fca9b9b&req=dSIlFcp4m4ZWXPMW1HO4zdLyThBlB%2ByKv1gvrvN%2FPwdDAf8uR2VP3AFe6%2BhY%0AH1gB%0A So we can pass the url closest to where the PDFs are to Claygent and it should be able to extract the text that you are looking for.
We haven't heard back from you in a bit, so we're going to go ahead and close things out here - feel free to let us know if you still need something!