Clay Icon

Recommendations for Web Scraping and Data Accuracy in Healthcare

·
·

Clay S. I’m using claygen to deal with paginations. I know this is not very good. I’m trying to scrape the database with multiple pages for example like the one here. https://connect.health.gov.bc.ca/assisted-living-residence Any recommendations on the prompt? Or method.? Next, I also noticed that some data when I use AI scraper on the Internet is not that accurate , what is the best way to find accurate data on the web scrapper or to verify data? For example, I’m in the healthcare niche , instead of pulling contact information for a care home in Canada and hold the information with a similar name in United States. Is there any way to avoid this?

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    When dealing with web scraping, especially for healthcare data, here's a strategic approach to ensure accuracy: 1. First, evaluate if web scraping is the best approach for your needs: • Consider if the data is available through other Clay enrichment methods • Determine if web scraping will provide more reliable results for your specific use case 2. For healthcare data accuracy: • Use multiple Apify actors in combination to cross-validate the data • Create comprehensive datasets by combining different data sources • Implement proper data validation and verification steps 3. To improve location-specific accuracy: • Be explicit in your search parameters • Use location-specific identifiers in your queries • Cross-reference data points across multiple sourcesRemember to approach web scraping responsibly and ethically, ensuring compliance with website terms of service and data protection regulations. Sources: Intro to Web Scraping [Limitless Research] - Clay University Deep Dive: Apify Actors [Limitless Research] - Clay University

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Talk to a real person

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Looks like you're still looking for help. We're currently outside of our standard office hours (9 AM to 9 PM EST, Monday through Friday), so it may take a little longer for us to respond.If you’re dealing with a specific table, drop the URL below so we can help you quicker. Otherwise, someone from our team will be in touch soon!

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Did that answer your question?

  • Avatar of Owen C.
    Owen C.
    ·
    ·

    Hey Edward, thanks for reaching out. The pagination challenge is real and seen across many website scrapers. It may be worth testing out a website crawler from Apify such as the Website Content Crawler. The prompt can only handle so much especially with performing tasks such as clicking the next page, collecting all of that data, clicking again. I think to gather the most accurate data when it comes to Claygent specifically, providing as detailed of context and steps about the specific task you're trying to perform may be helpful. Additionally in the case it seems like we know the location or at least country for said healthcare facility. This location could be another data point you add in the prompt (i.e. use this company name and it's location to find the website) or however your prompt is setup. Speaking of this, do you mind sending the link (url) to the table so we can take a closer look and find some workarounds? https://downloads.intercomcdn.com/i/o/w28k1kwz/1347419411/65baf32e1d2a789ffba395c6d35c/CleanShot+2025-01-22+at+_42I1FzxXHf%402x.png?expires=1738701000&signature=9ebd3df9c9d2a4c6dca8d5d0cb221e6c2c1abe6a130b7b5271cce78c3d339c48&req=dSMjEc1%2FlIVeWPMW1HO4zVws4pkwDVWogvYjA82bedBmYA%3D%3D%0A

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Here’s the link

  • Avatar of LuisArturo
    LuisArturo
    ·
    ·

    Hey there Edward, jumping in for Owen here, to clarify these are all in Canada correct? As Owen mentioned it would be a benefit to have location of these facility to include in the prompt. What we can do here is we know that all of these are in Canada is create a formula column with the following formula in your table. https://downloads.intercomcdn.com/i/o/w28k1kwz/1365484634/d8b0f2eff19090ddfaaa867afaca/image.png?expires=1738712700&signature=067bea0b74f09283b8f0fa3e8002e338c6b16507343ed304f2fac0b55a4a995e&req=dSMhE812mYdcXfMW1HO4zQ6COVFis6VIZdVG%2F9wPna4k68Mi%2F%2F%2BRc7TmPVnG%0AN1NR%0A This will help us given a country location to help the AI better focus in on just Canada. Afterwards we could use another formula to combine the "Facility Address" column, "Facility City" column and our Country formula above into a single address column to include in our prompt. From we can also use the AI prompt helper to help rewrite the prompt into an input that the AI integration can better read and execute. This AI prompt helper can be found by selecting the "Help Me" option in the bottom right of the prompt menu. https://downloads.intercomcdn.com/i/o/w28k1kwz/1365488580/529120dd9fd5d4b4bbbaa45e3ee1/image.png?expires=1738712700&signature=3cceb41b8506ba5d79620c71a174f079633fb5a08543bc8b2efbd6204bb0fa0d&req=dSMhE812lYRXWfMW1HO4zaKNUzNKeD1VAj%2F3v2QFPFd2GEy23FQTDcKemxVE%0AUstM%0A

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Hi team, That is a very impactful refinement. Thanks for the insight. I believe this will already help. One other thing I’m noticing is my column count is getting pretty high and the database is getting more and more messy . Do you have any suggestions on cleaning the database and organizing it?

  • Avatar of LuisArturo
    LuisArturo
    ·
    ·

    Hey there Edward, a suggestion that I would have to clean up your table would be to see if there are any columns to you can remove because they are not need. If this is solely because you want the data base to look cleaner, my other suggestion would be to use formulas to combine any inputs together that can be combined, just like we did above with address and city. And then to hide the actual columns we combined so that they are not visible and making the table look cluttered.

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    I Acknowledge your suggestion. I believe it is disorganized because I have both company data and personal data under one table. In addition to this, I was trying to extract more people data from the organizations. The original data were extracted from directories or claygen, is there anyway to merge the two databases of people data?

  • Avatar of LuisArturo
    LuisArturo
    ·
    ·

    Hey there Edward, we do have a method to merge two different tables. The following loom shows how this can be done. https://www.loom.com/share/1ae0bfb385094e9d98a8e22f7ed99c40

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Much appreciate your help. I hope to become a intermediate user soon.

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Hi Edward Z.! This thread was recently closed by our Support team. If you have a moment, please share your feedback:

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Thanks! We've reopened this thread. You can continue to add more detail directly in this thread.

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Hi, have another issue that came up which is to find the LinkedIn url or personal url info - corner puzzle piece for people

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Not a lot has came up for the people I’m looking for: https://app.clay.com/workspaces/464956/workbooks/wb_FjUdcBWYPoDQ/tables/t_8Skrb3pypgss/views/gv_K5pEiJWkzn9R Are there other affective ways to find either their emails or phones?

  • Avatar of Bo (.
    Bo (.
    ·
    ·

    Hey! The ideal workflow for mobile is 1. LinkedIn URL -> Mobile Waterfall The ideal workflow for Work Email 1. Full name, Domain -> Work Email Since you already have some data from Apollo, the best approach is to write it to another table and combine it with data from Claygent and Google Find Domain then do the enrichments from there Ex Apollo + Claygent + Google Domains -> Write to Other table -> Enrich there A couple of things to note: 1. You're getting heavily rate-limited due to your API tier—upgrading might help, but Claygent is the best option here. 2. Use the "Find Domain" integration from Google—it's free. Once you get the domain, you can pair it with full name to find work emails. 3. After getting LinkedIn profile URLs, you can run Enrich Mobile Phone or Enrich Contacts to pull more contact details. Let me know if you need help setting this up! 🚀

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Hey Bo, much appreaciate the help. I have the following questions:

    1. 1.

      i have 300/1800 contacts that I can find Linkedin for, any other way to enrich their email + phone data?

    2. 2.

      How does Dedupe work? I used it, but am not quite clear wehter this is working: https://app.clay.com/workspaces/464956/workbooks/wb_FjUdcBWYPoDQ/tables/t_8Skrb3pypgss/views/gv_K5pEiJWkzn9R

    3. 3.

      Regarding The ideal workflow for Work Email -> Full name, Domain -> Work Email -- does domain mean the company domain? A little tricky is I have the Organization & the Facility. in my database. One organization can own a few facilities. I'm wondering how would you solve for work email and contact in this situation?

    4. 4.

      Use the "Find Domain" integration from Google—it's free. Once you get the domain, you can pair it with full name to find work emails. -- on myside it's asking for 1 credit, is there something I'm paying attention to?

    5. 5.

      I want to hide repeated columns (or rows with repeated data) temporarily, enrich the remaining unique rows, and then propagate the enriched data back to all the duplicates. is there a way to do this?

  • Avatar of Bo (.
    Bo (.
    ·
    ·

    Hey! 1. For contacts without LinkedIn, we can try: - Use Claygent with this prompt to find domains: "Act as an expert web scraper focused on domain identification. Search for the domain using [Organization Name] and [Facility Address]. Verify through official sites. Output format: example.com" - Create a formula to combine first and last names - Add the newly found domain and the full name to the work email waterfall with a condition to only run on rows without results - Once you have emails, use LinkedIn waterfall to find more profiles 2. For deduping tables, don't use the list deduping you have now. This is for object that have lists inside of their object (that typically have a small arrow). Use the table auto-dedupe instead 3. For facilities with shared organizations, use the Claygent approach from step 1 to get accurate domain matches 4. My mistake - Clearbit gives free domain lookups, not Google Let me know if you'd like help implementing any of these.

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Hi Edward Z.! This thread was recently closed by our Support team. If you have a moment, please share your feedback:

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Thank you so much for sharing your feedback Edward Z.!

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Thanks! We've reopened this thread. You can continue to add more detail directly in this thread.

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    Thanks! We've reopened this thread. You can continue to add more detail directly in this thread.

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Hi Bo. Regarding the following

    1. 1.

      Dedupe--I noticed there are first & last names of people who runs multiple facilities; therefore I also combined facility names with formula. This seems to work.

    2. 2.

      Another thing that came up was how do I mass "reset to original value"? I can do it manually, but theres around 1300 inputs...

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    And just curious, if I made a mistake. is there anyway to go back?

  • Avatar of Daniela D.
    Daniela D.
    ·
    ·

    Hey Edward! Thanks for reaching out. Happy to help. The quick way to do this would be to highlight one cell (by clicking on it) and press the shift button at the bottom of the table (at the last cell) to highlight all cells. Once highlighted, press the delete button on keyboard to reset to original value. Currently, It is not possible to undo a change in a table. The exception to do this is the reset to original value (when a cell is manually modified) & "undo delete" row option. For context, Clay closely resembles a spreadsheet, but since the data comes from 3rd parties, imports or manual inputs, how and when we receive it into a table plays a complex role in that logic. The reason is that we don’t hold the data. It’s just being “parked” here. So when data is deleted or removed, there’s not much we can do other than sending a new API call to the provider/re-importing the data. Let me know if you have any questions!

  • Avatar of Channeled
    Channeled
    APP
    ·
    ·

    This thread was picked up by our in-app web widget and will no longer sync to Slack. If you are the original poster, you can continue this conversation by logging into https://app.clay.com and clicking "Support" in the sidebar. If you're not the original poster and require help from support, please post in 02 Support.

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Hi Clay team, I’m running into a challenge where I have two similar lead lists from different sources, but each contains some unique information that the other doesn’t have. My goal is to merge the data while keeping the most complete and accurate details for each lead. Here’s an example of the issue: List 1 (More Contact Details, But Some Gaps)

    Association: BC Care Providers  
    Company: Park Place Seniors Living Inc.  
    Phone: (604) 796-3886  
    Location: Agassiz Seniors Community  
    Address: 1525 MacKay Crescent, Agassiz  
    Industry: Long-Term Care  
    Website: https://parkplaceseniorsliving.com/find-a-location/british-columbia/lower-mainland/agassiz  
    Employee: Cindy Zorn  
    Title: Site Leader  
    Phone: 604-796-3886  
    Email: czorn@ppsl.com  
    Notes: Left Voicemail (Rep: Xander)  

    List 2 (Different Contacts & Additional Info, But Missing Some Fields)

    Company: Park Place Seniors Living Inc.  
    Phone: (604) 796-3886  
    Location: Agassiz Seniors Community  
    Address: 1525 MacKay Crescent, Agassiz  
    Website: www.parkplaceseniorsliving.com/find-a-location/british-columbia/lower-mainland/agassiz  
    Employee: Catherine McColl  
    Title: Food & Beverage Manager  
    Phone: Undefined  
    Email: Undefined  
    Notes: Catherine McColl, Agassiz Seniors Community  

    What I Need Help With: Best way to merge these records into a single complete entry while ensuring that no data is lost. How to handle duplicate fields intelligently (e.g., when one list has a phone number, but another does not).

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    Another issues is is there anyway to normalize the address?

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    the address are in Streets or St. or st. or st; one standard format would be great

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    in the past example, I'ld be missing a phone nmber in the column with the same facilities.

  • Avatar of Edward Z.
    Edward Z.
    ·
    ·

    And here, I'd have personal info, no name... is there anyway to mass merge these?