Best Practices for Processing School Websites Data Scraping

Pierre L.

Hello guys, Would like to know what would be iyo the best practice for my usecase.

1.
I’ve a list of schools websites
2.
I’m using Claygent to scrape the headmaster of the school (mentioned in the footer or in a google search)
3.
Then I’ve a lot of “cleaning steps” :

If it doesn’t have the first name, it fills both the firstname and the lastname with the fullname
I’ve to put some “proper” function in order to have like “Name” instead of “NAME”
I’ve to look manually if all the firstnames are filled...etc. etc.

And I’ve 5k schools to do What’s the best practices in your opinion ? Build a waterfall with all that different steps ? Build a big prompt for the claygent in order to do all that at once ? Build a lot of different formulas ? Would love to know what is working best for you in this case

14 comments

Channeled
APP
·
·
For efficiently processing 5,000 school websites using Claygent, here are some best practices: 1. Configure outputs: Define column outputs for each data point (e.g., full name, first name, last name) in the Claygent modal. 2. Create a comprehensive prompt: Instruct Claygent to extract the headmaster's information and perform initial cleaning steps. 3. Use Clay's data cleaning tools: Leverage our native integrations to clean and normalize the data after scraping. 4. Build a waterfall: Create a series of steps to handle different scenarios and cleaning tasks. 5. Utilize AI formulas: Implement additional cleaning and verification steps using AI formulas. This approach combines Claygent's scraping capabilities with Clay's data cleaning tools for an efficient, automated process.
Sources: Claygent: AI Web Scraper [Automated Outbound] - Clay University Data Cleaning + Normalization [Automated Outbound] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
kushagra t.
·
·
Jumping in here to share my experience, maybe support can answer better. An extensive prompt with the claygent and using structured json output can work. Alternatively one AI claygent column to get the data. Another one to process the data can be more reliable.
Channeled
APP
·
·
For your task of scraping and cleaning data from 5,000 school websites, here's a recommended approach using Claygent: 1. Configure outputs: Define column outputs in the Claygent modal for each data point (e.g., full name, first name, last name). 2. Create a comprehensive prompt: Instruct Claygent to extract headmaster information and perform initial cleaning steps. 3. Use structured JSON output: This can help organize the data more effectively. 4. Leverage Clay's data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 5. Implement AI formulas: Use these for additional cleaning and verification steps. 6. Consider a two-step process: One Claygent column to get the data, another to process it for increased reliability. Remember to use the "write to table" feature to save your results efficiently.
Sources: Data Cleaning + Normalization [Clay 101] - Clay University Claygent: AI Web Scraper [Automated Outbound] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
kushagra t.
·
·
frequently use claygent to find people and give structured outputs and have safeguards in place to return NA wherever info is not available. this is what the output looks like
Channeled
APP
·
·
For efficiently scraping and cleaning data from 5,000 school websites using Claygent, here are some best practices: 1. Configure outputs: Define column outputs in the Claygent modal for each data point (e.g., full name, first name, last name). 2. Create a comprehensive prompt: Instruct Claygent to extract headmaster information and perform initial cleaning steps. 3. Use structured JSON output: This can help organize the data more effectively and handle missing information. 4. Implement safeguards: Include instructions in your prompt to return "NA" or a similar placeholder where information is not available. 5. Leverage Clay's data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 6. Use the "write to table" feature to save your results efficiently. 7. Consider a two-step process: One Claygent column to get the data, another to process it for increased reliability and structured outputs.
Sources: Claygent: AI Web Scraper [Automated Outbound] - Clay University Claygent: AI Web Scraper [CRM Enrichment] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
Pierre L.
·
·
Thanks kushagra t. so you would do one claygent to do the research, extract the data (no cleaning) and another agent to clean the data ? I’m not sure how to manage “scenarios” like NA or if I don’t find the firstname for example (which is a case that happens all the time)
Channeled
APP
·
·
To manage scenarios with missing data when using Claygent, consider this approach: 1. Configure outputs: Define column outputs for each data point (e.g., full name, first name, last name) in the Claygent modal. 2. Create a comprehensive prompt: Instruct Claygent to handle missing data by returning "NA" or a placeholder for unavailable information. 3. Use two-step process: One Claygent column for research and data extraction, another for cleaning and processing. 4. Leverage Clay's data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 5. Implement AI formulas: Use these for additional cleaning and verification steps, especially for complex cases like missing first names. Remember to use the "write to table" feature to save your results efficiently.
Sources: Data Cleaning + Normalization [Clay 101] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
kushagra t.
·
·
The screenshot i shared only uses one column to find and format both. But in case the school website makes it difficult to find and use info in one go then trying with two columns would be the next option. Other consideration is using different models.
Channeled
APP
·
·
For scraping and cleaning data from 5,000 school websites using Claygent, here's a recommended approach: 1. Configure outputs: Define column outputs for each data point (e.g., full name, first name, last name) in the Claygent modal. 2. Create a comprehensive prompt: Instruct Claygent to extract headmaster information and handle missing data by returning "NA" or a placeholder for unavailable information. 3. Use structured JSON output: This can help organize the data more effectively and manage missing information. 4. Leverage Clay's data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 5. Use the "write to table" feature to save your results efficiently. Consider using one Claygent column for research and data extraction, and another for cleaning and processing if needed. This approach combines Claygent's scraping capabilities with Clay's data cleaning tools for an efficient, automated process.
Sources: Data Cleaning + Normalization [Automated Outbound] - Clay University Claygent: AI Web Scraper [Limitless Research] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
Pierre L.
·
·
It’s finding like 99% of the time the headmaster but the “errors” usecase I have is :
firstname and lastname are the same (so I’ve to delete the firstname manually, my principal need is the lastname)
missing the firstname (same as the 1st but the result cell is empty)
the lastname or firstname is in ALL CAPS where I need “proper” format like Pierre and not PIERRE
none of the information is available, so the “response” is in red so I need to add a filter to filter the “Information not available” on my first and lastname columns
Do you think I can solve all this with a better prompt and 1 column ?
Channeled
APP
·
·
Yes, you can address these issues with a single Claygent column using a well-structured prompt. Here's how: 1. Configure outputs: Define separate column outputs for first name, last name, and full name. 2. Create a comprehensive prompt: Instruct Claygent to: • Extract the headmaster's name • Handle missing first names • Properly format names (capitalize first letter, lowercase rest) • Return "NA" for unavailable information 3. Use our data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 4. Implement the "write to table" feature to save your results efficiently. This approach combines Claygent's capabilities with Clay's data cleaning tools for an efficient, automated process.
Sources: Data Cleaning + Normalization [Clay 101] - Clay University Data Cleaning + Normalization [Clay 101] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
kushagra t.
·
·
imo, the first three can be handled by a dedicated AI cleanup column the last one can be handled by adding conditional formulas and filters to remove empty results from being used downstream
Channeled
APP
·
·
Yes, you can handle these errors using a single Claygent column with a well-structured prompt. Here's how: 1. Configure outputs: Define separate columns for first name, last name, and full name. 2. Create a comprehensive prompt: • Extract the headmaster's name • Handle missing first names • Format names properly (capitalize first letter, lowercase rest) • Return "NA" for unavailable information 3. Use our data cleaning tools: After scraping, use our native integrations to further clean and normalize the data. 4. Use the "write to table" feature to save your results efficiently. This approach combines Claygent's capabilities with Clay's data cleaning tools for an efficient, automated process.
Sources: Data Cleaning + Normalization [Clay 101] - Clay University Data Cleaning + Normalization [Clay 101] - Clay University Claygent: AI Web Scraper [Clay 101] - Clay University
Pierre L.
·
·
Ok thanks a lot kushagra t. 🙂
🙌1