Improving GPT Accuracy for Job Vacancy Classification in Clay

How to avoid 50-score inconsistencies when classifying job vacancies with GPT in Clay? I'm using Clay with GPT-4o to analyze job vacancy pages and determine how likely the role is to be for a customer service position at an online webshop vs. a brick-and-mortar store. Here’s the exact prompt I’m using: “Analyze /url Assignment: - You are an AI that analyzes job vacancy texts and determines how likely the job is for a customer service position at an online webshop versus a brick-and-mortar store. Return only a single number from 1 to 100, where: 1 = Definitely a job at a brick-and-mortar store 100 = Definitely a job at an online webshop Use clues in the text such as: - Brick-and-mortar indicators (score 1–33): "in store", "at location", "in the office", "local", "retail store", "shop", "physical store", "cash register", "face-to-face", "onsite", "field work", ''brick and mortar'', ''in-store'', ''brick & mortar'', ''instore'', ''on site'' - Hybrid indicators (score 34–66): "partially digital", "hybrid", "semi digital", "part digital", "limited digital", "multi-location", "partially remote", "store and e-commerce", "cross-channel" - Online webshop indicators (score 67–100): "remote", "online", "work from home", "e-commerce", "webshop", ''ecommerce'', ''ecom'', "digital", "chat/email", "virtual", ''home-office'', ''home office'' Scoring mechanism: - If multiple types of indicators are present, assign a middle-range score accordingly. If the text is vague, score based on the most probable context. Exception: - if you're not certain and give it a score of 50, do a second analysis by analyzing /description Output: - Return only a number. Do not include any explanation or text.” The issue: When I run this formula across the full table, many rows return a score of exactly 50, marked with a red square in Clay—meaning the model wasn’t confident. But when I run those exact same rows manually or one-by-one, GPT almost always gives a more accurate score the second time (e.g., 22, 87, etc.). My question: Is there a way to improve this so GPT gets it right the first time during the bulk run? I’d love to avoid manually re-running 50s row-by-row after the fact. Has anyone dealt with this and found a reliable workaround? Open to suggestions around retries, batching, better prompting structure, or any Clay-specific setting that might help.

3 comments