HC/EV: A Framework for Thinking About GPT-3 Use Cases
People are working on figuring out how to use GPT-3. I think I’ve identified a class of ideas worth paying attention to (and by extension a class to ignore).
A bad idea
I recently came across a GPT-3 use case that someone was thinking of putting into their product: generating SQL code from English.
How much revenue did we have in the last week?
SELECT SUM(amount) FROM charges WHERE charge_dt > now() - interval ‘7 days’
It’s trying to solve a real problem: business intelligence (BI) people who can’t code often have to bug developers to get answers to their questions (because the developers can write SQL). With this feature, someone who couldn’t write SQL could rely on GPT-3 to write it for them.
I’ve played around with similar use cases and it’s a plausible idea. The SQL generated by GPT-3 given a good prompt will sometimes run and will sometimes be correct.
The problem is that the BI person can’t validate that the result is correct (they can’t read SQL) and GPT-3 isn’t accurate enough for us to just ignore that. Maybe one day we’ll get to the point where you can ask GPT-3 for SQL code and it’ll return either the correct code or an “I can’t do that,” but we aren’t there yet. If you ask GPT-3 for the revenue in the last week it might give you the revenue in the last month, the number of charges in the last week, the number of customers who purchased in the last week, or a number of other things.
Today, you need a human in the system who can validate the result returned from GPT-3 1. If you need a developer to validate the SQL code, then couldn’t he generate the SQL code in the first place? Why even have GPT-3 in the loop? This isn’t a good use case because it’s not much harder to generate the result than to validate it.
A similar class of bad GPT-3 ideas: the task GPT-3 is performing just isn’t that hard in the first place.
One example: GPT-3 takes a list of bullets and expands those bullets into an email (text expansion) as a way to save you time writing the whole email out. But is it really going to save me time to write the bullets, wait for your service, and then validate (and possibly correct) the result?
- I’ll be in Cabo that week
- Schedule here: https://calendly.com/imavchaha/30min
I’m looking forward to catching up, but I’m not available then. I’ll be in Cabo!
You can see when I’m available and schedule some time to talk at https://calendly.com/imavchaha/30min
We want use cases where GPT-3 is doing something Hard to Create but Easy to Validate (HC/EV).
Many start ups are experimenting with generating ad copy (Copy.ai, Markket.ai, Copysmith.ai). This isn’t a horrible idea. It’s difficult (time consuming) to come up with lots of variations of text to A/B test and GPT-3 does a pretty good job at generating variations. And the user can quickly evaluate the results and pick the ones they like.
My only complaint for ideas like these (and any kind of brainstorming) is that it usually isn’t that hard to generate ideas for the user and the ideas generated by GPT-3 aren’t much better (and are often worse) than what a human would do.
The best use cases for GPT-3 are situations where it’s very difficult for humans to perform a task but trivial for them to validate the result. I haven’t seen any GPT-3 use cases that demonstrate this, so I’m going to use a product that I use called Looka as an example.
HC/EV in other domains
I’m hopeless when it comes to visual design. I don’t know what tools to use. I don’t know what colors go well together. I don’t know how to create or buy rights to logos or fonts. I often have ideas in my head for what would look good, but when I see the result I hate it.
Looka is an online tool that makes it easy to create a logo for your business/website/startup by showing you a ton of different variants. You can pick the one you like most and then see more variations of that variant. You keep going until you’re satisfied.
Looka is HC/EV. It would probably take me days to create a logo that I really like. With Looka, it takes me 20-30 minutes.
Canva does something similar when choosing a template. You can glance through fifty different templates and find one that looks like what you want. It would be much harder to build something from scratch.
You find HC/EV with visuals because it’s hard to create something that looks good, but everyone can find designs they think look bad.
Some other examples of HC/EV:
- One-way functions, some proofs, some other computer science/math problems.
- Explaining a complex concept (easy to determine if it makes sense and applies to a situation, difficult to determine if the explanation captures everything relevant or is true).
- Writing code (easy to tell that it runs and whether it returns something plausible, hard to determine if it returns the correct result).
- Creating a sudoku or crossword puzzle.
- Creating a funny joke or meme.
I’m not suggesting that GPT-3 is a good tool to solve any of these. We need situations where:
- GPT-3 outperforms some people (HC)
- Those people can verify what GPT-3 does (EV)
- No existing solution does as good a job
GPT-3 is insanely flexible and I’m confident that it (and what comes after) will change the world. We’re still very early and haven’t identified the best use cases yet. HC/EV has been helpful for me when thinking about GPT-3, so I decided to share it with you.
I’d love to hear your thoughts on this framework. Where does it fall short?
This framework does not capture all good GPT-3 use cases.
Here’s another class of problems where GPT-3 is worth considering: GPT-3 is correct more often that not and you don’t need 100% accuracy. Let’s say you want to find negative tweets about your company (sentiment analysis). Using GPT-3 as a first pass for sentiment analysis isn’t a horrible idea. You could then look through the tweets it flagged and remove any false positives. This is better than looking through all the tweets yourself (although is it better than the State of the Art alternatives?).
It’s worth noting that humans don’t necessarily have to do the validation. You could validate GPT-3’s result using a different tool (SQL linting) or use some metric to supervise the comparison of different results (A/B test different completions). ↩︎