Data

Evaluating the success of consumer generative AI products

This past year has been transformative, especially as it relates to the new era of AI with the introduction of generative AI (GAI) causing every industry to figure out how best to incorporate it into their products and services. Over the last year, we’ve been focused on evaluating the success of several GAI product launches at LinkedIn, ensuring that these new features are high-quality, easy to use, and solve meaningful problems for members. Our goal in launching these products is to enable members to learn and share knowledge and advance their careers. Defining the right success criteria is a critical step in ensuring our products are achieving that goal.

Here’s a look at how we’ve been evaluating the success of these launches, using three examples: collaborative articles which are AI-powered conversation starters, developed with our editorial team that LinkedIn members can add their insights and perspectives; AI-powered Premium experience where subscribers see AI-powered takeaways on Feed posts and job-seeking advice on active job posts to help them unlock hidden opportunities; and personalized profile writing suggestions to help subscribers optimize their profiles with suggestions for the Headline and About sections. We evaluate success using three main quantitative methods, which will be explained in detail in this post: human review to evaluate the quality of GAI output, in-product feedback to evaluate members’ perception of GAI output quality, and product usage metrics to evaluate overall success of the feature. 

In addition to these metrics, there are entire other areas of measurement that are critical to rolling out a new GAI feature, including red-teaming exercises to identify critical trust issues like biased and/or harmful content, and measuring effectiveness of go-to-market efforts such as product discovery and adoption. While these metrics are important, this blog specifically focuses on evaluating the value of the core product experience. The best practices outlined in this blog are designed to help our members better understand how we ensure product quality and how we put the power of GAI into the hands of our members to help them more effectively connect to opportunities, be more productive, showcase their expertise and skills, and gain access to the knowledge they need to do their jobs. 

Human review

Human review is absolutely critical, and is the gold-standard, most reliable way to guarantee GAI product quality both during an initial launch and as ongoing evaluation to monitor quality over time. Whenever we add a new feature or functionality to a GAI product, we create guidelines for what constitutes an unacceptable, good, or great GAI output, as well as noting specific high-severity error types to flag, such as hallucinations or bias. We create a diverse set of sample outputs for reviewers to judge, striving to achieve good coverage across important variables that impact the GAI output. 

For example, review samples of collaborative articles should cover a broad range of the topics that we write articles about, while sample conversations with a generative AI feature should cover a broad range of members from different industries, job functions, and geographic areas. Human reviewers then annotate the sample outputs with their quality scores. It’s important to have sufficient sample size to identify fairly rare events, because some of the most serious problems, like hallucinations, could be considered launch-blockers even if they happen infrequently because they can damage member trust with the product. 

Reader, I can hear you wondering: “if GAI is so clever, why do we still need humans? Can’t we ask GAI to review its own work?” For some review tasks that are stable over time (such as simple quality monitoring of stable features, or monitoring for well-understood categories of problems), we can automate them by writing and validating prompts for GAI to check its output for specific problems. But in the case of a freshly-launched new product, the review tasks change too quickly for this to be useful because the product is evolving rapidly. The standards for what the GAI output should look like are changing every week or even every day. By the time we can write and evaluate prompts to have GAI do the evaluation tasks, we could have done the review ourselves. So for product launches in an early stage of fast iteration, we rely heavily on direct human review. 

The largest downside of human review is that it’s time consuming. Even a well-funded human review program can only cover a small fraction of the possible GAI outputs that members might see in production. Therefore we need to supplement human review with complementary types of metrics that are more scalable to provide higher coverage. 

In-product feedback

One straightforward way to collect higher-coverage feedback on GAI output quality is to directly ask users of a product to provide it. Every AI-assisted product must provide a simple, quick way for members to submit both positive and negative feedback within the product as they use it. We frequently use thumbs up/thumbs down buttons. This feedback collection can provide orders of magnitude higher sample size than a human review program, but it is still relatively low-coverage because most members don’t engage with this feedback feature. Also, the responses tend to have a positive skew because dissatisfied members tend to leave without giving feedback. In the case of GAI-powered features that involve a chat interface, it’s common for members to express their feelings directly in the chat experience rather than using the thumbs feedback. They type messages like “Thank you, that was great” or “No, you’ve got it all wrong” or “You don’t understand at all” directly into the chat window. 

Image of in-product feedback prompts
Feedback buttons and thumbs up/down buttons invite members to rate the quality of GAI-generated text.

Despite the low number of responses, this feedback is incredibly valuable both quantitatively and qualitatively.  We monitor the thumbs-up rate over time to monitor changes in member satisfaction, and compare between different product features or user segments to understand key problem areas to focus on. For qualitative review, we review samples of the GAI outputs that generated thumbs downs to identify trends. As an example, in the case of collaborative articles, we analyze the articles that receive negative feedback to understand common problems and develop an action plan to improve the quality. 

Even a highly-discoverable in-product feedback button generally only gets used by single-digit percentages of all the people who use a feature, and this small sample may not be representative of the entire population. Therefore we need to supplement these feedback metrics with measurements that are even higher-coverage, to evaluate the product experience for the entirety of its member base. 

Product usage metrics

The third category of metrics for evaluating GAI are usage metrics that quantify some aspect of how members interact with the product. This kind of metric has the highest coverage because every member who uses the product contributes to the metric, so the metrics help us understand the product’s success across its entire member base. However, these metrics give less-specific information compared to explicit feedback metrics, in the sense that if the metric is low, it’s hard to determine why it’s low. If members drop off of a flow or don’t return to the product, we know something is wrong, but the metric doesn’t necessarily tell us whether the GAI output quality is low, or some other aspect of the product isn’t working well. Therefore we often turn to researchers, who can help us by talking to, for example, users who tried the product but never used it again, to get a qualitative understanding of why they churned.

Product usage metrics tend to be very specific to the product itself. The ways members interact with GAI products vary according to the interaction modes that are available to engage with the GAI-generated content, so the product metrics must adapt according to these interaction modes. Here are three examples that each have unique interaction modes available: one where LinkedIn leverages GAI to help draft text and shares it with all members; one where GAI generates text in response to a member’s query in a conversational format; and one where GAI generates suggested text which a member is then able to edit and use. 

Collaborative Articles

Collaborative articles are composed of GAI-generated articles that serve as conversation starters, augmented by views and perspectives from expert members on Linkedin who share their thoughts directly into the article. In this case, the GAI-written text serves as a canvas to inspire members to share their experiences by contributing to the article. We have complete control over the prompts that GAI is processing, since the GAI-driven part of the process (main article writing) is completed before any members ever see the article. 

Our key metrics for evaluating the initial rollout of this feature included contributions (are members adding their contributions into the article) and contributor retention (do the members who contribute to articles come back and contribute again). Because this is a social product, we also monitor distribution and feedback to contributors: how far their contributions spread in their network and beyond, and how much engagement they receive. These are key indicators of whether contributors feel like the experience is valuable. Good-quality GAI-written article starters elicit good-quality contributions, which then circulate widely within the contributor’s network and beyond receiving responses. As the product has matured, we’ve also begun to specifically identify high-quality contributions in order to focus on increasing them.

GAI-powered Premium experience

The AI-powered Premium experience is where subscribers see GAI-powered takeaways and job advice on the platform that range from how to get more knowledgeable about important topics, to what actions to take to advance their career. In this case, the GAI output comes in the form of a conversation back and forth; members can ask any questions about professional topics. This gives us the ability to look at metrics like conversation depth (did members have a meaningful conversation with the AI). 

Unlike the other examples, this is a mostly non-public product. The GAI-powered features can help the member accomplish tasks, such as applying for a job or messaging someone, but many conversations are chats between the member and the feature itself. We aspire to build a product that is a critical part of the member’s daily product usage, so the key metrics are daily/weekly active users and feature usage retention (i.e., of all the members who used GAI features in a given week, how many use the feature again in the following week). As we add new functionality like messaging and posting integrations that allow the GAI-powered features to help the member complete actions, we also measure usage of these, as a way of understanding whether the feature is increasing productivity. A successful GAI-powered feature should allow the member to apply for more relevant jobs (and therefore get hired faster), or send more relevant messages (and therefore become more likely to receive a response). We do not set goals to increase the time spent chatting in this experience; its purpose is to make the member more productive.

Personalized profile writing suggestions

Our personalized profile writing suggestions help members summarize their career highlights in the About section of the profile, offering personalized writing suggestions based on content in the profile. In this case, the member can decide to accept, reject, or edit the GAI-suggested text. We evaluate the quality of the GAI suggestions using metrics like acceptance rate (did they use the GAI-suggested text or reject it) and edit distance (how heavily did the member need to edit the GAI-suggested content before accepting it). 

Unlike the other two examples, this is not a recurring use case. We expect that a member might like help revising their profile summary once every year or two, but not every day or every week. Therefore, retention is not a useful metric for this feature. If the personalized profile writing suggestions are good, members won’t need to use it again until they have new information to add to their profile. Instead, we care most about metrics like acceptance rate: out of all members who edit their profile summary and try this feature, what fraction of them choose to use the resulting output? We also focus on downstream engagement with the GAI-suggested profiles; if this feature is successful, the quality of writing that members create using it should be at least as high as they could create without it, so engagement with their profile should also increase. 

Image of The “Write with AI” button allows members to write a profile summary section that draws from all the other information on their profile, including work experience, education, and skills
The “Write with AI” button allows members to write a profile summary section that draws from all the other information on their profile, including work experience, education, and skills.

Conclusion

While all GAI products are unique and require a thoughtful approach to evaluating success, they share some unique challenges in measuring quality and ensuring a trustworthy experience. There are different ways to measure success for different types of GAI products, depending on the member’s goals, the interaction modes, and the product life cycle stage. We use a combination of human review (lowest-coverage, highest-fidelity measurement of GAI output quality), in-product feedback (medium-coverage, skewed measurement of GAI output quality), and product usage metrics (highest-coverage, most general metrics covering every aspect of product success) to evaluate product success. 

By constantly monitoring and improving these metrics, we ensure that our GAI products are delivering value to our members and helping them achieve their professional aspirations. As GAI goes mainstream and becomes a normal part of our daily experience using technology, I hope these examples help GAI product users understand the thought process that goes into designing them, and give other GAI builders some fresh ideas for how to evaluate their product launches. 

Acknowledgements

Each of these product launches involved a large cross-functional team too numerous to list here, who envisioned, designed, and built something truly new in the world. Thanks to everyone who worked on them.

I would also like to thank Jia Ding, Stacy Li, Pete Merkouris, Ziyang Zhou, Adam Tishok, Daniella Bayon, Cathy Liu, Grace Tang, Alex Murchison, and Katherine Vaiente for reviewing this blog and providing valuable feedback.