How Much Is a Content Library Worth to a Generative Search Engine?

Kunal Menda & Ralph Benarrosh

August 14, 2024

Executive summary

Since the launch of ChatGPT in 2022, there has been much debate about the value of the underlying content to generative AI, as well as how those who produce it should be compensated. In this blog post, we present a framework for valuing content libraries (e.g., the set of articles on a website) for use in generative AI applications. We propose that content's value in this context is attributable to two main value drivers:

Its relevance; i.e., how much demand there is from the AI’s user base for the information in the content library.
Its uniqueness; i.e., how many other content providers have the same information available.

We integrate these insights into an underwriting framework and use it to analyze the contributions of various content libraries to the responses generated by Perplexity AI for a set of sample queries. Using various deal term assumptions, we then convert these attribution percentages into dollar amounts that would be paid to the publishers under the terms of a hypothetical revenue-sharing agreement between the AI company and the publisher.

While our framework uses historical data to analyze the AI's behavior, we provide the ability to add forward-looking assumptions around the relevance of different topics over time. These assumptions will be reflected in the results of the underwriting. For example, while US domestic politics may be very relevant within the period covered by our sample data set, it may become less relevant after the November elections. Our framework allows for such a view to be expressed and reflected in the forward-looking valuation.

If you would like to discuss the specifics, get in touch!

Introduction

Ever since ChatGPT took the world by storm in late 2022, there has been enormous hype around the potential for generative AI to transform the way people search for and consume information on the internet. However, these new technological trends are not without the potential for collateral damage. The publishing industry in particular is facing a significant threat on several fronts because of generative AI. First, early generative AI models were trained on content that was not explicitly licensed from their original owners. Second, there is a widespread assumption that generative search experiences (GSEs) such as SearchGPT, Google’s AI Overviews, and Perplexity AI, will dramatically reduce the number of visitors to third-party websites, taking advertising revenue along with it. The reasoning goes that because GSEs produce fully formed responses to queries, there will be less of a need for users to click through to the source material, even if it is cited in the response.

In response to these developments, publishers have begun licensing their content to AI companies. This begs the question: how does one value a content library (i.e., the set of articles they've published) in the context of these licensing deals? There is a lot of money riding on this question, with recent transactions having been in the tens or even hundreds of millions of dollars. However, this is not an easy question to answer for several reasons. First, generative AI is a nascent technology and we have observed very few transactions of this kind in the market. The transactions we have observed are large and idiosyncratic. Second, content libraries cannot be valued in isolation because differentiated content is more valuable than commoditized content. Third, analyzing the behavior of generative search engines is challenging without the use of large-scale automation and analytics. Developing these capabilities in-house is challenging for the traditional analysts who would normally create such underwriting frameworks.

In this blog post, we discuss AltLab’s framework for assessing the value of a content library from the perspective of an AI company (the demand side). Our next post will build on this analysis and analyze it from the perspective of the publisher (the supply side). Want to discuss it in more detail? Shoot us a note…we’d be happy to chat!

Our approach

Data

We assembled a sample data set of Perplexity AI’s responses to queries across a variety of categories, such as What are the latest updates on the presidential election? and What are the upcoming matches for the premier league? For each query, we collected Perplexity’s response including any citations provided in support. An excerpt from our dataset is shown below.

Category	Query	Perplexity response
Business news	What are the latest trends in the stock market?	Here are the latest trends in the stock market based on the...
Current events	What are the latest updates on the presidential election?	Here are the latest updates on the 2024 presidential...
Sports	What are the latest NFL standings?	The latest NFL standings for the 2024 season are as follows...
Travel	What are the top attractions in Paris?	Paris is renowned for its iconic landmarks and cultural...
US elections	What are the key issues in the upcoming US elections?	The upcoming 2024 U.S. elections are centered around...

Content library value drivers

For most generative AI use cases, the value of a piece of content may be split into two components:

The ability to use it to train a model, and
The ability to integrate it into real time responses to queries.

This post will focus on the second use case, but many of the concepts are also relevant when assessing the first. When assessing the value of a piece of content for the second use case, we again focus on two key value drivers:

Its relevance; i.e., how many responses will it be used to generate?
Its uniqueness; i.e., how many other sources can provide the same or comparable information?

Intuitively, content that is being used to generate more responses is more valuable than one that is being used in fewer responses. However, we expect the uniqueness of the information to play a much more significant role in determining its value than in previous contexts. This is because of the programmatic nature in which AI retrieves and consumes information, as well as how it flattens source material into a single response, eliminating duplicate information in the process. This will exaggerate the effects of the uniqueness value driver, with a race to the bottom on price for undifferentiated content, and differentiated content commanding a significant premium, proportional to the demand for it.

Algorithm overview

We capture these insights in an underwriting model that models revenue contributions at an article level before rolling it up to the library level. We separate our algorithm into three parts:

For each query in our historical data set, assess the relative contributions of each cited article to the response produced. Queries are categorized into categories, allowing us to roll up the contributions of publishers to entire categories.
Project the breakdown of future searches by category. For example, what percentage of future searches will be about the US election over time? These projections are combined with the source contribution breakdowns produced in step one to get forward-looking projections on contributions by publisher.
Layer on revenue assumptions and specific deal terms to compute cashflows.

Example

Below, we work through an example underwriting of publishers for licensing their data to Perplexity using a sample data set.

Assessing the contributions of each publisher

Our algorithm assesses the relative contributions of each cited article to the response produced by Perplexity AI. We assess both the relevance of the article to the response; i.e., how frequently it was cited within the response, as well as the uniqueness of the article; i.e., how many other citations supported the same statement. Using these factors, it produces the relative contributions of each citation to the overall response. Below, we show Perplexity's response to the question What are the top attractions in Paris? as well as the contributions of the various sources that it cites.

Paris is renowned for its iconic landmarks and cultural attractions. Here are some of the top attractions that should not be missed:

Iconic Landmarks

Eiffel Tower: The most famous symbol of Paris, offering stunning views from its various levels. Visitors can enjoy dining at the Michelin-starred restaurant on the second floor or relax at the panoramic champagne bar on the third floor[1][2].
Louvre Museum: Home to thousands of works of art, including the Mona Lisa and the Venus de Milo. The museum itself is an architectural marvel, featuring a glass pyramid entrance[3].
Arc de Triomphe: A monumental arch honoring those who fought for France, located at the top of the Champs-Élysées. The view from the top is breathtaking, making it a popular spot for tourists[1][3].
Notre-Dame Cathedral: Although currently under restoration due to the 2019 fire, the exterior remains a stunning example of Gothic architecture. Plans are in place for parts of the cathedral to reopen to the public by 2024[2].

Cultural Experiences

Seine River Cruises: A scenic way to see many of Paris's landmarks from the water, especially beautiful during sunset[1][2].
Château de Versailles: A short trip from Paris, this opulent palace is famous for its gardens and historical significance. It's a must-visit for history buffs and those interested in royal architecture[1][2].
Montmartre: A charming neighborhood known for its artistic history, featuring the Basilica of Sacré-Cœur and a vibrant atmosphere filled with cafes and street artists[4].
Centre Pompidou: A modern art museum known for its unique architectural design and vast collection of contemporary art. The rooftop offers great views of the city[3].

Parks and Gardens

Luxembourg Gardens: A beautiful park perfect for a leisurely stroll or a picnic, showcasing stunning landscaping and fountains[2].
Tuileries Garden: Located between the Louvre and Place de la Concorde, this formal garden is ideal for relaxing and enjoying the Parisian atmosphere[3].

Unique Attractions

Père Lachaise Cemetery: A famous cemetery that serves as the final resting place for many notable figures, featuring beautiful sculptures and serene pathways[2].
Covered Passages: These historical shopping arcades offer a glimpse into Parisian life, with boutique shops and cafes tucked away from the bustling streets[2].

These attractions reflect the rich history, art, and culture that Paris has to offer, making it a top destination for travelers worldwide.

Citations:

https://www.timeout.com/paris/en/attractions/best-paris-attractions
https://www.parisdiscoveryguide.com/paris-attractions-top-10.html
https://www.lonelyplanet.com/france/paris/attractions
https://www.reddit.com/r/ParisTravelGuide/comments/19c72su/the_absolute_mustdosee_in_paris/

The plot below breaks down the contributions of the various sources to the response above. In this response, content from parisdiscoveryguide.com (http://parisdiscoveryguide.com) was given the highest attribution percentage.

‍

We ran our algorithm on every response within the dataset and aggregated the contributions by category. The chart below shows the breakdown of contributions by publisher for the Current events category. Within this dataset, we found that Rueters's content contributed 10% of the information in the responses, The Associated Press's content contributed 7%, etc.

‍

Projecting the breakdown of future searches by category

Next, we project how future searches will break down across categories. Intuitively, we would expect queries related to the US elections to be more common in the lead-up to the election, declining precipitously afterward. Similarly, we would expect queries related to sports to be more common beginning in the fall, while queries related to travel would be more common in the summer. The plot below shows our projections for the share of queries by categories over the next two years.

Note: The projections are treated as an input to our process. Projections shown are illustrative.

‍

‍

Layering on revenue assumptions and specific deal terms

Next, we need to make some assumptions about the monthly revenue generated by Perplexity AI and how it will grow over time. For reference, Perplexity AI is on track to make $35M USD in annualized revenue, and has seen their usage grow 700% in the last year. We also need to make an assumption about the share of Perplexity's revenue that will go to publishers, as well as the discount rate to use for future cashflows.

For the analysis below, we have assumed that Perplexity's current annual revenue is $35M USD with a 350% YoY forward-looking revenue growth assumption. We assume a 65% revenue share with publishers. Finally, we assume a 10% APR discount rate on cashflows.

From the estimates of the contributions of different content libraries to query responses, the category projections, and our assumptions around total revenue growth, we can project the revenue share that each publisher's content library will generate for them over the coming two years. The chart below shows those projections for the five most valuable content libraries.

‍

‍

Using the assumed discount rate of 10% per year, we can compute the present value of the revenue projections for each content library, shown in the bar chart below.

‍

Analysis

In the preceding example, Reuters' revenue share from Perplexity over the next two years is estimated to be worth $9.2M USD. While there is little publicly available information about the recent deals that have taken place, one data point is the recent deal between News Corp and OpenAI that is reportedly worth $250M USD over 5 years. OpenAI's annual revenue is reportedly around $3.4 billion USD and has approximately doubled in the last twelve months. If we were to assume the same revenue and growth rate for Perplexity AI, scale the deal's time horizon to five years, and assume that Perplexity shares 65% of their revenue with publishers, then the most valuable content library (Reuters) would be worth $775M USD to Perplexity. OpenAI is not just a generative search engine, though, so assuming a 65% revenue share with publishers is likely too high. In our framework, the $250M USD deal value would imply a 21% revenue share with publishers with the aforementioned assumptions.

Conclusion

This post has explored our framework for valuing a content library from the perspective of an AI company looking to license content from publishers. Using our framework, a generative AI company can value content libraries based on the historical relevance and uniqueness of their content as well as forward-looking views on category-relevance, revenue growth, and other factors. However, we have not yet discussed the other side of the market—how should publishers weigh the tradeoffs between licensing content to AI companies against potential lost advertising and subscription revenue? Stay tuned! In our next post, we consider the same question from the perspective of publishers.