Since the launch of ChatGPT in 2022, there has been much debate about the value of the underlying content to generative AI, as well as how those who produce it should be compensated. In this blog post, we present a framework for valuing content libraries (e.g., the set of articles on a website) for use in generative AI applications. We propose that content's value in this context is attributable to two main value drivers:
We integrate these insights into an underwriting framework and use it to analyze the contributions of various content libraries to the responses generated by Perplexity AI for a set of sample queries. Using various deal term assumptions, we then convert these attribution percentages into dollar amounts that would be paid to the publishers under the terms of a hypothetical revenue-sharing agreement between the AI company and the publisher.
While our framework uses historical data to analyze the AI's behavior, we provide the ability to add forward-looking assumptions around the relevance of different topics over time. These assumptions will be reflected in the results of the underwriting. For example, while US domestic politics may be very relevant within the period covered by our sample data set, it may become less relevant after the November elections. Our framework allows for such a view to be expressed and reflected in the forward-looking valuation.
If you would like to discuss the specifics, get in touch!
Ever since ChatGPT took the world by storm in late 2022, there has been enormous hype around the potential for generative AI to transform the way people search for and consume information on the internet. However, these new technological trends are not without the potential for collateral damage. The publishing industry in particular is facing a significant threat on several fronts because of generative AI. First, early generative AI models were trained on content that was not explicitly licensed from their original owners. Second, there is a widespread assumption that generative search experiences (GSEs) such as SearchGPT, Google’s AI Overviews, and Perplexity AI, will dramatically reduce the number of visitors to third-party websites, taking advertising revenue along with it. The reasoning goes that because GSEs produce fully formed responses to queries, there will be less of a need for users to click through to the source material, even if it is cited in the response.
In response to these developments, publishers have begun licensing their content to AI companies. This begs the question: how does one value a content library (i.e., the set of articles they've published) in the context of these licensing deals? There is a lot of money riding on this question, with recent transactions having been in the tens or even hundreds of millions of dollars. However, this is not an easy question to answer for several reasons. First, generative AI is a nascent technology and we have observed very few transactions of this kind in the market. The transactions we have observed are large and idiosyncratic. Second, content libraries cannot be valued in isolation because differentiated content is more valuable than commoditized content. Third, analyzing the behavior of generative search engines is challenging without the use of large-scale automation and analytics. Developing these capabilities in-house is challenging for the traditional analysts who would normally create such underwriting frameworks.
In this blog post, we discuss AltLab’s framework for assessing the value of a content library from the perspective of an AI company (the demand side). Our next post will build on this analysis and analyze it from the perspective of the publisher (the supply side). Want to discuss it in more detail? Shoot us a note…we’d be happy to chat!
We assembled a sample data set of Perplexity AI’s responses to queries across a variety of categories, such as What are the latest updates on the presidential election? and What are the upcoming matches for the premier league? For each query, we collected Perplexity’s response including any citations provided in support. An excerpt from our dataset is shown below.
For most generative AI use cases, the value of a piece of content may be split into two components:
This post will focus on the second use case, but many of the concepts are also relevant when assessing the first. When assessing the value of a piece of content for the second use case, we again focus on two key value drivers:
Intuitively, content that is being used to generate more responses is more valuable than one that is being used in fewer responses. However, we expect the uniqueness of the information to play a much more significant role in determining its value than in previous contexts. This is because of the programmatic nature in which AI retrieves and consumes information, as well as how it flattens source material into a single response, eliminating duplicate information in the process. This will exaggerate the effects of the uniqueness value driver, with a race to the bottom on price for undifferentiated content, and differentiated content commanding a significant premium, proportional to the demand for it.
We capture these insights in an underwriting model that models revenue contributions at an article level before rolling it up to the library level. We separate our algorithm into three parts:
Below, we work through an example underwriting of publishers for licensing their data to Perplexity using a sample data set.
Our algorithm assesses the relative contributions of each cited article to the response produced by Perplexity AI. We assess both the relevance of the article to the response; i.e., how frequently it was cited within the response, as well as the uniqueness of the article; i.e., how many other citations supported the same statement. Using these factors, it produces the relative contributions of each citation to the overall response. Below, we show Perplexity's response to the question What are the top attractions in Paris? as well as the contributions of the various sources that it cites.
The plot below breaks down the contributions of the various sources to the response above. In this response, content from parisdiscoveryguide.com (http://parisdiscoveryguide.com) was given the highest attribution percentage.
We ran our algorithm on every response within the dataset and aggregated the contributions by category. The chart below shows the breakdown of contributions by publisher for the Current events category. Within this dataset, we found that Rueters's content contributed 10% of the information in the responses, The Associated Press's content contributed 7%, etc.
Next, we project how future searches will break down across categories. Intuitively, we would expect queries related to the US elections to be more common in the lead-up to the election, declining precipitously afterward. Similarly, we would expect queries related to sports to be more common beginning in the fall, while queries related to travel would be more common in the summer. The plot below shows our projections for the share of queries by categories over the next two years.
Note: The projections are treated as an input to our process. Projections shown are illustrative.
Next, we need to make some assumptions about the monthly revenue generated by Perplexity AI and how it will grow over time. For reference, Perplexity AI is on track to make $35M USD in annualized revenue, and has seen their usage grow 700% in the last year. We also need to make an assumption about the share of Perplexity's revenue that will go to publishers, as well as the discount rate to use for future cashflows.
For the analysis below, we have assumed that Perplexity's current annual revenue is $35M USD with a 350% YoY forward-looking revenue growth assumption. We assume a 65% revenue share with publishers. Finally, we assume a 10% APR discount rate on cashflows.
From the estimates of the contributions of different content libraries to query responses, the category projections, and our assumptions around total revenue growth, we can project the revenue share that each publisher's content library will generate for them over the coming two years. The chart below shows those projections for the five most valuable content libraries.
Using the assumed discount rate of 10% per year, we can compute the present value of the revenue projections for each content library, shown in the bar chart below.
In the preceding example, Reuters' revenue share from Perplexity over the next two years is estimated to be worth $9.2M USD. While there is little publicly available information about the recent deals that have taken place, one data point is the recent deal between News Corp and OpenAI that is reportedly worth $250M USD over 5 years. OpenAI's annual revenue is reportedly around $3.4 billion USD and has approximately doubled in the last twelve months. If we were to assume the same revenue and growth rate for Perplexity AI, scale the deal's time horizon to five years, and assume that Perplexity shares 65% of their revenue with publishers, then the most valuable content library (Reuters) would be worth $775M USD to Perplexity. OpenAI is not just a generative search engine, though, so assuming a 65% revenue share with publishers is likely too high. In our framework, the $250M USD deal value would imply a 21% revenue share with publishers with the aforementioned assumptions.
This post has explored our framework for valuing a content library from the perspective of an AI company looking to license content from publishers. Using our framework, a generative AI company can value content libraries based on the historical relevance and uniqueness of their content as well as forward-looking views on category-relevance, revenue growth, and other factors. However, we have not yet discussed the other side of the market—how should publishers weigh the tradeoffs between licensing content to AI companies against potential lost advertising and subscription revenue? Stay tuned! In our next post, we consider the same question from the perspective of publishers.