Some questions to ask about datasets and the importance of that data for markets, firms and allegations of market power: Is the data already replicated elsewhere? Is the data replicable? Is it rivalrous? If it is does the value of the dataset continue to increase with the size of the dataset? And if so, up to what point? This starts off a little theoretical but becomes more concrete towards the end. That said, any numbers below are hypothetical, to illustrate the argument, not to point fingers at any company or market.
Is the data already replicated elsewhere? LinkedIn may know my date of birth and where I went to University. Other companies do as well. Can the data that LinkedIn has be put together from data sources elsewhere? The European Commission essentially found the answer to be yes in its Microsoft / LinkedIn merger decision.
Is the data replicable? If the data hasn’t already been gathered elsewhere, could it be? A theoretical possibility – eg the lack of legal barriers – wouldn’t seem to be enough because, to pick just one factor that may be relevant in replicability, the cost of replicating the data set may be significant. If the value of the entire market is 1000, and the cost of replicating the data set is 501, then we may be looking at a natural monopoly.
Is the data rivalrous? This is one factor to take into account in the replicability of the dataset. LinkedIn may know my date of birth, but someone else could also ask for my date of birth. But if Google Search knows what I searched for, what results it displayed, and what I then clicked on, then no-one else can know that. So clickstream data looks to be rivalrous. That in itself doesn’t mean anything of course. If there were – say – ten search engines, each capturing the rivalrous clickstream data of 10% of the market, and each continuing to trade profitably, then there would be no obvious grounds for concern. So that leads to the next question:
If the data is rivalrous, does the value of the dataset continue to increase with the size of the dataset? And if so, up to what point? This needs a little more discussion.
To understand the value of data on a market, we need to know whether the value of a dataset continues to grow with the growth of the dataset, or whether the value of the dataset flattens out – plateaus – at a certain size.
For example, if you are gathering data from moving vehicles in order to create a traffic data product then I can imagine that a dataset that gathers data from only 1 in 100,000 drivers (0.001% of vehicles on the road at any one time) has limited value in working out current traffic conditions. A dataset of 0.01% of vehicles will be more valuable than a dataset of 0.001%. A dataset of 0.1% will be more valuable still. (That still may not be enough to create a useful product of course: if you, for example, need data from 10% of all moving vehicles in order to create a traffic data product that works, then a dataset with only 5% of all moving vehicles may be worthless.)
But if you have a dataset of 10% of all moving vehicles, is a dataset of 20% much more valuable? Or 50%? There will almost certainly come a point where – at least for the purposes of mapping and traffic information – a larger dataset no longer increases in value. The curve flattens out.
If the curve flattens at, say, 20%, then at least in theory there is space for multiple competing providers of data.
If the curve does not flatten, or flattens only at a much higher point, then the number of competing providers becomes more limited. If a dataset covering 51% of the market is more valuable (in terms of improving the product for which the dataset is an input), then the market is perhaps a natural monopoly – because a provider with 51% of the data will be able to produce a product more useful to consumers than a provider with 49% of the data. Consumers may then all move to the 51% provider. (This depends on other factors of course. This is a simplification to focus on the dataset issue.)
So working out whether the value of the dataset continues to grow with the size of the dataset is important.
And, yes, this is a network effects issue. They come up a lot.