Introduction

Our approach to understanding and studying cities has radically changed in the last two decades1, 2. Advances in spatial network theory3, 4, dramatic increases in the volume and type of available data5 and insightful approaches based on out-of-equilibrium concepts6, 7 have contributed to the formation of a new, highly interdisciplinary science of cities. It is common knowledge that a fundamental role in the urbanization process is played by retail activity and indeed a city’s social life and physical shape is greatly influenced by such activities. There is continual motion in cities as people move to work and play, in restaurants, bars, grocery shops, shopping malls and so on. Understanding and describing the mechanisms that govern these activities and the processes that lead to their agglomeration is not only intriguing, but crucial to an understanding of how a city works. For this reason modeling retail location has been a pillar of urban simulation for a long time8. Researchers have attempted different types of approaches to characterize this phenomenon, from entropy maximizing models9, 10, to random utility models11,12,13 to agent based approaches14, 15. However, retail location and consumer location choice theory continue to draw heavily from fundamental ideas16, 17 in central place theory18, 19 and bid-rent theory20. Since the late 1990s, advances in the new economic geography21, 22 have reinvigorated the field and reorganized it, by considering economies of scale, cross-dependencies in markets (e.g. between labour, retail and housing) and forms of imperfect competition between firms23. However, apart from notable exceptions24,25,26, fresh approaches have been slow in informing state-of-the-art modelling and engaging with mainstream location choice modelling based on discrete choice and random utility theory, one of the cornerstones of consumer preference theory within economics developed over the last 40 years. Here we present such a model of consumer location choice based on random utility theory designed to capture both internal and external economies of scale at the individual retailer level. This means that we will consider single retailers, rather than retail centres which is usually the case when applying this type of approach, and describe how their economic properties depend both on personal characteristics (internal economies) and on the interaction with the other neighbouring retailers (external economies). In order to formalise both internal and external economies we need to quantify how the rent paid by a single retailer is affected by its floor-space (its surface are measured in m2), and by its proximity to other retailers. Measuring this second effect will allow us to define an interaction range between individual retailers and understand the role played by retail agglomeration in consumer choice.

In the following discussion, we will examine the model’s theoretical foundations, and propose an implementation strategy which takes advantage of unconventional data sources that have recently become available for calibration and validation. We then review the model’s output predictions. Our study focuses on central London and the data gives us information on the spatial distribution of the population and work places (from the 2011 Population Census), on work to retail and home to retail trips (from the London Travel Demand Survey 2012) which is be used to calibrate the cost function of our model, and on the position, size and paid rent of single retailers (from the Valuation Office Agency 2010 data) which we use to quantify the effect of internal and external economies. Finally we use data from Foursquare, a website where users keep track and comment places they have been to, to validate the results. A richer description of the datasets we use is found below in the Methods section. As we will see, this work shows the central role retail agglomeration plays in consumer choice, and how this is created by local interactions at the microscopic level of the single retailer. In our model, consumers are evaluating the attractiveness of each shop by considering its size and the retail activity with a 325 m perimeter around it. This is in line with observations reported in literature on the average length of walking trips.

Results

The Model

Using as input the spatial distribution of population and individual retailer locations in London, our goal is to build a model of consumer choice capable of predicting the success of a retailer to a high level of accuracy. This will be measured by the expected number of people that will visit the retail store, or in other words, by the fraction of the distribution of trips that the model predicts will flow towards each single retailer. As implied in the introduction we have considered the retail activities found in the VOA dataset, and have treated them with no distinction of category. Under this simplifying assumption all retailers interact among each other in the same way (see supplementary material section 1, for an in depth analysis of the data and the retail categories we have included in our analysis). In order to build our model we only consider a retailer’s location in the city, and its floor-space, i.e. the area of its surface or its size. We start by defining the utility of consumer i shopping for product p from retailer r as

$${u}_{ir}={u}_{p}-{p}_{r}-c({d}_{ir})+{\omega }_{ir}$$
(1)

where u p is the utility that comes by acquiring the product, p r the price at which the product is sold by retailer r, c(d ir ) a generalised cost of traveling between r and i and where ω ir is a random element that reflects the personal tastes of the consumer for location and product variation. As one can see eq. 1 is a random utility function.

As one can see eq. (1) is a random utility function. The standard way of solving such an equation is by assuming the random element ω ir to be i.i.d. (independently and identically distributed) Gumble distributed. The choice of the distribution is given because of its closed form, which allows us to calculate the probability of i choosing r over all other r′ possibilities. Details of the derivation of the equation are found in the supplementary material section 2. For now one only needs to understand that we define the probability of consumer i shopping in r as the probability of its utility being the greatest over all other alternatives \(P({u}_{ir} > {u}_{ir^{\prime} })\,\forall r^{\prime} \ne r\), which has the form

$${p}_{i\to r}=\frac{\exp \,(-{\beta }_{i}{u}_{ir})}{{\sum }_{r^{\prime} }\exp \,(-{\beta }_{i}{u}_{ir^{\prime} })}=\frac{\exp \,(-{\beta }_{i}({u}_{p}-{p}_{r}-c({d}_{ir})))}{{\sum }_{r^{\prime} }\exp \,(-{\beta }_{i}({u}_{p}-{p}_{r^{\prime} }-c({d}_{ir^{\prime} })))}$$
(2)

where β i is the inverse of the standard deviation of the distribution of ω ir . If we now consider β i  = β for every consumer, eq. (2) will depend only on the location of consumer i and retailer r.

Eq. (2) implies that each retailer r offers one, and only one, type of product p. Therefore, each consumer i associates only one random utility component ω ir with each retailer, reflecting the fit between this sole product variation and the preferences of the consumer. This assumption is quite unrealistic; typically retailers will stock several types of products. It is reasonable to assume that larger shops will stock larger varieties. Therefore, variety can be expressed as a function of floor-space. Following the work of Daly and Zachary27 and of Ben-Akiva and Lerman in ref. 28, and assuming that the price is constant for every retailer in the system, namely p r  = p = cost, the probability of consumer i shopping with retailer r is

$${p}_{i\to r}=\frac{{f}_{r}^{\alpha }\,\exp \,(-\beta \,c({d}_{ir}))}{{\sum }_{r^{\prime} }{f}_{r^{\prime} }^{\alpha }\,\exp \,(-\beta \,c({d}_{ir^{\prime} })}$$
(3)

where f r is the floor-space of retailer r and the exponent α quantifies the correlation between the stochastic components and the product variation. The model in eq. (3) is very similar to the multinomial logit model presented by McFadden29, 30 and to the entropy maximising equation introduced by Wilson9. Macroscopically, this suggests that the utility that consumer i gets from buying from retailer r is either a sub-linear (α < 1), linear (α = 1), or super-linear (α > 1) function of the floor-space of retail unit r. These internal economies (or dis-economies) of scale emerge from the application of a transparent utility-based approach and reflect perceived opportunities at the level of the individual retailer. Eq. (3) captures the potential impact of size at the individual shop level. As such it is sufficient in describing flow patterns and activity distribution associated with shop-size variations. But as we will see in the following section, it fails to account for the concentration of retail activity in clusters (markets/shopping centers). In other words, eq. (3) cannot explain agglomeration effects in the spatial distribution of retail activity.

In order to explicitly introduce agglomeration into the model, we start by defining the attractiveness of a retailer as

$${A}_{r}={f}_{r}^{\alpha }+\sum _{r^{\prime} :{d}_{rr^{\prime} }}\,{f}_{r^{\prime} }^{\alpha }\,\exp \,(-\gamma {d}_{rr^{\prime} })$$
(4)

where d rr is the distance between retailers, and γ is parameter that weights the role of the retailer’s neighborhood in the consumer’s perception. The perceived utility associated with a given retailer is influenced by the utility associated with the neighboring retailers. By substituting eq. (4) into eq. (3) we finally get the form of the distribution to be used in our model as:

$${p}_{i\to r}=\frac{{A}_{r}\,\exp \,(-\beta \,c({d}_{ir}))}{{\sum }_{r^{\prime} }{A}_{r^{\prime} }\,\exp \,(-\beta \,c({d}_{ir^{\prime} }))}$$
(5)

In the location choice model presented in eq. (5), we can see how the parameter α controls the internal and γ the external economies of scale. Once again α tells us if the utility scales sub-, super- or linearly with the retailer’s floor-space, while the γ values tune the interaction range between retailers: \(\gamma \to \infty \) implies that the utility of visiting a retailer is only a function of its own floor-space f r , while in the opposite case, when γ → 0, the utility of a retailer is equally shaped by all shops in its vicinity. Note that for α = 1 and \(\gamma \to \infty \), we derive the multinomial logit model. The model in eq. (5) is a particular case of the Cross-Nested Logit model (CNL)31, 32 in which retailer r represents a nest, and the extent of the overlap between nests is controlled by γ.

Now exploiting eq. (5), we can define the modeled turnover, i.e. the fraction of consumers that will shop in a given destination, for each retailer r as

$${Y}_{r}=\sum _{i}\,{n}_{i}\cdot {p}_{i\to r}=\sum _{i}\,(\frac{{A}_{r}{{\rm{e}}}^{-\beta C({d}_{ir})}}{{\sum }_{r^{\prime} }{A}_{r^{\prime} }{{\rm{e}}}^{-\beta C({d}_{ir^{\prime} })}})\cdot {n}_{i}$$
(6)

where we have considered the population concentrated in centroids i,and where n i indicates their respective populations. Therefore the term n i  · p ir tells us the fraction of the population of centroid i that will travel to retailer r, and depends both on the attractiveness A r of the retailer and on the distance between the two points. It will then be possible to compare the results yielded by eq. (6) with the actual rents paid by retailers. Quantifying the two parameters α and γ provides us with insight into the economical relationships between retailers which is the main objective of this paper. To do this we have to first calibrate the cost function which expresses how consumers perceive costs related to the distance between their origin and a shop.

Calibration of Cost Function

We define a cost function C(d, λ) as a Box-Cox transformation of the distance, namely

$$c(d,\lambda )=\frac{{d}^{\lambda }-1}{\lambda }$$
(7)

where d is the distance between the consumer and the retailer’s location. The Box-Cox function adds an extra modeling layer that maps objective costs into perceived costs. The curve of this transformation ranges from linear (λ = 1) to logarithmic (λ → 0). In the former case the marginal effect of cost travel d on trip distribution is exp(−β * d), while in the latter it becomes d β. To calibrate the two parameters β and λ we use the London Travel Demand Survey (LTDS) database which contains 5004 home to retail and 2242 work to retail trips (see Methods for more details). These numbers are not sufficient to calibrate the location choice model in full; we thus cannot use this dataset to give value to all the parameters used in model. In fact the LTDS supplementary report (TfL 2011) suggests that the sample size is sufficient only for an Inner/Outer London spatial classification. Therefore the survey is only used to calibrate distance profiles i.e. the distribution of distance traveled for shopping from home and work. As an input for the population distribution, we have used London Population Census data at the Lower Super Output Layer level (LSOA), where we have considered the population concentrated on the centroids. As we can see below in the Methods section, the database includes coordinates of the LSOA centroids, population and working population.

We start by arbitrarily assigning values to the α and γ parameters, which sets the values for the retailers’ attractiveness in eq. (4). We chose α = {0.8, 1.0, 1.8} in order to cover the 3 different scenarios of sub-linear, linear and super-linear scaling of attractiveness with floorspace, and γ = {0.005, 0.05, 0.5} which bearing in mind that we are working in meters, corresponds to a decay length of {200 m, 20 m, 2 m} respectively.

Now exploiting eq. (5) we can define the probability the model generates a trip of distance d as

$${p}_{m}(d)=\frac{{\sum }_{k}{n}_{k}\,{\sum }_{r:d < {d}_{rk} < d+{b}_{d}}{A}_{r}\cdot {e}^{-\beta C(\lambda ,{d}_{rk})}}{{\sum }_{k^{\prime} }{n}_{k^{\prime} }\,{\sum }_{r}{A}_{r}\cdot {e}^{-\beta C(\lambda ,{d}_{rk^{\prime} })}}$$
(8)

where the b d in the numerator is a conveniently defined bin. In other words eq. (8) tells us the fraction of trips that are of length d. To fix the β and λ values, we tune them so to maximize the likelihood equation

$${\mathbb{L}}(\beta ,\lambda )=\sum _{d}\,{p}_{e}(d)\cdot \,\mathrm{ln}({p}_{m}(d))$$
(9)

where d is the length of the trips and p e (d) is its observed distribution in the LTDS. We repeat this process on both types of trips, in Fig. (1) and for every couple (α, γ) in the previous list we obtain a couple (β, λ) that maximizes the likelihood. The results yielded for β and λ are robust for the different α, and γ inputs, and the variations are of ±0.1. As we can see from Fig. (1) there is a good agreement in both cases between the distributions observed in the data and those generated by the model with parameter values as β h  = 0.25, λ h  = 0.3, β w  = 0.35, λ w  = 0.25.

Figure 1
figure 1

We show the maximum likelihood modeled distributions and the observed ones. For home-retail trips in (a) β h  = 0.25 and λ h  = 0.3, while for work-retail trips in (b) we have found β w  = 0.35 and λ w  = 0.25. As we can see from the figures in both cases the parameters that maximize the likelihood generate a distribution that fits the observed one very well.

Correlation with Data

Considering the two different types of trips, the total modeled turnover predicted by the model is

$${Y}_{r}={Y}_{r}^{w}+{Y}_{r}^{h}=\sum _{i}\,({n}_{i}^{w}{p}_{i\to r}^{w}+{n}_{i}^{h}{p}_{i\to r}^{h})$$
(10)

and at an aggregated level this quantity can be used an indicator of the success of a retailer, with the idea that the more people visit a retailer the more that retailer earns. A more careful analysis should definitely take into account the type of goods sold by the retailer: indeed given the same number of visits, a car show room and a grocery corner shop yield different earnings, and a clothing shop in the vicinity of another clothing shop will have a different impact to that of a supermarket. These are two significant aspects that we leave for more detailed analysis in further research.

All the information on the retailers are found in the Valuation Office Agency database (see Methods VOA for more details), where we have found, among other data, information on the position, floorspace and the ratable value per meter for 98936 retailers in London. The rateable value represents the Valuation Office Agency’s estimate of the open market annual rental value of a business/non-domestic property, i.e. the rent the property would let for on the valuation date, if it were being offered on the open market; as such, it is considered a very good indicator of the property value. Our hope is that the number of visits estimated by the model in eq. (10) to a retailer of any type, could be a good predictor of its rent. To test this hypothesis we calculate the Pearson correlation between the two quantities

$${Y}_{{\rm{sq}}}(\alpha ,\gamma )=\frac{{Y}_{r}(\alpha ,\gamma )}{{f}_{r}}\quad {R}_{{\rm{sq}}}=\frac{{\rm{Rateable}}\,{\rm{Value}}}{{f}_{r}}$$
(11)

where f r is the floor-space. Not dividing by f r would result in higher levels of correlation, not generated however through higher accuracy of the model but because of the variable f r explicitly appearing in eq. (4), and because of the strong correlation between the two quantities

We therefore study the behavior of

$$C\,(\alpha ,\gamma )=\frac{{\rm{cov}}\,({Y}_{{\rm{sq}}},{R}_{{\rm{sq}}})}{{\sigma }_{{Y}_{{\rm{sq}}}}{\sigma }_{{R}_{{\rm{sq}}}}}$$
(12)

looking for the values (α max, γ max) that maximize the correlation. These parameters will be considered those that best describe the retail activity through internal and external economies. In Fig. (2a)a we show a heat map of the correlations found exploring the parameter space. The maximum correlation is C max = 0.48 which seems like a reasonable value given the level of aggregation we are working at with the parameter values (α max, γ max) = (1.3, 0.008). We should stress at this point that we do not consider the correlation level obtained as a validation of the model. These values are those normally obtained by models of this type and probably depend more on density effects33, 34 than on the details of our model. As we have said, our main goal is to quantify and weight of internal and external economies, and we will now do this characterising the values of the parameters that maximise the correlation. As we can see the α max tells us that a model with a super linear scaling between the number of visitors and floor-space best describes the data, while the γ max value indicates that retailers benefit from their proximity to other retailers given a characteristic length of the decay of \(\tfrac{1}{\gamma }=125\,m\).

Figure 2
figure 2

In this figure we present the results obtained comparing the retailer’s ratable value found in the VOA dataset, with the modeled turnovers Y(α, γ). In (a) we show a heat map of the correlation between the two quantities, we can see a clear maximum in the correlation for α max = 1.3 and γ max = 0.008. In (b) we show the correlation with the model of γ = 0, which implies that the attractiveness becomes \({A}_{r}={f}_{r}^{\alpha }+{\sum }_{r^{\prime} :{d}_{rr^{\prime} } < {d}_{o}}\,{f}_{r}^{\alpha }\), and we have studied the correlation for different values of d o . We can see how the maximum is found in α max = 1.2 and d max = 325 m which is 2.6 times the characteristic length of the decay and coincides with a dampening effect of 93% of the interactions. In (c) we compare the correlation in the case where \(\gamma =\infty \), or with no interactions, which is the form used in other more aggregated models. We can see how in this microscopic formalization the model always correlates better with the data for finite values of γ.

We can see in Fig. (2a) how for increasing values of γ, which imply a faster decay and therefore a shorter interaction range between retailers, the correlation levels decrease. In the extreme case of \(\gamma =\infty \), the attractiveness term in eq. (4) becomes \({A}_{r}={f}_{r}^{\alpha }\) meaning that retailer attractiveness only depends on its floor-space. An A r of this form is what is usually found in standard retail models such as those in ref. 9, and works well when considering retail centers. From these results it seems that for a microscopical formalization of the retail activity, this picture fails as a satisfactory description. In Fig. (2c) we compare the correlations obtained from the model with interactions and with no interactions. We can see how both models exhibit the maximum for the same value of α and how the model with interaction has constantly higher correlation values. This means that agglomeration has to be included when modeling consumers’ choice at the level of the single retailer.

Another interesting scenario is the opposite case with γ = 0, where consumers perceive the neighborhood of a retailer in the same way for a given distance. The attractiveness term in eq. (4) becomes

$${A}_{r}={f}_{r}^{\alpha }+\sum _{r^{\prime} :{d}_{rr^{\prime} } < {d}_{o}}\,{f}_{r}^{\alpha }$$
(13)

where we have introduced a parameter d o which explicitly sets the interaction range between retailers. We can now study the Pearson correlation for different values of the radius d o and α. In Fig. (2b) we can see how the maximum values are found for α = 1.2 which is very similar to what has been previously found for \({d}_{o}^{{\rm{\max }}}=325\), but despite the correlation levels being very close C max = 0.45 for the model with the exponential decay, this model typically shows lower values. What this seems to imply is that consumers implicitly take into account the distance and weight closer shops more than more distant ones, instead of equally considering their proximity.

In order to better quantify the level of agglomeration predicted by the model we can now compare the results in Fig. (2a,b). As implied, an exponent of γ = 0.008 indicates a characteristic decay length of \(\tfrac{1}{\gamma }=125\,m\) for the interaction range. If we compare this result with the γ = 0 model, we see that \({d}_{o}^{{\rm{\max }}}=2.6\cdot \tfrac{1}{\gamma }\), and if we substitute it in the exponential, we get \({{\rm{e}}}^{-0.008{d}_{o}^{{\rm{\max }}}}=0.07\) which means that at that distance retailers loose 93% of the benefits of neighboring floor-space. This analysis suggests that retail activity benefits from agglomeration, and this benefit is spread through local interactions between single retailers, whose main effect is felt within 325 m. From the consumer’s point of view, we could say that when choosing the retailer to visit, consumers take into account the amount of floor-space or choice within a radius of around 325 m. This result is in very good agreement with several studies on walking trips as in ref. 35 where it is shown that 65% of all walking trips are under \(0.25\,{\rm{miles}}\simeq 400\,{\rm{m}}\), and in ref. 36 where the average walking distance for several types of retail activity are just above \(\simeq \)500 m. We interpret these results as follows: the choice perceived when considering a specific retailer is quantifiable as its floor-space, and the floor-space of retailers which can be reached by foot.

Discussion

In this section we will work through the tests we have done in order to validate the levels of agglomeration and the scaling law predicted by the model. We have done so by repeating the same procedure both on a randomized data set and on data coming from Foursquare, a web site where people check in, and comment bars restaurants and other retail activities (see supplementary material for more information). We have randomized the data both in the rents/m 2, and in their locational positions. To randomize the rents/m 2, we have first measured their distribution in the VOA dataset and then assigned a random rent to each retailer taken from the same measured distribution, while to randomize the position we have simply reshuffled the position of retailers taking as lower and upper bounds the minimum and maximum coordinates found in the data. In both cases the correlation between the rents and the turnovers predicted by the model disappeared completely (C max 10−3). This seems to imply that there is a clear relation between the retailers’ spatial interactions and the rents they pay, and that these are captured, at an aggregated level by eq. (4).

A further validation consists in testing the model on check-in data coming from Foursquare. As we can see in the Methods section, this data set gives us information on the number of visitors who checked in a given venue on the application. The dataset contains all Foursquare venues within the M25 motorway, which consists of 300 × 103 venues, of which we have filtered just above 100 × 103 retail activity (more information on this process is in the Methods section). Each record contains location coordinates, number of check-ins since the venue was registered together with a detailed venue category. For each retailer r in the VOA dataset, we have summed the number of check-ins that were registered in the Foursquare dataset in a range R from the retailer. To choose the range R, we have studied the correlation between the Foursquare check-ins and the rent/m 2 seen in the VOA dataset.

As we can see from Fig. (3a) the maximum level of correlation is obtained by fixing R = 250 m as the range to count check-ins. Curiously enough this is quite close to the agglomeration characteristic exponential decay length. In Fig. (3b) we show the correlation between the modeled turnovers and the check-ins happening in range R = 250 m from the retailers. We can see how the two parameters for which the correlation is maximum are (α max, γ max) = (1.8, 0.01) which, considering the big difference in the data used, can be considered incredibly similar to what we previously obtained. The α value is still in line with a super-linear scaling between floor-space and attractiveness, and the γ implies a characteristic decay length of \(\tfrac{1}{\gamma }=100.0\,m\), against the \(\tfrac{1}{\gamma }=125\,m\) previously predicted. Once again given the great difference between the two datasets we have used, we consider that these results are a reassuring sign of the robustness of our findings.

Figure 3
figure 3

In the left panel in (a), we show the correlation between the rateable value of retailers in the VOA dataset, and the check-ins found in Foursquare around them for different radii. We can see how the correlation peaks at R = 250 m. In the right panel in (b) we show the correlation between the expected turnovers Y(α, γ) of each shop and the number of check-ins measured within a radius of R = 250 m from it. The set of parameters that maximize the correlation are α max, γ max = (1.8, 0.01), which are quite similar to what we found for the VOA rateable values.

In order to understand the strength and weaknesses of our model, for each retailer we have studied and compared the difference between the modeled turnover Y(α max, γ max) and the actual VOA rent. This will allow us to find out for which type of retailers the model is more accurate and which will have to be the future steps to better shape it. The first forced step is to make the two quantities comparable. We call R the vector of the fraction of rents, where the ith component \(\frac{{R}_{i}}{{\sum }_{i}\,{R}_{i}}\) represents the fraction of the total rent in the system that belongs to the ith retailer. In the same way, we define the vector of the fraction of modeled turnovers Y, with components \(\frac{{Y}_{i}}{{\sum }_{i}\,{Y}_{i}}\). We can now define the E error vector as

$${\bf{E}}=\,\mathrm{log}\,(\frac{{\bf{Y}}}{{\bf{R}}})$$
(14)

The components of the error vector in eq. (14) will indicate an overestimation of the success of a retailer, if E i  > 0 and an underestimation in the opposite case E i  < 0. By observing the characteristics of the retailers for which we underestimate and overestimate the rent, we can see how patterns emerge. In Fig. (4a) we show a ranked distribution of the errors E i . From the asymmetry of the curve we realize how the model, in this interpretation, tends to underestimate a retailer’s success. Indeed for the vast majority of the retailers we analysed, we noted that the Y i is smaller than their R i which implies that the overestimation errors are typically larger. We can therefore study the characteristics of the retailers we have overestimated, those above the red dashed line, and compare them with those we have underestimated, below the blue line. Observing the red curves in Fig. (4b,c), we can see how the model tends to overestimate retailers with a small floor-space surrounded by many neighbors, which looking at our definition of attractiveness in eq. (4) is predictable. On the other hand we can see how the blue curves, which represent the distributions of the underestimated retailers are very similar to the the distribution of the full dataset (indicated by the black dashed line). From the figures it is clear that the model underestimates the number of visits to big retailers in locations with low retail activity concentration, such as supermarkets and other retailers that likely fall under convenience rather than comparison retail, where intra-store variety is more important than inter-store variety.

Figure 4
figure 4

In this figure we present the results obtained analyzing the model’s errors in estimating the retailers’ success. In (a) we have ordered the errors by rank. The red and blue dashed lines indicate the thresholds we have used to define the subset of overestimated and underestimated retailers. In (b,c) we show that the overestimated retailers tend to be smaller than average with more floor-space than average in their neighborhood.

Outlook

In this paper we have presented a location choice model, based on random utility theory and following the growing popularity of the cross-nested choice structure, we have tested this on unconventional data sources of different types. The novelty of the proposed model is that it simulates retailers at the individual level. This opens up exciting opportunities towards integrating the consumer location choice component with explicit retail location micro simulation models able to take full advantage of the emerging availability of detailed data sources while also incorporating complex behavior on price setting, network dynamics and risk management. The proposed model in its current form has been simplified (with no loss of generality) into assuming that all retailers offer unique varieties of the same product. Moreover, it has been assumed that (i) all consumers have equal disposable retail budgets regardless of their location, (ii) all trips are uni-purpose (only shopping is considered), (iii) VOA rateable values are good indicators of floor-space rents and (iv) product prices do not vary in space. These assumptions mean that a considerable part of the complexity of the decision making mechanism is not represented by the model. The VOA and LTDS datasets that we are currently using have the potential to increase the complexity of the model significantly towards removing some of the existing simplifying assumptions, and when combined with passively collected social media datasets and formal datasets on economic activity (e.g. from the Business Structure Dataset developed by ONS), these models could offer sufficient detail to capture all the main dimensions of behavioral variation. Having said this, the basic model that we present in this paper remains very useful both as the baseline example of the proposed approach and as a benchmark; despite its simplicity, it is able to capture and reproduce both internal and external economies of scale very closely. Further immediate lines of research could consider different categories of retailers separately as well as applying the same analysis on data coming from other cities. Looking at the not too distant future, passively generated datasets of human presence promise to offer deeper insights into the dynamics of urban activities, including spatio-temporal patterns of shopping behavior. Having said that, the abundance of existing social media datasources and the relentless pace in which new data are currently being introduced, sustain the promise of accessible and highly dis-aggregated spatiotemporal information for anyone who manages to overcome lack of specification, representational biases and possibly absence of context37.