### Quantifying the number of hospital-acquired infections

#### Inferential approach

We estimate the total number of hospital-acquired infections in trust *i* (combining observed and unobserved infections), *z*_{i}, by applying Bayes’ formula:

$$P(z_i|\,y_i,\rm\pi ^\prime _i)=P(\,y_i|z_i,\rm\pi ^\prime _i)P(z_i)/P(\,y_i|\rm\pi ^\prime _i)$$

where \(\rm\pi ^\prime _i\), represents the probability that an infection acquired by a patient in trust *i* is both detected by a PCR test and meets the definition of a hospital-acquired infection (which requires the first positive sample to be taken 15 or more days after the day the patient is admitted to the trust and before patient discharge), assumed independent of *z*_{i}. Here, \(z_i,\rm\pi ^\prime _i)\) represents the binomial likelihood of observing *y*_{i} identified hospital-acquired infections, \(P(z_i)\) is the prior distribution for the total number of infections, which we take to be uniform (bounded by 0 and 20,000), and we calculate \(P(\,y_i|\rm\pi ^\prime _i)\) using the law of total probability \(P(\,y_i|\rm\pi ^\prime _i)=\sum _lP(\,y_i|\rm\pi ^\prime _i,z_i=l)P(z_i=l)\).

#### Effect of testing policy

The probability that a new hospital-acquired infection in trust *i* is detected is given by \(\rm\pi _i=\sum _m,d\gamma _imdP_imd\), where \(P_imd\) is the probability that a patient admitted to trust *i* with length of stay *m* and infected on day of stay *d* (where *d* ≤ *m*) has a positive PCR test while in hospital and \(\gamma _imd\) is the probability that, given a new hospital-acquired infection in trust *i* occurs, it occurs in a patient with length of stay *m* on day of stay *d*. Similarly, the probability that a new hospital-acquired infection is both detected and meets the definition of a hospital-acquired infection is

$$\rm\pi ^\prime _i=\sum _m,d\gamma _imdP^\prime _imd$$

where \(P^\prime _imd\) is the probability that an infection in a patient admitted to trust *i* with length of stay *m* infected on day of stay *d* is both detected and meets the definition of a hospital-acquired infection.

Consider an infection that a patient acquires *d* days after the day the patient is admitted to the hospital. The testing policy in place in the trust during the patient’s stay, the day of infection and the incubation period distribution together determine the probability that a patient is tested on day *k* after the patient is infected (for *k* = 0, 1, 2, 3 …). We assume the test has a specificity of 1. Let *ϕ*_{k} represent the sensitivity of a PCR test taken *k* days after the date of infection, and let \(\tau _ik\) represent the probability that such a test is performed *k* days after the infection event, assumed to be independent for each value of *k* of whether a test is performed on any other day. Then, \(P_imd=1-\prod _k=d\ldots m(1-\tau _i\left(k-d\right)\phi _k-d)\).

The corresponding probability, \(P^\prime _imd\), is zero for *m* < 15 (because in that case the definition of hospital-acquired infection is not met); otherwise, it is given by the probability that there is no positive test before day 15 and at least one positive test after. For *d* ≥ 15 this probability is identical to \(P_imd\); otherwise, it is given by

$$P^\prime _imd=\prod _k=d…14(1-\tau _i(k-d)\phi _k-d)(1-\prod _k=\mathrm15…m(1-\tau _i(k-d)\phi _k-d)).$$

If \(\lambda _im\) represents the probability that a patient at risk of nosocomial infection with SARS-CoV-2 admitted to trust *i* has a length of stay of *m* days, then, on a given day, the expected proportion of patients who both have a length of stay of *m* days and are currently on day of stay *d* is given by \(\psi _imd=[\frac\lambda _imm\sum _n\lambda _inn]\frac1mI(m\ge d)\), where \(I\left(m\ge d\right)\) is the indicator function, \(\left[\frac\lambda _imm\sum _n\lambda _inn\right]\) is the probability that on a randomly chosen day a randomly chosen patient has a length of stay *m* and \(\frac1m\) is the probability that this randomly chosen day is day *d* of stay. Analysis of individual-level patient data indicates that although daily risk of infection changes over calendar time, it does not vary appreciably with day of stay *d* for typical lengths of stays^{9}, and we therefore approximate \(\gamma _imd\) by \(\psi _{imd}\) which we estimate on the basis of the reported lengths of stays of completed episodes of patients admitted to each trust over the time period considered. This will represent a reasonable approximation provided that the infection hazard is small and roughly constant over a patient’s hospital stay.

#### Testing policies considered

We consider several different testing policies, which determine the probability values that the test is performed on day *k* after infection in trust *i* \((\tau _ik)\), as exact data on what policies were available in each trust are unavailable.

The minimal testing policy, which involves the fewest tests, requires only that patients displaying symptoms of COVID-19 are tested, and we assume all such patients are tested on a single occasion, the date of symptom onset. When this policy is in place, the time of testing of patients with hospital-acquired infections, in relation to the time of infection, is determined by the incubation period and such a test is assumed to be performed if and only if the patient develops symptoms on or before the day of discharge. A second testing policy extends this by assuming that in the event of a negative screening result from a patient with symptoms, daily testing will continue to be performed until patient discharge, the first positive test or three consecutive negative tests (whichever occurs first). We consider further testing policies which combine symptomatic testing (without retesting if negative) with routine asymptomatic testing. In these policies all patients who have not already tested positive are screened at predetermined intervals using the same PCR test. We consider weekly, twice weekly, three times weekly and daily testing of all in-patients as well as a policy of testing twice in the first week of stay (in accordance with national guidance in England).

#### Accounting for uncertainty in test sensitivity, incubation period distribution and the proportion of infections that are symptomatic

For a given length-of-stay distribution, incubation period distribution, PCR sensitivity profile and probability that infection is symptomatic, the calculations outlined above to determine the probability that an infection is detected or both detected and classified as a hospital-acquired infection are deterministic, and require no simulation. We account for uncertainty in these quantities through a Monte Carlo sampling scheme, at each iteration sampling new values for PCR sensitivities, the incubation period distribution and the proportion of infections that are symptomatic. For PCR sensitivities, we directly sample from the posterior distribution reported by Hellewell et al.^{16}. For the incubation period we assume a lognormal distribution, and sample the parameters from normal distributions with means (s.d.) of 1.621 (0.064) and 0.418 (0.069) as estimated by Lauer et al.^{32}. Estimates of the proportions of infections that are symptomatic are taken from Mizumoto et al.^{33} and this quantity is sampled from a normal distribution with mean (s.d.) of 0.82 (0.012). Length-of-stay distributions are directly obtained from the Secondary Uses Service for NHS acute trusts, excluding: (1) patients who were admitted with PCR-confirmed COVID-19; (2) patients who had samples taken in the first 7 d of their hospital stay that were PCR positive for SARS-CoV-2; and (3) patients with a length of stay of less than 1 d. In the primary analysis we use aggregate length-of-stay data for all trusts taken from the 12 month period from 1 March 2020. We also present results from two sensitivity analyses: in the first we use trust-specific \(\lambda _im\) values; in the second we allow for the possibility that length-of-stay distributions change over time and use period-specific empirical length-of-stay distributions from the periods: June to August 2020; September to November 2020; and December 2020 to February 2021.

#### Quantifying drivers of nosocomial transmission

We used generalized linear mixed models to quantify factors associated with nosocomial transmission. In these models the dependent variable was either the observed number of healthcare-associated infections in trust *i* and week *j* among patients, \(y_ij\), or the imputed number of infections in HCWs, \(y_ij^\prime \). When the dependent variable was healthcare-associated infections in patients, we used ECDC criteria, repeating the analysis using three different classifications of healthcare-associated infection: (1) definite; (2) definite and probable; (3) definite, probable and indeterminate. Three classes of independent variables were considered: (1) known exposures to others in the same trust infected with SARS-CoV-2 to account for within-trust temporal dependencies, with separate terms corresponding to exposures in the previous week to patients with community-onset SARS-CoV-2 infections \((z_i(j-1))\), patients with hospital-acquired SARS-CoV-2 \((\,y_i(j-1))\) and HCWs with SARS-CoV-2 \((\,y^\prime _i(j-1))\); (2) characteristics of the trusts that were considered, a priori, to be plausibly linked to hospital transmission: bed occupancy, provision of single rooms, age of hospital buildings, heated hospital building air volume per bed and size (number of acute care beds); (3) regional data including vaccine coverage among HCWs and the proportion of isolates represented by the Alpha variant. Models were formulated to reflect presumed mechanisms generating the data, and we used negative binomial models with identity link functions, allowing the number of exposures to different categories of SARS-CoV-2 infections to contribute additively to the predicted number of weekly detected infections, while allowing for multiplicative effects of the other terms. In models for which the dependent variable represented hospital-acquired infections in patients, the HCW vaccination effect was assumed to act only through a multiplicative term affecting transmission related to exposures to HCWs. By contrast, when the dependent variable represented infections in HCWs, vaccine exposure was allowed to have a multiplicative effect on the overall expected number of infections. Formally, we define the full model for infections in patients in trust *i* and week *j* (which we refer to as model P1.1.1) as:

$$y_ij\sim \rmn\rme\rmg\,\rmb\rmi\rmn(\mu _ij,\varphi _ij),$$

where \(\mu _ij\) represents the mean and the variance is given by \(\mu _ij+\mu _ij^2/\varphi _ij\).

In the full model \(\mu _ij=(a_i+by_i(j-1)+c_ijy_i(\,j-1)^\prime +dz_i(j-1))m_ijn_ij\)

$$\beginarraylm_ij\,=\exp (q\times \rms\rmi\rmn\rmg\rml\rme\,\rmr\rmo\rmo\rmm\rms_i+r\times \rmt\rmr\rmu\rms\rmt\,\rms\rmi\rmz\rme_i+s\times \rmo\rmc\rmc\rmu\rmp\rma\rmn\rmc\rmy_i(j-1)\\ \,\,\,\,\,\,\,+t\times \rmt\rmr\rmu\rms\rmt\,\rma\rmg\rme_ij+u\times \rmt\rmr\rmu\rms\rmt\,\rmv\rmo\rml\rmu\rmm\rme\,\rmp\rme\rmr\,\rmb\rme\rmd_ij)\\ \,\,n_ij=\exp (w\times \rmp\rmr\rmo\rmp\rmo\rmr\rmt\rmi\rmo\rmn\,\rmA\rml\rmp\rmh\rma\,\rmv\rma\rmr\rmi\rma\rmn\rmt_ij)\\ \,\,\,c_ij=c\times \exp (v\times \rmH\rmC\rmW\,\rmv\rma\rmx_i(j-1))\\ \,\,\varphi _ij=\varphi _0+k_i\,y_i(j-1).\\ \,\,\,a_i\sim N(a_0,\sigma _a^2)\\ \,\,\,\,k_i\sim N(k_0,\sigma _k^2).\endarray$$

The expression for the dispersion parameter of the negative binomial distribution, \(\varphi _ij\), reflects the fact that the sum of *n* independent negative binomially distributed random variables with mean *μ* and dispersion parameter *φ* will itself have a negative binomial distribution with mean *nμ* and dispersion parameter *nφ*. Thus, in the idealized case that each of *n* nosocomially infected patients in 1 week has a fully observed negative binomially distributed offspring distribution the next week with mean *μ* and dispersion parameter *φ*, then the total number of nosocomial infections observed would have a negative binomial distribution with parameters *nμ* and *nφ*. The \(a_i\) represents a trust-level random effect term to account for within-trust dependency. We also considered two nested models, P1.1.0 and P1.0.0, obtained by setting the terms *q*, *r*, *s*, *t* and *u* to zero in both cases (that is, removing the trust-level terms) and by additionally setting the terms *v* and *w* to zero in the latter case (that is, removing regional vaccine- and variant-related terms). As a further sensitivity analysis, we also considered a model that allowed for time-varying changes in the number of hospital-acquired infections not accounted for by the covariates, by setting

$$\mu _ij=(1+s(\,j))(a_i+by_i(j-1)+c_ij\,y^\prime _i(j-1)+dz_i(j-1))m_ijn_ij$$

where \(s(\,j)\) is a degree 3 spline with 6 equally spaced knots. We refer to this model as P1.1.1.tv. Similar models were used when the dependent variable was HCW infections, except that the HCW vaccine effect was included in the multiplicative term \(m_ij\) instead of operating only through the \(c_ij\) term.

We used normal(0,1) prior distributions by default for model parameters, except for variance terms \(\sigma _a^2\) and \(\sigma _k^2\) for which we used half-Cauchy(0,1) prior distributions, and *φ* for which a half-normal(0,1) prior distribution was specified for the transformed parameter \(1/\,\sqrt\varphi _0\). All analyses were performed in Stan^{34} using the rstan package v.2.21.1 in R (ref. ^{35}), running each model with four chains using 1,000 iterations for warm-up and 5,000 iterations for sampling.

In the main analysis, we used weekly aggregated data, counting week numbers as 1 plus the number of complete 7 d periods since 1 January 2020. We included only acute hospital trusts in this analysis, and excluded trusts that predominantly admitted children.

#### Imputation method for weekly number of infections in HCWs

Situation reports included fields allowing quantification of nosocomial transmission and number of HCWs isolated due to COVID-19 from 5 June 2020, but analysis here is restricted to data from week 42 (beginning 14 October 2020) to week 55 (beginning 13 January 2021), reflecting the date range for which all fields used in the analysis were consistently reported. Because situation reports did not explicitly include data on the number of infections in HCWs, only the number of HCWs absent due to COVID-19 on each day, we imputed the weekly number of infections among HCWs at each trust. We did this by first subtracting from the number of reported HCW COVID-19 absences in each trust on each day the reported number of such absences due to contact tracing and isolation policies (reflecting likely COVID-19 exposures in the community) to give \(a_t\), the number of HCWs absent on day *t* due to COVID-19 infection potentially arising from occupational exposure. Then, assuming that each HCW with COVID-19 was isolated for 10 d and assuming that durations of these absences were initially uniformly distributed (starting from week 36), the number imputed to have entered isolation on day *t*, \(x_t\), was taken as \(x_t=a_t+1+x_t-10-a_t\). For each trust we performed these calculations ten times, sampling the initial duration of staff absences from a multinomial distribution assigning equal probabilities to durations of 1 … 10 d, and then took the average (rounded to the nearest integer) of these samples. In some trusts it was evident that some days with missing HCW isolation data had been coded as zeroes. When such zeroes fell between daily counts in excess of ten we treated them as missing data and replaced them with the last number carried forward. Any negative numbers for daily imputed HCW infections resulting from the above procedure were replaced with zeroes.

Although data on healthcare-associated infections in patients were recorded consistently by all trusts throughout the inclusion period, in some trusts data on HCW absences due to COVID-19 were missing or had been recorded inconsistently throughout the inclusion period. Excluding such trusts and those with missing data for independent variables left 96 of the original 145 trusts included in the analysis.

### Negative control outcomes

We used as a negative outcome control the number of patients admitted with community-acquired SARS-CoV-2 infection as the outcome variable. We performed three analyses in which we adopted this negative control as our dependent variable, corresponding to models P1.1.1, P1.1.0 and P1.0.0 as defined above.

### Hospital–community interaction model

We modelled hospital–community interaction using ordinary differential equations for an expanded susceptible/exposed/infectious/removed model (Extended Data Fig. 9). This model included separate compartments for people in the community (*S*_{C}, *E*1_{C}, *E*2_{C}, *I*1_{C}, *I*2_{C}, *I*′_{C}, *R*_{C}), patients in hospital (*S*_{H}, *E*1_{H}, *E*2_{H}, *I*1_{H}, *I*2_{H}, *I*′_{H}, *R*_{H}) and HCWs (*S*_{HCW}, *E*1_{HCW}, *E*2_{HCW}, *I*1_{HCW}, *I*2_{HCW}, *I*′_{HCW}, *R*_{HCW}), in which the two exposed compartments (*E*1 and *E*2) and the two infectious compartments (*I*1 and *I*2) for each subpopulation correspond to assumptions of an Erlang-distributed latent and infectious period with shape parameter 2, whereas the *I*′ compartments represent people with severe disease potentially requiring hospitalization. The model allowed for patient–patient, HCW–HCW, HCW–patient and community–HCW transmission, as well as movements of people between the community and hospital. In the interest of simplicity, we neglect hospitalization of HCWs who account for about 1% of the total population.

We used the model to explore the impact of hospital transmission on overall epidemic dynamics with the aim of providing qualitative insights. We considered outcomes from high, intermediate and low hospital transmission scenarios in which the primary epidemic control measure was restricting rates of contact in the community (‘lockdowns’). This community control measure was assumed to have no direct impact on contact rates within hospitals as hospital infection control measures were in force throughout the study period irrespective of efforts aiming to limit community transmission. Full model details are provided in Supplementary Information section 1.2 and Supplementary Tables 1 and 2.

### Ethics approval

The study did not involve the collection of new patient data, or use any personal identifiable information, but used a combination of anonymized national aggregate data sources including C19SR01–COVID-19 Daily NHS Provider SitRep, and regionally aggregated vaccine coverage data from the SARS-CoV-2 immunity and reinfection evaluation (SIREN) study for which the study protocol was approved by the Berkshire Research Ethics Committee on 22 May 2020 with the vaccine amendment approved on 23 December 2020.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link