## Abstract

We analyse time series from 100 patients with bipolar disorder for correlates of depression symptoms. As the sampling interval is non-uniform, we quantify the extent of missing and irregular data using new measures of *compliance* and *continuity*. We find that uniformity of response is negatively correlated with the standard deviation of sleep ratings (*ρ* = –0.26, *p* = 0.01). To investigate the correlation structure of the time series themselves, we apply the Edelson–Krolik method for correlation estimation. We examine the correlation between depression symptoms for a subset of patients and find that self-reported measures of *sleep* and *appetite/weight* show a lower average correlation than other symptoms. Using surrogate time series as a reference dataset, we find no evidence that depression is correlated between patients, though we note a possible loss of information from sparse sampling.

## 1. Introduction

Health telemonitoring can benefit both patients and healthcare providers. A systematic review by Polisena *et al.* [1] found that home telehealth saved costs in 20 out of 22 studies, though it did note the poor quality of most of the economic evaluations. Another review by Paré *et al.* [2] examined 65 empirical studies of telemonitoring over four types of chronic illnesses: pulmonary conditions, diabetes, hypertension and cardiovascular diseases. They drew no conclusion about economic viability, but only because this was the subject of few studies, most of which had no detailed analysis. However, they suggested that telemonitoring might have a positive effect on the patients' condition and that this would be a promising avenue for research. A more recent *BMJ* review [3] found evidence of fewer hospital admissions and lower mortality among patients allocated to receive telehealth interventions, though again there was no evidence of cost savings. However, there are other benefits from both the patient's and clinician's point of view. The patients are monitored in their own environment, avoiding ‘white coat syndrome’, and they may have the freedom to manage their own reporting.

Most obvious from the researcher's point of view is the automated acquisition of data for analysis, sampled more often than an outpatient appointment would allow. Here, though, the freedom afforded to the patient has a potential disadvantage for time-series analysis. If data can be returned at any time, then the analyst cannot assume a regular reporting interval. As most time-series methods require uniform sampling, a common approach is to interpolate the data as a preprocessing step. In this study, we apply methods that may be used directly on non-uniform data and introduce two new measures for quantifying non-uniformity. The structure of the paper is as follows. In §2, we introduce time-series analysis and the Edelson–Krolik method for estimating correlation. In §3, we describe measures for quantifying non-uniformity in time series, and in §4 show their application to telemonitored data. In §5 we describe several different applications of the Edelson–Krolik correlation and correlation between time series using surrogate data. Finally, §6 summarizes the findings of this study.

## 2. Time series

Time-series analysis involves the description, explanation and prediction of observations taken sequentially in time [4]. Description implies the use of numerical and graphical descriptive statistics such as time plots and the correlogram. Correlograms can reveal *seasonality*, which is the tendency to repeat a pattern of a certain periodicity, such as a yearly cycle, and *trend*, or long-term variation up or down. Whereas description provides information about a given time series, inference induces a general form based on a finite number of observations. An example is time-series regression, which attempts to model an underlying relationship between dependent variables and time. Regression is often applied in the context of time-series prediction because of its many practical applications. Linear approaches are popular because they are readily interpretable and convenient [5]. Stationary, linear time-invariant Gaussian systems introduce several symmetries that have many conveniences, including statistical stability, sufficiency of first- and second-order moments, and convex and analytic inference procedures [6]. Nonlinear models can represent regime switching behaviour, and parsimonious nonlinear models have been shown to outperform linear methods in economic forecasting [7].

### (a) Correlation estimation

The autocorrelation function is an important measure of serial dependence in a time series and is defined for a stationary random process *Y*(*t*) as
2.1where *s* is the time lag and *γ*(*s*) is the *autocovariance function*, defined as the covariance between *Y*(*t*) and *Y*(*t* – *s*). An informative way of representing the serial dependence in a time series is by a graph of autocorrelation coefficients *ρ*(*k*) against the integer lag *k*. This sequence represents a sample autocorrelation function and is called the *correlogram* [8]. As natural time series often have missing or irregular data, it is often the applied sciences that have derived methods for their analysis. In astrophysics, Edelson & Krolik [9] derived the discrete correlation function (DCF) for correlation estimation in non-uniform time series. It is defined for two discrete, centred time series *a _{i}* and

*b*, first as a set of unbinned discrete correlation values 2.2for a measured pair of observations (

_{j}*a*,

_{i}*b*) whose time difference is

_{j}*Δ*

*t*. Here,

_{ij}*a*and

_{i}*b*are a concise notation for

_{j}*a*(

*t*) and

_{i}*b*(

*t*), respectively,

_{j}*and*

*σ*_{a}*are the respective standard deviations, and*

*σ*_{b}*e*and

_{a}*e*are estimates of the measurement noise in each time series. The DCF is derived by averaging the set of

_{b}*M*unbinned values 2.3where

*is the bin centre and*

*τ**Δ*

*is the bin width. The standard error is given by 2.4recalling that UDCF*

*τ**is a set and DCF(*

_{ij}*) is a scalar for given*

*τ**. The summation is over as before and the normalizing constant*

*τ**M*″ = ((

*M*– 1)(

*M*′ – 1))

^{2}with

*M*′ the number of unique measurement times for the series

*a*.

_{i}The Edelson–Krolik method is closely related to the *variogram*, an approach that is well known in geostatistics, where it is used to model spatial correlations [10]. It was until recently rarely mentioned in texts on time series or in the statistical literature as a whole [11], with the exception of Chatfield [4] and Diggle [8], who defines the variogram as
2.5
2.6where terms are defined as before. A plot of the quantities for all delays *k _{ij}* =

*t*−

_{i}*t*is called the sample variogram. As with the DCF, random scatter in the plot may arise from small sample sizes used in calculating

_{j}*v*. This scatter can be reduced by averaging

_{ij}*v*over binned time values to give .

_{ij}The binned variogram and DCF are examples of a *slotting* approach that uses a rectangular kernel to bin pairs of observations. They belong to one of four categories identified by Broerson *et al.* [12] for handling non-uniform data. The other categories are direct transform approaches, such as the Lomb–Scargle (LS) periodogram [13], model-based estimators (which presuppose a knowledge of the time-series dynamics) and resampling through interpolation. The LS approach, kernel methods (though not slotting) and linear interpolation are compared by Rehfeld *et al*. [14]. As the data analysed in this study have high relative noise and large gaps in the time indexes, we apply the Edelson–Krolik slotting approach. It provides a sample correlogram directly and avoids the assumptions necessary for interpolation or model-based estimators.

## 3. Measures of non-uniformity

We next introduce two measures for quantifying missing and non-uniform responses in time series. The first, which we call *compliance,* measures the proportion of real observations in a time series that contains imputed values. The second measure, called *continuity*, quantifies the sampling regularity among those real observations. Both measures are easily derived from a uniformly sampled series with missing data, but here we start from an irregular series and assume that a response is valid for an interval rather than a single point in time. This condition would apply, for example, to the answer from a questionnaire where the relevant interval is the week prior to the response. We begin by considering the process of resampling the time series into a homogenized equivalent with uniform intervals.

### (a) Compliance

Figure 1 illustrates the resampling process assuming that sampling is approximately once per week and that responses are valid for the previous week. The optimal weekday *w* for the resampled time series is chosen to minimize the total deviation of the original responses from their corresponding resampled position on the *x*-axis or ‘comb’ of weekdays. The deviation in this case is the elapsed time to the first response within 7 days.

The comb is then populated from the original series as follows. Starting from weekday *w* at the start, or the last instance of *w* before the start, of the time series, we record any response within 7 days. We repeat the search from weekday *w* in the following week and continue until the last response of the time series is reached. If no response is found within 7 days, a missing value is imputed by random selection from the previous four responses. The imputed value itself is chosen for the purposes of illustration and does not affect the non-uniformity measures.

Figure 2 shows the effect of resampling on two example series. Most of the original responses are not shifted, while some are moved to an earlier time point and where this cannot be accomplished, an imputation is made.

We define compliance as the proportion of non-imputed values in the resampled time series. Imputations occur when a response is later than the sample period * τ* which in this application is equal to 7 days. Formally,
3.1where

*C*is compliance,

_{m}*is the uniform sample period and*

*τ**t*is the

_{i}*i*th element of the time vector for the original series, which has

*N*′ points.

*N*is the number of points in the resampled series and is equal to the number of weeks spanned by the original time series, allowing for the period of validity. The function

*is equal to 0 if its argument is 0, otherwise it is equal to 1, and the indicator operator*

*Θ***1**has value 1 for a boolean argument of

`true`and 0 for

`false`. The value of

*C*lies between 0 and 1.

_{m}As long as the original series covers all the new sample time points, there will be no imputations and the compliance is 100%. For example, if responses are returned more often than every week, a uniform series may be derived by discarding some responses and without loss of compliance. A non-uniform series may also exhibit full compliance as long as no response is more than six (more generally, * τ* – 1) days late. However, longer gaps result in an imputed value being added to the uniform series and compliance being reduced. The measure thus penalizes missing data but not additions or late returns.

### (b) Continuity

A low compliance implies that there is a large proportion of imputed points in the resampled series but gives no information about their distribution throughout the observed responses. A second measure, which we call *continuity*, measures the connectedness of non-imputed responses in the resampled time series. To develop the measure, we examine the sequence of points in the resampled series and label them with a state indicator of `P` for imputed and `R` for not imputed. The number of sequential state changes `R` → `P` is a count of the discontinuity, and we use the ratio of this count to *N _{r}* – 1, where

*N*is the number of

_{r}`R`states. A simple example is the sequence

`RRR`

`PPP`

`R PPP R`. Here, there are two sequential changes of state from

`R`to

`P`out of a total of five

`R`states, giving a continuity of 2/4. The sequence

`RRRRR`then has a continuity of 1, and the sequence

`RPRPR`has a continuity of 0. In general, we then have 3.2where

*C*is continuity,

_{t}*N*is the length of the resampled series and

*w*∈ {

_{k}`R`,

`P`} is the state of the

*k*th data point. The minimum possible continuity occurs when the

`P`states are distributed throughout the time series. In this case, 3.3 3.4for , where

*N*is the number of

_{p}`P`states. It can be identified from (3.4) that as the compliance approaches 1, the minimum possible continuity approaches the compliance.

So compliance is the proportion of non-imputed responses and continuity is the proportion of correct intervals among them. Continuity summarizes the interval distribution using the probability density located only at the desired interval. The location of the remaining mass, corresponding to the distribution shape, does not influence its value.

This approach gives an advantage over standard dispersion measures (of either the raw or the homogenized series) because all intervals longer than the sampling period are classed together. Long gaps in the time series, when the patient fails to respond for a period, do not greatly influence the continuity value, although they are reflected in the compliance. The property is also relevant to the autocorrelation calculation because time series with high continuity can be treated as uniform for this purpose. Both compliance and continuity can be useful in both selection of near-uniform series for the application of standard methods and for exploring non-uniformity as an informative property in itself.

## 4. Application of measures

We apply the measures to time series from 153 patients with bipolar disorder who were monitored between 2006 and 2011. Data were collected as part of the OXTEXT programme funded by the National Institute for Health Research, which investigates the potential benefits of self monitoring of mood for people with bipolar disorder. The sub-sample of participants in this study was selected from the OXTEXT cohort and includes those patients who had used mood monitoring prior to recruitment into OXTEXT and who had given consent for the use of anonymized retrospective data for exploratory time-series analysis.

The mood data are returned approximately each week and comprise answers to standard self-rating questionnaires for both depression and mania. The rating scale used for depression is the Quick Inventory of Depressive Symptomatology-Self Report (QIDS-SR_{16}) [15], which has 16 questions covering nine symptom domains for depression according to the *Diagnostic and Statistical Manual of Mental Disorders*, 4th edn [16]. This self-rated instrument has highly acceptable psychometric properties, including high validity [17]. Each domain can contribute up to three points, giving a total possible score of 27 on the scale. The severity of mania is quantified using the Altman Self-Rating Mania Scale [18], which has five questions, each of which can contribute up to four points, giving a total possible score of 20.

### (a) Data selection

The initial set of 153 patients is first cleaned by removing repeated response values (i.e. those which share the same time stamp). These repeats arise when a patient resubmits a rating score either by mistake or in order to correct an earlier response. Assuming that earlier values are being corrected, we remove repeated responses by taking the most recent in the sequence. We then create set A (*n* = 93) with members whose time series have at least 25 data points, or approximately six months duration. Figure 3 illustrates the data selection process.

Two further subsets are then created from set A, one having equal numbers of male and female patients, and a second with equal numbers of Bipolar I (BPI) and Bipolar II (BPII) diagnoses. The first subset is labelled as set G (*n* = 40) and contains patients of whom all have a diagnosis of BPI disorder. It is created by selecting all the patients with BPI from set A and removing the female patient with the shortest time-series length. The second subset, labelled set D (*n* = 32), has equal numbers of patients diagnosed with BPI and BPII disorder, all of whom are female. Set D is created by retaining the 16 female BPII patients from set A and selecting 16 BPI female patients to match for time-series length. The selection algorithm attempts to match the length for each individual patient by progressively widening the search range until a suitable match is found. Descriptive statistics of the subsets are given in the electronic supplementary material, §I.

### (b) Non-uniformity

Using the subset of data labelled set A, we derive the compliance and continuity measures for each patient. A scatter plot is shown in figure 4. From (3.4), we see that the minimum continuity tends towards the compliance as the compliance approaches 1. For lower compliance, where there is a higher proportion of imputations, the continuity may be lower.

For the next analysis, we assume that any text message latency is small in comparison with the patient's delay in responding to a prompt from the monitoring system. We do not know when the prompt message is received by the patient, so we cannot distinguish total network latency from the patient's response delay. However, as the prompt messages are dispatched at weekly intervals, we can judge the scale of the overall delays by examining the time between prompt and receipt. The analysis is provided in electronic supplementary material, §5 and shows that most patients have a mean delay of half a day or more. This result is expected because the questionnaire relates to a weekly period rather than an instant in time: patients do not have to reply to the prompt immediately. However, the network delay remains unknown and a quantitative study of the monitoring infrastructure would be valuable in determining the scale and nature of network latency.

### (c) Demographic and mood data

We examine the correlation between continuity and both demographic and mood data over the set of patients using set G (*n* = 40), which has equal numbers of male and female patients, and set D (*n* = 32), with equal numbers of BPI and BPII diagnoses. No pattern emerges in either case, and a two-sample Kolmogorov–Smirnov test does not distinguish the distribution of male versus female or BPI versus BPII non-uniformity measures. Further details can be found in the electronic supplementary material, §IV.

Next, we look for correlates of non-uniformity with mood. There are nine variables for depression corresponding to symptoms of sleep, appetite, etc., and five variables for mania, which we summarize for each patient by mean, standard deviation and mean absolute difference. We take the rank correlation for each symptom with continuity over the set of 93 patients in set A. The results are shown in table 1. No correlations were found between mean symptom levels and continuity. For the dispersion statistics, only sleep in the depression questionnaire was found to have a correlation significant at the 1% level.

Variability of sleep correlates negatively with continuity when measured by standard deviation (*ρ* = –0.26, *p* = 0.01) and mean absolute difference between sequential values (*ρ* = –0.25, *p* = 0.02). A similar result was found when using compliance as the non-uniformity measure. The scatter plots for both statistics are shown in figure 5.

We note that there will be a sampling distribution for both the mean and variability measures arising from the limited sample sizes, which would manifest in figure 5 as a range for each point.

For some symptoms, any correlation with non-uniformity might be hidden by this effect. However, as the same sampling limits apply to all symptoms we can distinguish sleep variability as having a relatively strong association with non-uniformity of response.

The relationship of non-uniformity of response with sleep variability is an important finding from this analysis. The association is also interesting if response uniformity is taken as an indicator of general functioning. We would expect that delays in responding are caused by holidays, work commitments, physical illness, forgetting to reply, a low priority for replying or chaotic behaviour. Psychological factors may have an influence, and several of the symptoms explicitly measured on the QIDS scale are relevant, in particular severe lassitude or lack of energy, lack of interest, poor concentration and thoughts of death/suicide. As pointed out, it is quite possible that correlations with these variables exist, but that they are below the noise threshold. The relatively stronger effect of sleep points to a number of possibilities. First, a strong association between sleep and mental illness is well established, if not well understood [19]. So one possibility is that sleep is simply the strongest indicator of an underlying disorder, which causes irregularity through the behavioural issues listed above. The causation might be more direct; for example, sleep causing problems with memory or other functioning, leading to lost or delayed ratings. However, it is a high variability of sleep ratings rather than a high mean rating that predicts non-uniformity of response. It may be that there is some adaptation to poor sleep, whereas inconsistent sleep leads to inconsistent behaviour. The data are too noisy and do not provide a strong enough effect to distinguish these scenarios.

## 5. Application of methods

We now apply the Edelson–Krolik method to calculate autocorrelation and correlation using the time series for depression. We first examine evidence of seasonality from the correlogram for individual patients, then look at the correlation between symptoms of depression, and finally apply a surrogate data method to detect correlations among the set of time series themselves.

### (a) Seasonality

We examine the autocorrelation function of the depression time series using the Edelson–Krolik method to determine the autocorrelation at successive lags. Four examples of correlograms are shown in figure 6, in comparison with a standard correlogram (lighter line) that has not been adjusted for non-uniform response times. The third plot from the top shows a yearly seasonality for both the Edelson–Krolik method and the unadjusted correlogram, with the latter having a peak correlation at less than 50 weeks and less seasonal variation.

Figure 7 is the LS periodogram corresponding to this time plot. It shows a peak of spectral power at 370 days, indicating a yearly seasonality. The depression time series do not in general show clear evidence of yearly periodicity, though some have a peak at or near this period. Most exhibit a rapid decrease in correlation with lag and some show evidence of a trend, indicated by the correlogram not tending to zero as the lag increases.

### (b) Correlation between depression symptoms

The correlation between depression symptoms is examined for patients who have at least 100 data points in their homogenized time series. The first 100 responses are taken, the imputed values removed and the means subtracted from the individual domain scores. Correlation between domains is then calculated using the Edelson–Krolik method (2.3) and the scores averaged over the set of patients. In order to provide a comparison between symptoms, only those patients with non-zero symptom series and positive correlations are selected. There are six patients showing some pairs of negative correlations, but these did not show any common relationship. The subset of patients fulfilling these criteria is denoted set E and its statistical properties are summarized in the electronic supplementary material, §2, with further details about the selection. The selection of set E is illustrated in figure 8.

A heat map showing the relationship between symptom domains is shown in figure 9. On average, the symptom domains *sleep* and *appetite/weight* correlate less than other domains. By contrast, *feeling sad* correlates strongly with other domains, while *slowed down/restless* shows less correlation with others.

An analysis of the autocorrelation structure for symptom time series explains why the symptoms of *sleep* and *appetite/weight* tend to correlate less when paired with other domains. We take the 23 time series used above and find the autocorrelation at using the Edelson–Krolik method on the homogenized time series with imputed points removed. The results are shown in figure 10. The symptoms *sleep* and *appetite/weight* have a lower autocorrelation than the other symptoms, which explains their relatively low pairwise correlation in figure 9. Although *sleep* and *appetite/weight* have a similar first-order autocorrelation, figure 9 shows that they do not themselves correlate as a pair, the reason being that their autocorrelation structure is somewhat different: the autocorrelation for *sleep* remains higher than *appetite/weight* as the lag increases. Autocorrelation coefficients up to a lag of four are shown in the electronic supplementary material, §III.

We note that these two symptoms are the most amenable to objective measurement out of the nine symptoms in the QIDS rating scale, and that *slowed down/restless*, which might also fall into this category, also correlates less than the others. It may be that the other symptoms (*feeling sad*, *concentration*, *self-view*, *thoughts of death/suicide*, *interest* and *energy level*) have a common factor that influences them more than it does the other three symptoms. This finding is similar to that of Rush *et al*. [20], who identified three factors in the IDS instrument: cognitive/mood, anxiety/arousal and sleep (or sleep/appetite for the self-rated instrument).

### (c) Time-series correlation

In this section, we look for similar mood changes in patients by examining pairwise correlations between their time series of depression ratings. We take a set of 28 patients who have complete depression series during the years 2009 and 2010, which we denote as set F.

The selection process is illustrated in figure 11 and descriptive statistics are given in the electronic supplementary material, §II. We create a reference set of surrogate time series by shuffling the time order of existing series while maintaining their mean, variance and autocorrelation function. The algorithm used for this process is described by Kantz & Schreiber [21] and is implemented using the *TISEAN* function `surrogates` [22]. The distribution of the pairwise correlations for both the original and surrogate datasets is shown in figure 12.

The correlations between time series for original and surrogate datasets appear to have the same distribution, and a two-sample Kolmogorov–Smirnov test returns a value of *p* = 0.53. Although external factors do not appear to have a strong influence on depression over the set of patients, this does not preclude the possibility that there may be strong environmental effects in individual cases.

## 6. Conclusion

We have addressed the problem of describing and modelling time series with missing or irregularly spaced values. Two new measures for quantifying missing and non-uniform data were introduced and applied to a database of telemonitored mood data. The quantification of non-uniformity can be useful in (i) investigation of non-uniformity as a correlate of other variables; (ii) selecting subsets of data where uniformity is a requirement; and (iii) use as supplementary information for a clinician. We found that time-series uniformity does not correlate with either gender or diagnostic subtype. However, variability of sleep correlates with continuity. This finding has implications for selecting time series according to their uniformity as it may exclude patients with more variable sleep ratings.

The Edelson–Krolik method uses relative distances rather than fixed lags to determine time-series correlation, and so it is robust to non-uniform sampling intervals. We used the method to generate correlograms of depression ratings and showed that one patient exhibited mood with yearly seasonality. Most patients do not show evidence of seasonality, but rather a short-term autocorrelation structure.

We examined correlations between depression symptoms and found that *sleep* and *appetite/weight* show a lower average correlation than other symptoms. We found evidence that the autocorrelation structure for these domains is different from that of the others. Finally, we examined correlations between patients' depression time series but found no evidence of correlation in general. We note that for some patients, the weekly sampling will be below the Nyquist frequency for depression, so information will be lost. A study identifying the range of frequencies in depression in bipolar disorder would therefore help in choosing an optimal sample rate, consistent with practical considerations.

## Data accessibility

Data are available by request to john.geddes{at}psych.ox.ac.uk.

## Funding statement

This article presents independent research funded by the NIHR under its Grants for Applied Research Programme (grant reference no. RP-PG-0108-10087).

## Acknowledgements

The authors would like to acknowledge financial support from the John Fell Fund, Oxford University. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

- Received September 6, 2013.
- Accepted November 22, 2013.

© 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.