Title: | Estimating Hidden Population Size using Respondent Driven Sampling Data |
---|---|
Description: | Estimate the size of a networked population based on respondent-driven sampling data. The package is part of the "RDS Analyst" suite of packages for the analysis of respondent-driven sampling data. See Handcock, Gile and Mar (2014) <doi:10.1214/14-EJS923>, Handcock, Gile and Mar (2015) <doi:10.1111/biom.12255>, Kim and Handcock (2021) <doi:10.1093/jssam/smz055>, and McLaughlin, et. al. (2023) <doi:10.1214/23-AOAS1807>. |
Authors: | Mark S. Handcock [aut, cre, cph] , Krista J. Gile [aut, cph], Brian Kim [ctb], Katherine R. McLaughlin [ctb] |
Maintainer: | Mark S. Handcock <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.1.0-2 |
Built: | 2024-11-06 03:57:28 UTC |
Source: | https://github.com/cran/sspse |
dsizeprior
computes the prior distribution of the population
size of a hidden population. The prior is intended to be used in Bayesian
inference for the population size based on data collected by Respondent
Driven Sampling, but can be used with any Bayesian method to estimate
population size.
dsizeprior( n, type = c("beta", "nbinom", "pln", "flat", "continuous", "supplied"), mean.prior.size = NULL, sd.prior.size = NULL, mode.prior.sample.proportion = NULL, median.prior.sample.proportion = NULL, median.prior.size = NULL, mode.prior.size = NULL, quartiles.prior.size = NULL, effective.prior.df = 1, alpha = NULL, beta = NULL, maxN = NULL, log = FALSE, maxbeta = 120, maxNmax = 2e+05, supplied = list(maxN = maxN), verbose = TRUE )
dsizeprior( n, type = c("beta", "nbinom", "pln", "flat", "continuous", "supplied"), mean.prior.size = NULL, sd.prior.size = NULL, mode.prior.sample.proportion = NULL, median.prior.sample.proportion = NULL, median.prior.size = NULL, mode.prior.size = NULL, quartiles.prior.size = NULL, effective.prior.df = 1, alpha = NULL, beta = NULL, maxN = NULL, log = FALSE, maxbeta = 120, maxNmax = 2e+05, supplied = list(maxN = maxN), verbose = TRUE )
n |
count; the sample size. |
type |
character; the type of parametric distribution to use for the
prior on population size. The options are |
mean.prior.size |
scalar; A hyperparameter being the mean of the prior distribution on the population size. |
sd.prior.size |
scalar; A hyperparameter being the standard deviation of the prior distribution on the population size. |
mode.prior.sample.proportion |
scalar; A hyperparameter being the mode
of the prior distribution on the sample proportion |
median.prior.sample.proportion |
scalar; A hyperparameter being the
median of the prior distribution on the sample proportion |
median.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
mode.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
quartiles.prior.size |
vector of length 2; A pair of hyperparameters
being the lower and upper quartiles of the prior distribution on the
population size. For example, |
effective.prior.df |
scalar; A hyperparameter being the effective number of samples worth of information represented in the prior distribution on the population size. By default this is 1, but it can be greater (or less!) to allow for different levels of uncertainty. |
alpha |
scalar; A hyperparameter being the first parameter of the Beta prior model for the sample proportion. By default this is NULL, meaning that 1 is chosen. it can be any value at least 1 to allow for different levels of uncertainty. |
beta |
scalar; A hyperparameter being the second parameter of the Beta prior model for the sample proportion. By default this is NULL, meaning that 1 is chosen. it can be any value at least 1 to allow for different levels of uncertainty. |
maxN |
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution. |
log |
logical; return the prior or the the logarithm of the prior. |
maxbeta |
integer; maximum beta in the prior for population size. By default this is determined to ensure numerical stability. |
maxNmax |
integer; maximum possible population size. By default this is determined to ensure numerical stability. |
supplied |
list; If the argument |
verbose |
logical; if this is |
dsizeprior
returns a list consisting of the following
elements:
x |
vector; vector of degrees |
lpriorm |
vector; vector of probabilities
corresponding to the values in |
N |
scalar; a starting value for the population size computed from the prior. |
maxN |
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution. |
mean.prior.size |
scalar; A hyperparameter being the mean of the prior distribution on the population size. |
mode.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
effective.prior.df |
scalar; A hyperparameter being the effective number of samples worth of information represented in the prior distribution on the population size. By default this is 1, but it can be greater (or less!) to allow for different levels of uncertainty. |
mode.prior.sample.proportion |
scalar; A hyperparameter
being the mode of the prior distribution on the sample proportion
|
median.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
beta |
scalar; A
hyperparameter being the second parameter of the Beta distribution that is a
component of the prior distribution on the sample proportion |
type |
character; the type of parametric distribution to use for the
prior on population size. The possible values are |
The best way to specify the prior is via the
hyperparameter mode.prior.size
which specifies the mode of the prior
distribution on the population size. You can alternatively specify the
hyperparameter median.prior.size
which specifies the median of the
prior distribution on the population size, or mode.prior.sample
proportion
which specifies the mode of the prior distribution on the
proportion of the population size in the sample.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
network, statnet, degreenet
prior <- dsizeprior(n=100, type="beta", mode.prior.size=1000)
prior <- dsizeprior(n=100, type="beta", mode.prior.size=1000)
This is a faux set used to illustrate how the estimators for multiple Respondent-Driven sampling surveys perform under different populations and RDS schemes.
A list with the first element being an rds.data.frame
of the first survey and the
second element being an rds.data.frame
of the second survey.
The population is based on fauxmadrona
from the RDS
package.
It is a population with N=1000 nodes from which two successive respondent-driven samples are drawn.
For the first survey, the sample size is 200 so
that there is a relatively small sample fraction (20%). There is homophily
on disease status (R=5) and there is differential activity by disease status
whereby the infected nodes have mean degree twice that of the uninfected
(w=1.8).
In the sampling, the seeds are chosen randomly from the full population, so there is no dependency induced by seed selection.
Each sample member is given 2 uniquely identified coupons to distribute to other members of the target population in their acquaintance. Further each respondent distributes their coupons completely at random from among those they are connected to.
For the second sample the sample size is 250. The second survey has an additional variable recapture
indicating if the respondent was also surveyed in the first survey.
Each survey is represented as an rds.data.frame
and they are stored in a list with two elements.
The original network is included in the RDS
package as
fauxmadrona.network
, a network
object.
The RDS
package
also includes a third respondent-driven sample from the network and is referred to as
fauxmadrona
.
Use data(package="sspse")
to get a full list
of datasets.
Gile, Krista J., Handcock, Mark S., 2010 Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. doi:10.1111/j.1467-9531.2010.01223.x.
Kim, Brian J. and Handcock, Mark S. 2021 Population Size Estimation Using Multiple Respondent-Driven Sampling Surveys, Journal of Survey Statistics and Methodology, 9(1):94–120. doi:10.1093/jssam/smz055.
Estimates each person's personal visibility based on their self-reported degree and the number of their (direct) recruits. It uses the time the person was recruited as a factor in determining the number of recruits they produce.
impute.visibility( rds.data, max.coupons = NULL, type.impute = c("median", "distribution", "mode", "mean"), recruit.time = NULL, include.tree = FALSE, reflect.time = FALSE, parallel = 1, parallel.type = "PSOCK", interval = 10, burnin = 5000, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, return.posterior.sample.visibilities = FALSE, verbose = FALSE )
impute.visibility( rds.data, max.coupons = NULL, type.impute = c("median", "distribution", "mode", "mean"), recruit.time = NULL, include.tree = FALSE, reflect.time = FALSE, parallel = 1, parallel.type = "PSOCK", interval = 10, burnin = 5000, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, return.posterior.sample.visibilities = FALSE, verbose = FALSE )
rds.data |
An rds.data.frame |
max.coupons |
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). By default it is taken by the attribute or data, else the maximum recorded number of coupons. |
type.impute |
The type of imputation based on the conditional distribution.
It can be of type |
recruit.time |
vector; An optional value for the data/time that the person was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person. |
include.tree |
logical; If |
reflect.time |
logical; If |
parallel |
count; the number of parallel processes to run for the Monte-Carlo sample. This uses MPI or PSOCK. The default is 1, that is not to use parallel processing. |
parallel.type |
The type of parallel processing to use. The options are "PSOCK" or "MPI". This requires the corresponding type to be installed. The default is "PSOCK". |
interval |
count; the number of proposals between sampled statistics. |
burnin |
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number. |
mem.optimism.prior |
scalar; A hyper parameter being the mean of the distribution of the optimism parameter. |
df.mem.optimism.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the optimism parameter. This gives the equivalent sample size that would contain the same amount of information inherent in the prior. |
mem.scale.prior |
scalar; A hyper parameter being the scale of the concentration of baseline negative binomial measurement error model. |
df.mem.scale.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation of the dispersion parameter in the visibility model. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation. |
mem.overdispersion |
scalar; A parameter being the overdispersion of the negative binomial distribution that is the baseline for the measurement error model. |
return.posterior.sample.visibilities |
logical; If TRUE then return a
matrix of dimension |
verbose |
logical; if this is |
McLaughlin, Katherine R.; Johnston, Lisa G.; Jakupi, Xhevat; Gexha-Bunjaku, Dafina; Deva, Edona and Handcock, Mark S. (2023) Modeling the Visibility Distribution for Respondent-Driven Sampling with Application to Population Size Estimation, Annals of Applied Statistics, doi:10.1093/jrsssa/qnad031
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
This function extracts from an estimate of the posterior distribution of the population size based on data collected by Respondent Driven Sampling. The approach approximates the RDS via the Sequential Sampling model of Gile (2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE). It uses the order of selection of the sample to provide information on the distribution of network sizes over the population members.
## S3 method for class 'pospreddeg' plot( x, main = "Posterior Predictive p-values for the self-reported network sizes", nclass = 20, hist = FALSE, ylim = c(0, 2), order.by.recruitment.time = FALSE, ... )
## S3 method for class 'pospreddeg' plot( x, main = "Posterior Predictive p-values for the self-reported network sizes", nclass = 20, hist = FALSE, ylim = c(0, 2), order.by.recruitment.time = FALSE, ... )
x |
an object of class |
main |
character; title for the plot |
nclass |
count; The number of classes for the histogram plot |
hist |
logical; If |
ylim |
two-vector; lower and upper limits of vertical/density axis. |
order.by.recruitment.time |
logical; If |
... |
further arguments passed to or from other methods. |
It computes the posterior predictive distribution for each reported network size and computes the percentile rank of the reported network size within that posterior. The percentile rank should be about 0.5 for a well specified model, but could be close to uniform if there is little information about the reported network size. The percentile ranks should not be extreme (e.g., close to zero or one) on a consistent basis as this indicates a misspecified model.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
The model fitting function posteriorsize
,
plot
.
## Not run: data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=10, interval=1, samplesize=50) summary(fit) # Let's look at some MCMC diagnostics plot(pospreddeg(fit)) ## End(Not run)
## Not run: data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=10, interval=1, samplesize=50) summary(fit) # Let's look at some MCMC diagnostics plot(pospreddeg(fit)) ## End(Not run)
This is the plot
method for class "sspse"
. Objects of
this class encapsulate the
estimate of the posterior distribution of the
population size based on data collected by Respondent Driven Sampling. The
approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
## S3 method for class 'sspse' plot( x, xlim = NULL, support = 1000, HPD.level = 0.9, N = NULL, ylim = NULL, mcmc = FALSE, type = "all", main = "Posterior for population size", smooth = 4, include.tree = TRUE, cex.main = 1, log.degree = "", method = "bgk", ... )
## S3 method for class 'sspse' plot( x, xlim = NULL, support = 1000, HPD.level = 0.9, N = NULL, ylim = NULL, mcmc = FALSE, type = "all", main = "Posterior for population size", smooth = 4, include.tree = TRUE, cex.main = 1, log.degree = "", method = "bgk", ... )
x |
an object of class |
xlim |
the (optional) x limits (x1, x2) of the plot of the posterior of the population size. |
support |
the number of equally-spaced points to use for the support of the estimated posterior density function. |
HPD.level |
numeric; probability level of the highest probability density interval determined from the estimated posterior. |
N |
Optionally, an estimate of the population size to mark on the plots as a reference point. |
ylim |
the (optional) vertical limits (y1, y2) of the plot of the posterior of the population size. A vertical axis is the probability density scale. |
mcmc |
logical; If TRUE, additionally create simple diagnostic plots for the MCMC sampled statistics produced from the fit. |
type |
character; This controls the types of plots produced. If
|
main |
an overall title for the posterior plot. |
smooth |
the (optional) smoothing parameter for the density estimate. |
include.tree |
logical; If |
cex.main |
an overall title for the posterior plot. |
log.degree |
a character string which contains |
method |
character; The method to use for density estimation (default Gaussian Kernel; "bgk"). "Bayes" uses a Bayesian density estimator which has good properties. |
... |
further arguments passed to or from other methods. |
By default it produces a density plot of the posterior for population size and the prior for population size is overlaid. It also produces a density plot of the posterior for mean network size in the population, the posterior for standard deviation of the network size, and a density plot of the posterior mean network size distribution with sample histogram overlaid.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
The model fitting function posteriorsize
,
plot
.
Function coef
will extract the matrix of coefficients with
standard errors, t-statistics and p-values.
## Not run: data(fauxmadrona) # Here interval=1 and samplesize=50 so that it will run faster. It should be much higher # in a real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=10, interval=1, samplesize=50) summary(fit) # Let's look at some MCMC diagnostics plot(fit, mcmc=TRUE) ## End(Not run)
## Not run: data(fauxmadrona) # Here interval=1 and samplesize=50 so that it will run faster. It should be much higher # in a real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=10, interval=1, samplesize=50) summary(fit) # Let's look at some MCMC diagnostics plot(fit, mcmc=TRUE) ## End(Not run)
posteriorsize
computes the posterior distribution of the
population size based on data collected by Respondent Driven Sampling.
This function returns the warning message if it fails.
It enables packages that call posteriorsize
to use
a consistent error message.
posize_warning()
posize_warning()
posize_warning
returns a character string witn the warning message.
posteriorsize
This function extracts from an estimate of the posterior distribution of the population size based on data collected by Respondent Driven Sampling. The approach approximates the RDS via the Sequential Sampling model of Gile (2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE). It uses the order of selection of the sample to provide information on the distribution of network sizes over the population members.
pospreddeg(x, order.by.recruitment.time = FALSE)
pospreddeg(x, order.by.recruitment.time = FALSE)
x |
an object of class |
order.by.recruitment.time |
logical; If |
It computes the posterior predictive distribution for each reported network size and computes the percentile rank of the reported network size within that posterior. The percentile rank should be about 0.5 for a well specified model, but could be close to uniform if there is little information about the reported network size. The percentile ranks should not be extreme (e.g., close to zero or one) on a consistent basis as this indicates a misspecified model.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
The model fitting function posteriorsize
,
plot
.
## Not run: data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit) # Let's look at some MCMC diagnostics pospreddeg(fit) ## End(Not run)
## Not run: data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit) # Let's look at some MCMC diagnostics pospreddeg(fit) ## End(Not run)
posteriorsize
computes the posterior distribution of the
population size based on data collected by Respondent Driven Sampling. The
approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
posteriorsize( s, s2 = NULL, previous = NULL, median.prior.size = NULL, interval = 10, burnin = 5000, maxN = NULL, K = FALSE, samplesize = 1000, quartiles.prior.size = NULL, mean.prior.size = NULL, mode.prior.size = NULL, priorsizedistribution = c("beta", "flat", "nbinom", "pln", "supplied"), effective.prior.df = 1, sd.prior.size = NULL, mode.prior.sample.proportion = NULL, alpha = NULL, visibilitydistribution = c("cmp", "nbinom", "pln"), mean.prior.visibility = NULL, sd.prior.visibility = NULL, max.sd.prior.visibility = 4, df.mean.prior.visibility = 1, df.sd.prior.visibility = 3, beta_0.mean.prior = -3, beta_t.mean.prior = 0, beta_u.mean.prior = 0, beta_0.sd.prior = 10, beta_t.sd.prior = 10, beta_u.sd.prior = 10, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, visibility = TRUE, type.impute = c("median", "distribution", "mode", "mean"), Np = 0, n = NULL, n2 = NULL, mu_proposal = 0.1, nu_proposal = 0.15, beta_0_proposal = 0.2, beta_t_proposal = 0.001, beta_u_proposal = 0.001, memmu_proposal = 0.1, memscale_proposal = 0.15, burnintheta = 500, burninbeta = 50, parallel = 1, parallel.type = "PSOCK", seed = NULL, maxbeta = 90, supplied = list(maxN = maxN), max.coupons = NULL, recruit.time = NULL, recruit.time2 = NULL, include.tree = TRUE, unit.scale = FALSE, optimism = TRUE, reflect.time = FALSE, equalize = TRUE, verbose = FALSE )
posteriorsize( s, s2 = NULL, previous = NULL, median.prior.size = NULL, interval = 10, burnin = 5000, maxN = NULL, K = FALSE, samplesize = 1000, quartiles.prior.size = NULL, mean.prior.size = NULL, mode.prior.size = NULL, priorsizedistribution = c("beta", "flat", "nbinom", "pln", "supplied"), effective.prior.df = 1, sd.prior.size = NULL, mode.prior.sample.proportion = NULL, alpha = NULL, visibilitydistribution = c("cmp", "nbinom", "pln"), mean.prior.visibility = NULL, sd.prior.visibility = NULL, max.sd.prior.visibility = 4, df.mean.prior.visibility = 1, df.sd.prior.visibility = 3, beta_0.mean.prior = -3, beta_t.mean.prior = 0, beta_u.mean.prior = 0, beta_0.sd.prior = 10, beta_t.sd.prior = 10, beta_u.sd.prior = 10, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, visibility = TRUE, type.impute = c("median", "distribution", "mode", "mean"), Np = 0, n = NULL, n2 = NULL, mu_proposal = 0.1, nu_proposal = 0.15, beta_0_proposal = 0.2, beta_t_proposal = 0.001, beta_u_proposal = 0.001, memmu_proposal = 0.1, memscale_proposal = 0.15, burnintheta = 500, burninbeta = 50, parallel = 1, parallel.type = "PSOCK", seed = NULL, maxbeta = 90, supplied = list(maxN = maxN), max.coupons = NULL, recruit.time = NULL, recruit.time2 = NULL, include.tree = TRUE, unit.scale = FALSE, optimism = TRUE, reflect.time = FALSE, equalize = TRUE, verbose = FALSE )
s |
either a vector of integers or an |
s2 |
either a vector of integers or an |
previous |
character; optionally, the name of the variable in |
median.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
interval |
count; the number of proposals between sampled statistics. |
burnin |
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number. |
maxN |
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution. |
K |
count; the maximum visibility for an individual. This is usually
calculated as |
samplesize |
count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000. |
quartiles.prior.size |
vector of length 2; A pair of hyperparameters
being the lower and upper quartiles of the prior distribution on the
population size. For example, |
mean.prior.size |
scalar; A hyperparameter being the mean of the prior distribution on the population size. |
mode.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
priorsizedistribution |
character; the type of parametric distribution
to use for the prior on population size. The options are |
effective.prior.df |
scalar; A hyperparameter being the effective number of samples worth of information represented in the prior distribution on the population size. By default this is 1, but it can be greater (or less!) to allow for different levels of uncertainty. |
sd.prior.size |
scalar; A hyperparameter being the standard deviation of the prior distribution on the population size. |
mode.prior.sample.proportion |
scalar; A hyperparameter being the mode
of the prior distribution on the sample proportion |
alpha |
scalar; A hyperparameter being the first parameter of the beta prior model for the sample proportion. By default this is NULL, meaning that 1 is chosen. it can be any value at least 1 to allow for different levels of uncertainty. |
visibilitydistribution |
count; the parametric distribution to use for the
individual network sizes (i.e., degrees). The options are |
mean.prior.visibility |
scalar; A hyper parameter being the mean visibility for the prior distribution for a randomly chosen person. The prior has this mean. |
sd.prior.visibility |
scalar; A hyper parameter being the standard deviation of the visibility for a randomly chosen person. The prior has this standard deviation. |
max.sd.prior.visibility |
scalar; The maximum allowed value of |
df.mean.prior.visibility |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior. |
df.sd.prior.visibility |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation. |
beta_0.mean.prior |
scalar; A hyper parameter being the mean of the beta_0 parameter distribution in the model for the number of recruits. |
beta_t.mean.prior |
scalar; A hyper parameter being the mean of the beta_t parameter distribution in the model for the number of recruits. This corresponds to the time-to-recruit variable. |
beta_u.mean.prior |
scalar; A hyper parameter being the mean of the beta_u parameter distribution in the model for the number of recruits. This corresponds to the visibility variable. |
beta_0.sd.prior |
scalar; A hyper parameter being the standard deviation of the beta_0 parameter distribution in the model for the number of recruits. |
beta_t.sd.prior |
scalar; A hyper parameter being the standard deviation of the beta_t parameter distribution in the model for the number of recruits. This corresponds to the time-to-recruit variable. |
beta_u.sd.prior |
scalar; A hyper parameter being the standard deviation of the beta_u parameter distribution in the model for the number of recruits. This corresponds to the visibility variable. |
mem.optimism.prior |
scalar; A hyper parameter being the mean of the distribution of the optimism parameter. |
df.mem.optimism.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the optimism parameter. This gives the equivalent sample size that would contain the same amount of information inherent in the prior. |
mem.scale.prior |
scalar; A hyper parameter being the scale of the concentration of baseline negative binomial measurement error model. |
df.mem.scale.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation of the dispersion parameter in the visibility model. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation. |
mem.overdispersion |
scalar; A parameter being the overdispersion of the negative binomial distribution that is the baseline for the measurement error model. |
visibility |
logical; Indicate if the measurement error model
is to be used, whereby latent visibilities are used in place of the reported
network sizes as the unit size variable. If |
type.impute |
The type of imputation to use for the summary visibilities
(returned in the component |
Np |
integer; The overall visibility distribution is a mixture of the
|
n |
integer; the number of people in the sample. This is usually computed from
|
n2 |
integer; If |
mu_proposal |
scalar; The standard deviation of the proposal distribution for the mean visibility. |
nu_proposal |
scalar; The standard deviation of the proposal distribution for the CMP scale parameter that determines the standard deviation of the visibility. |
beta_0_proposal |
scalar; The standard deviation of the proposal distribution for the beta_0 parameter of the recruit model. |
beta_t_proposal |
scalar; The standard deviation of the proposal distribution for the beta_t parameter of the recruit model. This corresponds to the visibility variable. |
beta_u_proposal |
scalar; The standard deviation of the proposal distribution for the beta_u parameter of the recruit model. This corresponds to the time-to-recruit variable. |
memmu_proposal |
scalar; The standard deviation of the proposal distribution for the log of the optimism parameter (that is, gamma). |
memscale_proposal |
scalar; The standard deviation of the proposal distribution for the log of the s.d. in the optimism model. |
burnintheta |
count; the number of proposals in the Metropolis-Hastings
sub-step for the visibility distribution parameters ( |
burninbeta |
count; the number of proposals in the Metropolis-Hastings
sub-step for the visibility distribution parameters ( |
parallel |
count; the number of parallel processes to run for the Monte-Carlo sample. This uses MPI or PSOCK. The default is 1, that is not to use parallel processing. |
parallel.type |
The type of parallel processing to use. The options are "PSOCK" or "MPI". This requires the corresponding type to be installed. The default is "PSOCK". |
seed |
integer; random number integer seed. Defaults to |
maxbeta |
scalar; The maximum allowed value of the |
supplied |
list; If supplied, is a list with components |
max.coupons |
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). By default it is taken by the attribute or data, else the maximum recorded number of coupons. |
recruit.time |
vector; An optional value for the data/time that the person was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person. |
recruit.time2 |
vector; An optional value for the data/time that the person in the second RDS survey was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL, the default, then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person. |
include.tree |
logical; If |
unit.scale |
numeric; If not |
optimism |
logical; If |
reflect.time |
logical; If |
equalize |
logical; If |
verbose |
logical; if this is |
posteriorsize
returns a list consisting of the
following elements:
pop |
vector; The final posterior draw for the
degrees of the population. The first |
K |
count; the maximum visibility for an individual. This is usually calculated as twice the maximum observed degree. |
n |
count; the sample size. |
samplesize |
count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000. |
burnin |
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number. |
interval |
count; the number of proposals between sampled statistics. |
mu |
scalar; The
hyper parameter |
sigma |
scalar; The hyper parameter |
df.mean.prior.visibility |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior. |
df.sd.prior.visibility |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation. |
Np |
integer; The
overall visibility distribution is a mixture of the |
mu_proposal |
scalar; The standard deviation of the proposal distribution for the mean visibility. |
nu_proposal |
scalar; The standard deviation of the proposal distribution for the CMP scale parameter of the visibility distribution. |
N |
vector of length 5; summary statistics for the posterior population size.
|
maxN |
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution. |
sample |
matrix of dimension
|
vsample |
matrix of dimension |
lpriorm |
vector; the vector of (log) prior
probabilities on each value of |
burnintheta |
count; the number of
proposals in the Metropolis-Hastings sub-step for the visibility distribution
parameters ( |
verbose |
logical; if this is
|
predictive.visibility.count |
vector; a vector
of length the maximum visibility ( |
predictive.visibility |
vector; a vector of length the maximum visibility
( |
MAP |
vector of length 6
of MAP estimates corresponding to the output
|
mode.prior.sample.proportion |
scalar; A
hyperparameter being the mode of the prior distribution on the sample
proportion |
median.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
mode.prior.size |
scalar; A hyperparameter being the mode of the prior distribution on the population size. |
mean.prior.size |
scalar; A hyperparameter being the mean of the prior distribution on the population size. |
quartiles.prior.size |
vector of length 2; A pair of hyperparameters being the lower and upper quartiles of the prior distribution on the population size. |
visibilitydistribution |
count; the
parametric distribution to use for the individual network sizes (i.e.,
visibilities). The options are |
priorsizedistribution |
character; the type of parametric distribution
to use for the prior on population size. The options are |
The best way to specify the prior is via the
hyperparameter mode.prior.size
which specifies the mode of the prior
distribution on the population size. You can alternatively specify the
hyperparameter median.prior.size
which specifies the median of the
prior distribution on the population size, or mean.prior.sample
proportion
which specifies the mean of the prior distribution on the
proportion of the population size in the sample or mode.prior.sample
proportion
which specifies the mode of the prior distribution on the
proportion of the population size in the sample. Finally, you can specify
quartiles.prior.size
as a vector of length 2 being the pair of lower
and upper quartiles of the prior distribution on the population size.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
network, statnet, degreenet
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit)
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit)
This is the print
method for the summary
class method for class "sspse"
objects.
These objects encapsulate an estimate of the posterior distribution of
the population size based on data collected by Respondent Driven Sampling.
The approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
## S3 method for class 'summary.sspse' print( x, digits = max(3, getOption("digits") - 3), correlation = FALSE, covariance = FALSE, signif.stars = getOption("show.signif.stars"), eps.Pvalue = 1e-04, ... )
## S3 method for class 'summary.sspse' print( x, digits = max(3, getOption("digits") - 3), correlation = FALSE, covariance = FALSE, signif.stars = getOption("show.signif.stars"), eps.Pvalue = 1e-04, ... )
x |
an object of class |
digits |
the number of significant digits to use when printing. |
correlation |
logical; if |
covariance |
logical; if |
signif.stars |
logical. If |
eps.Pvalue |
number; indicates the smallest p-value.
|
... |
further arguments passed to or from other methods. |
print.summary.sspse
tries to be smart about formatting the
coefficients, standard errors, etc. and additionally gives
‘significance stars’ if signif.stars
is TRUE
.
Aliased coefficients are omitted in the returned object but restored by the
print
method.
Correlations are printed to two decimal places (or symbolically): to see the
actual correlations print summary(object)$correlation
directly.
The function summary.sspse
computes and returns a two row matrix of
summary statistics of the prior and estimated posterior distributions.
The rows correspond to the Prior
and the Posterior
, respectively.
The rows names are Mean
, Median
, Mode
,
25%
, 75%
, and 90%
.
These correspond to the distributional mean, median, mode, lower quartile,
upper quartile and 90% quantile, respectively.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
The model fitting function posteriorsize
,
summary
.
Function coef
will extract the matrix of coefficients with
standard errors, t-statistics and p-values.
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) fit
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) fit
This is the summary
method for class "sspse"
objects.
These objects encapsulate an estimate of the posterior distribution of
the population size based on data collected by Respondent Driven Sampling.
The approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
summary
method for class "sspse"
. posterior distribution of
the population size based on data collected by Respondent Driven Sampling.
The approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
## S3 method for class 'sspse' summary(object, support = 1000, HPD.level = 0.95, method = "bgk", ...)
## S3 method for class 'sspse' summary(object, support = 1000, HPD.level = 0.95, method = "bgk", ...)
object |
an object of class |
support |
the number of equally-spaced points to use for the support of the estimated posterior density function. |
HPD.level |
numeric; probability level of the highest probability density interval determined from the estimated posterior. |
method |
character; The method to use for density estimation (default Gaussian Kernel; "bgk"). "Bayes" uses a Bayesian density estimator which has good properties. |
... |
further arguments passed to or from other methods. |
print.summary.sspse
tries to be smart about formatting the
coefficients, standard errors, etc. and additionally gives
‘significance stars’ if signif.stars
is TRUE
.
Aliased coefficients are omitted in the returned object but restored by the
print
method.
Correlations are printed to two decimal places (or symbolically): to see the
actual correlations print summary(object)$correlation
directly.
The function summary.sspse
computes and returns a two row matrix of
summary statistics of the prior and estimated posterior distributions. The rows correspond to the Prior
and the
Posterior
, respectively.
The rows names are Mean
, Median
, Mode
, 25%
, 75%
, and 90%
.
These correspond to the distributional mean, median, mode, lower quartile, upper quartile and 90% quantile, respectively.
The model fitting function posteriorsize
,
summary
.
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit)
data(fauxmadrona) # Here interval=1 so that it will run faster. It should be higher in a # real application. fit <- posteriorsize(fauxmadrona, median.prior.size=1000, burnin=20, interval=1, samplesize=100) summary(fit)