Title: | Respondent-Driven Sampling |
---|---|
Description: | Provides functionality for carrying out estimation with data collected using Respondent-Driven Sampling. This includes Heckathorn's RDS-I and RDS-II estimators as well as Gile's Sequential Sampling estimator. The package is part of the "RDS Analyst" suite of packages for the analysis of respondent-driven sampling data. See Gile and Handcock (2010) <doi:10.1111/j.1467-9531.2010.01223.x>, Gile and Handcock (2015) <doi:10.1111/rssa.12091> and Gile, Beaudry, Handcock and Ott (2018) <doi:10.1146/annurev-statistics-031017-100704>. |
Authors: | Mark S. Handcock [aut, cre] , Krista J. Gile [aut], Ian E. Fellows [aut], W. Whipple Neely [ctb] |
Maintainer: | Mark S. Handcock <[email protected]> |
License: | LGPL-2.1 |
Version: | 0.9-10 |
Built: | 2024-11-06 03:57:13 UTC |
Source: | https://github.com/cran/RDS |
indexing
## S3 method for class 'rds.data.frame' x[i, j, ..., drop, warn = TRUE]
## S3 method for class 'rds.data.frame' x[i, j, ..., drop, warn = TRUE]
x |
object |
i |
indices |
j |
indices |
... |
unused |
drop |
drop |
warn |
Warn if any new seeds are created |
Subsetting of RDS recruitment trees does not always yield a full RDS tree. In this case, subjects whose recruiter is no longer in the dataset are considered seeds. is issued if the 'warn' parameter is TRUE. dat <- data.frame(id=c(1,2,3,4,5), recruiter.id=c(2,-1,2,-1,4), network.size.variable=c(4,8,8,2,3)) r <- as.rds.data.frame(dat) r[1:3,] # A valid pruning of the RDS tree. r[c(1,5),warn=FALSE] # recruiter.id of last row set to -1 (i.e. a seed) to maintain validity of tree
indexing
## S3 replacement method for class 'rds.data.frame' x[i, j] <- value
## S3 replacement method for class 'rds.data.frame' x[i, j] <- value
x |
object |
i |
indices |
j |
indices |
value |
value |
Indexed assignment. If the result is not a valid rds.data.frame, an error is emitted.
converts to character with minimal loss of precision for numeric variables
as.char(x, ...)
as.char(x, ...)
x |
the value |
... |
passed to either format or as.character. |
This function converts a regular R data frame into an
rds.data.frame
. The greatest advantage of this is that it
performs integrity checks and will fail if the recruitment information
in the original data frame is incomplete.
as.rds.data.frame( df, id = if (is.null(attr(df, "id"))) "id" else attr(df, "id"), recruiter.id = if (is.null(attr(df, "recruiter.id"))) { "recruiter.id" } else attr(df, "recruiter.id"), network.size = if (is.null(attr(df, "network.size.variable"))) { "network.size.variable" } else attr(df, "network.size.variable"), population.size = if (all(is.na(get.population.size(df, FALSE)))) { NULL } else get.population.size(df, FALSE), max.coupons = if (is.null(attr(df, "max.coupons"))) { NULL } else attr(df, "max.coupons"), notes = if (is.null(attr(df, "notes"))) { NULL } else attr(df, "time"), time = if (is.null(attr(df, "time"))) { NULL } else attr(df, "time"), check.valid = TRUE )
as.rds.data.frame( df, id = if (is.null(attr(df, "id"))) "id" else attr(df, "id"), recruiter.id = if (is.null(attr(df, "recruiter.id"))) { "recruiter.id" } else attr(df, "recruiter.id"), network.size = if (is.null(attr(df, "network.size.variable"))) { "network.size.variable" } else attr(df, "network.size.variable"), population.size = if (all(is.na(get.population.size(df, FALSE)))) { NULL } else get.population.size(df, FALSE), max.coupons = if (is.null(attr(df, "max.coupons"))) { NULL } else attr(df, "max.coupons"), notes = if (is.null(attr(df, "notes"))) { NULL } else attr(df, "time"), time = if (is.null(attr(df, "time"))) { NULL } else attr(df, "time"), check.valid = TRUE )
df |
A data.frame representing an RDS sample. |
id |
The unique identifier. |
recruiter.id |
The unique identifier of the recruiter of this row. |
network.size |
The number of alters (i.e. possible recruitees). |
population.size |
The size of the population from which this RDS sample has been drawn. Either a single number, or a vector of length three indicating low, mid and high estimates. |
max.coupons |
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). |
notes |
Data set notes. |
time |
the name of the recruitment time variable. optional. |
check.valid |
If true, validity checks are performed to ensure that the data is well formed. |
An rds.data.frame object
dat <- data.frame(id=c(1,2,3,4,5), recruiter.id=c(2,-1,2,-1,4), network.size.variable=c(4,8,8,2,3)) as.rds.data.frame(dat)
dat <- data.frame(id=c(1,2,3,4,5), recruiter.id=c(2,-1,2,-1,4), network.size.variable=c(4,8,8,2,3)) as.rds.data.frame(dat)
Does various checks and throws errors if x is not a valid rds.data.frame
assert.valid.rds.data.frame(x, ...)
assert.valid.rds.data.frame(x, ...)
x |
an rds.data.frame |
... |
unused |
Throws an informative message if x is malformed.
Performs a bootstrap test of independance between two categorical variables
bootstrap.contingency.test( rds.data, row.var, col.var, number.of.bootstrap.samples = 1000, weight.type = c("HCG", "RDS-II", "Arithmetic Mean"), table.only = FALSE, verbose = TRUE, ... )
bootstrap.contingency.test( rds.data, row.var, col.var, number.of.bootstrap.samples = 1000, weight.type = c("HCG", "RDS-II", "Arithmetic Mean"), table.only = FALSE, verbose = TRUE, ... )
rds.data |
an rds.data.frame |
row.var |
the name of the first categorical variable |
col.var |
the name of the second categorical variable |
number.of.bootstrap.samples |
The number of simulated boootstrap populations |
weight.type |
The type of weighting to use for the contningency table. Only large sample methods are allowed. |
table.only |
only returns the weighted table, without bootstrap. |
verbose |
level of output |
... |
Additional parameters for compute_weights |
This function first estimates a Homophily Configuration Graph model for the underlying network under the assumption that the two variables are independant and that the population size is large. It then draws bootstrap RDS samples from this population distribution and calculates the chi.squared statistic on the weighted contingency table. Weights are calculated using the HCG estimator assuming a large population size.
data(faux) bootstrap.contingency.test(rds.data=faux, row.var="X", col.var="Y", number.of.bootstrap.samples=50, verbose=FALSE)
data(faux) bootstrap.contingency.test(rds.data=faux, row.var="X", col.var="Y", number.of.bootstrap.samples=50, verbose=FALSE)
Calculates incidence and bootstrap confidence intervals for immunoassay data collected with RDS
bootstrap.incidence( rds.data, recent.variable, hiv.variable, N = NULL, weight.type = c("Gile's SS", "RDS-I", "RDS-I (DS)", "RDS-II", "Arithmetic Mean", "HCG"), mean.duration = 200, frr = 0.01, post.infection.cutoff = 730, number.of.bootstrap.samples = 1000, se.mean.duration = 0, se.frr = 0, confidence.level = 0.95, verbose = TRUE, ... )
bootstrap.incidence( rds.data, recent.variable, hiv.variable, N = NULL, weight.type = c("Gile's SS", "RDS-I", "RDS-I (DS)", "RDS-II", "Arithmetic Mean", "HCG"), mean.duration = 200, frr = 0.01, post.infection.cutoff = 730, number.of.bootstrap.samples = 1000, se.mean.duration = 0, se.frr = 0, confidence.level = 0.95, verbose = TRUE, ... )
rds.data |
an rds.data.frame |
recent.variable |
The name of the variable indicating recent infection |
hiv.variable |
The name of the variable indicating of hiv infection |
N |
Population size |
weight.type |
A string giving the type of estimator to use. The options
are |
mean.duration |
Estimated mean duration of recent infection (MDRI) (days) |
frr |
Estimated false-recent rate (FRR) |
post.infection.cutoff |
Post-infection time cut-off T, separating "true-recent" from "false-recent" results (days) |
number.of.bootstrap.samples |
The number of bootstrap samples used to construct the interval. |
se.mean.duration |
The standard error of the mean.duration estimate |
se.frr |
The standard error of the false recency estimate |
confidence.level |
The level of confidence for the interval |
verbose |
verbosity control |
... |
additional arguments to compute.weights |
The recent.variable and hiv should be the names of logical variables. Otherwise they are converted to logical using as.numeric(x) > 0.5.
This function estimates incidence using RDS sampling wieghts. Confidence intervals are constucted using HCG bootstraps. See http://www.incidence-estimation.org/ for additional information on (non-RDS) incidence estimation.
data(faux) faux$hiv <- faux$X == "blue" faux$recent <- NA faux$recent[faux$hiv] <- runif(sum(faux$hiv)) < .2 faux$recent[runif(nrow(faux)) > .5] <- NA faux$hiv[is.na(faux$recent)][c(1,6,10,21)] <- NA attr(faux,"time") <- "wave" bootstrap.incidence(faux,"recent","hiv",weight.type="RDS-II", number.of.bootstrap.samples=100)
data(faux) faux$hiv <- faux$X == "blue" faux$recent <- NA faux$recent[faux$hiv] <- runif(sum(faux$hiv)) < .2 faux$recent[runif(nrow(faux)) > .5] <- NA faux$hiv[is.na(faux$recent)][c(1,6,10,21)] <- NA attr(faux,"time") <- "wave" bootstrap.incidence(faux,"recent","hiv",weight.type="RDS-II", number.of.bootstrap.samples=100)
Bottleneck Plot
bottleneck.plot( rds.data, outcome.variable, est.func = RDS.II.estimates, as.factor = FALSE, n.eval.points = 25, ... )
bottleneck.plot( rds.data, outcome.variable, est.func = RDS.II.estimates, as.factor = FALSE, n.eval.points = 25, ... )
rds.data |
An rds.data.frame. |
outcome.variable |
A character vector of outcome variables. |
est.func |
A function taking rds.data and outcome.variable as parameters and returning an rds.weighted.estimate object. |
as.factor |
Convert all outcome variables to factors |
n.eval.points |
number of evaluation points to calculate the estimates at |
... |
additional parameters for est.func. |
Krista J. Gile, Lisa G. Johnston, Matthew J. Salganik Diagnostics for Respondent-driven Sampling eprint arXiv:1209.6254, 2012
data(fauxmadrona) bottleneck.plot(fauxmadrona,"disease")
data(fauxmadrona) bottleneck.plot(fauxmadrona,"disease")
Compute estimates of the sampling weights of the respondent's observations based on various estimators
compute.weights( rds.data, weight.type = c("Gile's SS", "RDS-I", "RDS-I (DS)", "RDS-II", "Arithmetic Mean", "HCG"), N = NULL, subset = NULL, control = control.rds.estimates(), ... )
compute.weights( rds.data, weight.type = c("Gile's SS", "RDS-I", "RDS-I (DS)", "RDS-II", "Arithmetic Mean", "HCG"), N = NULL, subset = NULL, control = control.rds.estimates(), ... )
rds.data |
An |
weight.type |
A string giving the type of estimator to use. The options
are |
N |
An estimate of the number of members of the population being
sampled. If |
subset |
An optional criterion to subset |
control |
A list of control parameters for algorithm
tuning. Constructed using |
... |
Additional parameters passed to the individual weighting algorithms. |
A vector of weights for each of the respondents. It is of the same
size as the number of rows in rds.data
.
rds.I.weights
, gile.ss.weights
, vh.weights
Utility method that overrides the standard ‘$’ list accessor to disable
partial matching for ergm control.list
objects
## S3 method for class 'control.list' object$name
## S3 method for class 'control.list' object$name
object |
list-coearceable object with elements to be searched |
name |
literal character name of list element to search for and return |
Executes getElement
instead of $
so
that element names must match exactly to be returned and partially matching
names will not return the wrong object.
Returns the named list element exactly matching name
, or
NULL
if no matching elements found
Pavel N. Krivitsky
see getElement
Auxiliary function as user interface for fine-tuning RDS.bootstrap.intervals algorithm, which computes interval estimates for via bootstrapping.
control.rds.estimates( confidence.level = 0.95, SS.infinity = 0.01, lowprevalence = c(8, 14), discrete.cutoff = 0.8, useC = TRUE, number.of.bootstrap.samples = NULL, hcg.reltol = sqrt(.Machine$double.eps), hcg.BS.reltol = 1e+05 * sqrt(.Machine$double.eps), hcg.max.optim = 500, seed = NULL )
control.rds.estimates( confidence.level = 0.95, SS.infinity = 0.01, lowprevalence = c(8, 14), discrete.cutoff = 0.8, useC = TRUE, number.of.bootstrap.samples = NULL, hcg.reltol = sqrt(.Machine$double.eps), hcg.BS.reltol = 1e+05 * sqrt(.Machine$double.eps), hcg.max.optim = 500, seed = NULL )
confidence.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
SS.infinity |
The sample proportion, |
lowprevalence |
Standard confidence interval procedures can be inaccurate when the
outcome expected count is close to zero. This sets conditions where alternatives to the
standard are used for the |
discrete.cutoff |
The minimum proportion of the values of the outcome variable that need to be unique before the variable is judged to be continuous. |
useC |
Use a C-level implementation of Gile's bootstrap (rather than the R level). The implementations should be computational equivalent (except for speed). |
number.of.bootstrap.samples |
The number of bootstrap samples to take
in estimating the uncertainty of the estimator. If |
hcg.reltol |
Relative convergence tolerance for the HCG estimator. The algorithm stops if
it is unable to reduce the log-likelihood by a factor of |
hcg.BS.reltol |
Relative convergence tolerance for the bootstrap of the HCG estimator.
It has the same interpretation as |
hcg.max.optim |
The number of iterations on the likelihood optimization for the HCG estimator. |
seed |
Seed value (integer) for the random number generator. See
|
This function is only used within a call to the RDS.bootstrap.intervals
function.
Some of the arguments are not yet fully implemented. It will evolve slower to incorporate more arguments as the package develops.
Standard confidence interval procedures can be inaccurate when the
outcome expected count is close to zero. In these cases
the combined Agresti-Coull and the bootstrap-t interval of
Mantalos and Zografos (2008) can be used.
The lowprevalence
argument is a
two vector parameter setting the conditions under which the approximation is used.
The first is the penalty term on the differential activity. If the observed number
of the rare group minus the product of the first parameter and the differential
activity is lower than the
second parameter, the low prevalence approximation is used.
A list with arguments as components.
This function creates diagnostic convergence plots for RDS estimators.
convergence.plot( rds.data, outcome.variable, est.func = RDS.II.estimates, as.factor = FALSE, n.eval.points = 25, ... )
convergence.plot( rds.data, outcome.variable, est.func = RDS.II.estimates, as.factor = FALSE, n.eval.points = 25, ... )
rds.data |
An rds.data.frame. |
outcome.variable |
A character vector of outcome variables. |
est.func |
A function taking rds.data and outcome.variable as parameters and returning an rds.weighted.estimate object. |
as.factor |
Convert all outcome variables to factors |
n.eval.points |
number of evaluation points to calculate the estimates at |
... |
additional parameters for est.func. |
Krista J. Gile, Lisa G. Johnston, Matthew J. Salganik Diagnostics for Respondent-driven Sampling eprint arXiv:1209.6254, 2012
data(faux) convergence.plot(faux,c("X","Y"))
data(faux) convergence.plot(faux,c("X","Y"))
Counts the number or recruiter->recruitee transitions between different levels of the grouping variable.
count.transitions(rds.data, group.variable)
count.transitions(rds.data, group.variable)
rds.data |
An rds.data.frame |
group.variable |
The name of a categorical variable in rds.data |
data(faux) count.transitions(faux,"X")
data(faux) count.transitions(faux,"X")
Calculates estimates at each successive wave of the sampling process
cumulative.estimate( rds.data, outcome.variable, est.func = RDS.II.estimates, n.eval.points = 25, ... )
cumulative.estimate( rds.data, outcome.variable, est.func = RDS.II.estimates, n.eval.points = 25, ... )
rds.data |
An rds.data.frame |
outcome.variable |
The outcome |
est.func |
A function taking rds.data and outcome.variable as parameters and returning an rds.weighted.estimate object |
n.eval.points |
number of evaluation points to calculate the estimates at |
... |
additional parameters for est.func |
Differential Activity between groups
differential.activity.estimates( rds.data, outcome.variable, weight.type = "Gile's SS", N = NULL, subset = NULL, ... )
differential.activity.estimates( rds.data, outcome.variable, weight.type = "Gile's SS", N = NULL, subset = NULL, ... )
rds.data |
An rds.data.frame object |
outcome.variable |
A character string of column names representing categorical variables. |
weight.type |
A string giving the type of estimator to use. The options
are |
N |
The population size. |
subset |
An expression defining a subset of rds.data. |
... |
Additional parameters passed to compute.weights. |
This function estimates the ratio of the average degree of one population group divided by the average degree of those in another population group.
data(faux) differential.activity.estimates(faux,"X",weight.type="RDS-II")
data(faux) differential.activity.estimates(faux,"X",weight.type="RDS-II")
Convert the output of print.rds.interval.estimate from a character data.frame to a numeric matrix
export.rds.interval.estimate(x, proportion = TRUE)
export.rds.interval.estimate(x, proportion = TRUE)
x |
An object, typically the result of print.rds.interval.estimate. |
proportion |
logical, Should the outcome be treated as a proportion and converted to a percentage. |
This is a faux set used to demonstrate RDS functions and analysis. It is used is some simple examples and has categorical variables "X", "Y" and "Z".
An rds.data.frame object
Gile, Krista J., Handcock, Mark S., 2010 Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327.
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X')
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X')
This is a faux set used to illustrate how the estimators perform under different populations and RDS schemes.
An rds.data.frame
The population had N=1000 nodes. In this case, the sample size is 500 so that there is a relatively small sample fraction (50%). There is homophily on disease status (R=5) and there is differential activity by disease status whereby the infected nodes have mean degree twice that of the uninfected (w=1.8).
In the sampling, the seeds are chosen randomly from the full population, so there is no dependency induced by seed selection.
Each sample member is given 2 uniquely identified coupons to distribute to other members of the target population in their acquaintance. Further each respondent distributes their coupons completely at random from among those they are connected to.
Here are the results for this data set and the sister fauxsycamore
data set:
Name | City | Type | Mean | RDS I (SH) | RDS II (VH) | SS |
fauxsycamore | Oxford | seed dependency, 70% | 0.2408 | 0.1087 | 0.1372 | 0.1814 |
fauxmadrona | Seattle | no seed dependency, 50% | 0.2592 | 0.1592 | 0.1644 | 0.1941 |
Even with only 50% sample, the VH is substantially biased , and the SS does much better.
The original network is included as
fauxmadrona.network
as a network
object.
The data set
also includes the data.frame
of the RDS data set as
fauxmadrona
.
Use data(package="RDS")
to get a full list
of datasets.
Gile, Krista J., Handcock, Mark S., 2010 Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327.
This is a faux set used to demonstrate RDS functions and analysis. The population had N=715 nodes. In this case, the sample size is 500 so that there is a relatively large sample fraction (70%). There is homophily on disease status (R=5) and there is differential activity by disease status whereby the infected nodes have mean degree twice that of the uninfected (w=1.8).
An rds.data.frame plus the original network as a network object
In the sampling the seeds are chosen randomly from the infected population, so there is extreme dependency induced by seed selection.
Each sample member is given 2 uniquely identified coupons to distribute to other members of the target population in their acquaintance. Further each respondent distributes their coupons completely at random from among those they are connected to.
With 70% sample, the VH is substantially biased, so the SS (and presumably MA) do much better. We expect the MA to perform a bit better than the SS.
It is network 702 and its sample from YesYes on mosix. Look for
"extract702.R"
The original network is included as
fauxsycamore.network
as a network
object.
The data set
also includes the data.frame
of the RDS data set as
fauxsycamore
.
Use data(package="RDS")
to get a full list
of datasets.
Gile, Krista J., Handcock, Mark S., 2009. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327.
This is a faux set used to demonstrate RDS functions and analysis.
An rds.data.frame object
Gile, Krista J., Handcock, Mark S., 2010 Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327.
Get Horvitz-Thompson estimator assuming inclusion probability proportional to the inverse of network.var (i.e. degree).
get.h.hat( rds.data, group.variable, network.var = attr(rds.data, "network.size") )
get.h.hat( rds.data, group.variable, network.var = attr(rds.data, "network.size") )
rds.data |
An rds.data.from |
group.variable |
The grouping variable. |
network.var |
The network.size variable. |
Get the subject id
get.id(x, check.type = TRUE)
get.id(x, check.type = TRUE)
x |
an rds.data.frame object |
check.type |
if true, x is required to be of type rds.data.frame |
returns the variable indicated by the 'id' attribute, coercing to a character vector
Returns the network size of each subject (i.e. their degree).
get.net.size(x, check.type = TRUE)
get.net.size(x, check.type = TRUE)
x |
the rds.data.frame |
check.type |
if true, x is required to be of type rds.data.frame |
Calculates the number of (direct) recuits for each respondent.
get.number.of.recruits(data)
get.number.of.recruits(data)
data |
An rds.data.frame |
data(fauxmadrona) nr <- get.number.of.recruits(fauxmadrona) #frequency of number recruited by each id barplot(table(nr))
data(fauxmadrona) nr <- get.number.of.recruits(fauxmadrona) #frequency of number recruited by each id barplot(table(nr))
Returns the population size associated with the data.
get.population.size(x, check.type = TRUE)
get.population.size(x, check.type = TRUE)
x |
the rds.data.frame |
check.type |
if true, x is required to be of type rds.data.frame |
Returns the recruitment time for each subject
get.recruitment.time( x, to.numeric = TRUE, wave.fallback = FALSE, check.type = TRUE )
get.recruitment.time( x, to.numeric = TRUE, wave.fallback = FALSE, check.type = TRUE )
x |
the rds.data.frame |
to.numeric |
if true, time will be converted into a numeric variable. |
wave.fallback |
if true, subjects' recruitment times are ordered by wave and then by data.frame index if no recruitment time variable is available. |
check.type |
if true, x is required to be of type rds.data.frame |
Get recruiter id
get.rid(x, check.type = TRUE)
get.rid(x, check.type = TRUE)
x |
an rds.data.frame object |
check.type |
if true, x is required to be of type rds.data.frame |
returns the variable indicated by the 'recruiter.id' attribute, coercing to a character vector
Calculates the root seed id for each node of the recruitement tree.
get.seed.id(data)
get.seed.id(data)
data |
An rds.data.frame |
data(fauxmadrona) seeds <- get.seed.id(fauxmadrona) #number recruited by each seed barplot(table(seeds))
data(fauxmadrona) seeds <- get.seed.id(fauxmadrona) #number recruited by each seed barplot(table(seeds))
Gets the recruiter id associated with the seeds
get.seed.rid(x, check.type = TRUE)
get.seed.rid(x, check.type = TRUE)
x |
an rds.data.frame object |
check.type |
if true, x is required to be of type rds.data.frame |
All seed nodes must have the same placeholder recruiter id.
Markov chain statistionary distribution
get.stationary.distribution(mle)
get.stationary.distribution(mle)
mle |
The transition probabilities |
A vector of proportions representing the proportion in each group at the stationary distribution of the Markov chain.
Calculates the depth of the recruitment tree (i.e. the recruitment wave) at each node.
get.wave(data)
get.wave(data)
data |
An rds.data.frame |
data(fauxmadrona) #number subjects in each wave w <- get.wave(fauxmadrona) #number recruited in each wave barplot(table(w))
data(fauxmadrona) #number subjects in each wave w <- get.wave(fauxmadrona) #number recruited in each wave barplot(table(w))
Weights using Giles SS estimator
gile.ss.weights( degs, N, number.ss.samples.per.iteration = 500, number.ss.iterations = 5, hajek = TRUE, SS.infinity = 0.04, se = FALSE, ... )
gile.ss.weights( degs, N, number.ss.samples.per.iteration = 500, number.ss.iterations = 5, hajek = TRUE, SS.infinity = 0.04, se = FALSE, ... )
degs |
subjects' degrees (i.e. network sizes). |
N |
Population size estimate. |
number.ss.samples.per.iteration |
The number of samples to use to estimate inclusion probabilities in a probability proportional to size without replacement design. |
number.ss.iterations |
number of iterations to use in giles SS algorithm. |
hajek |
Should the hajek estiamtor be used. If false, the HT estimator is used. |
SS.infinity |
The sample proportion, |
se |
Should covariances be included. |
... |
unused |
RDS data.frame has recruitment time information
has.recruitment.time(x, check.type = TRUE)
has.recruitment.time(x, check.type = TRUE)
x |
the rds.data.frame |
check.type |
if true, x is required to be of type rds.data.frame |
HCG parametric bootstrap replicate weights
hcg.replicate.weights( rds.data, outcome.variable, number.of.bootstrap.samples = 500, include.sample.weights = FALSE, N = NULL, small.fraction = FALSE )
hcg.replicate.weights( rds.data, outcome.variable, number.of.bootstrap.samples = 500, include.sample.weights = FALSE, N = NULL, small.fraction = FALSE )
rds.data |
An rds.data.frame |
outcome.variable |
The column name of the variable defining the groups for the homophily configuration graph |
number.of.bootstrap.samples |
The number of bootstrap replicate weights to be generated |
include.sample.weights |
If TRUE, the first column of the returned frame are the HCG weights for the sample |
N |
The population size |
small.fraction |
If TRUE, the sample size is assumed to be small compared to the population size |
This function generates bootstrap replicate weights which may be used to analyze RDS data in other packages or software systems (e.g. the survey package with svrepdesign).
A data.frame of replicate weights. If include.sample.weights is TRUE, the first column are the HCG weights for the observed sample.
## Not run: data("fauxmadrona") set.seed(1) # Generate replicate weights result <- hcg.replicate.weights(fauxmadrona, "disease", 50, TRUE) # Analyze with survey package and compare to internal function if(require(survey)){ set.seed(1) design <- svrepdesign(fauxmadrona, type = "bootstrap", weights= result[[1]], repweights = result[-1]) svymean(~disease, design) |> print() RDS.bootstrap.intervals(fauxmadrona, "disease", "HCG", "HCG", number.of.bootstrap.samples = 50) |> print() } ## End(Not run)
## Not run: data("fauxmadrona") set.seed(1) # Generate replicate weights result <- hcg.replicate.weights(fauxmadrona, "disease", 50, TRUE) # Analyze with survey package and compare to internal function if(require(survey)){ set.seed(1) design <- svrepdesign(fauxmadrona, type = "bootstrap", weights= result[[1]], repweights = result[-1]) svymean(~disease, design) |> print() RDS.bootstrap.intervals(fauxmadrona, "disease", "HCG", "HCG", number.of.bootstrap.samples = 50) |> print() } ## End(Not run)
homophily configuration graph weights
hcg.weights( rds.data, outcome.variable, N = NULL, small.fraction = FALSE, reltol = sqrt(.Machine$double.eps), max.optim = 500, theta.start = NULL, weights.include.seeds = TRUE, ... )
hcg.weights( rds.data, outcome.variable, N = NULL, small.fraction = FALSE, reltol = sqrt(.Machine$double.eps), max.optim = 500, theta.start = NULL, weights.include.seeds = TRUE, ... )
rds.data |
An rds.data.frame |
outcome.variable |
The variable used to base the weights on. |
N |
Population size |
small.fraction |
should a small sample fraction be assumed |
reltol |
Relative convergence tolerance for the HCG estimator. The algorithm stops if
it is unable to reduce the log-likelihood by a factor of |
max.optim |
The number of iterations on the likelihood optimization for the HCG estimator. |
theta.start |
The initial value of theta used in the likelihood optimization for the HCG estimator. If NULL, the default, it is the margin of the table of counts for the transitions. |
weights.include.seeds |
logical Should the weights be computed including the influence of the seeds? |
... |
Unused |
data(fauxtime) hcg.weights(fauxtime,"var1",N=3000) fauxtime$NETWORK[c(1,100,40,82,77)] <- NA
data(fauxtime) hcg.weights(fauxtime,"var1",N=3000) fauxtime$NETWORK[c(1,100,40,82,77)] <- NA
This function computes an estimate of the population homophily and the recruitment homophily based on a categorical variable.
homophily.estimates( rds.data, outcome.variable, weight.type = NULL, uncertainty = NULL, recruitment = FALSE, N = NULL, to.group0.variable = NULL, to.group1.variable = NULL, number.ss.samples.per.iteration = NULL, confidence.level = 0.95 )
homophily.estimates( rds.data, outcome.variable, weight.type = NULL, uncertainty = NULL, recruitment = FALSE, N = NULL, to.group0.variable = NULL, to.group1.variable = NULL, number.ss.samples.per.iteration = NULL, confidence.level = 0.95 )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the |
weight.type |
A string giving the type of estimator to use. The options are |
uncertainty |
A string giving the type of uncertainty estimator to use. The options are |
recruitment |
A logical indicating if the homophily in the recruitment chains should be computed also. The default is FALSE. |
N |
An estimate of the number of members of the population being sampled. If |
to.group0.variable |
The number in the network of each survey respondent who have group variable value 0. Usually this is not available. The default is to not use this variable. |
to.group1.variable |
The number in the network of each survey respondent who have group variable value 1. Usually this is not available. The default is to not use this variable. |
number.ss.samples.per.iteration |
The number of samples to take in estimating the inclusion probabilites in each iteration of the sequential sampling algorithm. If |
confidence.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
If outcome.variable
is binary then the homophily estimate of
0 verses 1 is returned, otherwise a vector of differential homophily
estimates is returned.
The recruitment homophily is a homophily measure for the recruitment process. It addresses the question: Do respondents differential recruit people like themselves? That is, the homophily on a variable in the recruitment chains. Take as an example infection status. In this case, it is the ratio of number of recruits that have the same infection status as their recruiter to the number we would expect if there was no homophily on infection status. The difference with the Population Homophily (see below) is that this is in the recruitment chain rather than the population of social ties. For example, of the recruitment homophily on infection status is about 1, we see little effect of recruitment homophily on infection status (as the numbers of homophilous pairs are close to what we would expect by chance).
This is an estimate the homophily of a given variable in the underlying networked population. For example, consider HIV status. The population homophily is the homophily in the HIV status of two people who are tied in the underlying population social network (a “couple”). Specifically, the population homophily is the ratio of the expected number of HIV discordant couples absent homophily to the expected number of HIV discordant couples with the homophily. Hence larger values of population homophily indicate more homophily on HIV status. For example, a value of 1 means the couple are random with respect to HIV status. A value of 2 means there are twice as many HIV discordant couples as we would expect if there was no homophily in the population. This measure is meaningful across different levels of differential activity. As we do not see most of the population network, we estimate the population homophily from the RDS data. As an example, suppose the population homophily on HIV is 0.75 so there are 25% more HIV discordant couples than expected due to chance. So their is actually heterophily on HIV in the population. If the population homophily on sex is 1.1, there are 10% more same-sex couples than expected due to chance. Hence there is modest homophily on sex.
Mark S. Handcock with help from Krista J. Gile
Gile, Krista J., Handcock, Mark S., 2010, Respondent-driven Sampling: An Assessment of Current Methodology. Sociological Methodology 40, 285-327.
## Not run: data(fauxmadrona) names(fauxmadrona) # # True value: # if(require(network)){ a=as.sociomatrix(fauxmadrona.network) deg <- apply(a,1,sum) dis <- fauxmadrona.network \ deg1 <- apply(a[dis==1,],1,sum) deg0 <- apply(a[dis==0,],1,sum) # differential activity mean(deg1)/ mean(deg0) p=mean(dis) N=1000 # True homophily p*(1-p)*mean(deg0)*mean(deg1)*N/(mean(deg)*sum(a[dis==1,dis==0])) } # HT based estimators using the to.group information data(fauxmadrona) homophily.estimates(fauxmadrona,outcome.variable="disease", to.group0.variable="tonondiseased", to.group1.variable="todiseased", N=1000) # HT based estimators not using the to.group information homophily.estimates(fauxmadrona,outcome.variable="disease", N=1000,weight.type="RDS-II") ## End(Not run)
## Not run: data(fauxmadrona) names(fauxmadrona) # # True value: # if(require(network)){ a=as.sociomatrix(fauxmadrona.network) deg <- apply(a,1,sum) dis <- fauxmadrona.network \ deg1 <- apply(a[dis==1,],1,sum) deg0 <- apply(a[dis==0,],1,sum) # differential activity mean(deg1)/ mean(deg0) p=mean(dis) N=1000 # True homophily p*(1-p)*mean(deg0)*mean(deg1)*N/(mean(deg)*sum(a[dis==1,dis==0])) } # HT based estimators using the to.group information data(fauxmadrona) homophily.estimates(fauxmadrona,outcome.variable="disease", to.group0.variable="tonondiseased", to.group1.variable="todiseased", N=1000) # HT based estimators not using the to.group information homophily.estimates(fauxmadrona,outcome.variable="disease", N=1000,weight.type="RDS-II") ## End(Not run)
Imputes missing degree values
impute.degree( rds.data, trait.variable = NULL, N = NULL, method = c("mean", "quantile"), quantile = 0.5, recruitment.lower.bound = TRUE, round.degree = TRUE )
impute.degree( rds.data, trait.variable = NULL, N = NULL, method = c("mean", "quantile"), quantile = 0.5, recruitment.lower.bound = TRUE, round.degree = TRUE )
rds.data |
an rds.data.frame |
trait.variable |
the name of the variable in rds.data to stratify the imputation by |
N |
population size |
method |
If mean, the weighted mean value is imputed, otherwize a quantile is used. |
quantile |
If method is "quantile", this is the quantile that is used. Defaults to median |
recruitment.lower.bound |
If TRUE, then for each individual, the degree is taken to be the minimum of the number of recruits plus one, and the reported degree |
round.degree |
Should degrees be integer rounded. |
This function imputes degree values using the weighted mean or quantile values of the non-missing degrees. Weights are calcualted using Gile's SS if N is not NULL, or RDS-II if it is. If a trait variable is specified, means and quantile are calculated within the levels of the trait variable
data(faux) rds.data <- faux rds.data$network.size[c(1,2,30,52,81,101,108,111)] <- NA impute.degree(rds.data) impute.degree(rds.data,trait.variable="X") impute.degree(rds.data,trait.variable="X",method="quantile")
data(faux) rds.data <- faux rds.data$network.size[c(1,2,30,52,81,101,108,111)] <- NA impute.degree(rds.data) impute.degree(rds.data,trait.variable="X") impute.degree(rds.data,trait.variable="X",method="quantile")
Estimates each person's personal visibility based on their self-reported degree and the number of their (direct) recruits. It uses the time the person was recruited as a factor in determining the number of recruits they produce.
impute.visibility( rds.data, max.coupons = NULL, type.impute = c("median", "distribution", "mode", "mean"), recruit.time = NULL, include.tree = FALSE, reflect.time = FALSE, parallel = 1, parallel.type = "PSOCK", interval = 10, burnin = 5000, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, return.posterior.sample.visibilities = FALSE, verbose = FALSE )
impute.visibility( rds.data, max.coupons = NULL, type.impute = c("median", "distribution", "mode", "mean"), recruit.time = NULL, include.tree = FALSE, reflect.time = FALSE, parallel = 1, parallel.type = "PSOCK", interval = 10, burnin = 5000, mem.optimism.prior = NULL, df.mem.optimism.prior = 5, mem.scale.prior = 2, df.mem.scale.prior = 10, mem.overdispersion = 15, return.posterior.sample.visibilities = FALSE, verbose = FALSE )
rds.data |
An rds.data.frame |
max.coupons |
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). By default it is taken by the attribute or data, else the maximum recorded number of coupons. |
type.impute |
The type of imputation based on the conditional distribution.
It can be of type |
recruit.time |
vector; An optional value for the data/time that the person was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person. |
include.tree |
logical; If |
reflect.time |
logical; If |
parallel |
count; the number of parallel processes to run for the Monte-Carlo sample. This uses MPI or PSOCK. The default is 1, that is not to use parallel processing. |
parallel.type |
The type of parallel processing to use. The options are "PSOCK" or "MPI". This requires the corresponding type to be installed. The default is "PSOCK". |
interval |
count; the number of proposals between sampled statistics. |
burnin |
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number. |
mem.optimism.prior |
scalar; A hyper parameter being the mean of the distribution of the optimism parameter. |
df.mem.optimism.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the optimism parameter. This gives the equivalent sample size that would contain the same amount of information inherent in the prior. |
mem.scale.prior |
scalar; A hyper parameter being the scale of the concentration of baseline negative binomial measurement error model. |
df.mem.scale.prior |
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation of the dispersion parameter in the visibility model. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation. |
mem.overdispersion |
scalar; A parameter being the overdispersion of the negative binomial distribution that is the baseline for the measurement error model. |
return.posterior.sample.visibilities |
logical; If TRUE then return a
matrix of dimension |
verbose |
logical; if this is |
McLaughlin, Katherine R.; Johnston, Lisa G.; Jakupi, Xhevat; Gexha-Bunjaku, Dafina; Deva, Edona and Handcock, Mark S. (2023) Modeling the Visibility Distribution for Respondent-Driven Sampling with Application to Population Size Estimation, Annals of Applied Statistics, doi:10.1093/jrsssa/qnad031
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
Estimates each person's personal visibility based on their self-reported degree and the number of their (direct) recruits. It uses the time the person was recruited as a factor in determining the number of recruits they produce.
impute.visibility_mle( rds.data, max.coupons = NULL, type.impute = c("distribution", "mode", "median", "mean"), recruit.time = NULL, include.tree = FALSE, unit.scale = NULL, unit.model = c("cmp", "nbinom"), optimism = FALSE, guess = NULL, reflect.time = TRUE, maxit = 100, K = NULL, verbose = TRUE )
impute.visibility_mle( rds.data, max.coupons = NULL, type.impute = c("distribution", "mode", "median", "mean"), recruit.time = NULL, include.tree = FALSE, unit.scale = NULL, unit.model = c("cmp", "nbinom"), optimism = FALSE, guess = NULL, reflect.time = TRUE, maxit = 100, K = NULL, verbose = TRUE )
rds.data |
An rds.data.frame |
max.coupons |
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). By default it is taken by the attribute or data, else the maximum recorded number of coupons. |
type.impute |
The type of imputation based on the conditional distribution.
It can be of type |
recruit.time |
vector; An optional value for the data/time that the person was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person. |
include.tree |
logical; If |
unit.scale |
numeric; If not |
unit.model |
The type of distribution for the unit sizes.
It can be of |
optimism |
logical; If |
guess |
vector; if not |
reflect.time |
logical; If |
maxit |
integer; The maximum number of iterations in the likelihood maximization. By default it is 100. |
K |
integer; The maximum degree. All self-reported degrees above this are recorded as being at least K. By default it is the 95th percentile of the self-reported network sizes. |
verbose |
logical; if this is |
McLaughlin, K.R., M.S. Handcock, and L.G. Johnston, 2015. Inference for the visibility distribution for respondent-driven sampling. In JSM Proceedings. Alexandria, VA: American Statistical Association. 2259-2267.
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
## Not run: data(fauxmadrona) # The next line fits the model for the self-reported personal # network sizes and imputes the personal network sizes # It may take up to 60 seconds. visibility <- impute.visibility(fauxmadrona) # frequency of estimated personal visibility table(visibility) ## End(Not run)
Is an instance of rds.data.frame
is.rds.data.frame(x)
is.rds.data.frame(x)
x |
An object to be tested. |
Is an instance of rds.interval.estimate
is.rds.interval.estimate(x)
is.rds.interval.estimate(x)
x |
An object to be tested. |
Is an instance of rds.interval.estimate.list This is a (typically time ordered) sequence of RDS estimates of a comparable quantity
is.rds.interval.estimate.list(x)
is.rds.interval.estimate.list(x)
x |
An object to be tested. |
This function takes a series of point estimates and their associated standard errors and
computes the p-value for the test of a monotone decrease in the
population prevalences (in sequence order).
The p-value for a monotone increase is
also reported. An optional plot of the estimates and the null distribution of the test statistics is provided.
More formally, let the population prevalences in sequence order be
.
We test the null hypothesis:
vs
with at least one equality strict. The alternatie hypothesis is for a monotone decreasing trend.
A likelihood ratio statistic for this test has
been derived (Bartholomew 1959).
The null distribution of the likelihood ratio statistic is very complex
but can be determined by a simple Monte Carlo process.
Alternatively, we can test the null hypothesis:
vs
The null distribution of the likelihood ratio statistic is very complex
but can be determined by a simple Monte Carlo process.
In both cases we also test for:
that is, a monotonically increasing trend. The function requires the isotone library.
LRT.trend.test( data, variables = colnames(data), null = "monotone", confidence.level = 0.95, number.of.bootstrap.samples = 5000, plot = NULL, seed = 1 )
LRT.trend.test( data, variables = colnames(data), null = "monotone", confidence.level = 0.95, number.of.bootstrap.samples = 5000, plot = NULL, seed = 1 )
data |
A two row matrix or data.frame of prevalence estimates and
their standard errors. The first row is the prevalence estimates and the
second are the standard errors. The column are the comparison groups in the
order (e.g., time) there are to be assessed. The row names of |
variables |
A character vector of column names it select from |
null |
A character string indicating the null hypothesis to use. The value |
confidence.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
number.of.bootstrap.samples |
The number of Monte Carlo draws to determine the null distribution of the likelihood ratio statistic. |
plot |
A character vector of choices, a subset of |
seed |
The value of the random number seed. Preset by default to allow reproducibility. |
A list with components
pvalue.increasing
: The p-value for the test of a monotone increase in population prevalence.
pvalue.decreasing
: The p-value for the test of a monotone decrease in population prevalence.
L
: The value of the likelihood-ratio statistic.
x
: The passed vector of prevalence estimates in the order (e.g., time).
sigma
The passed vector of standard error estimates corresponding to x
.
Mark S. Handcock
Bartholomew, D. J. (1959). A test of homogeneity for ordered alternatives. Biometrika 46 36-48.
d <- t(data.frame(estimate=c(0.16,0.15,0.3), sigma=c(0.04,0.04,0.1))) colnames(d) <- c("time_1","time_2","time_3") LRT.trend.test(d,number.of.bootstrap.samples=1000)
d <- t(data.frame(estimate=c(0.16,0.15,0.3), sigma=c(0.04,0.04,0.1))) colnames(d) <- c("time_1","time_2","time_3") LRT.trend.test(d,number.of.bootstrap.samples=1000)
This function takes a series of point estimates and their associated standard errors and
computes the p-value for the test of a monotone decrease in the
population prevalences (in sequence order).
The p-value for a monotone increase is
also reported.
More formally, let the population prevalences in sequence order be
.
We test the null hypothesis:
vs
with at least one equality strict. A likelihood ratio statistic for this test has
been derived (Bartholomew 1959).
The null distribution of the likelihood ratio statistic is very complex
but can be determined by a simple Monte Carlo process.
We also test the null hypothesis:
vs
The null distribution of the likelihood ratio statistic is very complex but can be determined by a simple Monte Carlo process. The function requires the isotone library.
LRT.value.trend(x, sigma)
LRT.value.trend(x, sigma)
x |
A vector of prevalence estimates in the order (e.g., time). |
sigma |
A vector of standard error estimates corresponding to |
A list with components
pvalue.increasing
: The p-value for the test of a monotone increase in population prevalence.
pvalue.decreasing
: The p-value for the test of a monotone decrease in population prevalence.
L
: The value of the likelihood-ratio statistic.
x
: The passed vector of prevalence estimates in the order (e.g., time).
sigma
The passed vector of standard error estimates corresponding to x
.
Mark S. Handcock
Bartholomew, D. J. (1959). A test of homogeneity for ordered alternatives. Biometrika 46 36-48.
## Not run: x <- c(0.16,0.15,0.3) sigma <- c(0.04,0.04,0.1) LRT.value.trend(x,sigma) ## End(Not run)
## Not run: x <- c(0.16,0.15,0.3) sigma <- c(0.04,0.04,0.1) LRT.value.trend(x,sigma) ## End(Not run)
This function computes the sequential sampling (MA) estimates for a categorical variable or numeric variable.
MA.estimates( rds.data, trait.variable, seed.selection = "degree", number.of.seeds = NULL, number.of.coupons = NULL, number.of.iterations = 3, N = NULL, M1 = 25, M2 = 20, seed = 1, initial.sampling.probabilities = NULL, MPLE.samplesize = 50000, SAN.maxit = 5, SAN.nsteps = 2^19, sim.interval = 10000, number.of.cross.ties = NULL, max.degree = NULL, parallel = 1, parallel.type = "PSOCK", full.output = FALSE, verbose = TRUE )
MA.estimates( rds.data, trait.variable, seed.selection = "degree", number.of.seeds = NULL, number.of.coupons = NULL, number.of.iterations = 3, N = NULL, M1 = 25, M2 = 20, seed = 1, initial.sampling.probabilities = NULL, MPLE.samplesize = 50000, SAN.maxit = 5, SAN.nsteps = 2^19, sim.interval = 10000, number.of.cross.ties = NULL, max.degree = NULL, parallel = 1, parallel.type = "PSOCK", full.output = FALSE, verbose = TRUE )
rds.data |
An |
trait.variable |
A string giving the name of the variable in the
|
seed.selection |
An estimate of the mechanism guiding the choice of seeds. The choices are
|
number.of.seeds |
The number of seeds chosen to initiate the sampling. |
number.of.coupons |
The number of coupons given to each respondent. |
number.of.iterations |
The number of iterations used at the core of the algorithm. |
N |
An estimate of the number of members of the population being
sampled. If |
M1 |
The number of networked populations generated at each iteration. |
M2 |
The number of (full) RDS samples generated for each networked population at each iteration. |
seed |
The random number seed used to initiate the computations. |
initial.sampling.probabilities |
Initialize sampling probabilities for the algorithm. If missing, they are taken as proportional to degree, and this is almost always the best starting values. |
MPLE.samplesize |
Number of samples to take in the computation of the maximum pseudolikelihood estimator (MPLE) of the working model parameter. The default is almost always sufficient. |
SAN.maxit |
A ceiling on the number of simulated annealing iterations. |
SAN.nsteps |
Number of MCMC proposals for all the annealing runs combined. |
sim.interval |
Number of MCMC steps between each of the M1 sampled networks per iteration. |
number.of.cross.ties |
The expected number of ties between those with
the trait and those without. If missing, it is computed based on the
respondent's reports of the number of ties they have to population members
who have the trait (i.e. |
max.degree |
Impose ceiling on degree size. |
parallel |
Number of processors to use in the computations. The default is 1, that is no parallel processing. |
parallel.type |
The type of cluster to start. e.g. 'PSOCK', 'MPI', etc. |
full.output |
More verbose output |
verbose |
Should verbose diagnostics be printed while the algorithm is running. |
If trait.variable
is numeric then the model-assisted estimate
of the mean is returned, otherwise a vector of proportion estimates is
returned. If full.output=TRUE
this leads to:
If full.output=FALSE
this leads to an object of class
rds.interval.estimate
which is a list with component
the numerical point estimate of proportion of thetrait.variable
.
a matrix with size columns and one row per category of trait.variable
:
The HT estimate of the population mean.
Lower 95% confidence bound
Upper 95% confidence bound
The design effect of the RDS
standard error
count of the number of sample values with that value of the trait
an rds.data.frame
that indicates recruitment
patterns by a pair of attributes named “id” and “recruiter.id”.
an estimate of the number of members of the population being
sampled. If NULL
it is read as the pop.size.mid
attribute of
the rds.data
frame. If that is missing it defaults to 1000.
the number of networked populations generated at each iteration.
the number of (full) RDS populations generated for each networked population at each iteration.
the random number seed used to initiate the computations.
an estimate of the mechanism guiding the choice of seeds. The choices are
indicating that all the seeds had the trait;
meaning they were, as if, a simple random sample of individuals from the population;
indicating that the seeds are taken as those in the sample (and resampled for the population with that composition if necessary);
is proportional to the degree of the individual;
indicating that all the seeds had the trait and the probability of being a seed is proportional to the degree of the respondent.
The number of seeds chosen to initiate the sampling.
The number of coupons given to each respondent.
The number of iterations used at the core of the algorithm.
The name of the outcome variable
The type of weighting used (i.e. MA)
The type of weighting used (i.e. MA)
A list of other diagnostic output from the computations.
Output from the bootstrap procedure. A list with two
elements: var
is the bootstrap variance, and BSest
is the
vector of bootstrap estimates themselves.
estimate of the parameter of the ERGM for the network.
Krista J. Gile with help from Mark S. Handcock
Gile, Krista J. 2011 Improved Inference for Respondent-Driven Sampling Data with Application to HIV Prevalence Estimation, Journal of the American Statistical Association, 106, 135-146.
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
RDS.I.estimates
, RDS.I.estimates
## Not run: data(faux) MA.estimates(rds.data=faux,trait.variable='X') ## End(Not run)
## Not run: data(faux) MA.estimates(rds.data=faux,trait.variable='X') ## End(Not run)
Diagnostic plots for the RDS recruitment process
## S3 method for class 'rds.data.frame' plot( x, plot.type = c("Recruitment tree", "Network size by wave", "Recruits by wave", "Recruits per seed", "Recruits per subject"), stratify.by = NULL, ... )
## S3 method for class 'rds.data.frame' plot( x, plot.type = c("Recruitment tree", "Network size by wave", "Recruits by wave", "Recruits per seed", "Recruits per subject"), stratify.by = NULL, ... )
x |
An rds.data.frame object. |
plot.type |
the type of diagnostic. |
stratify.by |
A factor used to color or stratify the plot elements. |
... |
Additional arguments for the underlying plot function if applicable. |
Several types of diagnostics are supported by the plot.type argument. 'Recruitment tree' displays a network plot of the RDS recruitment process. 'Network size by wave' monitors systematic changes is network size based on how far subjects are from the seed 'Recruits by wave' displays counts of subjects based on how far they rare from their seed. 'Recruit per seed' shows the total tree size for each seed. 'Recruits per subject' shows counts of how many subjects are recruited by each subject who are non-terminal.
Either nothing (for the recruitment tree plot), or a ggplot2 object.
data(fauxmadrona) ## Not run: plot(fauxmadrona) ## End(Not run) plot(fauxmadrona, plot.type='Recruits by wave') plot(fauxmadrona, plot.type='Recruits per seed') plot(fauxmadrona, plot.type='Recruits per subject') plot(fauxmadrona, plot.type='Recruits by wave', stratify.by='disease') plot(fauxmadrona, plot.type='Recruits per seed', stratify.by='disease') plot(fauxmadrona, plot.type='Recruits per subject', stratify.by='disease')
data(fauxmadrona) ## Not run: plot(fauxmadrona) ## End(Not run) plot(fauxmadrona, plot.type='Recruits by wave') plot(fauxmadrona, plot.type='Recruits per seed') plot(fauxmadrona, plot.type='Recruits per subject') plot(fauxmadrona, plot.type='Recruits by wave', stratify.by='disease') plot(fauxmadrona, plot.type='Recruits per seed', stratify.by='disease') plot(fauxmadrona, plot.type='Recruits per subject', stratify.by='disease')
Prints an differential.activity.estimate object
## S3 method for class 'differential.activity.estimate' print(x, ...)
## S3 method for class 'differential.activity.estimate' print(x, ...)
x |
an differential.activity.estimate object |
... |
unused |
Displays a pvalue.table
## S3 method for class 'pvalue.table' print(x, ...)
## S3 method for class 'pvalue.table' print(x, ...)
x |
a pvalue.table object |
... |
additional parameters passed to print.data.frame. |
Displays an rds.contin.bootstrap
## S3 method for class 'rds.contin.bootstrap' print(x, show.table = FALSE, ...)
## S3 method for class 'rds.contin.bootstrap' print(x, show.table = FALSE, ...)
x |
an rds.contin.bootstrap object |
show.table |
Display weighted contingency table |
... |
additional parameters passed to print.matrix. |
Displays an rds.data.frame
## S3 method for class 'rds.data.frame' print(x, ...)
## S3 method for class 'rds.data.frame' print(x, ...)
x |
an rds.data.frame object |
... |
additional parameters passed to print.data.frame. |
rds.interval.estimate
objectPrints an rds.interval.estimate
object
## S3 method for class 'rds.interval.estimate' print(x, as.percentage = NULL, ...)
## S3 method for class 'rds.interval.estimate' print(x, as.percentage = NULL, ...)
x |
an |
as.percentage |
logical. Print the interval estimates as percentages (as distinct from proportions). The default, NULL, means that it will determine if the variable is discrete or continuous and only print them as percentages if they are discrete. |
... |
unused |
print.summary.svyglm.RDS
is a version of print.summary.svyglm
that
reports odds-ratios in place of coefficients in the summary table.
This only applies for the binomial
family. Otherwise it is identical to
print.summary.svyglm
.
The default inprint.summary.svyglm
is to display the log-odds-ratios
and this displays the exponetiated from
and a 95
p-values are still displayed.
## S3 method for class 'summary.svyglm.RDS' print( x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ... )
## S3 method for class 'summary.svyglm.RDS' print( x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ... )
x |
an object of class |
digits |
the number of significant digits to use when printing. |
symbolic.cor |
logical. If |
signif.stars |
logical. If |
... |
further arguments passed to or from other methods. |
## For examples see example(svyglm)
## For examples see example(svyglm)
This function computes an interval estimate for one or more categorical variables. It optionally uses attributes of the RDS data set to determine the type of estimator and type of uncertainty estimate to use.
RDS.bootstrap.intervals( rds.data, outcome.variable, weight.type = NULL, uncertainty = NULL, N = NULL, subset = NULL, confidence.level = 0.95, number.of.bootstrap.samples = NULL, fast = TRUE, useC = TRUE, ci.type = "t", control = control.rds.estimates(), to.factor = FALSE, cont.breaks = 3, ... )
RDS.bootstrap.intervals( rds.data, outcome.variable, weight.type = NULL, uncertainty = NULL, N = NULL, subset = NULL, confidence.level = 0.95, number.of.bootstrap.samples = NULL, fast = TRUE, useC = TRUE, ci.type = "t", control = control.rds.estimates(), to.factor = FALSE, cont.breaks = 3, ... )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the
|
weight.type |
A string giving the type of estimator to use. The options
are |
uncertainty |
A string giving the type of uncertainty estimator to use.
The options are |
N |
An estimate of the number of members of the population being
sampled. If |
subset |
An optional criterion to subset |
confidence.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
number.of.bootstrap.samples |
The number of bootstrap samples to take
in estimating the uncertainty of the estimator. If |
fast |
Use a fast bootstrap where the weights are reused from the estimator rather than being recomputed for each bootstrap sample. |
useC |
Use a C-level implementation of Gile's bootstrap (rather than the R level). The implementations should be a computational equivalent estimator (except for speed). |
ci.type |
Type of confidence interval to use, if possible. If "t", use lower and upper confidence interval values based on the standard deviation of the bootstrapped values and a t multiplier. If "pivotal", use lower and upper confidence interval values based on the basic bootstrap (also called the pivotal confidence interval). If "quantile", use lower and upper confidence interval values based on the quantiles of the bootstrap sample. If "proportion", use the "t" unless the estimated proportion is less than 0.15 or the bounds are outside [0,1 . In this case, try the "quantile" and constrain the bounds to be compatible with [0,1]. |
control |
A list of control parameters for algorithm
tuning. Constructed using |
to.factor |
force variable to be a factor |
cont.breaks |
For continuous variates, some bootstrap proceedures require categorical data. In these cases, in order to contruct each bootstrap replicate, the outcome variable is split into cont.breaks categories. |
... |
Additional arguments for RDS.*.estimates. |
An object of class rds.interval.estimate
summarizing the inference.
The confidence interval and standard error are based on the bootstrap procedure.
In additon, the object has attribute bsresult
which provides details of the
bootstrap procedure. The contents of the bsresult
attribute depends on the
uncertainty
used. If uncertainty=="Salganik"
then bsresult
is a
vector of standard deviations of the bootstrap samples.
If uncertainty=="Gile's SS"
then
bsresult
is a list with components for the bootstrap point estimate,
the bootstrap
samples themselves and the standard deviations of the bootstrap samples.
If uncertainty=="SRS"
then bsresult
is NULL.
Gile, Krista J. 2011 Improved Inference for Respondent-Driven Sampling Data with Application to HIV Prevalence Estimation, Journal of the American Statistical Association, 106, 135-146.
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
## Not run: data(fauxmadrona) RDS.bootstrap.intervals(rds.data=fauxmadrona,weight.type="RDS-II", uncertainty="Salganik", outcome.variable="disease",N=1000,number.of.bootstrap.samples=50) data(fauxtime) RDS.bootstrap.intervals(rds.data=fauxtime,weight.type="HCG", uncertainty="HCG", outcome.variable="var1",N=1000,number.of.bootstrap.samples=10) ## End(Not run)
## Not run: data(fauxmadrona) RDS.bootstrap.intervals(rds.data=fauxmadrona,weight.type="RDS-II", uncertainty="Salganik", outcome.variable="disease",N=1000,number.of.bootstrap.samples=50) data(fauxtime) RDS.bootstrap.intervals(rds.data=fauxtime,weight.type="HCG", uncertainty="HCG", outcome.variable="var1",N=1000,number.of.bootstrap.samples=10) ## End(Not run)
Compares the rates of two variables against one another.
RDS.compare.proportions(first.interval, second.interval, M = 10000)
RDS.compare.proportions(first.interval, second.interval, M = 10000)
first.interval |
An |
second.interval |
An |
M |
The number of bootstrap resamplings to use |
This function preforms a bootstrap test comparing the the rates of two variables against one another.
## Not run: data(faux) int1 <- RDS.bootstrap.intervals(faux, outcome.variable=c("X"), weight.type="RDS-II", uncertainty="Salganik", N=1000, number.ss.samples.per.iteration=1000, confidence.level=0.95, number.of.bootstrap.samples=100) int2 <- RDS.bootstrap.intervals(faux, outcome.variable=c("Y"), weight.type="RDS-II", uncertainty="Salganik", N=1000, number.ss.samples.per.iteration=1000, confidence.level=0.95, number.of.bootstrap.samples=100) RDS.compare.proportions(int1,int2) ## End(Not run)
## Not run: data(faux) int1 <- RDS.bootstrap.intervals(faux, outcome.variable=c("X"), weight.type="RDS-II", uncertainty="Salganik", N=1000, number.ss.samples.per.iteration=1000, confidence.level=0.95, number.of.bootstrap.samples=100) int2 <- RDS.bootstrap.intervals(faux, outcome.variable=c("Y"), weight.type="RDS-II", uncertainty="Salganik", N=1000, number.ss.samples.per.iteration=1000, confidence.level=0.95, number.of.bootstrap.samples=100) RDS.compare.proportions(int1,int2) ## End(Not run)
Compares the rates of two variables against one another.
RDS.compare.two.proportions( data, variables, confidence.level = 0.95, number.of.bootstrap.samples = 5000, plot = FALSE, seed = 1 )
RDS.compare.two.proportions( data, variables, confidence.level = 0.95, number.of.bootstrap.samples = 5000, plot = FALSE, seed = 1 )
data |
An object of class |
variables |
A character vector of column names to select from |
confidence.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
number.of.bootstrap.samples |
The number of Monte Carlo draws to determine the null distribution of the likelihood ratio statistic. |
plot |
Logical, if TRUE then a plot is produces of the null distribution of the likelihood ratio statistic with the observed statistics plotted as a vertical dashed line. |
seed |
The value of the random number seed. Preset by default to allow reproducability. |
An object of class pvalue.table
containing the cross-tabulation of p-values
for comparing the two classes
This function computes the Homophily Configuration Graph type estimates for a categorical variable.
RDS.HCG.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, small.fraction = FALSE, empir.lik = TRUE, to.factor = FALSE, cont.breaks = 3 )
RDS.HCG.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, small.fraction = FALSE, empir.lik = TRUE, to.factor = FALSE, cont.breaks = 3 )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the
|
N |
Population size to be used to calculate the empirical likelihood interval. If NULL, this value is taken to be the population.size.mid attribute of the data and if that is not set, no finite population correction is used. |
subset |
An optional criterion to subset |
small.fraction |
Should a small sample fraction be assumed |
empir.lik |
Should confidence intervals be estimated using empirical likelihood. |
to.factor |
force variable to be a factor |
cont.breaks |
If variable is numeric, how many discretization points should be used in the calculation of the weights. |
If the empir.lik
is true, an object of class
rds.interval.estimate
is returned. This is a list with components
estimate
: The numerical point estimate of proportion
of the trait.variable
.
interval
: A matrix with six
columns and one row per category of trait.variable
:
point estimate
: The HT estimate of the population mean.
95% Lower Bound
: Lower 95% confidence bound.
95%
Upper Bound
: Upper 95% confidence bound.
Design Effect
: The
design effect of the RDS.
s.e.
: Standard error.
n
:
Count of the number of sample values with that value of the trait.
Otherwise an object of class rds.HCG.estimate
object is returned.
Ian E. Fellows
RDS.I.estimates
, RDS.II.estimates
, RDS.SS.estimates
data(fauxtime) RDS.HCG.estimates(rds.data=fauxtime,outcome.variable='var1')
data(fauxtime) RDS.HCG.estimates(rds.data=fauxtime,outcome.variable='var1')
This function computes the RDS-I type estimates for a categorical variable. It is also referred to as the Salganik-Heckathorn estimator.
RDS.I.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, smoothed = FALSE, empir.lik = TRUE, to.factor = FALSE, cont.breaks = 3 )
RDS.I.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, smoothed = FALSE, empir.lik = TRUE, to.factor = FALSE, cont.breaks = 3 )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the
|
N |
Population size to be used to calculate the empirical likelihood interval. If NULL, this value is taken to be the population.size.mid attribute of the data and if that is not set, no finite population correction is used. |
subset |
An optional criterion to subset |
smoothed |
Logical, if TRUE then the “data smoothed” version of RDS-I is used, where it is assumed that the observed Markov process is reversible. |
empir.lik |
Should confidence intervals be estimated using empirical likelihood. |
to.factor |
force variable to be a factor |
cont.breaks |
The number of categories used for the RDS-I adjustment when the variate is continuous. |
If the empir.lik
is true, an object of class
rds.interval.estimate
is returned. This is a list with components
estimate
: The numerical point estimate of proportion
of the trait.variable
.
interval
: A matrix with six
columns and one row per category of trait.variable
:
point estimate
: The HT estimate of the population mean.
95% Lower Bound
: Lower 95% confidence bound.
95%
Upper Bound
: Upper 95% confidence bound.
Design Effect
: The
design effect of the RDS.
s.e.
: Standard error.
n
:
Count of the number of sample values with that value of the trait.
Otherwise an object of class rds.I.estimate
object is returned.
Mark S. Handcock and W. Whipple Neely
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
Neely, W. W., 2009. Bayesian methods for data from respondent driven sampling. Dissertation in-progress, Department of Statistics, University of Wisconsin, Madison.
Salganik, M., Heckathorn, D. D., 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34, 193-239.
Volz, E., Heckathorn, D., 2008. Probability based estimation theory for Respondent Driven Sampling. The Journal of Official Statistics 24 (1), 79-97.
RDS.II.estimates
, RDS.SS.estimates
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X') RDS.I.estimates(rds.data=faux,outcome.variable='X',smoothed=TRUE)
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X') RDS.I.estimates(rds.data=faux,outcome.variable='X',smoothed=TRUE)
RDS-I weights
rds.I.weights(rds.data, outcome.variable, N = NULL, smoothed = FALSE, ...)
rds.I.weights(rds.data, outcome.variable, N = NULL, smoothed = FALSE, ...)
rds.data |
An rds.data.frame |
outcome.variable |
The variable used to base the weights on. |
N |
Population size |
smoothed |
Should the data smoothed RDS-I weights be computed. |
... |
Unused |
This function computes the RDS-II estimates for a categorical variable or the RDS-II estimate for a numeric variable.
RDS.II.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, empir.lik = TRUE, to.factor = FALSE )
RDS.II.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, empir.lik = TRUE, to.factor = FALSE )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the
|
N |
Population size to be used to calculate the empirical likelihood interval. If NULL, this value is taken to be the population.size.mid attribute of the data and if that is not set, no finite population correction is used. |
subset |
An optional criterion to subset |
empir.lik |
If true, and outcome.variable is numeric, standard errors based on empirical likelihood will be given. |
to.factor |
force variable to be a factor |
If outcome.variable
is numeric then the RDS-II estimate of the mean is returned, otherwise a vector of proportion estimates is returned.
If the empir.lik
is true, an object of class rds.interval.estimate
is returned. This is a list with components
estimate
: The numerical point estimate of proportion
of the trait.variable
.
interval
: A matrix with six
columns and one row per category of trait.variable
:
point estimate
: The HT estimate of the population mean.
95% Lower Bound
: Lower 95% confidence bound.
95%
Upper Bound
: Upper 95% confidence bound.
Design Effect
: The
design effect of the RDS.
s.e.
: Standard error.
n
:
Count of the number of sample values with that value of the trait.
Otherwise, an object of class rds.II.estimate
is returned.
Mark S. Handcock and W. Whipple Neely
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
Salganik, M., Heckathorn, D. D., 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34, 193-239.
Volz, E., Heckathorn, D., 2008. Probability based estimation theory for Respondent Driven Sampling. The Journal of Official Statistics 24 (1), 79-97.
RDS.I.estimates
, RDS.SS.estimates
data(faux) RDS.II.estimates(rds.data=faux,outcome.variable='X') RDS.II.estimates(rds.data=faux,outcome.variable='X',subset= Y!="blue")
data(faux) RDS.II.estimates(rds.data=faux,outcome.variable='X') RDS.II.estimates(rds.data=faux,outcome.variable='X',subset= Y!="blue")
This function creates an object of class rds.interval.estimate
.
rds.interval.estimate( estimate, outcome.variable, weight.type, uncertainty, weights, N = NULL, conf.level = 0.95, csubset = "" )
rds.interval.estimate( estimate, outcome.variable, weight.type, uncertainty, weights, N = NULL, conf.level = 0.95, csubset = "" )
estimate |
The numerical point estimate of proportion of the
|
outcome.variable |
A string giving the name of the variable in the
|
weight.type |
A string giving the type of estimator to use. The options
are |
uncertainty |
A string giving the type of uncertainty estimator to use.
The options are |
weights |
A numerical vector of sampling weights for the sample, in order of the sample. They should be inversely proportional to the first-order inclusion probabilites, although this is not assessed or inforced. |
N |
An estimate of the number of members of the population being
sampled. If |
conf.level |
The confidence level for the confidence intervals. The default is 0.95 for 95%. |
csubset |
A character string representing text to add to the output label. Typically this will be the expression used it define the subset of the data used for the estimate. |
An object of class rds.interval.estimate
is returned. This is
a list with components
estimate
: The numerical point
estimate of proportion of the trait.variable
.
interval
:
A matrix with six columns and one row per category of trait.variable
:
point estimate
: The HT estimate of the population
mean.
95% Lower Bound
: Lower 95% confidence bound.
95% Upper Bound
: Upper 95% confidence bound.
Design
Effect
: The design effect of the RDS.
s.e.
: Standard error.
n
: Count of the number of sample values with that value of the
trait.
Mark S. Handcock
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
Salganik, M., Heckathorn, D. D., 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34, 193-239.
Volz, E., Heckathorn, D., 2008. Probability based estimation theory for Respondent Driven Sampling. The Journal of Official Statistics 24 (1), 79-97.
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X',smoothed=TRUE)
data(faux) RDS.I.estimates(rds.data=faux,outcome.variable='X',smoothed=TRUE)
This function computes the sequential sampling (SS) estimates for a categorical variable or numeric variable.
RDS.SS.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, number.ss.samples.per.iteration = 500, number.ss.iterations = 5, control = control.rds.estimates(), hajek = TRUE, empir.lik = TRUE, to.factor = FALSE )
RDS.SS.estimates( rds.data, outcome.variable, N = NULL, subset = NULL, number.ss.samples.per.iteration = 500, number.ss.iterations = 5, control = control.rds.estimates(), hajek = TRUE, empir.lik = TRUE, to.factor = FALSE )
rds.data |
An |
outcome.variable |
A string giving the name of the variable in the
|
N |
An estimate of the number of members of the population being
sampled. If |
subset |
An optional criterion to subset |
number.ss.samples.per.iteration |
The number of samples to take in
estimating the inclusion probabilites in each iteration of the sequential
sampling algorithm. If |
number.ss.iterations |
The number of iterations of the sequential sampling algorithm. If that is missing it defaults to 5. |
control |
A list of control parameters for algorithm
tuning. Constructed using |
hajek |
logical; Use the standard Hajek-type estimator of Gile (2011) or the standard Hortitz-Thompson. The default is TRUE. |
empir.lik |
If true, and outcome.variable is numeric, standard errors based on empirical likelihood will be given. |
to.factor |
force variable to be a factor |
If outcome.variable
is numeric then the Gile SS estimate of the mean is returned, otherwise a vector of proportion estimates is returned.
If the empir.lik
is true, an object of class rds.interval.estimate
is returned. This is a list with components
estimate
: The numerical point estimate of proportion
of the trait.variable
.
interval
: A matrix with six
columns and one row per category of trait.variable
:
point estimate
: The HT estimate of the population mean.
95% Lower Bound
: Lower 95% confidence bound.
95%
Upper Bound
: Upper 95% confidence bound.
Design Effect
: The
design effect of the RDS.
s.e.
: Standard error.
n
:
Count of the number of sample values with that value of the trait.
Otherwise, an object of class rds.SS.estimate
is returned.
Krista J. Gile with help from Mark S. Handcock
Gile, Krista J. 2011 Improved Inference for Respondent-Driven Sampling Data with Application to HIV Prevalence Estimation, Journal of the American Statistical Association, 106, 135-146.
Gile, Krista J., Handcock, Mark S., 2010. Respondent-driven Sampling: An Assessment of Current Methodology, Sociological Methodology, 40, 285-327. <doi:10.1111/j.1467-9531.2010.01223.x>
Gile, Krista J., Beaudry, Isabelle S. and Handcock, Mark S., 2018 Methods for Inference from Respondent-Driven Sampling Data, Annual Review of Statistics and Its Application <doi:10.1146/annurev-statistics-031017-100704>.
Gile, Krista J., Handcock, Mark S., 2015 Network Model-Assisted Inference from Respondent-Driven Sampling Data, Journal of the Royal Statistical Society, A. <doi:10.1111/rssa.12091>.
Salganik, M., Heckathorn, D. D., 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34, 193-239.
Volz, E., Heckathorn, D., 2008. Probability based estimation theory for Respondent Driven Sampling. The Journal of Official Statistics 24 (1), 79-97.
RDS.I.estimates
, RDS.II.estimates
data(fauxmadrona) RDS.SS.estimates(rds.data=fauxmadrona,outcome.variable="disease",N=1000)
data(fauxmadrona) RDS.SS.estimates(rds.data=fauxmadrona,outcome.variable="disease",N=1000)
Create RDS samples with given characteristics
rdssampleC( net, nnodes = network.size(net), nsamp0, fixinitial, nsamp, replace, coupons, select = NULL, bias = NULL, rds.samp = NULL, seed.distribution = NULL, attrall = FALSE, trait.variable = "disease", nsims = 1, seeds = NULL, prob.network.recall = 1, verbose = TRUE )
rdssampleC( net, nnodes = network.size(net), nsamp0, fixinitial, nsamp, replace, coupons, select = NULL, bias = NULL, rds.samp = NULL, seed.distribution = NULL, attrall = FALSE, trait.variable = "disease", nsims = 1, seeds = NULL, prob.network.recall = 1, verbose = TRUE )
net |
the network object from which to draw a sample |
nnodes |
the number of nodes in the network [at least as default] |
nsamp0 |
the number of seeds to be drawn (i.e. the size of the 0th wave of sampling) |
fixinitial |
a variable that indicates the distribution from which to draw the initial seeds, if the seeds variable is NULL and the seed.distribution variable is NULL |
nsamp |
number of individuals in each RDS sample |
replace |
sampling with replacement |
coupons |
number of coupons |
select |
not used |
bias |
not used |
rds.samp |
not used |
seed.distribution |
a variable [what kind?] that indicates the distribution from which to draw the initial seeds |
attrall |
Whether all the information about the sample should be returned [??] |
trait.variable |
attribute of interest |
nsims |
number of RDS samples to draw |
seeds |
an array of seeds. Default is NULL, in which case the function draws the seeds from the nodes of the network. |
prob.network.recall |
simulates the probability that an individual will remember any particular link |
verbose |
Print verbose output |
A list with the following elements: nsample: vector of indices of sampled nodes wsample: vector of waves of each sampled node degsample: vector of degrees of sampled nodes attrsample: vector of attrs of sampled nodes toattr: vector of numbers of referrals to attrsd nodes tonoattr: vector of number of referrans to unattrsd nominators: recruiter of each sample
rds.data.frame
This function imports RDSAT data files as rds.data.frame
objects.
read.rdsat(file, delim = c("<auto>", "\t", " ", ","), N = NULL)
read.rdsat(file, delim = c("<auto>", "\t", " ", ","), N = NULL)
file |
the name of the file which the data are to be read from. If it does not contain an _absolute_ path, the file name is _relative_ to the current working directory, 'getwd()'. Tilde-expansion is performed where supported. As from R 2.10.0 this can be a compressed file (see 'file') |
delim |
The seperator defining columns. <auto> will guess the delimitor based on the file. |
N |
The population size (Optional). |
fn <- paste0(path.package("RDS"),"/extdata/nyjazz.rdsat") rd <- read.rdsat(fn) plot(rd)
fn <- paste0(path.package("RDS"),"/extdata/nyjazz.rdsat") rd <- read.rdsat(fn) plot(rd)
Import data saved using write.rdsobj
read.rdsobj(file)
read.rdsobj(file)
file |
the name of the file which the data are to be read from. If it does not contain an _absolute_ path, the file name is _relative_ to the current working directory, 'getwd()'. Tilde-expansion is performed where supported. As from R 2.10.0 this can be a compressed file (see 'file') |
Plots the recruitment network using the Reingold Tilford algorithm.
reingold.tilford.plot( x, vertex.color = NULL, vertex.color.scale = hue_pal(), vertex.size = 2, vertex.size.range = c(1, 5), edge.arrow.size = 0, vertex.label.cex = 0.2, vertex.frame.color = NA, vertex.label = get.id(x), show.legend = TRUE, plot = TRUE, ... )
reingold.tilford.plot( x, vertex.color = NULL, vertex.color.scale = hue_pal(), vertex.size = 2, vertex.size.range = c(1, 5), edge.arrow.size = 0, vertex.label.cex = 0.2, vertex.frame.color = NA, vertex.label = get.id(x), show.legend = TRUE, plot = TRUE, ... )
x |
An rds.data.frame |
vertex.color |
The name of the categorical variable in x to color the points with. |
vertex.color.scale |
The scale to create the color palette. |
vertex.size |
The size of the vertex points. either a number or the name of a column of x. |
vertex.size.range |
If vertex.size represents a variable, vertex.size.range is a vector of length 2 representing the minimum and maximum cex for the points. |
edge.arrow.size |
The size of the arrow from recruiter to recruitee. |
vertex.label.cex |
The size expansion factor for the vertex.labels. |
vertex.frame.color |
the color of the outside of the vertex.points. |
vertex.label |
The name of a variable to use as vertex labels. NA implies no labels. |
show.legend |
If true and either vertex.color or vertex.size represent variables, legends will be displayed at the bottom of the plot. |
plot |
Logical, if TRUE then a plot is produced of recruitment tree. ratio statistic with the observed statistics plotted as a vertical dashed line. |
... |
Additional parameters passed to plot.igraph. |
A two-column vector of the positions of the nodes in the recruitment tree.
## Not run: data(fauxmadrona) data(faux) reingold.tilford.plot(faux) reingold.tilford.plot(fauxmadrona,vertex.color="disease") ## End(Not run)
## Not run: data(fauxmadrona) data(faux) reingold.tilford.plot(faux) reingold.tilford.plot(fauxmadrona,vertex.color="disease") ## End(Not run)
Determines the recruiter.id from recruitment coupon information
rid.from.coupons( data, subject.coupon = NULL, coupon.variables, subject.id = NULL, seed.id = "seed" )
rid.from.coupons( data, subject.coupon = NULL, coupon.variables, subject.id = NULL, seed.id = "seed" )
data |
a data.frame |
subject.coupon |
The variable representing the coupon returned by subject |
coupon.variables |
The variable representing the coupon ids given to the subject |
subject.id |
The variable representing the subject's id |
seed.id |
The recruiter.id to assign to seed subjects. |
fpath <- system.file("extdata", "nyjazz.csv", package="RDS") dat <- read.csv(fpath) dat$recruiter.id <- rid.from.coupons(dat,"own.coupon", paste0("coupon.",1:7),"id") #create and rds.data.frame rds <- as.rds.data.frame(dat,network.size="network.size")
fpath <- system.file("extdata", "nyjazz.csv", package="RDS") dat <- read.csv(fpath) dat$recruiter.id <- rid.from.coupons(dat,"own.coupon", paste0("coupon.",1:7),"id") #create and rds.data.frame rds <- as.rds.data.frame(dat,network.size="network.size")
This function sets the class of the control list, with the default being the name of the calling function.
set.control.class( myname = as.character(RDS::ult(sys.calls(), 2)[[1L]]), control = get("control", pos = parent.frame()) )
set.control.class( myname = as.character(RDS::ult(sys.calls(), 2)[[1L]]), control = get("control", pos = parent.frame()) )
myname |
Name of the class to set. |
control |
Control list. Defaults to the |
The control list with class set.
check.control.class, print.control.list
Displays an rds.data.frame
show.rds.data.frame(x, ...)
show.rds.data.frame(x, ...)
x |
an rds.data.frame object. |
... |
additional parameters passed to print.data.frame. |
RDS::summary.svyglm.RDS
is a version of summary.svyglm
that
reports odds-ratios in place of coefficients in the summary table.
This only applies for the binomial
family. Otherwise it is identical to
summary.svyglm
.
The default in summary.svyglm
is to display the log-odds-ratios
and this displays the exponetiated from
and a 95
p-values are still displayed.
## S3 method for class 'svyglm.RDS' summary(object, correlation = FALSE, df.resid = NULL, odds = TRUE, ...)
## S3 method for class 'svyglm.RDS' summary(object, correlation = FALSE, df.resid = NULL, odds = TRUE, ...)
object |
an object of class |
correlation |
logical; if |
df.resid |
Optional denominator degrees of freedom for Wald tests. |
odds |
logical; Should the coefficients be reported as odds (rather than log-odds)? |
... |
further arguments passed to or from other methods. |
svyglm
fits a generalised linear model to data from a complex survey design, with
inverse-probability weighting and design-based standard errors.
There is no anova
method for svyglm
as the models are not
fitted by maximum likelihood.
See the manual page on svyglm
for detail of that function.
RDS::summary.svyglm
returns an object of class "summary.svyglm.RDS"
,
a list with components
call |
the component from |
family |
the component
from |
deviance |
the component from |
contrasts |
the component from |
df.residual |
the
component from |
null.deviance |
the component from
|
df.null |
the component from |
deviance.resid |
the deviance residuals: see
|
coefficients |
the matrix of coefficients, standard errors, z-values and p-values. Aliased coefficients are omitted. |
aliased |
named logical vector showing if the original coefficients are aliased. |
dispersion |
either the supplied argument or
the inferred/estimated dispersion if the latter is |
df |
a 3-vector of the rank of the model and the number of residual degrees of freedom, plus number of coefficients (including aliased ones). |
cov.unscaled |
the unscaled ( |
cov.scaled |
ditto,
scaled by |
correlation |
(only if |
symbolic.cor |
(only if |
odds |
Are the coefficients reported as odds (rather than log-odds)? |
## For examples see example(svyglm)
## For examples see example(svyglm)
calculates the mle. i.e. the row proportions of the transition matrix
transition.counts.to.Markov.mle(transition.counts)
transition.counts.to.Markov.mle(transition.counts)
transition.counts |
a matrix or table of transition counts |
depreicated. just use prop.table(transition.counts,1)
Extract or replace the *ult*imate (last) element of a vector or a list, or an element counting from the end.
ult(x, i = 1L)
ult(x, i = 1L)
x |
a vector or a list. |
i |
index from the end of the list to extract or replace (where 1 is the last element, 2 is the penultimate element, etc.). |
An element of 'x'.
x <- 1:5 (last <- ult(x)) (penultimate <- ult(x, 2)) # 2nd last.
x <- 1:5 (last <- ult(x)) (penultimate <- ult(x, 2)) # 2nd last.
Volz-Heckathorn (RDS-II) weights
vh.weights(degs, N = NULL)
vh.weights(degs, N = NULL)
degs |
The degrees (i.e. network sizes) of the sample units. |
N |
Population size |
writes an rds.data.frame recruitment tree as a GraphViz file
write.graphviz(x, file)
write.graphviz(x, file)
x |
An rds.data.frame. |
file |
A character vector representing the file |
Writes out the RDS tree in NetDraw format
write.netdraw(x, file = NULL, by.seed = FALSE)
write.netdraw(x, file = NULL, by.seed = FALSE)
x |
An rds.data.frame. |
file |
a character vector representing a file. |
by.seed |
If true, seperate files will be created for each seed. |
If by.seed is false, two files are created using 'file' as a base name.
paste0(file,".DL")
contains the edge information, and paste0(file,".vna")
contains the nodal attributes
Writes out the RDS tree in RDSAT format
write.rdsat(x, file = NULL)
write.rdsat(x, file = NULL)
x |
An rds.data.frame. |
file |
a character vector representing a file. |
Export an rds.data.frame to file
write.rdsobj(x, file)
write.rdsobj(x, file)
x |
The rds.data.frame to export |
file |
The name of the file to create. |