| Title: | Nonparametric Probabilistic-Statistical Variate Analysis with Automated Markov-Chain Monte Carlo |
|---|---|
| Description: | Calculate posterior joint and conditional probabilities, probability distributions of population frequencies, and information-theoretic measures, by means of Bayesian nonparametric methods. Data imputation is automatic and done in a principled way. Markov-chain Monte Carlo calculations are automatically handled and do not require user supervision. Applications range from statistical estimation and probabilistic hypothesis testing to evidence-based inference and decision making, in a wide range of disciplines from astrophysics to medicine. For more details and examples see for instance Porta Mana et al. (2026) <doi:10.31219/osf.io/8nr56>, Dunson & Bhattacharya (2011) <doi:10.1093/acprof:oso/9780199694587.003.0005>, Lindley & Novick (1981) <doi:10.1214/aos/1176345331>, Bernardo & Smith (2000) <doi:10.1002/9780470316870>, Müller et al. (2015) <doi:10.1007/978-3-319-18968-0>. Requires the packages 'Nimble', 'parallel', 'extraDistr'. |
| Authors: | PierGianLuca Porta Mana [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-6070-0784>), Aurora Grefsrud [ctb] (ORCID: <https://orcid.org/0009-0005-9063-4131>), Håkon Mydland [ctb] (ORCID: <https://orcid.org/0009-0004-9558-8894>), Maksim Ohvrill [ctb], Simen Hesthamar Hauge [ctb] (ORCID: <https://orcid.org/0009-0005-6158-2846>) |
| Maintainer: | PierGianLuca Porta Mana <[email protected]> |
| License: | AGPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-07-02 21:08:31 UTC |
| Source: | https://github.com/pglpm/prova |
Plot function that modifies and expands the graphics package's graphics::matplot() function in several ways.
flexiplot( x, y, type = NULL, lty = c(1, 2, 4, 3, 6, 5), lwd = 2, pch = c(1, 2, 0, 5, 6, 3), col = palette(), xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add = FALSE, xdomain = NULL, ydomain = NULL, alpha.f = 1, xjitter = NULL, yjitter = NULL, grid = TRUE, cex.main = 1, ... )flexiplot( x, y, type = NULL, lty = c(1, 2, 4, 3, 6, 5), lwd = 2, pch = c(1, 2, 0, 5, 6, 3), col = palette(), xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add = FALSE, xdomain = NULL, ydomain = NULL, alpha.f = 1, xjitter = NULL, yjitter = NULL, grid = TRUE, cex.main = 1, ... )
x |
Numeric or character: vector of x-coordinates. If missing, a numeric vector |
y |
Numeric or character: vector of y coordinates. If missing, a numeric vector |
type, lty, lwd, pch, col, xlab, ylab, add, cex.main
|
see analogous arguments in |
xlim, ylim
|
|
xdomain, ydomain
|
Character or numeric or |
alpha.f |
Numeric, default 1: opacity of the colours, |
xjitter, yjitter
|
Logical or |
grid |
Logical: whether to plot a light grid. Default |
... |
Other parameters to be passed to |
This function is essentially a wrapper around graphics::matplot(), augmenting the latter with some additional features useful for plotting data and results handled by Prova. Some of the additional features provided by flexiplot are the following:
Either or both x and y arguments can be of class base::character. In this case, axes labels corresponding to the unique values are used (see arguments xdomain and ydomain). This makes it easier to plot nominal and ordinal variates.
A jitter can also be added to the generated points, via the xjitter and yjitter switches. This makes it easier to generate scatter plots of nominal and ordinal variates.
It is possible to specify only a lower or upper limit in the xlim and ylim arguments, letting the other limit to be found automatically. This can be useful in plotting probabilities, in cases where we want to specify the lower, 0 limit, but want the upper limit to simply be the the maximum probability.
Transparency of lines or markers can be specified through argument alpha.f.
The plotting style is different, and default argument type = 'l' (line plot) rather than type = 'p' (point plot).
See the package's vignettes for more examples.
NULL, invisibly; produces a plot, see graphics::matplot().
Pr() to calculate posterior probabilities and quantiles.
plot.probability() to directly plot posterior probabilities and quantiles contained in a probability object.
plotquantiles() to plot quantile ranges.
## Scatter plot of the 'island' vs 'species' nominal variates of the penguins dataset; ## note how jitter is automatically added: flexiplot(x = penguins[, 'species'], y = penguins[, 'island']) ## Scatter plot of the 'bill_len' vs 'species' variates of the penguins dataset: flexiplot(x = penguins[, 'species'], y = penguins[, 'bill_len']) ## We can add jitter to separate the nominal values: flexiplot(x = penguins[, 'species'], y = penguins[, 'bill_len'], xjitter = TRUE) ## Scatter plot of the 'bill_len' vs 'body_mass' variates; ## in this case we must specify the scatter-plot option `type = 'p'`: flexiplot(x = penguins[, 'body_mass'], y = penguins[, 'bill_len'], type = 'p') ## Calculate the values of a normal distribution in a restricted range x <- seq(from = -2, to = 2, length.out = 127) y <- dnorm(x, mean = 0, sd = 1) ## plot the distribution, with 0 as the lower plot range: flexiplot(x = x, y = y, ylim = c(0, NA))## Scatter plot of the 'island' vs 'species' nominal variates of the penguins dataset; ## note how jitter is automatically added: flexiplot(x = penguins[, 'species'], y = penguins[, 'island']) ## Scatter plot of the 'bill_len' vs 'species' variates of the penguins dataset: flexiplot(x = penguins[, 'species'], y = penguins[, 'bill_len']) ## We can add jitter to separate the nominal values: flexiplot(x = penguins[, 'species'], y = penguins[, 'bill_len'], xjitter = TRUE) ## Scatter plot of the 'bill_len' vs 'body_mass' variates; ## in this case we must specify the scatter-plot option `type = 'p'`: flexiplot(x = penguins[, 'body_mass'], y = penguins[, 'bill_len'], type = 'p') ## Calculate the values of a normal distribution in a restricted range x <- seq(from = -2, to = 2, length.out = 127) y <- dnorm(x, mean = 0, sd = 1) ## plot the distribution, with 0 as the lower plot range: flexiplot(x = x, y = y, ylim = c(0, NA))
The posterior probabilities calculated with the Pr() function, and outputted as a probability object, have an associated variability that comes from the finite size of the data sample. This variability can be interpreted in two ways:
How the probabilities would change, if we could collect a very large (infinite) amount of additional data, and how likely would such change be;
The relative frequency of a particular variate value in the full (sampled and unsampled) population is unknown; we can quantify our uncertainty about this relative frequency with a probability distribution.
The hist() method for a probability object is a utility to visualize this kind of variability, in the form of a distribution.
## S3 method for class 'probability' hist( x, subset = NULL, breaks = NULL, legend = "top", lty = c(1, 2, 4, 3, 6, 5), lwd = 2, col = palette(), alpha.f = 1, fill.alpha.f = 0.125, showmean = TRUE, xlab = NULL, ylab = NULL, xlim = NULL, ylim = c(0, NA), main = NULL, grid = TRUE, add = FALSE, ... )## S3 method for class 'probability' hist( x, subset = NULL, breaks = NULL, legend = "top", lty = c(1, 2, 4, 3, 6, 5), lwd = 2, col = palette(), alpha.f = 1, fill.alpha.f = 0.125, showmean = TRUE, xlab = NULL, ylab = NULL, xlim = NULL, ylim = c(0, NA), main = NULL, grid = TRUE, add = FALSE, ... )
x |
Object of class "probability", obtained with |
subset |
Named list or named vector: which variate values to display. For the variates corresponding to the names in this list, only the vector of values corresponding to that variate is displayed. |
breaks |
|
legend |
One of the values |
lty, lwd, col, alpha.f, xlab, ylab, xlim, ylim, main, grid, add
|
see analogous arguments in |
fill.alpha.f |
Numeric, default 0.125: opacity of the histogram filling. |
showmean |
Logical, default |
... |
Other parameters to be passed to |
Invisibly, an object of class "histogram".
Pr() to calculate posterior probabilities and quantiles.
plot.probability() to plot the posterior probabilities.
flexiplot() (on which hist.probability() is based) for more general plots.
plotquantiles() to plot quantile ranges.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## calculate the probability, and its variability, ## for the value 'Adelie' of the "species" variate probs <- Pr(Y = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1) probs$values ## show the variability of this probability; equivalently show ## the probability distribution for the relative frequency of ## 'Adelie' penguins in the full population hist(probs, legend = 'topright')## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## calculate the probability, and its variability, ## for the value 'Adelie' of the "species" variate probs <- Pr(Y = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1) probs$values ## show the variability of this probability; equivalently show ## the probability distribution for the relative frequency of ## 'Adelie' penguins in the full population hist(probs, legend = 'topright')
Compute the posterior joint probability distribution of the variates conditional on the given data, by means of Markov-chain Monte Carlo, using the package Nimble.
learn( data, metadata, auxdata = NULL, outputdir = NULL, nsamples = 3600, nchains = 8, nsamplesperchain = 450, parallel = TRUE, seed = NULL, cleanup = TRUE, appendinfo = TRUE, valueislearnt = TRUE, subsampledata = NULL, prior = missing(data) || is.null(data), startupMCiterations = 3600, minMCiterations = 0, maxMCiterations = +Inf, maxhours = +Inf, ncheckpoints = 12, maxrelMCSE = +Inf, minESS = 450, initES = 2, thinning = NULL, verbose = TRUE, plottraces = !cleanup, showKtraces = FALSE, showAlphatraces = FALSE, hyperparams = list(ncomponents = 64, minalpha = -4, maxalpha = 4, byalpha = 1, Rshapelo = 0.5, Rshapehi = 0.5, Rvarm1 = 3^2, Cshapelo = 0.5, Cshapehi = 0.5, Cvarm1 = 3^2, Dshapelo = 0.5, Dshapehi = 0.5, Dvarm1 = 3^2, Bshapelo = 1, Bshapehi = 1, Dthreshold = 1, tscalefactor = 4.266, Oprior = "Hadamard", Nprior = "Hadamard", avoidzeroW = NULL, initmethod = "datacentre", Qerror = pnorm(c(-1, 1))) )learn( data, metadata, auxdata = NULL, outputdir = NULL, nsamples = 3600, nchains = 8, nsamplesperchain = 450, parallel = TRUE, seed = NULL, cleanup = TRUE, appendinfo = TRUE, valueislearnt = TRUE, subsampledata = NULL, prior = missing(data) || is.null(data), startupMCiterations = 3600, minMCiterations = 0, maxMCiterations = +Inf, maxhours = +Inf, ncheckpoints = 12, maxrelMCSE = +Inf, minESS = 450, initES = 2, thinning = NULL, verbose = TRUE, plottraces = !cleanup, showKtraces = FALSE, showAlphatraces = FALSE, hyperparams = list(ncomponents = 64, minalpha = -4, maxalpha = 4, byalpha = 1, Rshapelo = 0.5, Rshapehi = 0.5, Rvarm1 = 3^2, Cshapelo = 0.5, Cshapehi = 0.5, Cvarm1 = 3^2, Dshapelo = 0.5, Dshapehi = 0.5, Dvarm1 = 3^2, Bshapelo = 1, Bshapehi = 1, Dthreshold = 1, tscalefactor = 4.266, Oprior = "Hadamard", Nprior = "Hadamard", avoidzeroW = NULL, initmethod = "datacentre", Qerror = pnorm(c(-1, 1))) )
data |
A dataset, given as a |
metadata |
metadata about the dataset's variates, given either as a data frame or as a file path to a CSV file. |
auxdata |
A larger dataset, given as a data frame or as a file path to a CSV file. Such a dataset would be too large to use in the Monte Carlo sampling, but can still be used to help estimate some hyperparameters. |
outputdir |
|
nsamples |
Integer, default 3600: number of desired, approximately independent Monte Carlo samples. If this argument is changed, the user is also required to explicitly give either |
nchains |
Integer, default 8: number of Monte Carlo chains. If this argument is changed, the user is also required to explicitly give either |
nsamplesperchain |
Integer, default 450: number of approximately independent Monte Carlo samples per chain. If this argument is changed, the user is also required to explicitly give either |
parallel |
Logical or positive integer or cluster object. |
seed |
Integer or |
cleanup |
Logical, default |
appendinfo |
Logical, default |
valueislearnt |
Logical or |
subsampledata |
Integer or |
prior |
Logical: Calculate the prior distribution? Default is |
startupMCiterations |
Integer, default 3600: number of initial Monte Carlo iterations. |
minMCiterations |
Integer, default 0: minimum number of Monte Carlo iterations to be doneby a chain. |
maxMCiterations |
Integer, default |
maxhours |
Numeric, default |
ncheckpoints |
Integer or |
maxrelMCSE |
Numeric positive, default |
minESS |
Numeric positive or |
initES |
Numeric positive, default 2: number of initial "burn-in" samples, separated by the Expected Sample Size, to be discarded. Note that the Monte Carlo chain typically starts in a high-probability region, so there is no reason to discard many initial samples. |
thinning |
Integer or |
verbose |
Logical, default |
plottraces |
Logical, default |
showKtraces |
Logical, default |
showAlphatraces |
Logical, default |
hyperparams |
List: hyperparameters of the hyperprior; see values in "Usage". |
This function takes as main inputs a set of data and metadata, and computes the full joint probability distribution for new data, including its variability. From this full joint distribution any other distributions of interest can subsequently be computed; see Pr() and related functions. This computation can also be interpreted as an estimation of the full joint frequency distribution of the variates in the whole population, beyond the sample data, together with its uncertainty. The computation allows for the use of datapoints with partially missing variables: imputation is automatically made. This imputation is principled, made according to the rules of probability theory.
The output is a "learnt" object, typically saved in a learnt.rds file, which is used in all subsequent probabilistic computations. Other information about the computation is provided in logs and plots, saved in a directory specified by the user.
See vignette('intro') for introductory examples.
The computation is "non-parametric": probability or frequency distributions are not assumed to be Gaussian or of any other specific shape; no "model" is assumed. The mathematical representation of the space of joint frequency distributions follows ideas of Dunson & Bhattacharya (2011); see technical manual for details.
The computation is done via Markov-chain Monte Carlo, using the package Nimble. "Convergence" of the Monte Carlo computation is automatically assessed with methods described in Vehtari & al. (2021) and Kwon & al. (2025); see technical manual for details. The default values for convergence require that all of the following three conditions be fulilled:
The computation's numerical error (Monte-Carlo Standard Error) for the posterior probability must be smaller than 4.7% of the standard deviation of the posterior's variability.
The computation's numerical error for the 0.055- and 0.945-percentiles of the posterior's variability should be smaller than 4.7% of the distance between them.
Typically this requirement leads to final results obtained with the Pr() function having at least two significant digits.
The learn() function can take hours or even days to perform its computations, depending on the size of the dataset, number of variates, and the (initially unknown) "shape" of the underlying probability distribution. For this reason it is typically called within an R script, executed via utils::Rscript. For example, a script 'myscript.R' could have the following structure:
library('prova')
learn(
data = 'filename_with_data.csv', # CSV file containing the dataset
metadata = 'filename_with_metadata.csv', # CSV file containing the metadata
outputdir = 'some_directory', # path to output directory
parallel = 8 # machine has more than 8 cores, so we use 8
## possibly other arguments to learn()
)
and then be called on a bash terminal with
$ Rscript myscript.R > learnoutput.log 2>&1 &
with such a call, the file 'learnoutput.log' will contain information about how the computation is proceeding and the estimated end time.
A "learnt" object, or name of directory containing such an object and other output files, or NULL, depending on argument valueislearnt.
learn() saves several files in a directory. By default this output directory is a temporary directory within the one used by base::tempdir(), but an alternative one can be chosen with the argument outputdir =. The output directory contain several diagnostic files for the Monte Carlo computation; in particular:
MCtraces.pdf: shows several trace plots of the Monte Carlo sampling; the correspondin data are in the file MCtraces.rds.
plotsamples_learnt.pdf, plotquantiles_learnt.pdf: show the marginal posterior distributions of each individual variate, together with their variability (as samples or quantiles).
log-1.log, log-2.log, ... one for each parallel core; report the progress of each parallel Monte Carlo computation and notes about it.
rng_seed.rds: the state of the pseudorandom seed (see base::Random) when learn() was called.
metadata.csv: a copy of the metadata.
It is recommended that you give an explicit argument outputdir = and save the directory with the files above for future reference. In particular, the MCtraces.pdf plot and MCtraces.rds data can be useful to report Monte Carlo convergence in any work of yours that used Prova.
For the mathematical representation of the frequency space:
Dunson, Bhattacharya (2011): Nonparametric Bayes regression and classification through mixtures of product kernels https://doi.org/10.1093/acprof:oso/9780199694587.003.0005.
Ishwaran, Zarepour (2002): Exact and approximate sum representations for the Dirichlet process https://doi.org/10.2307/3315951.
Porta Mana https://github.com/pglpm/prova/raw/main/development/manual/pglpm2024-bayes_nonparam.pdf.
About Bayesian inference under exchangeability ("population inference"):
Lindley, Novick (1981): The role of exchangeability in inference, https://doi.org/10.1214/aos/1176345331.
Bernardo, Smith (2000): Bayesian Theory. Wiley https://doi.org/10.1002/9780470316870.
Porta Mana https://github.com/pglpm/prova/raw/main/development/manual/pglpm2024-bayes_nonparam.pdf.
About nonparametrics:
Müller et al. (2015): Nonparametric Bayesian inference. IMS https://doi.org/10.1007/978-3-319-18968-0.
Hjort et al. (2010): Bayesian Nonparametrics. Cambridge University Press https://doi.org/10.1017/CBO9780511802478.
About Markov-chain Monte Carlo and "convergence":
de Valpine, Paciorek, Turek, & al. (2026): NIMBLE: MCMC, Particle Filtering, and Programmable Hierarchical Modeling https://doi.org/10.5281/zenodo.1211190, https://cran.r-project.org/package=nimble.
Kwon & al. (2025): MCMC stopping rules in latent variable modelling https://doi.org/10.1111/bmsp.12357.
Vehtari & al. (2021): Rank-normalization, folding, and localization: an improved R-hat for assessing convergence of MCMC https://doi.org/10.1214/20-BA1221.
Roy (2020): Convergence diagnostics for Markov chain Monte Carlo https://doi.org/10.1146/annurev-statistics-031219-041300.
Gilks & al. (1998): Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC https://doi.org/10.1201/b14835.
D. J. C. MacKay (2005): Information Theory, Inference, and Learning Algorithms. Cambridge University Press https://www.inference.org.uk/itila/book.html.
Porta Mana https://github.com/pglpm/prova/raw/main/development/manual/pglpm2024-bayes_nonparam.pdf.
metadatatemplate() to help writing metadata files.
Pr() to calculate probabilities, and qPr() to calculate quantiles, given the data processed by learn().
rPr() to generate datapoints similar to the data processed by learn().
mutualinfo() to calculate mutual information given the data processed by learn().
pread.csv() and pwrite.csv() to read and write CSV files in the format used by learn().
### WARNING: the following example, if run, might even take a minute or more. ## Create dataset with 3 points of variate 'V' for demonstration: dataset <- data.frame(V = rnorm(n = 3)) ## Create metadata file: metadata <- data.frame(name = 'V', type = 'continuous') ## Learn from the data: learnt <- learn( data = dataset, metadata = metadata, ## the following parameters are unrealistic ## only used to reduce computation time for this example nsamples = 10, nchains = 1, startupMCiterations = 10, maxMCiterations = 10, minESS = 0, initES = 0 ) ## Check structure of `learnt` object: str(learnt)### WARNING: the following example, if run, might even take a minute or more. ## Create dataset with 3 points of variate 'V' for demonstration: dataset <- data.frame(V = rnorm(n = 3)) ## Create metadata file: metadata <- data.frame(name = 'V', type = 'continuous') ## Learn from the data: learnt <- learn( data = dataset, metadata = metadata, ## the following parameters are unrealistic ## only used to reduce computation time for this example nsamples = 10, nchains = 1, startupMCiterations = 10, maxMCiterations = 10, minESS = 0, initES = 0 ) ## Check structure of `learnt` object: str(learnt)
learnt object produced by learn()An example learnt object obtained by means of the learn() function, using the datasets::penguins dataset and the metadata in metadataExample, according to the call
learn(data = penguins, metadata = metadataExample, nsamples = 225, nchains = 15)
It is a list that essentially contains posterior hyperparameters for drawing statistical inferences about the variates species and bill_len.
Note that the learn() function that produced learntExample was called with the option to create only a limited number (225) of Monte Carlo samples, in order to reduce its memory size. Thus the numerical error associated with the Monte Carlo approximation is relatively in inferences drawn from the posterior hyperparameters saved in learntExample. It is only meant to be used for illustration purposes of the package's capabilities.
learntExamplelearntExample
learntExampleA list containing results from Markov-chain Monte Carlo computation, including diagnostics and variate metadata.
No return value.
learn(), which produces this kind of object.
Pr(), qPr(), rPr(), mutualinfo(): functions that require this kind of object in order to calculate probabilities and quantiles, generate data points, and calculate mutual information.
A data frame containing the prior information about the variates species and bill_len of the datasets::penguins dataset.
metadataExamplemetadataExample
metadataExampleA data frame with 2 rows and 10 columns.
No return value.
metadatatemplate() which helps producing this kind of metadata files from a given dataset.
learn() which needs this kind of metadata files to "learn" from data.
Metadata and helper function to create a template metadata file or object.
metadatatemplate( data, file = NULL, includevrt = NULL, excludevrt = NULL, addsummary2metadata = FALSE, backupfiles = FALSE, verbose = TRUE )metadatatemplate( data, file = NULL, includevrt = NULL, excludevrt = NULL, addsummary2metadata = FALSE, backupfiles = FALSE, verbose = TRUE )
data |
A dataset, given as a data frame or as a file path to a csv file. |
file |
Character or |
includevrt |
Character or |
excludevrt |
Character or |
addsummary2metadata |
Logical: also output some diagnostic statistics
in the metadata? Default |
backupfiles |
Logical: rename previous metadata file if it exists?
Default |
verbose |
Logical: output heuristics for each variate? Default |
The learn() function needs metadata about the variates present in the data. Such metadata can be provided either as a csv file or as a base::data.frame(). The function buildmetadata creates a template metadata csv-file, or outputs a metadata data.frame, by trying to guess metadata information from the dataset.The guesses may be very incorrect (as already said, metadata is information not contained in the data, so no algorithm can exist that extracts it from the data). The user must modify and correct this template, using it as a starting point to prepare the correct metadata information.
A preliminary data frame containing the metadata, invisibly if file = NULL. If argument file is a character, a preliminary metadata file is also created with that name or path.
In order to correctly learn from a dataset, the learn() function needs information that is not contained in the data themeselves; that is, it needs metadata. Metadata are provided either as a csv file or as a base::data.frame().
A metadata file or data.frame must contain one row for each simple variate in the given inference problem, and the following fields (columns), even if some of them may be empty:
name, type, domainmin, domainmax, datastep, minincluded, maxincluded, V1, V2, (possibly additional V-fields, sequentially numbered)
The type field has three possible values: nominal, ordinal, continuous. The remaining fields that must be filled in depend on the type field. Here is a list of requirements:
nominal and ordinal: require either V1, V2, ... fields or domainmin, domainmax, datastep (all three) fields. No other fields are required.
continuous: requires domainmin, domainmax, datastep, minincluded, maxincluded.
Here are the meanings and possible values of the fields:
name: The name of the variate. This must be the same character string as it appears in the dataset (be careful about upper- and lower-case).
type: The data type of variate name. Possible values are nominal, ordinal, continuous.
A nominal (also called categorical) variate has a discrete, finite number of possible values which have no intrinsic ordering. Examples could be a variate related to colour, with values "red", "green", "blue", and so on; or a variate related to cat breeds, with values "Siamese", "Abyssinian", "Persian", and so on. The possible values of the variate must be given in the fields V1, V2, and so on. It is important to include values that are possible but are not present in the dataset. A variate having only two possible values (binary variate), for example "yes" and "no", can be specified as nominal.
An ordinal variate has a discrete, finite number of possible values which do have an intrinsic ordering. Examples could be a Likert-scaled variate for the results of a survey, with values "very dissatisfied", "dissatisfied", "satisfied", "very satisfied"; or a variate related to the levels of some quantities, with values "low", "medium", "high"; or a variate having a numeric scale with values from 1 to 10. Whether a variate is nominal or ordinal often depends on the context. The possible values of the variate but be given in either one (but not both) or two ways: (1) in the fields V1, V2, ..., as for nominal variates; (2) as the fields domainmin, domainmax, datastep. Option (2) only works with numeric, equally spaced values: it assumes that the first value is domainmin, the second is domainmin+datastep, the third is domainmin+2*datastep, and so on up to the last value, domainmax.
A continuous variate has a continuum of values with an intrinsic ordering. Examples could be a variate related to the width of an object; or to the age of a person; or one coordinate of an object in a particular reference system. A continuous variate requires specification of the fields domainmin, domainmax, datastep, minincluded, maxincluded. Some naturally continuous variates are often rounded to a given precision; for instance, the age of a person might be reported as rounded to the nearest year (25 years, 26 years, and so on); or the length of an object might be reported to the nearest centimetre (1 m, 1.01 m, 1.02 m, and so on). The minimum distance between such rounded values must be reported in the datastep field; this would be 1 in the age example and 0.01 in the length example above. See below for further explanation of why reporting such rounding is important.
domainmin: The minimum value that the variate (ordinal or continuous) can take on. Possible values are a real number or an empty value, which is then interpreted as -Inf (explicit values like -Inf, -inf, -infinity should also work). Some continuous variates, like age or distance or temperature, are naturally positive, and therefore have domainmin equal 0. But in other contexts the minimum value could be different. For instance, if a given inference problem only involves people of age 18 or more, then domainmin would be set to 18.
domainmax: The maximum value that the variate (ordinal or continuous) can take on. Possible values are a real number, or an empty value, which is then interpreted as +Inf (explicit values like Inf, inf, infinity should also work). As with domainmin, the maximum value depends on the context. An age-related variate could theoretically have domainmax equal to infinity (empty value in the metadata file); but if a given study categorizes some people as "90 years old or older", then domainmax should be set to 90.
datastep: The minimum distance between the values of a variate (ordinal or continuous). Possible values are a positive real number or an empty value, which is then interpreted as 0 (the explicit value 0 is also accepted). For a numeric ordinal variate, datastep is the step between consecutive values. For a continuous rounded variate, datastep is the minimum distance between different values that occurs because of rounding; see the examples given above. The function buildmetadata has some heuristics to determine whether the variate is rounded or not. See further details under the section Rounding below.
minincluded, maxincluded: Whether the minimum (domainmin) and maximum(domainmax) values of a continuous variate can really appear in the data or not. Possible values are true (or t or yes) or false (or f, no, or an empty field); upper- or lower-case is irrelevant. Here are some examples about the meaning of these fields. (a) A continuous unrounded variate such as temperature has 0 as a minimum possible value domainmin, but this value itself is physically impossible and can never appear in data; in this case minincluded is empty (or set to false or no). (b) A variate related to the unrounded length, in metres, of some objects may take on any positive real value; but suppose that all objects of length 5 or less are grouped together under the value 5. It is then possible for several datapoints to have value 5: one such datapoint could originally have the value 3.782341...; another the value 4.929673..., and so on. In this case domainmin is set to 5, and minincluded is set to true (or yes). Similarly for the maximum value of a variate and maxincluded. Note that if domainmin is minus-infinity (empty value in the metadata file), then minincluded is automatically empty (that is, false), and similarly for maxincluded if domainmax is infinity.
learn(), which generates the information necessary to calculate posterior probabilities, based on data and metadata.
## Create a preliminary data frame of metadata for the `penguins` dataset metadata <- metadatatemplate(data = datasets::penguins, file = NULL) ## Note how the preliminary data frame includes additional spots ## for values of nominal and ordinal variates ## which could be missing from the data print(metadata) ## Create a preliminary data frame of metadata for the `penguins` dataset, ## including only the 'species' and 'bill_len' variates: metadata2 <- metadatatemplate( data = datasets::penguins, file = NULL, includevrt = c('species', 'bill_len') ) print(metadata2) ## Create a preliminary data frame of metadata for the `penguins` dataset, ## excluding the 'year' variate: metadata3 <- metadatatemplate( data = datasets::penguins, file = NULL, excludevrt = 'year' ) print(metadata3) ## Generate 10 points for a continuous variate in (0, 1) dataset <- runif(10) ## `metadatatemplate` correctly guesses the variate minimum, ## but not the maximum (`NA` is equivalent to `+Inf`) metadata <- metadatatemplate(data = dataset, file = NULL) print(metadata)## Create a preliminary data frame of metadata for the `penguins` dataset metadata <- metadatatemplate(data = datasets::penguins, file = NULL) ## Note how the preliminary data frame includes additional spots ## for values of nominal and ordinal variates ## which could be missing from the data print(metadata) ## Create a preliminary data frame of metadata for the `penguins` dataset, ## including only the 'species' and 'bill_len' variates: metadata2 <- metadatatemplate( data = datasets::penguins, file = NULL, includevrt = c('species', 'bill_len') ) print(metadata2) ## Create a preliminary data frame of metadata for the `penguins` dataset, ## excluding the 'year' variate: metadata3 <- metadatatemplate( data = datasets::penguins, file = NULL, excludevrt = 'year' ) print(metadata3) ## Generate 10 points for a continuous variate in (0, 1) dataset <- runif(10) ## `metadatatemplate` correctly guesses the variate minimum, ## but not the maximum (`NA` is equivalent to `+Inf`) metadata <- metadatatemplate(data = dataset, file = NULL) print(metadata)
This function calculates various entropic information measures between two grops of joint variates: the mutual information, the conditional entropies, and the entropies.
mutualinfo( Y1names, Y2names = NULL, X = NULL, learnt, tails = NULL, n = NULL, unit = "Sh", parallel = TRUE, verbose = FALSE )mutualinfo( Y1names, Y2names = NULL, X = NULL, learnt, tails = NULL, n = NULL, unit = "Sh", parallel = TRUE, verbose = FALSE )
Y1names |
Character vector: first group of joint variates |
Y2names |
Character vector or |
X |
Matrix or data.frame or |
learnt |
Either a character with the name of a directory or full path for an 'learnt.rds' object, or such an object itself. |
tails |
Named vector or list, or |
n |
Integer or |
unit |
Either one of 'Sh' for shannon (default), 'Hart' for hartley, 'nat' for natural unit, or a positive real indicating the base of the logarithms to be used. |
parallel |
Logical or positive integer or cluster object. |
verbose |
Logical, default |
If and are two variates, each of which can be a joint variate such as , and a third, also possibly join, variate, then the mutual information between and , conditional on , is given by
an expression which can also be written in several other equivalent ways. It is an information-theoretic measure of association that is model-free, that is, does not depend on assumptions such as linearity, gaussianity, and similar. See vignette('mutualinfo') for discussion and example uses, and also the "References" section. If are jointly gaussian variates, then there is a mathematical correspondence between their mutual information and their Pearson correlation coefficient; see output MI.rGauss in the "Value" section.
The conditional entropy of with respect to , conditional on , is given by
The (differential) entropy of , conditional on , is given by
see "References" section for discussions about entropy and conditional entropy.
The function mutualinfo() calculates the quantities above for the joint variates specified in the arguments Y1names and Y2names, conditional on the values of the variates specified in the data frame X. If X is omitted or NULL, then the posterior probabilities etc. are used. Each variate in the argument X can be specified either as a point-value or as a left-open interval or as a right-open interval , through the argument tails.
The computation of these quantities is done via Monte Carlo integration, using the samples produced by the learn() function. The present function also output the numerical error associated with this computation.
A list consisting of the following elements:
MI, a vector of value and accuracy: the mutual information between (joint) variates Y1names and (joint) variates Y2names.
CondEn12, CondEn21, vectors of value and accuracy: the conditional entropy of the first variate given the second, and vice versa.
En1, En2, vectors of value and accuracy: the (differential) entropies of the first and second variates.
MI.rGauss, a vector of value and accuracy: the absolute value of the Pearson correlation coefficient of a multivariate Gaussian distribution having mutual information MI; the two are related by . It may provide a vague intuition for the MI value for people more familiar with Pearson's correlation, but should be taken with a grain of salt.
unit, Y1names, Y1names: same as the input arguments, included for the user's convenience.
Pr() to calculate probabilities and their variability.
learn(), which generates the learnt objects required by mutualinfo().
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## mutual information between variates 'species' and 'bill_len' MI <- mutualinfo(Y1names = 'species', Y2names = 'bill_len', learnt = learnt, parallel = 1) paste0(MI$MI, ' ', MI$unit, collapse = ' +/- ') ## Shannon entropy of variate 'species' paste0(MI$En1, ' ', MI$unit, collapse = ' +/- ') ## Shannon entropy of variate 'species', ## conditional on a bill length of 30 mm: entr <- mutualinfo( Y1names = 'species', X = data.frame(bill_len = 30), learnt = learnt, parallel = 1 ) paste0(entr$En1, ' ', entr$unit, collapse = ' +/- ') ## the entropy is now lower; indeed a penguin with a short bill length ## is most probably of the 'Adelie' species: probs <- Pr( Y = data.frame(species = c('Adelie', 'Gentoo', 'Chinstrap')), X = data.frame(bill_len = 30), learnt = learnt, parallel = 1 ) print(probs)## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## mutual information between variates 'species' and 'bill_len' MI <- mutualinfo(Y1names = 'species', Y2names = 'bill_len', learnt = learnt, parallel = 1) paste0(MI$MI, ' ', MI$unit, collapse = ' +/- ') ## Shannon entropy of variate 'species' paste0(MI$En1, ' ', MI$unit, collapse = ' +/- ') ## Shannon entropy of variate 'species', ## conditional on a bill length of 30 mm: entr <- mutualinfo( Y1names = 'species', X = data.frame(bill_len = 30), learnt = learnt, parallel = 1 ) paste0(entr$En1, ' ', entr$unit, collapse = ' +/- ') ## the entropy is now lower; indeed a penguin with a short bill length ## is most probably of the 'Adelie' species: probs <- Pr( Y = data.frame(species = c('Adelie', 'Gentoo', 'Chinstrap')), X = data.frame(bill_len = 30), learnt = learnt, parallel = 1 ) print(probs)
This base::plot() method is a utility to plot probabilities obtained with Pr(), as well as their variabilities. The probabilities are plotted either against Y, with one curve for each value of X, or vice versa.
## S3 method for class 'probability' plot( x, variability = NULL, subset = NULL, PvsY = NULL, legend = "top", lty = c(1, 2, 4, 3, 6, 5), lwd = 2, col = palette(), type = NULL, alpha.f = 1, var.alpha.f = NULL, xlab = NULL, ylab = NULL, main = NULL, ylim = c(0, NA), grid = TRUE, add = FALSE, ... )## S3 method for class 'probability' plot( x, variability = NULL, subset = NULL, PvsY = NULL, legend = "top", lty = c(1, 2, 4, 3, 6, 5), lwd = 2, col = palette(), type = NULL, alpha.f = 1, var.alpha.f = NULL, xlab = NULL, ylab = NULL, main = NULL, ylim = c(0, NA), grid = TRUE, add = FALSE, ... )
x |
Object of class "probability", obtained with |
variability |
One of the values |
subset |
Named list or named vector: which variate values to display. For the variates corresponding to the names in this list, only the vector of values corresponding to that variate is displayed. |
PvsY |
Logical or |
legend |
One of the values |
lty, lwd, col, type, xlab, ylab, main, ylim, grid, add
|
see analogous arguments in |
alpha.f |
Numeric, default 0.25: opacity of the colours, |
var.alpha.f |
Numeric: opacity of the quantile bands or of the samples, |
... |
Other parameters to be passed to |
NULL, invisibly; produces a plot, see graphics::matplot().
Pr() to calculate posterior probabilities and quantiles.
hist.probability() to plot the variability of the probabilities as a distribution.
flexiplot() (on which plot.probability() is based) for more general plots.
plotquantiles() to plot quantile ranges.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## create a grid of values for variate "bill length", ## based on the information in the dataset and metadata: values <- vrtgrid(vrt = 'bill_len', learnt = learnt) ## calculate the probabilities and quantiles probs <- Pr(Y = data.frame(bill_len = values), learnt = learnt, parallel = 1) ## plot the probabilities and quantiles plot(probs)## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## create a grid of values for variate "bill length", ## based on the information in the dataset and metadata: values <- vrtgrid(vrt = 'bill_len', learnt = learnt) ## calculate the probabilities and quantiles probs <- Pr(Y = data.frame(bill_len = values), learnt = learnt, parallel = 1) ## plot the probabilities and quantiles plot(probs)
Utility function to plot pairs of quantiles obtained with Pr().
plotquantiles( x, y, xdomain = NULL, alpha.f = 0.25, col = palette(), border = NA, type = "n", ... )plotquantiles( x, y, xdomain = NULL, alpha.f = 0.25, col = palette(), border = NA, type = "n", ... )
x |
Numeric or character: vector of x-coordinates. See |
y |
Numeric: a matrix having as many rows as |
xdomain |
Character or numeric or |
alpha.f |
Numeric, default 0.25: opacity of the quantile bands, |
col |
Fill colour of the quantile bands. Can be specified in any of the usual ways, see for instance |
border |
Fill colour of the quantile bands. Can be specified in any of the usual ways, see for instance |
type |
see analogous argument in |
... |
Other parameters to be passed to |
NULL, invisibly; produces a plot, see graphics::matplot().
Pr() to calculate posterior probabilities and quantiles.
plot.probability() to directly plot posterior probabilities and quantiles contained in a probability object.
flexiplot() for more general plots.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## create a grid of values for variate "bill length", ## based on the information in the dataset and metadata: values <- vrtgrid(vrt = 'bill_len', learnt = learnt) ## calculate the probabilities and quantiles probs <- Pr(Y = data.frame(bill_len = values), learnt = learnt, parallel = 1) ## plot the quantiles, setting lower plot range to zero plotquantiles(x = values, y = probs$quantiles[, 1, ], ylim = c(0, NA), xlab = 'bill length', ylab = 'probability') ## add a plot of the probabilities in thick dashed red flexiplot(x = values, y = probs$values, lwd = 5, lty = 2, col = 2, add = TRUE)## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## create a grid of values for variate "bill length", ## based on the information in the dataset and metadata: values <- vrtgrid(vrt = 'bill_len', learnt = learnt) ## calculate the probabilities and quantiles probs <- Pr(Y = data.frame(bill_len = values), learnt = learnt, parallel = 1) ## plot the quantiles, setting lower plot range to zero plotquantiles(x = values, y = probs$quantiles[, 1, ], ylim = c(0, NA), xlab = 'bill length', ylab = 'probability') ## add a plot of the probabilities in thick dashed red flexiplot(x = values, y = probs$values, lwd = 5, lty = 2, col = 2, add = TRUE)
This function calculates posterior probability densities, cumulative posterior probabilities, and mixtures thereof. It also outputs the variability of such probabilities if more training data were available, and the Monte Carlo Standard Error for the calculated posterior probabilities.
Pr( Y, X = NULL, learnt, tails = NULL, priorY = NULL, nsamples = "all", quantiles = c(0.055, 0.25, 0.75, 0.945), parallel = TRUE, sep = ",", solidus = "|", verbose = FALSE, keepYX = TRUE )Pr( Y, X = NULL, learnt, tails = NULL, priorY = NULL, nsamples = "all", quantiles = c(0.055, 0.25, 0.75, 0.945), parallel = TRUE, sep = ",", solidus = "|", verbose = FALSE, keepYX = TRUE )
Y |
Matrix or data.table: set of values of variates of which we want the joint probability of. One variate per column, one set of values per row. |
X |
Matrix or data.table or |
learnt |
Either a character with the name of a directory or full path for a 'learnt.rds' object, produced by the |
tails |
Named vector or list, or |
priorY |
Numeric vector with the same length as the rows of |
nsamples |
Integer or |
quantiles |
Numeric vector, between 0 and 1, or |
parallel |
Logical or positive integer or cluster object. |
sep |
character, default |
solidus |
character, default |
verbose |
Logical, default |
keepYX |
Logical, default |
This function calculates the posterior probability , where and are two (non overlapping) sets of joint variate values, inputted as data frame arguments Y and X. It is somewhat analogous to the d-variants and p-variantes of R distribution functions, such as stats::dnorm() and stats::pnorm(). If X is omitted or NULL, then the posterior probability is calculated.
For some variates in Y or X, tail values can also be prescribed, so that this function calculates mixed probabilities such as
Tail values are inputted via the 'tails' argument; see "Usage".
This function also outputs the variability of the posterior probabilities above, that is, probabilities such as that we could have if more learning data were provided, as well as a number of samples of the possible values of such probability. This variability can be outputted in two ways; the user can choose either, or both, or none:
As samples (default 3600 samples, depending on the 'nsamples' argument given to the learn() function) of the alternative values that the posterior probability could have.
As quantiles (default 5.5%, 25%, 75%, 94.5%) of the possible variability.
If several joint values are given for Y or X, the function will create a 2D grid of results for all possible combinations of the given Y and X values.
This function also allows for base-rate or other prior-probability corrections: If a prior (for instance, a base rate) for Y is given, the function will calculate the probability from and the prior, by means of Bayes's theorem.
Each variate in each argument Y, X can be specified either as a point-value or as a left-open interval or as a right-open interval , through the argument tails.
See vignette('intro') for example uses.
An object of class "probability", effectively a list consisting of the following elements:
values: a matrix with the probabilities , for all joint values of the -variates (rows) and all joint values of the -variates (columns).
quantiles (possibly NULL): an array with the variability quantiles (3rd dimension of the array) for such probabilities.
samples (possibly NULL): an array with the variability samples (3rd dimension of the array) for such probabilities.
values.MCaccuracy, quantiles.MCaccuracy: arrays with the numerical accuracies (roughly speaking a standard deviation) of the Monte Carlo calculations for the values and quantiles elements.
Y, X: copies of the Y and X arguments.
Lindley, Novick (1981): The role of exchangeability in inference, https://doi.org/10.1214/aos/1176345331.
Bernardo, Smith (2000): Bayesian Theory. Wiley https://doi.org/10.1002/9780470316870.
Jaynes (2003): Probability Theory: The Logic of Science. Cambridge University Press https://doi.org/10.1017/CBO9780511790423.
MacKay (2005): Information Theory, Inference, and Learning Algorithms. Cambridge University Press https://www.inference.org.uk/itila/book.html.
Porta Mana (2025): What's special about 89% credibility intervals? https://doi.org/10.5281/zenodo.17072199.
learn(), which generates the learnt objects required by Pr().
plot.probability() to plot probabilities and quantiles calculated by Pr().
hist.probability() to plot histograms of the probability distributions calculated by Pr().
print.probability() to print the main elements of the probabilities calculated by Pr().
qPr() to calculate quantiles for a specific variate, that is, the variate values having given probabilities.
rPr() to generate datapoints.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Calculate the probability that an unknown penguin from this population ## is of species 'Adelie' probs <- Pr( Y = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## the full-population frequency of 'Adelie' penguins is unknown; ## display the 5.5%- and 94.5%-probability values ## for such frequency probs$quantiles[, , c('5.5%', '94.5%')] ## we can also plot the probability distribution for this full-population frequency hist(probs, legend = 'topright') ## ## Example 2: ## Calculate the 3 probabilities that an unknown penguin from this population ## is of species 'Adelie', 'Chinstrap', 'Gentoo' probs <- Pr( Y = data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')), learnt = learnt, parallel = 1 ) ## display the 3 probability values probs$values ## the full-population frequencies of the three species are unknown; ## display the 5.5%- and 94.5%-probability values ## for such frequencies probs$quantiles[, , c('5.5%', '94.5%')] ## plot the probabilities and quantiles plot(probs) ## plot the probability distribution for the full-population frequency ## of each species hist(probs) ## ## Example 3: ## Calculate the probability that an unknown penguin is of species 'Adelie' ## GIVEN that its bill length is 43 mm probs <- Pr( Y = data.frame(species = 'Adelie'), X = data.frame(bill_len = 43), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## the full-subpopulation frequency of 'Adelie' penguins, ## among penguins having bill length of 43 mm, is unknown; ## display the 5.5%- and 94.5%-probability values ## for such conditional frequency probs$quantiles[, , c('5.5%', '94.5%')] ## ## Example 4: ## Calculate the probability that ## an unknown penguin is of species 'Adelie' AND its bill length is 43 mm probs <- Pr( Y = data.frame(species = 'Adelie', bill_len = 43), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population frequency of 'Adelie' penguins with 43 mm bills probs$quantiles[, , c('5.5%', '94.5%')] ## ## Example 5: ## Calculate the 3 x 2 probabilities for the 3 species ## GIVEN bill-lengths of 43 mm and 44 mm Y <- data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')) X <- data.frame(bill_len = c(43, 44)) probs <- Pr(Y = Y, X = X, learnt = learnt, parallel = 1) ## display the 3 x 2 probability values probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population joint frequencies probs$quantiles[, , c('5.5%', '94.5%')] ## plot the probabilities and quantiles plot(probs) ## ## Example 6: ## Calculate the 3 x 2 joint probabilities for the 3 species ## AND bill-lengths of 43 mm and 44 mm Y <- expand.grid( species = c('Adelie', 'Chinstrap', 'Gentoo'), bill_len = c(43, 44) ) probs <- Pr(Y = Y, learnt = learnt, parallel = 1) ## display the 6 joint-probability values probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population joint frequencies probs$quantiles[, , c('5.5%', '94.5%')]## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Calculate the probability that an unknown penguin from this population ## is of species 'Adelie' probs <- Pr( Y = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## the full-population frequency of 'Adelie' penguins is unknown; ## display the 5.5%- and 94.5%-probability values ## for such frequency probs$quantiles[, , c('5.5%', '94.5%')] ## we can also plot the probability distribution for this full-population frequency hist(probs, legend = 'topright') ## ## Example 2: ## Calculate the 3 probabilities that an unknown penguin from this population ## is of species 'Adelie', 'Chinstrap', 'Gentoo' probs <- Pr( Y = data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')), learnt = learnt, parallel = 1 ) ## display the 3 probability values probs$values ## the full-population frequencies of the three species are unknown; ## display the 5.5%- and 94.5%-probability values ## for such frequencies probs$quantiles[, , c('5.5%', '94.5%')] ## plot the probabilities and quantiles plot(probs) ## plot the probability distribution for the full-population frequency ## of each species hist(probs) ## ## Example 3: ## Calculate the probability that an unknown penguin is of species 'Adelie' ## GIVEN that its bill length is 43 mm probs <- Pr( Y = data.frame(species = 'Adelie'), X = data.frame(bill_len = 43), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## the full-subpopulation frequency of 'Adelie' penguins, ## among penguins having bill length of 43 mm, is unknown; ## display the 5.5%- and 94.5%-probability values ## for such conditional frequency probs$quantiles[, , c('5.5%', '94.5%')] ## ## Example 4: ## Calculate the probability that ## an unknown penguin is of species 'Adelie' AND its bill length is 43 mm probs <- Pr( Y = data.frame(species = 'Adelie', bill_len = 43), learnt = learnt, parallel = 1 ) ## display the probability value probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population frequency of 'Adelie' penguins with 43 mm bills probs$quantiles[, , c('5.5%', '94.5%')] ## ## Example 5: ## Calculate the 3 x 2 probabilities for the 3 species ## GIVEN bill-lengths of 43 mm and 44 mm Y <- data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')) X <- data.frame(bill_len = c(43, 44)) probs <- Pr(Y = Y, X = X, learnt = learnt, parallel = 1) ## display the 3 x 2 probability values probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population joint frequencies probs$quantiles[, , c('5.5%', '94.5%')] ## plot the probabilities and quantiles plot(probs) ## ## Example 6: ## Calculate the 3 x 2 joint probabilities for the 3 species ## AND bill-lengths of 43 mm and 44 mm Y <- expand.grid( species = c('Adelie', 'Chinstrap', 'Gentoo'), bill_len = c(43, 44) ) probs <- Pr(Y = Y, learnt = learnt, parallel = 1) ## display the 6 joint-probability values probs$values ## display the 5.5%- and 94.5%-probability values ## for the full-population joint frequencies probs$quantiles[, , c('5.5%', '94.5%')]
This base::print() method is a utility to display selected elements of a "probability" object obtained with Pr(); typically its posterior probabilies (element $values) and their variabilities (element $quantiles). If the Y or X variates are joint variates, this method also allow to display only selected values of them
## S3 method for class 'probability' print(x, elements = NULL, subset = NULL, digits = TRUE, ...)## S3 method for class 'probability' print(x, elements = NULL, subset = NULL, digits = TRUE, ...)
x |
Object of class "probability", obtained with |
elements |
character or integer vector, or |
subset |
Named list or named vector: which variate values to display. For the variates corresponding to the names in this list, only the vector of values corresponding to that variate is displayed. |
digits |
positive number or |
... |
Other parameters to be passed to |
Its x argument, invisibly; see base::print().
Pr() to calculate posterior probabilities and quantiles.
plot.probability() to plot probabilities and quantiles calculated by ‘Pr()’.
hist.probability() to plot the variability of the probabilities as a distribution.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## Calculate the 3 x 2 probabilities for the 3 species ## given bill-lengths of 43 mm and 44 mm Y <- data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')) X <- data.frame(bill_len = c(43, 44)) probs <- Pr(Y = Y, X = X, learnt = learnt, parallel = 1) ## display the values and variabilities of these probabilities print(probs) ## diplay 'values' only, and only for the species value 'Gentoo' print(probs, elements = 'values', subset = list(species = 'Gentoo'))## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## Calculate the 3 x 2 probabilities for the 3 species ## given bill-lengths of 43 mm and 44 mm Y <- data.frame(species = c('Adelie', 'Chinstrap', 'Gentoo')) X <- data.frame(bill_len = c(43, 44)) probs <- Pr(Y = Y, X = X, learnt = learnt, parallel = 1) ## display the values and variabilities of these probabilities print(probs) ## diplay 'values' only, and only for the species value 'Gentoo' print(probs, elements = 'values', subset = list(species = 'Gentoo'))
Utility functions to read and write CSV files in the format required by Prova
pwrite.csv(x, file, ...) pread.csv(file, ...)pwrite.csv(x, file, ...) pread.csv(file, ...)
x |
The object to be written. Preferably a matrix or data frame; if not, it is attempted to coerce |
file |
Either a character naming a file or a connection open for writing or reading. See |
... |
Other arguments to be passed to |
The functions learn() and metadatatemplate() accept CSV files formatted as follows:
Decimal values should be separated by a dot; no comma should be used to separate thousands etc. Example: 86342.75.
Character and names should be quoted in single or double quotes. Example: "female".
Values should be separated by commas, not by tabs or semicolons.
Missing values should be simply empty, not denoted by "NA", "missing", "-", or similar.
Preferably there should not be factors (see base::factor); use character names instead.
The utility functions pwrite.csv() and pread.csv() are wrappers to utils::write.csv() and utils::read.csv() that set appropriate default parameters according to the formatting rules above.
pread.csv returns a data frame containing a representation of the data in the file; see utils::read.csv(). pwrite.csv' returns NULL' invisibly.
metadatatemplate() to help writing metadata files.
learn(), which needs a metadata data-frame or CSV file.
## Save the 'penguins' dataset in a (temporary) file filename <- tempfile(fileext = '.csv') pwrite.csv(penguins, file = filename) ## check first few lines of the raw file writeLines(readLines(filename, n = 10))## Save the 'penguins' dataset in a (temporary) file filename <- tempfile(fileext = '.csv') pwrite.csv(penguins, file = filename) ## check first few lines of the raw file writeLines(readLines(filename, n = 10))
This function calculates the quantiles of posterior probabilities and posterior conditional probabilities. It also outputs the variability of such quantiles if more training data were available.
qPr( p = c(0.25, 0.5, 0.75), Yname, X = NULL, learnt, tails = NULL, priorY = NULL, nsamples = "all", quantiles = c(0.055, 0.5, 0.945), parallel = TRUE, sep = ",", solidus = "|", verbose = FALSE, keepYX = TRUE, tol = .Machine$double.eps * 10 )qPr( p = c(0.25, 0.5, 0.75), Yname, X = NULL, learnt, tails = NULL, priorY = NULL, nsamples = "all", quantiles = c(0.055, 0.5, 0.945), parallel = TRUE, sep = ",", solidus = "|", verbose = FALSE, keepYX = TRUE, tol = .Machine$double.eps * 10 )
p |
Numeric vector of probability levels. Default: |
Yname |
Character vector: name of variate whose quantiles will be computed. |
X |
Matrix or data.table or |
learnt |
Either a character with the name of a directory or full path for a 'learnt.rds' object, produced by the |
tails |
Named vector or list, or |
priorY |
Reserved for use in future versions of the package. |
nsamples |
Integer or |
quantiles |
Numeric vector, between 0 and 1, or |
parallel |
Logical or positive integer or cluster object. |
sep |
character, default |
solidus |
character, default |
verbose |
Logical, default |
keepYX |
Logical, default |
tol |
numeric positive: tolerance in the calculation of quantiles. Default: |
This function calculates the quantiles of or of or combinations thereof, at specified cumulative-probability levels. In other words, it calculates the values of having specified cumulative probabilities or conditional probabilities. It also calculates the variability of those quantiles if more learning data were provided. It is somewhat analogous to the q-variants of R distribution functions, such as stats::qnorm(). The variability can be expressed in the form of quantiles, samples, or both, as in the Pr() function. If several joint values are given for the probability levels and for X, the function creates a 2D grid of results for all possible combinations of the given probability levels and X values. Each variate in the argument X can be specified either as a point-value or as a left-open interval or as a right-open interval , through the argument tails.
A list of the following elements:
values: a matrix with the requested -quantiles p conditional on the requested -values in X, for all combinations of p (rows) and X (columns).
quantiles (possibly NULL): an array with the variability quantiles (3rd dimension of the array) for the quantiles of the value element.
samples (possibly NULL): an array with the variability samples (3rd dimension of the array) for such quantiles.
Y, X: copies of the Y and X arguments.
Porta Mana (2025): What's special about 89% credibility intervals? https://doi.org/10.5281/zenodo.17072199.
learn(), which generates the learnt objects required by qPr().
Pr() to calculate joint and conditional probabilities.
rPr() to generate datapoints.
### WARNING: the following examples, if run, might even take a minute or more. ## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Calculate the 5.5%-, 50%-, and 94.5%-quantiles for the variate "bill lengt", ## that is, the values of "bill length" having such cumulative probabilities quants <- qPr( Yname = 'bill_len', learnt = learnt, parallel = 1 ) ## display the quantile values quants$values ## verify these values using Pr(): probs <- Pr( Y = data.frame(bill_len = c(quants$values)), tails = list(bill_len = -1), learnt = learnt, parallel = 1 ) ## the cumulative probabilities are indeed 0.055, 0.5, 0.945 within numerical error: probs$values ## display the variability about the quantiles quants$quantiles ## ## Example 2: ## Calculate the 5.5%-, 50%-, and 94.5%-quantiles for the variate "bill lengt", ## for the subpopulation of species 'Adelie' quants <- qPr( Yname = 'bill_len', X = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1 ) ## display the quantile values quants$values ## verify these values using Pr(): probs <- Pr( Y = data.frame(bill_len = c(quants$values)), X = data.frame(species = 'Adelie'), tails = list(bill_len = -1), learnt = learnt, parallel = 1) ## the cumulative probabilities are indeed 0.055, 0.5, 0.945 within numerical error: probs$values### WARNING: the following examples, if run, might even take a minute or more. ## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Calculate the 5.5%-, 50%-, and 94.5%-quantiles for the variate "bill lengt", ## that is, the values of "bill length" having such cumulative probabilities quants <- qPr( Yname = 'bill_len', learnt = learnt, parallel = 1 ) ## display the quantile values quants$values ## verify these values using Pr(): probs <- Pr( Y = data.frame(bill_len = c(quants$values)), tails = list(bill_len = -1), learnt = learnt, parallel = 1 ) ## the cumulative probabilities are indeed 0.055, 0.5, 0.945 within numerical error: probs$values ## display the variability about the quantiles quants$quantiles ## ## Example 2: ## Calculate the 5.5%-, 50%-, and 94.5%-quantiles for the variate "bill lengt", ## for the subpopulation of species 'Adelie' quants <- qPr( Yname = 'bill_len', X = data.frame(species = 'Adelie'), learnt = learnt, parallel = 1 ) ## display the quantile values quants$values ## verify these values using Pr(): probs <- Pr( Y = data.frame(bill_len = c(quants$values)), X = data.frame(species = 'Adelie'), tails = list(bill_len = -1), learnt = learnt, parallel = 1) ## the cumulative probabilities are indeed 0.055, 0.5, 0.945 within numerical error: probs$values
This function generates datapoints of chosen joint variates, according to posterior probabilities and posterior conditional probabilities.
rPr( n, Ynames, X = NULL, learnt, tails = NULL, mcsamples = NULL, parallel = NULL )rPr( n, Ynames, X = NULL, learnt, tails = NULL, mcsamples = NULL, parallel = NULL )
n |
Positive integer: number of samples to draw. |
Ynames |
Character vector: names of variates to draw jointly |
X |
List or data.table or |
learnt |
Either a character with the name of a directory or full path for a 'learnt.rds' object, produced by the |
tails |
Named vector or list, or |
mcsamples |
Vector of integers, or |
parallel |
Not used: this function does not use parallelization. |
This function generates datapoints according to the posterior probability or or combinations thereof, for the variates specified in the argument Y, and conditional on the variate values specified in the argument X. It is somewhat analogous to the r-variants of R distribution functions, such as stats::rnorm(). If X is omitted or NULL, then the posterior probability is used. Each variate in the argument X can be specified either as a point-value or as a left-open interval or as a right-open interval , through the argument tails.
A data frame of joint draws of the variates Ynames from the posterior distribution, conditional on X. The row names of the data frame report the Monte Carlo sample (from learn()) used for that draw, and the total number of draws from that sample so far.
learn(), which generates the learnt objects required by qPr().
Pr() to calculate joint and conditional probabilities.
qPr() to calculate quantiles.
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Generate 10 values of the 'species' variate, ## according to the frequency distribution estimated from the data datapoints <- rPr( n = 10, Ynames = 'species', learnt = learnt ) c(datapoints) ## ## Example 2: ## Generate 5 joint values of the 'species' and 'bill_len' variates. datapoints <- rPr( n = 5, Ynames = c('species', 'bill_len'), learnt = learnt ) print(datapoints, row.names = FALSE) ## row names give MCMC information ## ## Example 3: ## Generate 5 values of the 'species' variate, ## for the subpopulation of penguins having bill length shorter than 40 mm datapoints <- rPr( n = 5, Ynames = 'species', X = data.frame(bill_len = 40), tails = list(bill_len = -1), learnt = learnt ) c(datapoints)## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## ## Example 1: ## Generate 10 values of the 'species' variate, ## according to the frequency distribution estimated from the data datapoints <- rPr( n = 10, Ynames = 'species', learnt = learnt ) c(datapoints) ## ## Example 2: ## Generate 5 joint values of the 'species' and 'bill_len' variates. datapoints <- rPr( n = 5, Ynames = c('species', 'bill_len'), learnt = learnt ) print(datapoints, row.names = FALSE) ## row names give MCMC information ## ## Example 3: ## Generate 5 values of the 'species' variate, ## for the subpopulation of penguins having bill length shorter than 40 mm datapoints <- rPr( n = 5, Ynames = 'species', X = data.frame(bill_len = 40), tails = list(bill_len = -1), learnt = learnt ) c(datapoints)
This function creates a set of values for a variate, based on the information from data and metadata stored in a learnt object, created by the learn() function. The set of values depends on the type of variate (nominal or continuous, rounded, and so on, see metadata). The range of values is chosen to include, and extend slightly beyond, the range observed in the data used in the learn() function. Variate domains are always respected.
vrtgrid(vrt, learnt, length.out = 129)vrtgrid(vrt, learnt, length.out = 129)
vrt |
Character: name of the variate, must match one of the names in the |
learnt |
Either a character with the name of a directory or full path for a 'learnt.rds' object, produced by the |
length.out |
Numeric, positive (default 129): number of values to be created; used only for continuous, non-rounded variates (see |
A numeric or character vector of values.
learn(), which generates the learnt objects required by vrtgrid().
Pr() to calculate probabilities and their variability.
plot.probability() to plot probabilities and quantiles calculated by Pr().
## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## set of values for the variate "species"; ## since this variate is of a nominal kind, all values are included valuesSpecies <- vrtgrid(vrt = 'species', learnt = learnt) print(valuesSpecies) ## create a set of values for the variate "bill length"; ## this variate is continuous and rounded, only realistic values are included valuesBill <- vrtgrid(vrt = 'bill_len', learnt = learnt) range(valuesBill) ## let's take a subset of these values, to speed up computation valuesBill <- valuesBill[seq(to = length(valuesBill), length.out = 65)] ## calculate the conditional probabilities for the 'bill_len' values above, ## given the values of 'species' probs <- Pr( Y = data.frame(bill_len = valuesBill), X = data.frame(species = valuesSpecies), learnt = learnt, parallel = 1 ) ## plot the conditional probability distributions, and their variability plot(probs)## Load the example `learnt` object calculated from the "penguins" dataset; ## variates: 'species' and 'bill_len' learnt <- learntExample ## set of values for the variate "species"; ## since this variate is of a nominal kind, all values are included valuesSpecies <- vrtgrid(vrt = 'species', learnt = learnt) print(valuesSpecies) ## create a set of values for the variate "bill length"; ## this variate is continuous and rounded, only realistic values are included valuesBill <- vrtgrid(vrt = 'bill_len', learnt = learnt) range(valuesBill) ## let's take a subset of these values, to speed up computation valuesBill <- valuesBill[seq(to = length(valuesBill), length.out = 65)] ## calculate the conditional probabilities for the 'bill_len' values above, ## given the values of 'species' probs <- Pr( Y = data.frame(bill_len = valuesBill), X = data.frame(species = valuesSpecies), learnt = learnt, parallel = 1 ) ## plot the conditional probability distributions, and their variability plot(probs)