A package to support Rscript files

R
{scribe}
Published

March 13, 2023

I’m excited to be finalizing release preparations of {scribe}. This package supports writing your own Rscript files and executing them through a terminal.

We’ll start with a simple example. For most of these, I’ll be using the direct R interface. However, this package is best used with a shebang script 1.

1 I’m pretty sure this is pronounced like sha-bang because it’s a hash (#) and bang (!). But I used to think she-bang, which conjures The Stone Roses’ She Bangs the Drums, a pleasant ear-worm. I think it also works to just shout octothorpe!

Keep in mind that command_args() doesn’t need an explicit input, and when used with Rscript will automatically capture command line arguments.

Code
library(scribe)
ca <- command_args(string = "-a 1 -b 0")
ca$add_argument("-a", default = 0L)
ca$add_argument("-b", default = 0L)
args <- ca$parse()
args$a + args$b
#> [1] 1
Code
ca$set_input(c("-a 10 -b 10"))
args <- ca$parse()
#> Warning: Not all values parsed:
#> -a 10 -b 10
args$a + args$b
#> [1] 0

That’s a little easy, so maybe we can make something a bit more interesting.

First we’ll make ourselves a little modeling function. This is not meant for completeness, but simply provides a few examples for creativity.

Code
my_model <- function(
    data = c("penguins", "mtcars", "sat.act"), 
    y, 
    x = NA, 
    family = "gaussian", 
    correlation = FALSE
  ) {
  data <- match.arg(data)
  
  data <- switch(
    data,
    penguins = palmerpenguins::penguins,
    mtcars = datasets::mtcars,
    sat.act = transform(
      psych::sat.act, 
      gender = as.integer(gender == 1)
    )
  )
  
  if (isTRUE(is.na(x))) {
    x <- setdiff(colnames(data), y)
  }
  
  data <- data[, c(y, x)]
  form <- stats::DF2formula(data)
  mod <- stats::glm(form, data = data, family = family)
  summary(mod, correlation = correlation)
}

Now that we have that, we can set up the command args to parse what our string inputs are.

Code
# we'll pass arguments after
ca <- command_args()
ca$add_description("run a quick model")
ca$add_argument(
  "data",
  default = "penguins",
  info = "a dataset to view"
)
ca$add_argument("y", info = "value to predict")
ca$add_argument("x", default = NA, info = "variables")
ca$add_argument(
  "--family",
  default = "gaussian",
  info = "error distribution, link function"
)
ca$add_argument(
  "--correlation",
  action = "flag",
  info = "when set, returns the correlation matrix"
)
ca$add_example("my-model.R penguins body_mass_g")
ca$add_example("my-model.R mtcars mpg --correlation")

There’s a default help arg added to the scribeCommandArg object. When --help is found in the command line arguments, the script will try to exit, returning only the help information.

Code
options(scribe.interactive = TRUE)
ca$set_input("--help")
ca$parse()
#> {scribe} command_args
#> 
#> file : /home/jordan/github/quarto-cli/src/resources/rmd/rmd.R
#> 
#> DESCRIPTION
#>   run a quick model
#> 
#> USAGE
#>   rmd.R [--help | --version]
#>   rmd.R [data [ARG]] [y [ARG]] [x [ARG]] [--family [ARG]] [--correlation, --no-correlation] 
#> 
#> ARGUMENTS
#>   --help                          : prints this and quietly exits                   
#>   --version                       : prints the version of {scribe} and quietly exits
#>   data [ARG]                      : a dataset to view                               
#>   y [ARG]                         : value to predict                                
#>   x [ARG]                         : variables                                       
#>   --family [ARG]                  : error distribution, link function               
#>   --correlation, --no-correlation : when set, returns the correlation matrix        
#> 
#> EXAMPLES
#>   $ my-model.R penguins body_mass_g    
#>   $ my-model.R mtcars mpg --correlation

Let’s simulate a few examples:

my-model.R penguins body_mass_g
Code
ca$set_input(c("penguins", "body_mass_g"))
do.call(my_model, ca$parse())
#> 
#> Call:
#> stats::glm(formula = form, family = family, data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -809.70  -180.87    -6.25   176.76   864.22  
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       84087.945  41912.019   2.006  0.04566 *  
#> speciesChinstrap   -282.539     88.790  -3.182  0.00160 ** 
#> speciesGentoo       890.958    144.563   6.163 2.12e-09 ***
#> islandDream         -21.180     58.390  -0.363  0.71704    
#> islandTorgersen     -58.777     60.852  -0.966  0.33482    
#> bill_length_mm       18.964      7.112   2.667  0.00805 ** 
#> bill_depth_mm        60.798     20.002   3.040  0.00256 ** 
#> flipper_length_mm    18.504      3.128   5.915 8.46e-09 ***
#> sexmale             378.977     48.074   7.883 4.95e-14 ***
#> year                -42.785     20.949  -2.042  0.04194 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 82096.03)
#> 
#>     Null deviance: 215259666  on 332  degrees of freedom
#> Residual deviance:  26517018  on 323  degrees of freedom
#>   (11 observations deleted due to missingness)
#> AIC: 4725
#> 
#> Number of Fisher Scoring iterations: 2
my-mode.R mtcars mpg --correlation
Code
ca$set_input(c("mtcars", "mpg", "--correlation"))
do.call(my_model, ca$parse())
#> 
#> Call:
#> stats::glm(formula = form, family = family, data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.4506  -1.6044  -0.1196   1.2193   4.6271  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) 12.30337   18.71788   0.657   0.5181  
#> cyl         -0.11144    1.04502  -0.107   0.9161  
#> disp         0.01334    0.01786   0.747   0.4635  
#> hp          -0.02148    0.02177  -0.987   0.3350  
#> drat         0.78711    1.63537   0.481   0.6353  
#> wt          -3.71530    1.89441  -1.961   0.0633 .
#> qsec         0.82104    0.73084   1.123   0.2739  
#> vs           0.31776    2.10451   0.151   0.8814  
#> am           2.52023    2.05665   1.225   0.2340  
#> gear         0.65541    1.49326   0.439   0.6652  
#> carb        -0.19942    0.82875  -0.241   0.8122  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 7.023544)
#> 
#>     Null deviance: 1126.05  on 31  degrees of freedom
#> Residual deviance:  147.49  on 21  degrees of freedom
#> AIC: 163.71
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> Correlation of Coefficients:
#>      (Intercept) cyl   disp  hp    drat  wt    qsec  vs    am    gear 
#> cyl  -0.67                                                            
#> disp -0.02       -0.27                                                
#> hp   -0.07       -0.18 -0.52                                          
#> drat -0.42        0.28 -0.12  0.09                                    
#> wt    0.09        0.11 -0.77  0.24  0.17                              
#> qsec -0.77        0.27  0.29  0.11  0.04 -0.51                        
#> vs    0.09        0.32  0.10 -0.27 -0.03  0.08 -0.37                  
#> am   -0.23        0.26  0.03 -0.05 -0.16  0.09  0.27  0.21            
#> gear -0.41        0.35 -0.08 -0.09 -0.07  0.18  0.08 -0.04 -0.31      
#> carb  0.12       -0.23  0.67 -0.53 -0.21 -0.70  0.27  0.09  0.06 -0.42
my-model.R sat.act gender --family binomial --correlation
Code
ca$set_input(c("sat.act", "gender", "--family", "binomial", "--correlation"))
do.call(my_model, ca$parse())
#> 
#> Call:
#> stats::glm(formula = form, family = family, data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.6500  -0.9385  -0.7658   1.2356   2.0129  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.804944   0.587445  -3.073  0.00212 ** 
#> education   -0.220411   0.069023  -3.193  0.00141 ** 
#> age          0.024923   0.010339   2.411  0.01593 *  
#> ACT         -0.019895   0.022941  -0.867  0.38582    
#> SATV        -0.002496   0.001026  -2.434  0.01493 *  
#> SATQ         0.005462   0.001069   5.110 3.22e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 895.09  on 686  degrees of freedom
#> Residual deviance: 854.46  on 681  degrees of freedom
#>   (13 observations deleted due to missingness)
#> AIC: 866.46
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> Correlation of Coefficients:
#>           (Intercept) education age   ACT   SATV 
#> education -0.01                                  
#> age       -0.28       -0.55                      
#> ACT       -0.28       -0.09     -0.11            
#> SATV      -0.25        0.01      0.07 -0.30      
#> SATQ      -0.24       -0.01      0.06 -0.38 -0.46

If I needed this, maybe it would make sense to be able to read the data from a file path, then execute something like:

my-model.R data/example.csv response

For a more real example, I’ll use a trimmed down version of a {pak} cli utiliy I’ve been using a lot. I really like using python’s pip and wanted to have something similar to R. {pak} is fantastic and highly recommended.

So, to make our own little command line utility, we just need to include small things and get going:

#!/usr/bin/env -S Rscript --vanilla

library(scribe)
ca <- command_args()
ca$add_argument("pkg", action = "dots", default = "local::.")
ca$add_argument("-d", "--dependencies, action = "list", default = TRUE)
args <- ca$parse()
do.call(pak::pak, args)

Now, I can install packages nicely in a terminal:

pak github::jmbarbone/mark -d
pak dplyr dbplyr dtplyr