A package to support Rscript files

R
{scribe}
Published

March 13, 2023

I’m excited to be finalizing release preparations of {scribe}. This package supports writing your own Rscript files and executing them through a terminal.

We’ll start with a simple example. For most of these, I’ll be using the direct R interface. However, this package is best used with a shebang script 1.

  • 1 I’m pretty sure this is pronounced like sha-bang because it’s a hash (#) and bang (!). But I used to think she-bang, which conjures The Stone Roses’ She Bangs the Drums, a pleasant ear-worm. I think it also works to just shout octothorpe!

  • Keep in mind that command_args() doesn’t need an explicit input, and when used with Rscript will automatically capture command line arguments.

    Code
    library(scribe)
    ca <- command_args(string = "-a 1 -b 0")
    ca$add_argument("-a", default = 0L)
    ca$add_argument("-b", default = 0L)
    args <- ca$parse()
    args$a + args$b
    #> [1] 1
    Code
    ca$set_input(c("-a 10 -b 10"))
    args <- ca$parse()
    #> Warning: Not all values parsed:
    #> -a 10 -b 10
    args$a + args$b
    #> [1] 0

    That’s a little easy, so maybe we can make something a bit more interesting.

    First we’ll make ourselves a little modeling function. This is not meant for completeness, but simply provides a few examples for creativity.

    Code
    my_model <- function(
        data = c("penguins", "mtcars", "sat.act"), 
        y, 
        x = NA, 
        family = "gaussian", 
        correlation = FALSE
      ) {
      data <- match.arg(data)
      
      data <- switch(
        data,
        penguins = palmerpenguins::penguins,
        mtcars = datasets::mtcars,
        sat.act = transform(
          psych::sat.act, 
          gender = as.integer(gender == 1)
        )
      )
      
      if (isTRUE(is.na(x))) {
        x <- setdiff(colnames(data), y)
      }
      
      data <- data[, c(y, x)]
      form <- stats::DF2formula(data)
      mod <- stats::glm(form, data = data, family = family)
      summary(mod, correlation = correlation)
    }

    Now that we have that, we can set up the command args to parse what our string inputs are.

    Code
    # we'll pass arguments after
    ca <- command_args()
    ca$add_description("run a quick model")
    ca$add_argument(
      "data",
      default = "penguins",
      info = "a dataset to view"
    )
    ca$add_argument("y", info = "value to predict")
    ca$add_argument("x", default = NA, info = "variables")
    ca$add_argument(
      "--family",
      default = "gaussian",
      info = "error distribution, link function"
    )
    ca$add_argument(
      "--correlation",
      action = "flag",
      info = "when set, returns the correlation matrix"
    )
    ca$add_example("my-model.R penguins body_mass_g")
    ca$add_example("my-model.R mtcars mpg --correlation")

    There’s a default help arg added to the scribeCommandArg object. When --help is found in the command line arguments, the script will try to exit, returning only the help information.

    Code
    options(scribe.interactive = TRUE)
    ca$set_input("--help")
    ca$parse()
    #> {scribe} command_args
    #> 
    #> file : /home/jordan/github/quarto-cli/src/resources/rmd/rmd.R
    #> 
    #> DESCRIPTION
    #>   run a quick model
    #> 
    #> USAGE
    #>   rmd.R [--help | --version]
    #>   rmd.R [data [ARG]] [y [ARG]] [x [ARG]] [--family [ARG]] [--correlation, --no-correlation] 
    #> 
    #> ARGUMENTS
    #>   --help                          : prints this and quietly exits                   
    #>   --version                       : prints the version of {scribe} and quietly exits
    #>   data [ARG]                      : a dataset to view                               
    #>   y [ARG]                         : value to predict                                
    #>   x [ARG]                         : variables                                       
    #>   --family [ARG]                  : error distribution, link function               
    #>   --correlation, --no-correlation : when set, returns the correlation matrix        
    #> 
    #> EXAMPLES
    #>   $ my-model.R penguins body_mass_g    
    #>   $ my-model.R mtcars mpg --correlation

    Let’s simulate a few examples:

    my-model.R penguins body_mass_g
    Code
    ca$set_input(c("penguins", "body_mass_g"))
    do.call(my_model, ca$parse())
    #> 
    #> Call:
    #> stats::glm(formula = form, family = family, data = data)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -809.70  -180.87    -6.25   176.76   864.22  
    #> 
    #> Coefficients:
    #>                    Estimate Std. Error t value Pr(>|t|)    
    #> (Intercept)       84087.945  41912.019   2.006  0.04566 *  
    #> speciesChinstrap   -282.539     88.790  -3.182  0.00160 ** 
    #> speciesGentoo       890.958    144.563   6.163 2.12e-09 ***
    #> islandDream         -21.180     58.390  -0.363  0.71704    
    #> islandTorgersen     -58.777     60.852  -0.966  0.33482    
    #> bill_length_mm       18.964      7.112   2.667  0.00805 ** 
    #> bill_depth_mm        60.798     20.002   3.040  0.00256 ** 
    #> flipper_length_mm    18.504      3.128   5.915 8.46e-09 ***
    #> sexmale             378.977     48.074   7.883 4.95e-14 ***
    #> year                -42.785     20.949  -2.042  0.04194 *  
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for gaussian family taken to be 82096.03)
    #> 
    #>     Null deviance: 215259666  on 332  degrees of freedom
    #> Residual deviance:  26517018  on 323  degrees of freedom
    #>   (11 observations deleted due to missingness)
    #> AIC: 4725
    #> 
    #> Number of Fisher Scoring iterations: 2
    my-mode.R mtcars mpg --correlation
    Code
    ca$set_input(c("mtcars", "mpg", "--correlation"))
    do.call(my_model, ca$parse())
    #> 
    #> Call:
    #> stats::glm(formula = form, family = family, data = data)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -3.4506  -1.6044  -0.1196   1.2193   4.6271  
    #> 
    #> Coefficients:
    #>             Estimate Std. Error t value Pr(>|t|)  
    #> (Intercept) 12.30337   18.71788   0.657   0.5181  
    #> cyl         -0.11144    1.04502  -0.107   0.9161  
    #> disp         0.01334    0.01786   0.747   0.4635  
    #> hp          -0.02148    0.02177  -0.987   0.3350  
    #> drat         0.78711    1.63537   0.481   0.6353  
    #> wt          -3.71530    1.89441  -1.961   0.0633 .
    #> qsec         0.82104    0.73084   1.123   0.2739  
    #> vs           0.31776    2.10451   0.151   0.8814  
    #> am           2.52023    2.05665   1.225   0.2340  
    #> gear         0.65541    1.49326   0.439   0.6652  
    #> carb        -0.19942    0.82875  -0.241   0.8122  
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for gaussian family taken to be 7.023544)
    #> 
    #>     Null deviance: 1126.05  on 31  degrees of freedom
    #> Residual deviance:  147.49  on 21  degrees of freedom
    #> AIC: 163.71
    #> 
    #> Number of Fisher Scoring iterations: 2
    #> 
    #> Correlation of Coefficients:
    #>      (Intercept) cyl   disp  hp    drat  wt    qsec  vs    am    gear 
    #> cyl  -0.67                                                            
    #> disp -0.02       -0.27                                                
    #> hp   -0.07       -0.18 -0.52                                          
    #> drat -0.42        0.28 -0.12  0.09                                    
    #> wt    0.09        0.11 -0.77  0.24  0.17                              
    #> qsec -0.77        0.27  0.29  0.11  0.04 -0.51                        
    #> vs    0.09        0.32  0.10 -0.27 -0.03  0.08 -0.37                  
    #> am   -0.23        0.26  0.03 -0.05 -0.16  0.09  0.27  0.21            
    #> gear -0.41        0.35 -0.08 -0.09 -0.07  0.18  0.08 -0.04 -0.31      
    #> carb  0.12       -0.23  0.67 -0.53 -0.21 -0.70  0.27  0.09  0.06 -0.42
    my-model.R sat.act gender --family binomial --correlation
    Code
    ca$set_input(c("sat.act", "gender", "--family", "binomial", "--correlation"))
    do.call(my_model, ca$parse())
    #> 
    #> Call:
    #> stats::glm(formula = form, family = family, data = data)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -1.6500  -0.9385  -0.7658   1.2356   2.0129  
    #> 
    #> Coefficients:
    #>              Estimate Std. Error z value Pr(>|z|)    
    #> (Intercept) -1.804944   0.587445  -3.073  0.00212 ** 
    #> education   -0.220411   0.069023  -3.193  0.00141 ** 
    #> age          0.024923   0.010339   2.411  0.01593 *  
    #> ACT         -0.019895   0.022941  -0.867  0.38582    
    #> SATV        -0.002496   0.001026  -2.434  0.01493 *  
    #> SATQ         0.005462   0.001069   5.110 3.22e-07 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 895.09  on 686  degrees of freedom
    #> Residual deviance: 854.46  on 681  degrees of freedom
    #>   (13 observations deleted due to missingness)
    #> AIC: 866.46
    #> 
    #> Number of Fisher Scoring iterations: 4
    #> 
    #> Correlation of Coefficients:
    #>           (Intercept) education age   ACT   SATV 
    #> education -0.01                                  
    #> age       -0.28       -0.55                      
    #> ACT       -0.28       -0.09     -0.11            
    #> SATV      -0.25        0.01      0.07 -0.30      
    #> SATQ      -0.24       -0.01      0.06 -0.38 -0.46

    If I needed this, maybe it would make sense to be able to read the data from a file path, then execute something like:

    my-model.R data/example.csv response

    For a more real example, I’ll use a trimmed down version of a {pak} cli utiliy I’ve been using a lot. I really like using python’s pip and wanted to have something similar to R. {pak} is fantastic and highly recommended.

    So, to make our own little command line utility, we just need to include small things and get going:

    #!/usr/bin/env -S Rscript --vanilla
    
    library(scribe)
    ca <- command_args()
    ca$add_argument("pkg", action = "dots", default = "local::.")
    ca$add_argument("-d", "--dependencies, action = "list", default = TRUE)
    args <- ca$parse()
    do.call(pak::pak, args)

    Now, I can install packages nicely in a terminal:

    pak github::jmbarbone/mark -d
    pak dplyr dbplyr dtplyr