13. rpy2

13.1. Basic

Import rpy2 and check its version.

[1]:

import rpy2

print(rpy2.__version__)

3.4.5

Get the rpy2 environment.

[2]:

import rpy2.situation

for row in rpy2.situation.iter_info():
    print(row)

rpy2 version:
3.4.5
Python version:
3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ]
Looking for R's HOME:
    Environment variable R_HOME: None
    Calling `R RHOME`: /Library/Frameworks/R.framework/Resources
    Environment variable R_LIBS_USER: None
R's additions to LD_LIBRARY_PATH:

R version:
    In the PATH: R version 4.1.1 (2021-08-10) -- "Kick Things"
    Loading R library from rpy2: OK
Additional directories to load R packages from:
None
C extension compilation:
  include:
  ['/Library/Frameworks/R.framework/Resources/include']
  libraries:
  ['pcre2-8', 'lzma', 'bz2', 'z', 'icucore', 'dl', 'm', 'iconv']
  library_dirs:
  ['/usr/local/lib', '/usr/local/lib']
  extra_compile_args:
  []
  extra_link_args:
  ['-F/Library/Frameworks/R.framework/..', '-framework', 'R']

Initializing an embedded R environment.

[3]:

import rpy2.robjects as robjects

Importing packages.

[4]:

from rpy2.robjects.packages import importr

base = importr('base')

Install packages.

[5]:

import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector

utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)

packages = [p for p in ('ggplot2', 'hexbin') if not rpackages.isinstalled(p)]

if len(packages) > 0:
    utils.install_packages(StrVector(packages))

13.2. r instance

The r field on rpy2.robjects may be used to communicate with R. In this example, we retrieve pi from R. Note how the results is a vector.

[6]:

robjects.r['pi']

[6]:

FloatVector with 1 elements.

3.141593

We can also treat r like a function passing in expressions.

[7]:

robjects.r('pi')

[7]:

FloatVector with 1 elements.

3.141593

Or how about a script?

[8]:

script = '''
3 + 4
'''

robjects.r(script)

[8]:

FloatVector with 1 elements.

7.000000

We can also create functions and then retrieve them.

[9]:

script = '''
addThem <- function(a, b) {
    a + b
}
'''

_ = robjects.r(script)

Retrieving the function is done through the globalenv field. Below, we use the function r_repr() to get the string representation of the function (which is exactly what we coded).

[10]:

robjects.globalenv['addThem'].r_repr()

[10]:

'function (a, b) \n{\n    a + b\n}'

The R function can be retrieved and invoked.

[11]:

add_them = robjects.globalenv['addThem']
add_them(1, 2)

[11]:

IntVector with 1 elements.

3

[12]:

robjects.r['sum'](robjects.IntVector([1,2,3]))

[12]:

IntVector with 1 elements.

6

[13]:

robjects.r['sum'](robjects.FloatVector([1,2,3]))

[13]:

FloatVector with 1 elements.

6.000000

[14]:

m = robjects.r['matrix'](robjects.FloatVector([1.1, 2.2, 3.3, 4.4, 5.5, 6.6]), nrow = 2)
print(m)

     [,1] [,2] [,3]
[1,]  1.1  3.3  5.5
[2,]  2.2  4.4  6.6

13.3. Dataframes

Let’s create a Pandas dataframe.

[15]:

import pandas as pd

pdf = pd.DataFrame({'x1': [8, 8, 2, 3], 'x2': [9, 8, 1, 1]})
pdf

[15]:

	x1	x2
0	8	9
1	8	8
2	2	1
3	3	1

Pandas -> R: We can convert the Pandas dataframe to a R dataframe.

[16]:

from rpy2.robjects.conversion import localconverter
from rpy2.robjects import pandas2ri
import rpy2.robjects as ro

with localconverter(ro.default_converter + pandas2ri.converter):
    rdf = ro.conversion.py2rpy(pdf)

rdf.r_repr()

[16]:

'structure(list(x1 = c(8L, 8L, 2L, 3L), x2 = c(9L, 8L, 1L, 1L)), class = "data.frame", row.names = c("0", \n"1", "2", "3"))'

R -> Pandas: And we can also convert an R dataframe to a Pandas dataframe.

[17]:

with localconverter(ro.default_converter + pandas2ri.converter):
    temp_df = ro.conversion.rpy2py(rdf)

temp_df

[17]:

	x1	x2
0	8	9
1	8	8
2	2	1
3	3	1

If we create a dataframe in R, we can also convert it to a Pandas one.

[18]:

script = '''
df <- data.frame(
    age = c(18, 16, 15),
    grade = c('A', 'B', 'C'),
    name = c('Jane', 'Jack', 'Joe'),
    male = c(FALSE, TRUE, TRUE)
)
'''
rdf = robjects.r(script)

with localconverter(ro.default_converter + pandas2ri.converter):
    temp_df = ro.conversion.rpy2py(rdf)

temp_df

[18]:

	age	grade	name	male
1	18.0	A	Jane	0
2	16.0	B	Jack	1
3	15.0	C	Joe	1

The context manager can also auto-convert Pandas dataframe to a R one when calling functions.

[19]:

with localconverter(ro.default_converter + pandas2ri.converter):
    summary_df = base.summary(pdf)

print(summary_df)

       x1             x2
 Min.   :2.00   Min.   :1.00
 1st Qu.:2.75   1st Qu.:1.00
 Median :5.50   Median :4.50
 Mean   :5.25   Mean   :4.75
 3rd Qu.:8.00   3rd Qu.:8.25
 Max.   :8.00   Max.   :9.00

The summary result is stored as a StrMatrix, and we can kludge a solution to parse its elements into a Pandas dataframe.

[20]:

import numpy as np

pd.DataFrame(
    np.array([float(s[s.index(':')+1:].strip()) for s in summary_df]).reshape(summary_df.nrow, summary_df.ncol),
    columns=[n.strip() for n in summary_df.colnames]
)

[20]:

	x1	x2
0	2.00	2.75
1	5.50	5.25
2	8.00	8.00
3	1.00	1.00
4	4.50	4.75
5	8.25	9.00

13.4. Modeling

Results from modeling in R can also be captured. This example is a toy dataset where we want to apply linear regression. We might choose R over Python, since the former provides more information on the model.

[21]:

pdf = pd.DataFrame({
    'x1': [1, 2, 3, 4, 5],
    'x2': [5, 4, 3, 2, 1],
    'y': [10, 20, 30, 40, 55]
})

with localconverter(ro.default_converter + pandas2ri.converter):
    rdf = ro.conversion.py2rpy(pdf)

[22]:

stats = importr('stats')
r = stats.lm('y ~ .', data=rdf)

[23]:

print(base.summary(r))


Call:
(function (formula, data, subset, weights, na.action, method = "qr",
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
    contrasts = NULL, offset, ...)
{
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "subset", "weights", "na.action",
        "offset"), names(mf), 0L)
    mf <- mf[c(1L, m)]
    mf$drop.unused.levels <- TRUE
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval(mf, parent.frame())
    if (method == "model.frame")
        return(mf)
    else if (method != "qr")
        warning(gettextf("method = '%s' is not supported. Using 'qr'",
            method), domain = NA)
    mt <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- as.vector(model.weights(mf))
    if (!is.null(w) && !is.numeric(w))
        stop("'weights' must be a numeric vector")
    offset <- model.offset(mf)
    mlm <- is.matrix(y)
    ny <- if (mlm)
        nrow(y)
    else length(y)
    if (!is.null(offset)) {
        if (!mlm)
            offset <- as.vector(offset)
        if (NROW(offset) != ny)
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
                NROW(offset), ny), domain = NA)
    }
    if (is.empty.model(mt)) {
        x <- NULL
        z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
            ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
            y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
            0) else ny)
        if (!is.null(offset)) {
            z$fitted.values <- offset
            z$residuals <- y - offset
        }
    }
    else {
        x <- model.matrix(mt, mf, contrasts)
        z <- if (is.null(w))
            lm.fit(x, y, offset = offset, singular.ok = singular.ok,
                ...)
        else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
            ...)
    }
    class(z) <- c(if (mlm) "mlm", "lm")
    z$na.action <- attr(mf, "na.action")
    z$offset <- offset
    z$contrasts <- attr(x, "contrasts")
    z$xlevels <- .getXlevels(mt, mf)
    z$call <- cl
    z$terms <- mt
    if (model)
        z$model <- mf
    if (ret.x)
        z$x <- x
    if (ret.y)
        z$y <- y
    if (!qr)
        z$qr <- NULL
    z
})(formula = "y ~ .", data = structure(list(x1 = 1:5, x2 = 5:1,
    y = c(10L, 20L, 30L, 40L, 55L)), class = "data.frame", row.names = c("0",
"1", "2", "3", "4")))

Residuals:
         0          1          2          3          4
 1.000e+00  2.276e-15 -1.000e+00 -2.000e+00  2.000e+00

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.0000     1.9149  -1.044 0.373021
x1           11.0000     0.5774  19.053 0.000316 ***
x2                NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.826 on 3 degrees of freedom
Multiple R-squared:  0.9918,    Adjusted R-squared:  0.9891
F-statistic:   363 on 1 and 3 DF,  p-value: 0.0003157

[24]:

print(r.rclass)

<rpy2.rinterface_lib.sexp.StrSexpVector object at 0x7faf56a54e40> [RTYPES.STRSXP]

[25]:

print(r.names)

 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "xlevels"       "call"          "terms"         "model"

[26]:

print(r.rx2('coefficients'))

(Intercept)          x1          x2
         -2          11          NA

[27]:

pd.Series([c for c in r.rx2('coefficients')], r.rx2('coefficients').names)

[27]:

(Intercept)    -2.0
x1             11.0
x2              NaN
dtype: float64