13. rpy2
13.1. Basic
Import rpy2 and check its version.
[1]:
import rpy2
print(rpy2.__version__)
3.4.5
Get the rpy2 environment.
[2]:
import rpy2.situation
for row in rpy2.situation.iter_info():
print(row)
rpy2 version:
3.4.5
Python version:
3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ]
Looking for R's HOME:
Environment variable R_HOME: None
Calling `R RHOME`: /Library/Frameworks/R.framework/Resources
Environment variable R_LIBS_USER: None
R's additions to LD_LIBRARY_PATH:
R version:
In the PATH: R version 4.1.1 (2021-08-10) -- "Kick Things"
Loading R library from rpy2: OK
Additional directories to load R packages from:
None
C extension compilation:
include:
['/Library/Frameworks/R.framework/Resources/include']
libraries:
['pcre2-8', 'lzma', 'bz2', 'z', 'icucore', 'dl', 'm', 'iconv']
library_dirs:
['/usr/local/lib', '/usr/local/lib']
extra_compile_args:
[]
extra_link_args:
['-F/Library/Frameworks/R.framework/..', '-framework', 'R']
Initializing an embedded R environment.
[3]:
import rpy2.robjects as robjects
Importing packages.
[4]:
from rpy2.robjects.packages import importr
base = importr('base')
Install packages.
[5]:
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packages = [p for p in ('ggplot2', 'hexbin') if not rpackages.isinstalled(p)]
if len(packages) > 0:
utils.install_packages(StrVector(packages))
13.2. r instance
The r
field on rpy2.robjects
may be used to communicate with R
. In this example, we retrieve pi
from R. Note how the results is a vector.
[6]:
robjects.r['pi']
[6]:
3.141593 |
We can also treat r
like a function passing in expressions.
[7]:
robjects.r('pi')
[7]:
3.141593 |
Or how about a script?
[8]:
script = '''
3 + 4
'''
robjects.r(script)
[8]:
7.000000 |
We can also create functions and then retrieve them.
[9]:
script = '''
addThem <- function(a, b) {
a + b
}
'''
_ = robjects.r(script)
Retrieving the function is done through the globalenv
field. Below, we use the function r_repr()
to get the string representation of the function (which is exactly what we coded).
[10]:
robjects.globalenv['addThem'].r_repr()
[10]:
'function (a, b) \n{\n a + b\n}'
The R function can be retrieved and invoked.
[11]:
add_them = robjects.globalenv['addThem']
add_them(1, 2)
[11]:
3 |
[12]:
robjects.r['sum'](robjects.IntVector([1,2,3]))
[12]:
6 |
[13]:
robjects.r['sum'](robjects.FloatVector([1,2,3]))
[13]:
6.000000 |
[14]:
m = robjects.r['matrix'](robjects.FloatVector([1.1, 2.2, 3.3, 4.4, 5.5, 6.6]), nrow = 2)
print(m)
[,1] [,2] [,3]
[1,] 1.1 3.3 5.5
[2,] 2.2 4.4 6.6
13.3. Dataframes
Let’s create a Pandas dataframe.
[15]:
import pandas as pd
pdf = pd.DataFrame({'x1': [8, 8, 2, 3], 'x2': [9, 8, 1, 1]})
pdf
[15]:
x1 | x2 | |
---|---|---|
0 | 8 | 9 |
1 | 8 | 8 |
2 | 2 | 1 |
3 | 3 | 1 |
Pandas -> R
: We can convert the Pandas dataframe to a R dataframe.
[16]:
from rpy2.robjects.conversion import localconverter
from rpy2.robjects import pandas2ri
import rpy2.robjects as ro
with localconverter(ro.default_converter + pandas2ri.converter):
rdf = ro.conversion.py2rpy(pdf)
rdf.r_repr()
[16]:
'structure(list(x1 = c(8L, 8L, 2L, 3L), x2 = c(9L, 8L, 1L, 1L)), class = "data.frame", row.names = c("0", \n"1", "2", "3"))'
R -> Pandas
: And we can also convert an R dataframe to a Pandas dataframe.
[17]:
with localconverter(ro.default_converter + pandas2ri.converter):
temp_df = ro.conversion.rpy2py(rdf)
temp_df
[17]:
x1 | x2 | |
---|---|---|
0 | 8 | 9 |
1 | 8 | 8 |
2 | 2 | 1 |
3 | 3 | 1 |
If we create a dataframe in R, we can also convert it to a Pandas one.
[18]:
script = '''
df <- data.frame(
age = c(18, 16, 15),
grade = c('A', 'B', 'C'),
name = c('Jane', 'Jack', 'Joe'),
male = c(FALSE, TRUE, TRUE)
)
'''
rdf = robjects.r(script)
with localconverter(ro.default_converter + pandas2ri.converter):
temp_df = ro.conversion.rpy2py(rdf)
temp_df
[18]:
age | grade | name | male | |
---|---|---|---|---|
1 | 18.0 | A | Jane | 0 |
2 | 16.0 | B | Jack | 1 |
3 | 15.0 | C | Joe | 1 |
The context manager can also auto-convert Pandas dataframe to a R one when calling functions.
[19]:
with localconverter(ro.default_converter + pandas2ri.converter):
summary_df = base.summary(pdf)
print(summary_df)
x1 x2
Min. :2.00 Min. :1.00
1st Qu.:2.75 1st Qu.:1.00
Median :5.50 Median :4.50
Mean :5.25 Mean :4.75
3rd Qu.:8.00 3rd Qu.:8.25
Max. :8.00 Max. :9.00
The summary result is stored as a StrMatrix
, and we can kludge a solution to parse its elements into a Pandas dataframe.
[20]:
import numpy as np
pd.DataFrame(
np.array([float(s[s.index(':')+1:].strip()) for s in summary_df]).reshape(summary_df.nrow, summary_df.ncol),
columns=[n.strip() for n in summary_df.colnames]
)
[20]:
x1 | x2 | |
---|---|---|
0 | 2.00 | 2.75 |
1 | 5.50 | 5.25 |
2 | 8.00 | 8.00 |
3 | 1.00 | 1.00 |
4 | 4.50 | 4.75 |
5 | 8.25 | 9.00 |
13.4. Modeling
Results from modeling in R can also be captured. This example is a toy dataset where we want to apply linear regression. We might choose R over Python, since the former provides more information on the model.
[21]:
pdf = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [5, 4, 3, 2, 1],
'y': [10, 20, 30, 40, 55]
})
with localconverter(ro.default_converter + pandas2ri.converter):
rdf = ro.conversion.py2rpy(pdf)
[22]:
stats = importr('stats')
r = stats.lm('y ~ .', data=rdf)
[23]:
print(base.summary(r))
Call:
(function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
else if (method != "qr")
warning(gettextf("method = '%s' is not supported. Using 'qr'",
method), domain = NA)
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- as.vector(model.weights(mf))
if (!is.null(w) && !is.numeric(w))
stop("'weights' must be a numeric vector")
offset <- model.offset(mf)
mlm <- is.matrix(y)
ny <- if (mlm)
nrow(y)
else length(y)
if (!is.null(offset)) {
if (!mlm)
offset <- as.vector(offset)
if (NROW(offset) != ny)
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
NROW(offset), ny), domain = NA)
}
if (is.empty.model(mt)) {
x <- NULL
z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
0) else ny)
if (!is.null(offset)) {
z$fitted.values <- offset
z$residuals <- y - offset
}
}
else {
x <- model.matrix(mt, mf, contrasts)
z <- if (is.null(w))
lm.fit(x, y, offset = offset, singular.ok = singular.ok,
...)
else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
...)
}
class(z) <- c(if (mlm) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
if (model)
z$model <- mf
if (ret.x)
z$x <- x
if (ret.y)
z$y <- y
if (!qr)
z$qr <- NULL
z
})(formula = "y ~ .", data = structure(list(x1 = 1:5, x2 = 5:1,
y = c(10L, 20L, 30L, 40L, 55L)), class = "data.frame", row.names = c("0",
"1", "2", "3", "4")))
Residuals:
0 1 2 3 4
1.000e+00 2.276e-15 -1.000e+00 -2.000e+00 2.000e+00
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0000 1.9149 -1.044 0.373021
x1 11.0000 0.5774 19.053 0.000316 ***
x2 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.826 on 3 degrees of freedom
Multiple R-squared: 0.9918, Adjusted R-squared: 0.9891
F-statistic: 363 on 1 and 3 DF, p-value: 0.0003157
[24]:
print(r.rclass)
<rpy2.rinterface_lib.sexp.StrSexpVector object at 0x7faf56a54e40> [RTYPES.STRSXP]
[25]:
print(r.names)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
[26]:
print(r.rx2('coefficients'))
(Intercept) x1 x2
-2 11 NA
[27]:
pd.Series([c for c in r.rx2('coefficients')], r.rx2('coefficients').names)
[27]:
(Intercept) -2.0
x1 11.0
x2 NaN
dtype: float64