IntegratedJM – an R package to Jointly Model Gene-Expression and Bioassay Data, Taking Care of the Fingerprint Feature Effect

Author: Rudradev Sengupta

Modern drug discovery processes involve multiple sources of high-dimensional data. This imposes the challenge of data integration. For example, in order to check for safety issues prior to advance and expensive phases of drug development cycles, while launching a new drug in the market, pharmaceutical companies need to analyse the chemical structure (fingerprint features) of the compounds included in the drug, the phenotypic activity (e.g. bioassay read-outs) data for targets of interest, depending on the disease of interest and transcriptomic (gene expression) data. Perualila-Tan et al.(2016) discussed a joint model for the transcriptomic and the phenotypic variables, conditioned on the chemical structure. This modelling approach can be used to uncover, for a given set of compounds, the association between gene expression and biological activity, taking into account the influence of the chemical structure of the compound on both the variables (Amaratunga et al.(2014); Perualila et al.(2016)). The model allows to detect genes that are associated with the bioactivity data facilitating the identification of potential genomic biomarkers for compounds efficacy. The R package IntegratedJM was created to implement the model and is explained in details in the book chapter by Perualila et al.(2016). Here, the methodology and the R package are explained briefly together with few examples.

thumbnail image: IntegratedJM - an R package to Jointly Model Gene-Expression and Bioassay Data, Taking Care of the Fingerprint Feature Effect

In short, the model and the parameters of interest for the kth fingerprint feature, jth gene and ith compound can be described as follows:

This joint model can be fitted and the parameters can be estimated using the fitJM() function of the IntegratedJM R package, e.g.

jmRes <- fitJM(genematrix/expressionset, activityvector, fpvector, methodMultTest = ‘fdr’)

The joint model helps us to understand the type of association between the activity and the gene expression data. Figure 1 displays two different kinds of correlations between these two variables. In the first case, (a) the correlation between the response and the gene expression is because of the grouping variable i.e. given the group these two variables are not correlated. The right panel in Figure 1 gives an example of the other scenario where the two variables are correlated irrespective of the grouping variable i.e. in this case, we expect to have high adjusted association between the two variables, even after adjusting for the grouping factor.

Figure 1 (a)

Figure 1(b)

Figure 1: Graphical illustration of adjusted and unadjusted association: (a) Given the groups, the data are not correlated. (b) Given the groups, the variables are correlated

plot1gene() can be used to create plots (Figure 2) similar to Figure 1. The upper panel in Figure 2 displays the observed data together with the unadjusted correlation between the variables and the lower panel displays the data, after adjusting for the treatment, alongside the adjusted association. This kind of visualization makes it easier to understand how the gene expression is correlated with the fingerprint feature when both the variables are adjusted for a grouping factor. This function can also be useful when one is interested in some particular gene.

Figure 2: An example created with plot1-gene() for Gene 3589 and Fingerprint 97.

After the model is fitted one might be interested to find top k genes according to different set of criterions using the topkGenes() from the package, as described below:

topkGenes(jointModelResult = jmRes, subset_type = “Effect”, ranking = “Pearson”, k = 5, sigLevel = 0.05)

topkGenes(jointModelResult = jmRes, subset_type = “Effect and Correlation”, ranking “CovEffect1”, k = 5, sigLevel = 0.05)

The argument “Effect” implies that a subset of differentially expressed genes will be selected whereas the argument “Effect and Correlation” implies selecting a subset of differentially expressed and correlated genes. The ranking argument specifies how to rank the genes within the selected subset.

Other available functions from the package:

  • plotAsso() – plots the unadjusted association vs the adjusted association for all the genes.
  • plotEff() – plots the fingerprint effect on gene expression vs the adjusted association for all the genes.
  • volcano() – produces the volcano plot for logratios vs corresponding p-values.
  • multiplot() – can be used to plot multiple plots, created using plot1gene(), to compare among genes.


1) Amaratunga D, Cabrera J and Shkedy Z (2014, Chapter 8): Exploration and Analysis of DNA Microarray and Other High Dimensional Data; Wiley Series in Probability and Statistics.
2) Perualila N. J, Shkedy Z, Sengupta R, Bigirumurame T, Bijnens L, Talloen W, Verbist B, Göhlmann H.W.H, Kasim A and QSTAR Consortium (2016, Chapter 16): Applied Surrogate Endpoint Evaluation Methods with Sas and R. In, edited by Alonso A, Bigirumurame T, Burzykowski T, Buyse M, Molenberghs G, Muchene L, Perualila N.J, Shkedy Z and Van der Elst W; CRC Press.
3) Perualila-Tan N. J, Kasim A, Talloen W, Verbist B, Göhlmann H.W.H, QSTAR Consortium and Shkedy Z (2016): A joint modeling approach for uncovering associations between gene expression, bioactivity and chemical structure in early drug discovery to guide lead selection and genomic biomarker development; Statistical Applications in Genetics and Molecular Biology

Copyright: Image appears courtesy of ClipArt