This tutorial is generated from a `Jupyter `__ notebook that can be found `here `__. Using non-Python operations =========================== If your simulator or other operations are implemented in a programming language other than Python, you can still use ELFI. This notebook briefly demonstrates how to do this in three common scenarios: - External executable (written e.g. in C++ or a shell script) - R function - MATLAB function Let’s begin by importing some libraries that we will be using: .. code:: ipython3 import os import numpy as np import matplotlib import matplotlib.pyplot as plt import scipy.io as sio import scipy.stats as ss import elfi import elfi.examples.bdm import elfi.examples.ma2 %matplotlib inline .. note:: To run some parts of this notebook you need to either compile the simulator, have R or MATLAB installed and install their respective wrapper libraries. External executables -------------------- ELFI supports using external simulators and other operations that can be called from the command-line. ELFI provides some tools to easily incorporate such operations to ELFI models. This functionality is introduced in this tutorial. We demonstrate here how to wrap executables as ELFI nodes. We will first use ``elfi.tools.external_operation`` tool to wrap executables as a Python callables (function). Let’s first investigate how it works with a simple shell ``echo`` command: .. code:: ipython3 # Make an external command. {0} {1} are positional arguments and {seed} a keyword argument `seed`. command = 'echo {0} {1} {seed}' echo_sim = elfi.tools.external_operation(command) # Test that `echo_sim` can now be called as a regular python function echo_sim(3, 1, seed=123) .. parsed-literal:: array([ 3., 1., 123.]) The placeholders for arguments in the command string are just Python’s ```format strings`` `__. Currently ``echo_sim`` only accepts scalar arguments. In order to work in ELFI, ``echo_sim`` needs to be vectorized so that we can pass to it a vector of arguments. ELFI provides a handy tool for this as well: .. code:: ipython3 # Vectorize it with elfi tools echo_sim_vec = elfi.tools.vectorize(echo_sim) # Make a simple model m = elfi.ElfiModel(name='echo') elfi.Prior('uniform', .005, 2, model=m, name='alpha') elfi.Simulator(echo_sim_vec, m['alpha'], 0, name='echo') # Test to generate 3 simulations from it m['echo'].generate(3) .. parsed-literal:: array([[ 1.93678222e+00, 0.00000000e+00, 7.43529055e+08], [ 9.43846120e-01, 0.00000000e+00, 7.43529055e+08], [ 2.67626618e-01, 0.00000000e+00, 7.43529055e+08]]) So above, the first column draws from our uniform prior for :math:`\alpha`, the second column has constant zeros, and the last one lists the seeds provided to the command by ELFI. Complex external operations :math:`-` case BDM ---------------------------------------------- To provide a more realistic example of external operations, we will consider the Birth-Death-Mutation (BDM) model used in `Lintusaari at al 2016 `__ *[1]*. Birth-Death-Mutation process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We will consider here the Birth-Death-Mutation process simulator introduced in *Tanaka et al 2006 [2]* for the spread of Tuberculosis. The simulator outputs a count vector where each of its elements represents a “mutation” of the disease and the count describes how many are currently infected by that mutation. There are three rates and the population size: - :math:`\alpha` - (birth rate) the rate at which any infectious host transmits the disease. - :math:`\delta` - (death rate) the rate at which any existing infectious hosts either recovers or dies. - :math:`\tau` - (mutation rate) the rate at which any infectious host develops a new unseen mutation of the disease within themselves. - :math:`N` - (population size) the size of the simulated infectious population It is assumed that the susceptible population is infinite, the hosts carry only one mutation of the disease and transmit that mutation onward. A more accurate description of the model can be found from the original paper or e.g. `Lintusaari at al 2016 `__ *[1]*. .. For documentation .. image:: http://research.cs.aalto.fi/pml/software/elfi/docs/0.5/images/bdm.png :width: 400 px :alt: BDM model illustration from Lintusaari et al. 2016 :align: center This simulator cannot be implemented effectively with vectorized operations so we have implemented it with C++ that handles loops efficiently. We will now reproduce Figure 6(a) in `Lintusaari at al 2016 `__ *[2]* with ELFI. Let’s start by defining some constants: .. code:: ipython3 # Fixed model parameters delta = 0 tau = 0.198 N = 20 # The zeros are to make the observed population vector have length N y_obs = np.array([6, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype='int16') Let’s build the beginning of a new model for the birth rate :math:`\alpha` as the only unknown .. code:: ipython3 m = elfi.ElfiModel(name='bdm') elfi.Prior('uniform', .005, 2, model=m, name='alpha') .. parsed-literal:: Prior(name='alpha', 'uniform') .. code:: ipython3 # Get the BDM source directory sources_path = elfi.examples.bdm.get_sources_path() # Compile (unix-like systems) !make -C $sources_path # Move the executable in to the working directory !mv $sources_path/bdm . .. parsed-literal:: g++ bdm.cpp --std=c++0x -O -Wall -o bdm .. note:: The source code for the BDM simulator comes with ELFI. You can get the directory with `elfi.examples.bdm.get_source_directory()`. Under unix-like systems it can be compiled with just typing `make` to console in the source directory. For windows systems, you need to have some C++ compiler available to compile it. .. code:: ipython3 # Test the executable (assuming we have the executable `bdm` in the working directory) sim = elfi.tools.external_operation('./bdm {0} {1} {2} {3} --seed {seed} --mode 1') sim(1, delta, tau, N, seed=123) .. parsed-literal:: array([ 19., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) The BDM simulator is actually already internally vectorized if you provide it an input file with parameters on the rows. This is more efficient than looping in Python (``elfi.tools.vectorize``), because one simulation takes very little time and we wish to generate tens of thousands of simulations. We will also here redirect the output to a file and then read the file into a numpy array. This is just one possibility among the many to implement this. The most efficient would be to write a native Python module with C++ but it’s beyond the scope of this article. So let’s work through files which is a fairly common situation especially with existing software. .. code:: ipython3 # Assuming we have the executable `bdm` in the working directory command = './bdm {filename} --seed {seed} --mode 1 > {output_filename}' # Function to prepare the inputs for the simulator. We will create filenames and write an input file. def prepare_inputs(*inputs, **kwinputs): alpha, delta, tau, N = inputs meta = kwinputs['meta'] # Organize the parameters to an array. The broadcasting works nicely with constant arguments here. param_array = np.row_stack(np.broadcast(alpha, delta, tau, N)) # Prepare a unique filename for parallel settings filename = '{model_name}_{batch_index}_{submission_index}.txt'.format(**meta) np.savetxt(filename, param_array, fmt='%.4f %.4f %.4f %d') # Add the filenames to kwinputs kwinputs['filename'] = filename kwinputs['output_filename'] = filename[:-4] + '_out.txt' # Return new inputs that the command will receive return inputs, kwinputs # Function to process the result of the simulation def process_result(completed_process, *inputs, **kwinputs): output_filename = kwinputs['output_filename'] # Read the simulations from the file. simulations = np.loadtxt(output_filename, dtype='int16') # Clean up the files after reading the data in os.remove(kwinputs['filename']) os.remove(output_filename) # This will be passed to ELFI as the result of the command return simulations # Create the python function (do not read stdout since we will work through files) bdm = elfi.tools.external_operation(command, prepare_inputs=prepare_inputs, process_result=process_result, stdout=False) Now let’s replace the echo simulator with this. To create unique but informative filenames, we ask ELFI to provide the operation some meta information. That will be available under the ``meta`` keyword (see the ``prepare_inputs`` function above): .. code:: ipython3 # Create the simulator bdm_node = elfi.Simulator(bdm, m['alpha'], delta, tau, N, observed=y_obs, name='sim') # Ask ELFI to provide the meta dict bdm_node.uses_meta = True # Draw the model elfi.draw(m) .. image:: https://raw.githubusercontent.com/elfi-dev/notebooks/dev/figures/external_files/external_21_0.svg .. code:: ipython3 # Test it data = bdm_node.generate(3) print(data) .. parsed-literal:: [[13 1 4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [19 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [14 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] Completing the BDM model ~~~~~~~~~~~~~~~~~~~~~~~~ We are now ready to finish up the BDM model. To reproduce Figure 6(a) in `Lintusaari at al 2016 `__ *[2]*, let’s add different summaries and discrepancies to the model and run the inference for each of them: .. code:: ipython3 def T1(clusters): clusters = np.atleast_2d(clusters) return np.sum(clusters > 0, 1)/np.sum(clusters, 1) def T2(clusters, n=20): clusters = np.atleast_2d(clusters) return 1 - np.sum((clusters/n)**2, axis=1) # Add the different distances to the model elfi.Summary(T1, bdm_node, name='T1') elfi.Distance('minkowski', m['T1'], p=1, name='d_T1') elfi.Summary(T2, bdm_node, name='T2') elfi.Distance('minkowski', m['T2'], p=1, name='d_T2') elfi.Distance('minkowski', m['sim'], p=1, name='d_sim') .. parsed-literal:: Distance(name='d_sim') .. code:: ipython3 elfi.draw(m) .. image:: https://raw.githubusercontent.com/elfi-dev/notebooks/dev/figures/external_files/external_25_0.svg .. code:: ipython3 # Save parameter and simulation results in memory to speed up the later inference pool = elfi.OutputPool(['alpha', 'sim']) # Fix a seed seed = 20170511 rej = elfi.Rejection(m, 'd_T1', batch_size=10000, pool=pool, seed=seed) %time T1_res = rej.sample(5000, n_sim=int(1e5)) rej = elfi.Rejection(m, 'd_T2', batch_size=10000, pool=pool, seed=seed) %time T2_res = rej.sample(5000, n_sim=int(1e5)) rej = elfi.Rejection(m, 'd_sim', batch_size=10000, pool=pool, seed=seed) %time sim_res = rej.sample(5000, n_sim=int(1e5)) .. parsed-literal:: CPU times: user 3.11 s, sys: 143 ms, total: 3.26 s Wall time: 5.56 s CPU times: user 29.9 ms, sys: 1.45 ms, total: 31.3 ms Wall time: 31.2 ms CPU times: user 33.8 ms, sys: 500 µs, total: 34.3 ms Wall time: 34 ms .. code:: ipython3 # Load a precomputed posterior based on an analytic solution (see Lintusaari et al 2016) matdata = sio.loadmat('./resources/bdm.mat') x = matdata['likgrid'].reshape(-1) posterior_at_x = matdata['post'].reshape(-1) # Plot the reference plt.figure() plt.plot(x, posterior_at_x, c='k') # Plot the different curves for res, d_node, c in ([sim_res, 'd_sim', 'b'], [T1_res, 'd_T1', 'g'], [T2_res, 'd_T2', 'r']): alphas = res.outputs['alpha'] dists = res.outputs[d_node] # Use gaussian kde to make the curves look nice. Note that this tends to benefit the algorithm 1 # a lot as it ususally has only a very few accepted samples with 100000 simulations kde = ss.gaussian_kde(alphas[dists<=0]) plt.plot(x, kde(x), c=c) plt.legend(['reference', 'algorithm 1', 'algorithm 2, T1\n(eps=0)', 'algorithm 2, T2\n(eps=0)']) plt.xlim([-.2, 1.2]); print('Results after 100000 simulations. Compare to figure 6(a) in Lintusaari et al. 2016.') .. parsed-literal:: Results after 100000 simulations. Compare to figure 6(a) in Lintusaari et al. 2016. .. image:: https://raw.githubusercontent.com/elfi-dev/notebooks/dev/figures/external_files/external_27_1.png Interfacing with R ------------------ It is possible to run R scripts in command line for example with `Rscript `__. However, in Python it may be more convenient to use `rpy2 `__, which allows convenient access to the functionality of R from within Python. You can install it with ``pip install rpy2``. Here we demonstrate how to calculate the summary statistics used in the ELFI tutorial (autocovariances) using R’s ``acf`` function for the MA2 model. .. code:: ipython3 import rpy2.robjects as robj from rpy2.robjects import numpy2ri as np2ri # Converts numpy arrays automatically np2ri.activate() .. Note:: See this issue_ if you get a `undefined symbol: PC` error in the import after installing rpy2 and you are using Anaconda. .. _issue: https://github.com/ContinuumIO/anaconda-issues/issues/152 Let’s create a Python function that wraps the R commands (please see the documentation of `rpy2 `__ for details): .. code:: ipython3 robj.r(''' # create a function `f` f <- function(x, lag=1) { ac = acf(x, plot=FALSE, type="covariance", lag.max=lag, demean=FALSE) ac[['acf']][lag+1] } ''') f = robj.globalenv['f'] def autocovR(x, lag=1): x = np.atleast_2d(x) apply = robj.r['apply'] ans = apply(x, 1, f, lag=lag) return np.atleast_1d(ans) .. code:: ipython3 # Test it autocovR(np.array([[1,2,3,4], [4,5,6,7]]), 1) .. parsed-literal:: array([ 5., 23.]) Load a ready made MA2 model: .. code:: ipython3 ma2 = elfi.examples.ma2.get_model(seed_obs=4) elfi.draw(ma2) .. image:: https://raw.githubusercontent.com/elfi-dev/notebooks/dev/figures/external_files/external_36_0.svg Replace the summaries S1 and S2 with our R autocovariance function. .. code:: ipython3 # Replace with R autocov S1 = elfi.Summary(autocovR, ma2['MA2'], 1) S2 = elfi.Summary(autocovR, ma2['MA2'], 2) ma2['S1'].become(S1) ma2['S2'].become(S2) # Run the inference rej = elfi.Rejection(ma2, 'd', batch_size=1000, seed=seed) rej.sample(100) .. parsed-literal:: Method: Rejection Number of samples: 100 Number of simulations: 10000 Threshold: 0.111 Sample means: t1: 0.599, t2: 0.177 Interfacing with MATLAB ----------------------- There are a number of options for running MATLAB (or Octave) scripts from within Python. Here, evaluating the distance is demonstrated with a MATLAB function using the official `MATLAB Python cd API `__. (Tested with MATLAB 2016b.) .. code:: ipython3 import matlab.engine A MATLAB session needs to be started (and stopped) separately: .. code:: ipython3 eng = matlab.engine.start_matlab() # takes a while... Similarly as with R, we have to write a piece of code to interface between MATLAB and Python: .. code:: ipython3 def euclidean_M(x, y): # MATLAB array initialized with Python's list ddM = matlab.double((x-y).tolist()) # euclidean distance dM = eng.sqrt(eng.sum(eng.power(ddM, 2.0), 2)) # Convert back to numpy array d = np.atleast_1d(dM).reshape(-1) return d .. code:: ipython3 # Test it euclidean_M(np.array([[1,2,3], [6,7,8], [2,2,3]]), np.array([2,2,2])) .. parsed-literal:: array([ 1.41421356, 8.77496439, 1. ]) Load a ready made MA2 model: .. code:: ipython3 ma2M = elfi.examples.ma2.get_model(seed_obs=4) elfi.draw(ma2M) .. image:: https://raw.githubusercontent.com/elfi-dev/notebooks/dev/figures/external_files/external_48_0.svg Replace the summaries S1 and S2 with our R autocovariance function. .. code:: ipython3 # Replace with Matlab distance implementation d = elfi.Distance(euclidean_M, ma2M['S1'], ma2M['S2']) ma2M['d'].become(d) # Run the inference rej = elfi.Rejection(ma2M, 'd', batch_size=1000, seed=seed) rej.sample(100) .. parsed-literal:: Method: Rejection Number of samples: 100 Number of simulations: 10000 Threshold: 0.113 Sample means: t1: 0.602, t2: 0.178 Finally, don’t forget to quit the MATLAB session: .. code:: ipython3 eng.quit() Verdict ------- We showed here a few examples of how to incorporate non Python operations to ELFI models. There are multiple other ways to achieve the same results and even make the wrapping more efficient. Wrapping often introduces some overhead to the evaluation of the generative model. In many cases however this is not an issue since the operations are usually expensive by themselves making the added overhead insignificant. References ~~~~~~~~~~ - [1] Jarno Lintusaari, Michael U. Gutmann, Ritabrata Dutta, Samuel Kaski, Jukka Corander; Fundamentals and Recent Developments in Approximate Bayesian Computation. Syst Biol 2017; 66 (1): e66-e82. doi: 10.1093/sysbio/syw077 - [2] Tanaka, Mark M., et al. “Using approximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data.” Genetics 173.3 (2006): 1511-1520.