My PhD research was concerned with the development of statistical methodology for the integration of multiple ‘omic datasets (e.g. genomic, transcriptomic, proteomic, etc.) in personalised medicine.
My goal was to tackle some of the challenges presented by the identification of relevant patient subgroups (e.g. patients that might be expected to respond similarly to treatments) on the basis of those datasets.
First, when combining different types of ‘omics datasets, it is crucial to take into account the different nature of each dataset. For this reason, I developed integrative clustering methods that explicitly weigh the contribution of each dataset to the final clustering according to the amount of information that it contains, and that allow to combine datasets of different type (e.g. continuous, categorical, etc.). These methods are based on the idea that the output of classical statistical techniques such as model-based Bayesian clustering can be used in combination with kernel methods from the machine learning literature to find a meaningful global clustering that summarises all the information available.
Second, because ‘omic datasets comprise measurements taken on a very large number of variables, many different patient subgroups can usually be identified, depending on which variables we include in our analysis. For this reason, I also worked on integrating genetic information with data on specific patient outcomes, to ensure that we identify truly relevant patient subgroups. To do so, I generalised the method above to the supervised case. A variational inference algorithm for outcome-guided model-based Bayesian clustering could be implemented as an alternative to that.
On a more applied note, I participated in a study on cardiovascular disease. My role in the project was to analyse data collected at the Cambridge Blood Donor Centre with the statistical methods mentioned above, to define a personalised cardiovascular disease risk score.
References:
- Cabassi, A., Kirk, P. D. W., 2020. Multiple kernel learning for integrative consensus clustering of genomic datasets. Bioinformatics, btaa593. doi:10.1093/bioinformatics/btaa593.
- Seyres, D., Cabassi, A., …, Frontini, M., 2020. Extreme phenotypes define epigenetic and metabolic signatures in cardiometabolic diseases. bioRxiv preprint, bioRxiv:2020.03.06.961805.
- Cabassi, A., Seyres, D., Frontini, M., Kirk, P. D. W., 2020 Two-step penalised logistic regression for multi-omic data with an application to cardiometabolic syndrome. arXiv preprint, arXiv:2008.00235.
- Cabassi, A., Richardson, S., Kirk, P. D. W. Kernel learning approaches for summarising and combining posterior similarity matrices. arXiv preprint, arXiv:2009.12852.
Previous research
High performance, large scale regression
During my internship at The Alan Turing Institute, I explored different methods and libraries to perform high-performance, large-scale regression on a supercomputer, with particular focus on Apache Spark and TensorFlow. The internship was funded by Cray Inc and carried out in close collaboration with the Cray EMEA Research Lab. You can find more details about our findings on the blog and the official webpage of the project.
Permutation tests for functional and network data
- Cabassi A., Casa A., Fontana M., Russo M., Farcomeni A., 2018. Three Testing Perspectives on Connectome Data. In: Canale A., Durante D., Paci L., Scarpa B. (eds) Studies in Neural Data Science. Start Up Research 2017. Springer Proceedings in Mathematics & Statistics, vol 257. doi:10.1007/978-3-030-00039-4_3.
- Cabassi, A., Pigoli, D., Secchi, P., Carter, P. A., 2017. Permutation tests for the equality of covariance operators of functional data with applications to evolutionary biology. Electron. J. Statist. 11, no. 2, 3815–3840. doi:10.1214/17-EJS1347.
Macroscopic traffic flow models
- Cabassi, A., Goatin, P., 2013. Validation of traffic flow models on processed GPS data. Research report RR-8382, INRIA. hal-00876311.
