This piece was submitted for the 2019 Max Perutz Science Writing Award.
A competition for MRC students to write about their science to the public.
It is aimed for students to communicate about their PhD in a way that a wider audience, such as non-scientific individuals, can understand - to help us build our communication skills.
How can we do great science if we can’t find useful information? In the UK there are many repositories that have captured vast collections of patient information. These repositories are called Biobanks. Each Biobank contains confidential data corresponding to the health and wellbeing of thousands of patients, including years of health records, doctor’s letters, and importantly, links those records to banked biosamples. Yet, a lack of standardised clinical annotation across these Biobanks leaves gaps in our knowledge which hampers the utility of millions of biosamples, as they can be difficult to find.
I am creating a smart assistant to bridge these gaps. This smart assistant will be utilising artificial intelligence techniques in order to learn from given data. My work aims to unify the representation of biomedical information in Biobanks, allowing queries to yield more results that would not have been seen before, which means more data for researchers to work with.
How will I develop a system that captures more results?
Ontologies allow us to represent a domain of interest: compressing everything we know about a particular ‘thing’ into something more easily understandable. I like to imagine ontologies as ‘webs of data’. These silky webs show how parts of a ‘thing’ are connected, and what rules constrain the web. For example, a human anatomy ontology could link a finger to a hand, and a hand to an arm. Using this web, you can logically infer a finger is linked to an arm. These rules are simple, but we can keep weaving webs like this to build up a big picture of an entire domain of interest; covering the symptoms of disease, methods for diagnosis, and the effects of medical interventions. To help researchers capture more results, my assistant will weave a web that unifies the terminology used to describe ‘things’ by the different Biobanks, this means that we will be able to find patient data for whatever we are looking for in the Biobank, even if the diseases or symptoms have different names, codes or misspellings. You could think of my work as a search engine for researchers who want to find patient data to prove their hypotheses.
I have just finished building my first ontology. I weaved this web myself from scratch, starting with a single document that describes inflammation of the eye, using only terse medical terminology. But to be useful, we need to link this terminology to words that patients and researchers use every-day via synonyms (e.g. “head” is a synonym of “skull”). I built up these synonyms by extracting “natural language” from an online forum, whose users discuss their symptoms and ongoing treatments with each-other. My goal is to answer the question: how much more data can we link by improving the mapping of medical terminology to natural language?
If applying my first ontology to a different source of data proves useful for identifying patient information, we can use this idea of weaving webs with natural language synonyms from any number of publicly available datasets to capture more about how diseases and treatments affect real people.
My big plan is a smart assistant weaving webs across all the Biobanks, enabling us to link and capture more information by making use of years of underutilised textual data found in patient records, which is typically difficult to work with due to a lack of standardisation. My assistant will weave links between the terminology used across the different Biobanks, bridging the gaps in annotations so researchers don’t have to. With better results from our Biobanks, researchers will be able to do better science.
Organised and consistent data is the key to research. As someone who loves organisation and science, I am excited to be the spider at the heart of these webs, that I one day hope will form a foundation to smarter science for biomedical research. These Biobanks are an amazing resource and I believe my work will help us to make the most out of them.
Spiders weave webs to capture their prey. I weave webs to capture more information.