SemBT Biomedical Question Answering tool: User guide

Question processing

In the interface shown in Figure 1, the user enters the question in the Query field then presses the Search button and gets a list of semantic relations as answers.

Right now, the user question (query) cannot be entered in natural language form. In general, the question is specified as a template (subject, relation, object), which refers to various components of the stored semantic relations. At least one component must be specified, but it is possible to specify two or even all three, depending on the question. The question is forwarded to Lucene which means that full Lucene query syntax is allowed.

Basic examples

A question containing only one of the arguments, e.g. Alzheimer's disease is very general and means any relation between Alzheimer's disease and some other biomedical concepts. Such question will produce as an answer a set of semantic relations that can be used for a quick overview of a concept.
A more realistic question might ask "What are the treatments for Alzheimer's disease?" and can be specified in our tool in the simplest form as TREATS Alzheimer's disease as shown in Figure 2, in which some of the answers are: Donepezil-TREATS-Alzheimer's disease and Galantamine-TREATS-Alzheimer's disease.
The question "What does donepezil treat?" can be asked in our tool as donepezil TREATS and the answer will contain, in addition to Donepezil-TREATS-Alzheimer’s disease, other relations, including Donepezil-TREATS-Schizophrenia.
A very specific question such as "Whether donepezil has been used for Down syndrome?" could be asked as donepezil TREATS down syndrome in which all three components are specified; the returned relation Donepezil-TREATS-Down syndrome confirms that indeed it has been used for this disorder.

When specifying a question, the case does not matter: Donepezil TREATS Down syndrome is equivalent to donepezil treats down syndrome.

Semantic types and semantic relations

The semantic type of the subject and/or object can also be used in questions. For example, "Which pharmacologic substances cause which diseases or syndromes?" can be asked as phsu causes dsyn. Here phsu is the abbreviation of the semantic type "Pharmacological Substance" and dsyn is the abbreviation of the semantic type "Disease or Syndrome". When specifying a question in the current version of our tool, semantic types must be abbreviated; full names are not accepted.

A list of the semantic types and their corresponding abbreviations are shown in Table 1. The semantic relation names that can be used in questions are shown in Table 2.

Table 1: Semantic types and abbreviations. For full table see SemBT Semantic Types.
Abbreviation	Semantic Type
dsyn	Disease or Syndrome
aapp	Amino Acid, Peptide, or Protein
podg	Patient or Disabled Group
bpoc	Body Part, Organ, or Organ Component
gngm	Gene or Genome
topp	Therapeutic or Preventive Procedure
neop	Neoplastic Process
cell	Cell
mamm	Mammal
phsu	Pharmacologic Substance
orch	Organic Chemical
bacs	Biologically Active Substance
fndg	Finding
patf	Pathologic Function
popg	Population Group
aggp	Age Group
tisu	Tissue
celc	Cell Component
humn	Human
sosy	Sign or Symptom
diap	Diagnostic Procedure
orgf	Organism Function
inpo	Injury or Poisoning
celf	Cell Function
mobd	Mental or Behavioral Dysfunction

Table 2: Subset of most common semantic relations. For full table see SemBT Relation Types.
Predicate
PROCESS_OF
LOCATION_OF
PART_OF
TREATS
ISA
COEXISTS_WITH
AFFECTS
INTERACTS_WITH
USES
ASSOCIATED_WITH
CAUSES
ADMINISTERED_TO
STIMULATES
INHIBITS
AUGMENTS

Explicit reference to subject, object, and relations

In the examples shown above we did not refer to the subject, relation and/or object explicitly, but rather implicitly. A query such as donepezil treats down syndrome searches all the words in all the fields of the relations. Most of the time, such a query will be satisfactory; however, it is possible to construct more precise queries by referring explicitly to particular search fields. Subject related search fields are:

sub_name meaning subject name
sub_semtype meaning subject semantic type abbreviation.

Object related search fields are:

obj_name meaning object name
obj_semtype meaning object semantic type abbreviation.

If we do not want to distinguish between the subject and the object, we can use: arg_name meaning the name of the subject or the object, and arg_semtype meaning the semantic type abbreviation of the subject or the object. And finally, there is one semantic relation related field - relation meaning the name of the relation. The query above with explicit search fields would look like sub_name:donepezil relation:treats obj_name:down syndrome.

Logical Expressions

Another implicit aspect of the queries shown so far is the logical connection or operator between the question terms. If there is no explicit logical operator present then AND is assumed. For example, the last query above really means and is equivalent to sub_name:donepezil AND relation:treats AND obj_name:down syndrome. When constructing questions, it is possible to use the standard Boolean operators AND, OR and NOT and group the search terms with parenthesis.

In contrast to the search terms themselves, the logical operators must be capitalized to be properly understood by the tool. The question "What are the genes or proteins known to be etiologically related to Alzheimer?" can be specified with explicit Boolean operators as sub_semtype:(aapp OR gngm) AND relation:(CAUSES OR PREDISPOSES OR ASSOCIATED_WITH) AND obj_name:Alzheimer where aapp stands for "Amino acid, peptide or protein" and gngm for "Gene or genome". An example of the NOT operator might be the question "What has been used to treat Alzheimer that is not a pharmacological substance?" which could be minimally specified as NOT phsu treats Alzheimer and in full form as NOT sub_semtype:phsu AND relation:treats AND obj_name:Alzheimer. As practical advice, we recommend that users of our QA tool first try specifying questions without explicit field reference and Boolean operators.

Automatic argument expansion

Automatic argument expansion is another useful feature of our QA tool. If requested, it expands the question arguments with semantically narrower concepts (hyponyms). For example, if we issue the query arg_name:antipsychotic treats we will get only relations where antipsychotic agents appears. However, if we use argument expansion by selecting from the Expand set of options before the query is submitted, the semantic relation ISA (meaning "is a") is used behind the scenes to search for narrower concepts, and the original query is expanded with them. The results will then also contain particular antipsychotic agents, such as clozapine, olanzapine, risperidone, haloperidol and so on. As another example, we can deal with a whole class of disorders in a question such as "What are the most common treatments for neurodegenerative disorders?" This question can be answered by using expansion in the query treats arg_name:neurodegenerative. Here, "neurodegenerative" is expanded with the particular neurodegenerative disorders, such as Alzheimer’s disease, Parkinson disease and so on.A similar question might be "What are the most common treatments for various neoplasms?" Here again we require expansion and use the query treats arg_name:neoplasms.

Currently, there are some limitations in the argument expansion facility:

Explicit field reference must be used (e.g., arg_name, sub_name or obj_name).
If there are many narrower concepts, only the first one hundred are used.
Only a single word can be used to specify the concepts to be expanded (that's why we used "antipsychotic" and "neurodegenerative" above).

The last limitation means that when using expansion, the single word entered (e.g. "antipsychotic") is used to search for all the concepts containing that word (e.g. "antipsychotic agents", "atypical antipsychotic", "Antipsychotic Medications", ...), and, finally, all the concepts found are expanded. Therefore, although a single word is entered, it is possible to expand on multiple word concepts. These limitations are due to technical issues faced when parsing and modifying the original query, and we plan to remove them in the future.

Faceting

When the user question is not specific enough at the beginning or when a more exploratory approach is taken, faceting is another promising avenue to explore. In our tool, faceting is turned on with the Filter option and is used for two purposes:

To show the top-N subjects, relations and objects of a query.
To use these for further query refinement or result filtering.

Faceting results are shown in the left column of the user interface (Figure 3).

In our faceting approach top-N means, in case of the subjects, the top-N subjects by the number of relations in which they appear. In other words, a concept that appears as a subject most often in the semantic relations that are the answers to the original query will be shown at the top of the subject facet. The same method applies to the relation and object facets. For example, if the user wants to do some exploratory research on neoplasms and enters the query arg_name:neoplasms and also uses argument expansion the most common neoplasms are automatically included in the question. This is a very general question that results in several hundred thousand semantic relations. Now the user can browse the facets in the left column and investigate the subject, relations and objects appearing in highest number of relations. In the relation facet, the PREDISPOSES relation is selected in the relation facet, because that is the aspect the user wants to investigate further. The original query is automatically refined with the selected relation to become arg_name:neoplasms AND relation:PREDISPOSES (Figure 3). Now the results of the query show which concepts are known to predispose which particular neoplasms. The facets in the left column can be interpreted as:

The concepts in the subject facet are those that predispose the largest number of neoplasms.
The concepts in the object facet are the neoplasms with the largest number of known factors that predispose them.

Answer processing and presentation

In the question processing phase the question entered by the user is interpreted depending on user-selected options; then it is executed. Answers are presented in a top-down fashion, semantic relations first, then, on demand, semantic relation instances, and finally, MEDLINE citations. In Figure 4 is the list of semantic relations, which are presented first. In addition to the subject, relation and object fields, the table also contains a Frequency field which is the number of instances of each relation in the table. The relations in the answer list are sorted by frequency of descending relation instance. In other words, the most frequent relation is at the top of the list.

The frequency field is a hyperlink and if followed, a new browser window shows the relation instances and a list of sentences from which each relation was extracted. In the sentences, whenever possible, the subject, relation and object are highlighted in different colors to make it easier to identify the relation and its context. Figure 5 shows the list of highlighted sentences for the semantic relation Donepezil-TREATS-Alzheimer’s disease. The highlighted sentences are listed in ascending order of argument-predicate distance, which is measured as the number of noun phrases between the arguments (subject and object) and the word indicating the semantic relation (the predicate).

Figure 5: List of highlighted sentences for the particular semantic relation.

It is important to notice that the highlighted terms are not always the same as the official names used for the subject, relation or object. For example, in some sentences the abbreviation AD appears, but SemRep correctly recognizes this as Alzheimer's disease. Also, the words "in" and "for" are used several times in the text to indicate treatment, which is quite frequent in medical text. This is even more common when gene symbols are mentioned in the text. Many genes have more than one symbol to denote them. And often, different genes might have the same symbol. To make things even more difficult, some gene symbols can also have another, often more common, meaning. For example, CT and MR are gene symbols, but more often mean Computed Tomography and Magnetic Resonance, respectively. This problem is known as gene symbol ambiguity and SemRep attempts to address it as described in the Background section of the paper. At the end of each highlighted sentence, the PMID of the MEDLINE citation in which the sentence appears is shown as a link that, when followed, opens the MEDLINE citation so that the context of the sentence can be seen.