In the interface shown in Figure 1, the user enters the question in the
Query field then presses the
Search button and gets a list of semantic relations as answers.
Right now, the user question (query) cannot be entered in natural language form. In general, the question is specified as a template (subject, relation, object), which refers to various components of the stored semantic relations. At least one component must be specified, but it is possible to specify two or even all three, depending on the question. The question is forwarded to Lucene which means that full Lucene query syntax is allowed.
A question containing only one of the arguments, e.g.
Alzheimer's disease is very general and means any relation between Alzheimer's disease and some other biomedical concepts. Such question will produce as an answer a set of semantic relations that can be used for a quick overview of a concept.
A more realistic question might ask "What are the treatments for Alzheimer's disease?" and can be specified in our tool in the simplest form as
TREATS Alzheimer's disease as shown in Figure 2, in which some of the answers are:
Donepezil-TREATS-Alzheimer's disease and
The question "What does donepezil treat?" can be asked in our tool as
donepezil TREATS and the answer will contain, in addition to
Donepezil-TREATS-Alzheimer’s disease, other relations, including
A very specific question such as "Whether donepezil has been used for Down syndrome?" could be asked as
donepezil TREATS down syndrome in which all three components are specified; the returned relation
Donepezil-TREATS-Down syndrome confirms that indeed it has been used for this disorder.
When specifying a question, the case does not matter:
Donepezil TREATS Down syndrome is equivalent to
donepezil treats down syndrome.
The semantic type of the subject and/or object can also be used in questions. For example, "Which pharmacologic substances cause which diseases or syndromes?" can be asked as
phsu causes dsyn. Here
phsu is the abbreviation of the semantic type "Pharmacological Substance" and
dsyn is the abbreviation of the semantic type "Disease or Syndrome". When specifying a question in the current version of our tool, semantic types must be abbreviated; full names are not accepted.
A list of the semantic types and their corresponding abbreviations are shown in Table 1. The semantic relation names that can be used in questions are shown in Table 2.
|dsyn||Disease or Syndrome|
|aapp||Amino Acid, Peptide, or Protein|
|podg||Patient or Disabled Group|
|bpoc||Body Part, Organ, or Organ Component|
|gngm||Gene or Genome|
|topp||Therapeutic or Preventive Procedure|
|bacs||Biologically Active Substance|
|sosy||Sign or Symptom|
|inpo||Injury or Poisoning|
|mobd||Mental or Behavioral Dysfunction|
In the examples shown above we did not refer to the subject, relation and/or object explicitly, but rather implicitly. A query such as
donepezil treats down syndrome searches all the words in all the fields of the relations. Most of the time, such a query will be satisfactory; however, it is possible to construct more precise queries by referring explicitly to particular search fields. Subject related search fields are:
sub_name meaning subject name
sub_semtype meaning subject semantic type abbreviation.
Object related search fields are:
obj_name meaning object name
obj_semtype meaning object semantic type abbreviation.
If we do not want to distinguish between the subject and the object, we can use:
arg_name meaning the name of the subject or the object, and
arg_semtype meaning the semantic type abbreviation of the subject or the object. And finally, there is one semantic relation related field -
relation meaning the name of the relation. The query above with explicit search fields would look like
sub_name:donepezil relation:treats obj_name:down syndrome.
Another implicit aspect of the queries shown so far is the logical connection or operator between the question terms. If there is no explicit logical operator present then
AND is assumed. For example, the last query above really means and is equivalent to
sub_name:donepezil AND relation:treats AND obj_name:down syndrome. When constructing questions, it is possible to use the standard Boolean operators
NOT and group the search terms with parenthesis.
In contrast to the search terms themselves, the logical operators must be capitalized to be properly understood by the tool. The question "What are the genes or proteins known to be etiologically related to Alzheimer?" can be specified with explicit Boolean operators as
sub_semtype:(aapp OR gngm) AND relation:(CAUSES OR PREDISPOSES OR ASSOCIATED_WITH) AND obj_name:Alzheimer where
aapp stands for "Amino acid, peptide or protein" and
gngm for "Gene or genome". An example of the
NOT operator might be the question "What has been used to treat Alzheimer that is not a pharmacological substance?" which could be minimally specified as
NOT phsu treats Alzheimer and in full form as
NOT sub_semtype:phsu AND relation:treats AND obj_name:Alzheimer. As practical advice, we recommend that users of our QA tool first try specifying questions without explicit field reference and Boolean operators.
Automatic argument expansion is another useful feature of our QA tool. If requested, it expands the question arguments with semantically narrower concepts (hyponyms). For example, if we issue the query
arg_name:antipsychotic treats we will get only relations where antipsychotic agents appears. However, if we use argument expansion by selecting from the
Expand set of options before the query is submitted, the semantic relation
ISA (meaning "is a") is used behind the scenes to search for narrower concepts, and the original query is expanded with them. The results will then also contain particular antipsychotic agents, such as clozapine, olanzapine, risperidone, haloperidol and so on. As another example, we can deal with a whole class of disorders in a question such as "What are the most common treatments for neurodegenerative disorders?" This question can be answered by using expansion in the query
treats arg_name:neurodegenerative. Here, "neurodegenerative" is expanded with the particular neurodegenerative disorders, such as Alzheimer’s disease, Parkinson disease and so on.A similar question might be "What are the most common treatments for various neoplasms?" Here again we require expansion and use the query
Currently, there are some limitations in the argument expansion facility:
The last limitation means that when using expansion, the single word entered (e.g. "antipsychotic") is used to search for all the concepts containing that word (e.g. "antipsychotic agents", "atypical antipsychotic", "Antipsychotic Medications", ...), and, finally, all the concepts found are expanded. Therefore, although a single word is entered, it is possible to expand on multiple word concepts. These limitations are due to technical issues faced when parsing and modifying the original query, and we plan to remove them in the future.
When the user question is not specific enough at the beginning or when a more exploratory approach is taken, faceting is another promising avenue to explore. In our tool, faceting is turned on with the
Filter option and is used for two purposes:
Faceting results are shown in the left column of the user interface (Figure 3).
In our faceting approach top-N means, in case of the subjects, the top-N subjects by the number of relations in which they appear. In other words, a concept that appears as a subject most often in the semantic relations that are the answers to the original query will be shown at the top of the subject facet. The same method applies to the relation and object facets. For example, if the user wants to do some exploratory research on neoplasms and enters the query
arg_name:neoplasms and also uses argument expansion the most common neoplasms are automatically included in the question. This is a very general question that results in several hundred thousand semantic relations. Now the user can browse the facets in the left column and investigate the subject, relations and objects appearing in highest number of relations. In the relation facet, the
PREDISPOSES relation is selected in the relation facet, because that is the aspect the user wants to investigate further. The original query is automatically refined with the selected relation to become
arg_name:neoplasms AND relation:PREDISPOSES (Figure 3). Now the results of the query show which concepts are known to predispose which particular neoplasms. The facets in the left column can be interpreted as:
The concepts in the subject facet are those that predispose the largest number of neoplasms.
The concepts in the object facet are the neoplasms with the largest number of known factors that predispose them.
In the question processing phase the question entered by the user is interpreted depending on user-selected options; then it is executed. Answers are presented in a top-down fashion, semantic relations first, then, on demand, semantic relation instances, and finally, MEDLINE citations. In Figure 4 is the list of semantic relations, which are presented first. In addition to the subject, relation and object fields, the table also contains a
Frequency field which is the number of instances of each relation in the table. The relations in the answer list are sorted by frequency of descending relation instance. In other words, the most frequent relation is at the top of the list.
The frequency field is a hyperlink and if followed, a new browser window shows the relation instances and a list of sentences from which each relation was extracted. In the sentences, whenever possible, the subject, relation and object are highlighted in different colors to make it easier to identify the relation and its context. Figure 5 shows the list of highlighted sentences for the semantic relation
Donepezil-TREATS-Alzheimer’s disease. The highlighted sentences are listed in ascending order of argument-predicate distance, which is measured as the number of noun phrases between the arguments (subject and object) and the word indicating the semantic relation (the predicate).
It is important to notice that the highlighted terms are not always the same as the official names used for the subject, relation or object. For example, in some sentences the abbreviation AD appears, but SemRep correctly recognizes this as Alzheimer's disease. Also, the words "in" and "for" are used several times in the text to indicate treatment, which is quite frequent in medical text. This is even more common when gene symbols are mentioned in the text. Many genes have more than one symbol to denote them. And often, different genes might have the same symbol. To make things even more difficult, some gene symbols can also have another, often more common, meaning. For example, CT and MR are gene symbols, but more often mean Computed Tomography and Magnetic Resonance, respectively. This problem is known as gene symbol ambiguity and SemRep attempts to address it as described in the Background section of the paper. At the end of each highlighted sentence, the PMID of the MEDLINE citation in which the sentence appears is shown as a link that, when followed, opens the MEDLINE citation so that the context of the sentence can be seen.