SELECT lql.p1, lql.verb, lql.p2, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene' ^ [layer="mesh" && tree_number BELOW 'D12.776'] AS p1 $ ] $ ] [layer='pos' && tag_name="verb" && (content~"activate%" || content~"inhibit%" || content~"bind%") ] AS verb [layer='shallow_parse' && tag_name="NP" [layer="mesh" && tree_number BELOW "D12.776" ^ [layer='gene'] $ ] AS p2 $ ] ] SELECT p1.content AS p1, verb.content AS verb, p2.content AS p2 END_LQL ) AS lql GROUP BY lql.p1, lql.verb, lql.p2 ORDER BY cnt DESC
The verb should be a form of activate, inhibit or bind, e.g. inhibit, inhibits, binding, activated etc. (Of course, this simple way to handle morphological variants can lead to false positives, e.g., activation or inhibitors. Some of these will be filtered out via mismatch in part-of-speech.)
The % symbol here stands for zero or more symbols, as in SQL. It is interpreted as a wildcard within the scope of the ~ operator. The conditional statements are connected with boolean operators like || and &&. The double quotes stand for case insensitive match, thus inhibit, Inhibits and INHIBITED will all be matched. For case sensitive comparisons we use single quotes and in some cases we can use them interchangeably, as the example query shows. We use double quotes for tag_name="verb" as it is a macros, which expands to tag_name="VB%", i.e. to VB, VBZ, VBD etc. Note that the verb is from the POS layer, while the layers before and after it are shallow parse NPs. This is an example of ordering between elements from different layers. Finally, { ALLOW GAPS } allows for intervening words between the verb and the proteins.
The query returns the contents of the two MeSH and POS layers but we could have also selected the NP, the gene or the sentence layers. The real LQL query is enclosed within a BEGIN_LQL - END_LQL and additional SQL functions are allowed over the LQL selection.
See the automatically generated SQL query for the LQL statement above.See the results of query execution.
The results are not quite good as there are a lot of proteins that are predicted to interact with themselves. This is because in a long sentence the proteins tend to be mentioned multiple times, and we did not put any constraints on how far away these can be from the verb. In addition, there have been only 227 results returned, which is due mainly to the redundant requirement that entities, identified as genes in LocusLink, need to be also listed as proteins in MeSH, which contains a much smaller set of proteins.
We can improve the accuracy of the extracted triples by disallowing gaps. This will require the proteins and the verb to follow each other immediately and will lower the recall. To remedy for that, we can also remove the two MeSH layers from the query, which express an redundant requirement anyway (we already limited the two NPs to be entities from the gene/protein layer). This produces 91 triples.
SELECT lql.p1, lql.verb, lql.p2, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' [layer='shallow_parse' && tag_name='NP' [layer='gene'] AS p1 $ ] [layer='pos' && tag_name="verb" && (content~"activate%" || content~"inhibit%" || content~"bind%") ] AS verb [layer='shallow_parse' && tag_name="NP" [layer='gene'] AS p2 $ ] ] SELECT p1.content AS p1, verb.content AS verb, p2.content AS p2 END_LQL ) AS lql GROUP BY lql.p1, lql.verb, lql.p2 ORDER BY cnt DESCSee the automatically generated SQL query for the LQL statement above.
See the query execution results.