This Web page offers the retrieval and display bio-molecular interactions that have been extracted from the scientific literature via text mining.
The focus lies on interactions that involve genes, gene products, families, complexes and groups of genes. For the purpose of this service, no difference is made between genes and their products. Biomedical Natural Language Processing (BioNLP) algorithms are employed to scan the text of documents from PubMed and the PubMed Central open access subset for mentions of interactions between these types of entities. Recognized interactions contain gene expression, phosphorylation, positive and negative regulation and physical binding events. For example, the sentence
Fibroblast growth factor-1 (FGF-1) enhances IL-2 production and nuclear translocation of NF-kappaB in FGF receptor-bearing Jurkat T cells.
We have examined changes in HS1 phosphorylation[...]
Both, binary and unary events are included in the GePI search index. In result displays that require two interaction partners, unary events are either filtered out - e.g. for Sankey charts - or do not show second interaction partner information - e.g. the result table.
At the heart of a GePI search lies the specification of one or two lists of genes, proteins or families, complexes and groups of them. Recognized names or IDs include
Given a gene name or database ID, GePI makes best efforts to recognize the search term and resolve it to one of the above databases. In some cases, IDs are ambiguous across multiple databases. For example, NCBI Gene as well as HGNC Groups use numbers as IDs. If an input ID is not recognized as expected, database prefixes can be used for disambiguation, e.g. gene:2475
refers to the human mTOR NCBI Gene entry.
Prefixes are separated via a colon :
from the specific database ID. This even possible if the ID already includes a prefix of its own like HGNC or GO. The following database prefixes are used:
database | prefix | exampleShow in GePI |
---|---|---|
NCBI Gene | gene | gene:2475 |
Ensembl | ens | ens:ENSG00000198793 |
HGNC Gene | hgnc | hgnc:HGNC:3942 |
UniProtKB | up | up:P42345 , up:MTOR_HUMAN |
HGNC Gene Group | hgncg | hgncg:1900 |
FamPlex | fplx | fplx:AMPK |
Gene Ontology | go | go:GO:0016301 |
Regardless of the specific source database, GePI will map the input to all of its orthologs according to the NCBI Gene gene_orthologs.gz
file. Thus, interaction retrieval will be performed across all species with that gene.
Species can be restricted using the NCBI Taxonomy filters described further below.
+
is used for AND
, the pipe symbol |
signifies a boolean OR
and quotes "..."
denote an phrase expression that groups words togther, e.g. "MAPK pathway" | "MAPK signaling"
searches for MAPK pathway OR MAPK signaling
where the text is required to contain the exact phrases. The query syntax is specified in the ElasticSearch Simple Query String Query with all operators enabled.
When both sentence and paragraph filters are specified, the way to combine both filters can be selected using the button between the input fields. "AND" means that an interaction must match both filters to be included in the result, "OR" includes all interactions that match either filter.
Show in GePI
The section heading search aims at interactions mentioned within sections whose headings match the specified query. All interactions in the database store the names of the sections or subsections they are contained in. As an example, consider the following section structure.
Abstract
[...]
Results
[...]
The defence collagens C1q and MBL interact with BMP-1 and mTLL-1
[...]
BMP-1/C1q co-localization in tissue inflammation
To test the possibility of an interaction of C1q with
BMP-1 in physiological conditions, we investigated the expression of these two proteins in human skin sections by immunofluorescence.
Result filters. A GePI search can be filtered for species via NCBI Taxonomy identifiers, interaction types, and factuality level. These filters cannot constitute searches of their own but exist to solely to restrict the search defined by the methods described above.
For each filter type presented below, an example is given. After clicking an example link on opening the new browser tab, you may fetch the input panel by the handle to the left of page to see how the filter is used with the input form.
To test the possibility of an interaction of C1q with
BMP-1 in physiological conditions, we investigated the expression of these two proteins in human skin sections by immunofluorescence.
factuality level name | description | example expression |
---|---|---|
negation | assertive statement about an interaction not taking place | not, absent, fail |
low | indication of disbelief that an interaction is taking place | unlikely |
investigation | intention or possibility for examination of an interaction | assess, explore |
moderate | existing hints to an interaction but no affirmation | may, possibly |
high | prior knowledge, e.g. experiments, that points to the existence of an interaction | imply, indicate that |
assertion | assertive statement about an interaction, absence of words that restrict the factuality | — |
Upon successful submission of query, die input panel disappears to left of the screen. A handle bar remains that can be clicked on to slide it back into view.
Subsequently, the result dashboard is displayed. It offers basic statistics about the search query itself and the resulting interactions, a number of data visualizations and a table that contains the individual interactions with their textual reference.
The gene and protein interaction partners are grouped into orthologous clusters as far as the underlying resources contain such cluster information. Additionally, items with the same NCBI Gene symbol are collapsed into a single group since they would obtain the same label anyway. Thus, gene names appear only once in charts and aggregated statistics instead of once for each species in the result data. In case of invalid name mergings, multiple separate GePI queries are recommended that specify only one of the separate but equal-named genes, respectively.
Almost all information on the dashboard refers to A-List or B-List interaction partners, even if one or both of the lists may actually be empty. In case of a pure full-text search where List A and List B are empty, the A- and B order refers to the textual order that the interaction have in a specific place in the searched literature. If List A or both lists are specified, all interactions are re-ordered such that the interaction partner that is part of the A-List appears first before calculating the statistics and visualizations. In open search scenarios - when only List A is non-empty - the B-List items represent the 'other' retrieved interaction partners that were not part of the query.
In the following, the different dashboard elements are explained.
The Statistics dashboard element offers insight into the number of results, the most frequent interaction partners and how the input query elements were recognized by the Web application. The search input tables are shown for the non-empty input lists. Thus, zero, one or two tables may be displayed for a given query.
The search input tables offer insight into the database entry an input was mapped to and reports inputs that could not be recognized at all. In case of unrecognized items, refer to Input Specification for information about accepted names and IDs.
The Pie Chart dashboard element offers a quick overview of the most frequent interaction partners. There are two tabs that allow to switch between the display of A or B interaction partners and an input field for the specification of the top N most frequent interaction partners to show. There is always an additional Pie slice that carries the 'others' label. That slice accumulates the frequency of all the interaction partners beyond the top N. In this way, the Pie Chart slice proportions correspond to the actual proportion in the data.
For open searches or large input lists, the Pie Chart may become consumed by the 'others' slice. In this case, a look on the Bar Chart is recommended. Since a Bar Chart does not necessarily display a part-whole relationship, the 'others' bar may be switched off there to leave room for the top N interaction partners.
Note that that smallest Pie Slices may not receive a callout with the name of the respective interaction partner due to spacial collision reasons. Hover the mouse cursor over such slices to see a tooltip that will provide the name.
Similar to the Pie Chart, the Bar Chart displays the most frequent interaction partners, either from List A or List B. It also allows to switch between lists and the specification of the number of interaction partners to display in one chart.
The Bar Chart allows to display or hide the special 'others' bar through the drop-down menu next to it. This bar accumulates the frequencies of all the interaction partners that are not contained in the top N most frequent interactions partners that are currently shown. For open searches or large closed searches, this bar may dominate the whole plot in which case it is recommended to hide it.
The bars show tooltips upon hovering the mouse cursor over them, showing the gene name and frequency.
The Sankey Charts show interaction instead of interaction partner distributions. All interactions with the same arguments in the same A-B order are accumulated to obtain the frequency of an interaction in the result set. Sankey Charts leverage the frequency information to vary the thickness of an edge between two gene symbol nodes as well as the size of the node. The more frequent an interaction appears in the result, the thicker its edge is shown. Gene symbols that participate very frequently in interactions are correspondingly larger. For each gene, the Sankey Charts show the proportion of its interaction to other genes in the result set.
When a Sankey Chart is enlarged using the respective button in its header, additional control items will be shown to change the space between the nodes and whether or not to display the 'other' nodes. These nodes represent the accumulated frequencies of the interaction partners that are not explicitly shown in the chart. For open searches or large closed searches, these nodes can get overly large. It is recommended to deactivate the nodes in such cases but to keep in mind that this truncates interactions from the Chart and skews the proportional aspect of the visualization.
The Most frequent Interactions Sankey Chart aggregates all result interactions as described above and displays the most frequent ones.
The Common Interaction Partners Sankey Chart also aggregates over all result interactions like the Most frequent Interactions Sankey Chart but orders the interactions differently to determine the top N items for display. The aim is to elevate those genes that connect two other genes through interactions in high frequency. The resulting view features indirect connections between two genes through such a third gene they both are often described to interact with in the literature.
The table displays the interactions as they were extracted from the Literature providing the source document ID and the specific text portion that contains the interaction description. The table can be browsed using the paging buttons at the bottom.
It shows the interaction partner symbols and IDs as recognized during automated literature processing via natural language processing (NLP) techniques, their actual text string in the document, their factuality rating and the sentence from the document that contains the description of the interaction. If a full-text filter was set on paragraph level, a portion of the paragraph containing the filter query is also given.
The result table offer the download of the complete interaction result set in Excel format. The Download button in header bar of the result dashboard element starts the assembly of the data. For large result sets, this may take several minutes.
The final Excel file will contain every interaction item with gene IDs, document IDs, factuality ratings and textual references. Additionally, basic statistics about interaction and interaction partner frequencies are included in separate sheets.
GePI offers a Web-API that allows programmatic access. All elements of the input form can be expressed through the API. As a result, the excel sheet (see above) or a tab-separated file flatly listing all retrieved interactions can be obtained.
The API currently works through HTTP GET requests (POST is not yet supported) and
is realized through URL request parameters, i.e. the GePI Web address followed by
/api/v1/interactions
, a
single question mark (?
) and parameter-value pairs. A parameter-value pair
is separated by an equal sign (=
) and a sequence of such pairs are separated by
the ampersand characters (&
).
For example:
https://gepi.coling.uni-jena.de/api/v1/interactions?alist=mtor&blist=jun
A comprehensive list of parameters and their possible values is given in the table below.
NOTE:
alist
and blist
, for
example - multiple values are separated via commata (,
)name | description | examples |
---|---|---|
alist | Items of list A as described in Input Specification. | mtor,s6k |
blist | Items of list B as described in Input Specification. | mtor,s6k |
taxids | Organism IDs from the NCBI Taxonomy. | 9606 |
taxidsa | Organism IDs from the NCBI Taxonomy. | 9606 |
taxidsb | Organism IDs from the NCBI Taxonomy. | 9606 |
eventtypes | A list of items in {Regulation , Positive_regulation , Negative_regulation , Binding , Localization , Phosphorylation }. Omitting this parameter leads to no restrictions in event types. | Regulation,Phosphorylation |
factuality | A number between 0 and 6 (inclusive). Represents the minimum factuality rating of returned events where 0 means a negated event statement and 6 means an assertive statement. | 3 |
filterconnector | One of {AND or OR }. Specifies how sentencefilter and paragraphfilter (see blow) are combined. I.e. must only one filter match or must both match for an event to be returned by the search? | AND |
sentencefilter | Sentence-level filter query. See Full-text Search. | obesity |
paragraphfilter | Paragraph- or abstract-level filter query. See Full-text Search. | obesity |
sectionnamefilter | Query to match any title/heading in a document. Is matched against document title, section headings, caption headings etc. | results |
includeunary | Whether to include events like phosphorylation of BRCA2. False by default. | true |
docid | Can be a single document ID from PubMed or PMC. Note that PMC document IDs are prefixed with PMC. | PMC4502726 |
limit | The maximum number of events to return. Defaults to unlimited. | 100 |
format | The download file format. One of excel , tsv or web where the last options returns the GePI result page which is not useful for download. | tsv |