Purpose of this Web Site

This Web page offers the retrieval and display bio-molecular interactions that have been extracted from the scientific literature via text mining.

The focus lies on interactions that involve genes, gene products, families, complexes and groups of genes. For the purpose of this service, no difference is made between genes and their products. Biomedical Natural Language Processing (BioNLP) algorithms are employed to scan the text of documents from PubMed and the PubMed Central open access subset for mentions of interactions between these types of entities. Recognized interactions contain gene expression, phosphorylation, positive and negative regulation and physical binding events. For example, the sentence

Fibroblast growth factor-1 (FGF-1) enhances  IL-2 production and nuclear translocation of NF-kappaB in FGF receptor-bearing Jurkat T cells.

contains the gene/protein mentions Fibroblast growth factor-1, FGF-1, IL-2, the protein complex NFkappaB and the FGF receptor family. Between some of those, several events describe the up-regulation of IL-1 , indicated by the words enhances and production. FGF-1 is a short form of Fibroblast growth factor-1. In such cases the interaction with the short form is omitted from the database to avoid duplication. For this service, the gene/protein names are recognized and molecular events are extracted from the document text. Interactions that include multiple events are simplified to a single GePI database item that records the entities and the interaction/event types: Fibroblast growth factor-1 and IL-2 , (up-)regulation and gene expression. This is an example of a binary interaction between two entities. Events can refer to a single entity and, thus, be unary. Frequent examples are phosphorylation events when the cause of the activation is either not mentioned or is not an entity in the scope of GePI. Consider the example

We have examined changes in HS1 phosphorylation[...]

where the phosophorylation of an entity is discussed without the mention of another entity.

Both, binary and unary events are included in the GePI search index. In result displays that require two interaction partners, unary events are either filtered out - e.g. for Sankey charts - or do not show second interaction partner information - e.g. the result table.

Input Specification

At the heart of a GePI search lies the specification of one or two lists of genes, proteins or families, complexes and groups of them. Recognized names or IDs include

The first list of identifiers will be called List A or A-List and the second one will be called List B or B-List all over the Web page.

Given a gene name or database ID, GePI makes best efforts to recognize the search term and resolve it to one of the above databases. In some cases, IDs are ambiguous across multiple databases. For example, NCBI Gene as well as HGNC Groups use numbers as IDs. If an input ID is not recognized as expected, database prefixes can be used for disambiguation, e.g. gene:2475 refers to the human mTOR NCBI Gene entry. Prefixes are separated via a colon : from the specific database ID. This even possible if the ID already includes a prefix of its own like HGNC or GO. The following database prefixes are used:

databaseprefixexampleShow in GePI
NCBI Genegenegene:2475
Ensemblensens:ENSG00000198793
HGNC Genehgnchgnc:HGNC:3942
UniProtKBupup:P42345, up:MTOR_HUMAN
HGNC Gene Grouphgncghgncg:1900
FamPlexfplxfplx:AMPK
Gene Ontologygogo:GO:0016301

Regardless of the specific source database, GePI will map the input to all of its orthologs according to the NCBI Gene gene_orthologs.gz file. Thus, interaction retrieval will be performed across all species with that gene. Species can be restricted using the NCBI Taxonomy filters described further below.

Search types

Open Search
When List A is specified but List B is not, this is called an open search because the set of possible interaction partners is left open. The result will contain interactions that have an item of List A as one interaction partner and any other interaction partner. No events will be returned where both interaction partners belong to List A. If this should be allowed, a closed search can be performed where List A and List B are set to the same list of identifiers.

Closed Search
When List A and List B are provided, this is called a closed search because a closed set of interaction partners has been defined. Result interactions will always contain one interaction partner from List A and one from List B. Entries are allowed to exist in both lists. No events will be returned that contain the same gene as both interaction partners.

Full-text Search
It is possible to omit the specification of List A and List B completely if at least one of the three full-text filter fields is specified. The result will contain the interactions that appeared in the sentences or paragraphs that match the respective full-text query. Basic boolean operators are possible, for example + is used for AND, the pipe symbol | signifies a boolean OR and quotes "..." denote an phrase expression that groups words togther, e.g. "MAPK pathway" | "MAPK signaling" searches for MAPK pathway OR MAPK signaling where the text is required to contain the exact phrases. The query syntax is specified in the ElasticSearch Simple Query String Query with all operators enabled. When both sentence and paragraph filters are specified, the way to combine both filters can be selected using the button between the input fields. "AND" means that an interaction must match both filters to be included in the result, "OR" includes all interactions that match either filter. Show in GePI The section heading search aims at interactions mentioned within sections whose headings match the specified query. All interactions in the database store the names of the sections or subsections they are contained in. As an example, consider the following section structure.
Interaction of Complement Defence Collagens C1q and Mannose-Binding Lectin with BMP-1/Tolloid-like Proteinases

Abstract

[...]

Results

[...]

The defence collagens C1q and MBL interact with BMP-1 and mTLL-1

[...]

BMP-1/C1q co-localization in tissue inflammation

To test the possibility of an interaction of C1q with BMP-1 in physiological conditions, we investigated the expression of these two proteins in human skin sections by immunofluorescence.

The interaction between C1q and BMP-1 is mentioned in the Results subsection named BMP-1/C1q co-localization in tissue inflammation of PMC5717261. The database entry for this interaction stores all headings and titles that include the interaction - or each other - up to the document title:
  • BMP-1/C1q co-localization in tissue inflammation subsection heading
  • Results section heading
  • Interaction of Complement Defence Collagens C1q and Mannose-Binding Lectin with BMP-1/Tolloid-like Proteinases article title
Of note, the subheading The defence collagens C1q and MBL interact with BMP-1 and mTLL-1 is not included in the list because it is not part of the section-chain from interaction to article title but exists on the same section sub-level as Defence collagens are neither substrates nor inhibitors of soluble BMP-1 that includes the interaction.

Filter types

Result filters. A GePI search can be filtered for species via NCBI Taxonomy identifiers, interaction types, and factuality level. These filters cannot constitute searches of their own but exist to solely to restrict the search defined by the methods described above.

For each filter type presented below, an example is given. After clicking an example link on opening the new browser tab, you may fetch the input panel by the handle to the left of page to see how the filter is used with the input form.

The filter examples consist of two links each, one for a reference search without the filter and one search with the enabled filter for comparison. Please be patient and wait for each search to complete before you hit another link. The web application has a single session state and would return the same - the latest - result for both searches if different links are hit too quickly one after another.

Taxonomy filter
A comma-separated list of NCBI Taxonomy IDs may be provided to filter on the interaction partners. NCBI Gene entries are always assigned exactly one taxonomy ID. The GePI text processing pipeline also assigns species to protein families based on the species discussed in the respective document. In case no species is mentioned at all in a document - which commonly happens in PubMed abstracts - the human taxonomy ID 9606 is assigned as this the most frequent organism discussed in PubMed. The species are stored as taxonomy IDs in the interaction database and can be used for species filter purposes. GePI offers taxonomy filters for 1) List A or B, 2) only List A and 3) only List B. The semantics is that in case 1) an interaction will be retrieved if either of its arguments matches any of the specified species. In case 2), the genes of List A will be restricted to the given organisms. Case 3) has two sub-cases: 3a) is when there are items on List B. Then, those items are restricted to the given organisms analogous to case 2). If List B is empty but an organism filter for List B is specified, then List B is implicitly all genes, groups, families etc. that belong to the given taxonomy IDs for List B.

Interaction types filter
By default, interactions of all types are included into a search result. This selection can be reduced to focus on specific interaction types, e.g. binding events regarding the input gene lists.

Inclusion of single gene events
The interaction extraction algorithm frequently finds molecular event descriptions that refer to a single gene without further interaction partners as described in Purpose of this Web site. These unary events are excluded from the search results by default. They will be included if the button is activated.

Factuality level filter
Reports of molecular interactions may include words or phrases to express a restriction on factuality of an interaction description meaning that the described interaction might or might not have been observed. Consider the following example.

To test the possibility of an interaction of C1q with BMP-1 in physiological conditions, we investigated the expression of these two proteins in human skin sections by immunofluorescence.

The word possibility expresses at this specific position in text that the interaction between C1q and BMP-1 is not reported as a fact but stands to be investigated. To control the degree of factuality of returned interactions, GePI offers six filter levels.
factuality level namedescriptionexample expression
negationassertive statement about an interaction not taking placenot, absent, fail
lowindication of disbelief that an interaction is taking placeunlikely
investigationintention or possibility for examination of an interactionassess, explore
moderateexisting hints to an interaction but no affirmationmay, possibly
highprior knowledge, e.g. experiments, that points to the existence of an interactionimply, indicate that
assertionassertive statement about an interaction, absence of words that restrict the factuality
The selection of a factuality item filters interactions to be assigned the selected or a higher - more assertive - factuality level. Thus, selecting moderate will allow interactions with moderate, high and assertion factuality levels to be included in the result. Thus, the default selection lies on negation which does not impose any factuality restriction.

Full-text filters
The full-text fields can be used to start a search on their own or they can serve as a filter for the results of specified A-List and B-List genes. Then, only interactions will be returned that adhere to the gene list specification and also match the provided full-text filters. See full-text search for information about the exact full-text query mechanisms.

Output Dashboard

Upon successful submission of query, die input panel disappears to left of the screen. A handle bar remains that can be clicked on to slide it back into view.

Subsequently, the result dashboard is displayed. It offers basic statistics about the search query itself and the resulting interactions, a number of data visualizations and a table that contains the individual interactions with their textual reference.

The gene and protein interaction partners are grouped into orthologous clusters as far as the underlying resources contain such cluster information. Additionally, items with the same NCBI Gene symbol are collapsed into a single group since they would obtain the same label anyway. Thus, gene names appear only once in charts and aggregated statistics instead of once for each species in the result data. In case of invalid name mergings, multiple separate GePI queries are recommended that specify only one of the separate but equal-named genes, respectively.

Almost all information on the dashboard refers to A-List or B-List interaction partners, even if one or both of the lists may actually be empty. In case of a pure full-text search where List A and List B are empty, the A- and B order refers to the textual order that the interaction have in a specific place in the searched literature. If List A or both lists are specified, all interactions are re-ordered such that the interaction partner that is part of the A-List appears first before calculating the statistics and visualizations. In open search scenarios - when only List A is non-empty - the B-List items represent the 'other' retrieved interaction partners that were not part of the query.

In the following, the different dashboard elements are explained.

Statistics Element

The Statistics dashboard element offers insight into the number of results, the most frequent interaction partners and how the input query elements were recognized by the Web application. The search input tables are shown for the non-empty input lists. Thus, zero, one or two tables may be displayed for a given query.

The search input tables offer insight into the database entry an input was mapped to and reports inputs that could not be recognized at all. In case of unrecognized items, refer to Input Specification for information about accepted names and IDs.

Pie Chart

The Pie Chart dashboard element offers a quick overview of the most frequent interaction partners. There are two tabs that allow to switch between the display of A or B interaction partners and an input field for the specification of the top N most frequent interaction partners to show. There is always an additional Pie slice that carries the 'others' label. That slice accumulates the frequency of all the interaction partners beyond the top N. In this way, the Pie Chart slice proportions correspond to the actual proportion in the data.

For open searches or large input lists, the Pie Chart may become consumed by the 'others' slice. In this case, a look on the Bar Chart is recommended. Since a Bar Chart does not necessarily display a part-whole relationship, the 'others' bar may be switched off there to leave room for the top N interaction partners.

Note that that smallest Pie Slices may not receive a callout with the name of the respective interaction partner due to spacial collision reasons. Hover the mouse cursor over such slices to see a tooltip that will provide the name.

Bar Chart

Similar to the Pie Chart, the Bar Chart displays the most frequent interaction partners, either from List A or List B. It also allows to switch between lists and the specification of the number of interaction partners to display in one chart.

The Bar Chart allows to display or hide the special 'others' bar through the drop-down menu next to it. This bar accumulates the frequencies of all the interaction partners that are not contained in the top N most frequent interactions partners that are currently shown. For open searches or large closed searches, this bar may dominate the whole plot in which case it is recommended to hide it.

The bars show tooltips upon hovering the mouse cursor over them, showing the gene name and frequency.

Sankey Charts

The Sankey Charts show interaction instead of interaction partner distributions. All interactions with the same arguments in the same A-B order are accumulated to obtain the frequency of an interaction in the result set. Sankey Charts leverage the frequency information to vary the thickness of an edge between two gene symbol nodes as well as the size of the node. The more frequent an interaction appears in the result, the thicker its edge is shown. Gene symbols that participate very frequently in interactions are correspondingly larger. For each gene, the Sankey Charts show the proportion of its interaction to other genes in the result set.

When a Sankey Chart is enlarged using the respective button in its header, additional control items will be shown to change the space between the nodes and whether or not to display the 'other' nodes. These nodes represent the accumulated frequencies of the interaction partners that are not explicitly shown in the chart. For open searches or large closed searches, these nodes can get overly large. It is recommended to deactivate the nodes in such cases but to keep in mind that this truncates interactions from the Chart and skews the proportional aspect of the visualization.

Most frequent Interactions Sankey Chart

The Most frequent Interactions Sankey Chart aggregates all result interactions as described above and displays the most frequent ones.

Common Interaction Partners Sankey Chart

The Common Interaction Partners Sankey Chart also aggregates over all result interactions like the Most frequent Interactions Sankey Chart but orders the interactions differently to determine the top N items for display. The aim is to elevate those genes that connect two other genes through interactions in high frequency. The resulting view features indirect connections between two genes through such a third gene they both are often described to interact with in the literature.

Result Table

The table displays the interactions as they were extracted from the Literature providing the source document ID and the specific text portion that contains the interaction description. The table can be browsed using the paging buttons at the bottom.

It shows the interaction partner symbols and IDs as recognized during automated literature processing via natural language processing (NLP) techniques, their actual text string in the document, their factuality rating and the sentence from the document that contains the description of the interaction. If a full-text filter was set on paragraph level, a portion of the paragraph containing the filter query is also given.

Result Download

The result table offer the download of the complete interaction result set in Excel format. The Download button in header bar of the result dashboard element starts the assembly of the data. For large result sets, this may take several minutes.

The final Excel file will contain every interaction item with gene IDs, document IDs, factuality ratings and textual references. Additionally, basic statistics about interaction and interaction partner frequencies are included in separate sheets.

API

GePI offers a Web-API that allows programmatic access. All elements of the input form can be expressed through the API. As a result, the excel sheet (see above) or a tab-separated file flatly listing all retrieved interactions can be obtained.

The API currently works through HTTP GET requests (POST is not yet supported) and is realized through URL request parameters, i.e. the GePI Web address followed by /api/v1/interactions, a single question mark (?) and parameter-value pairs. A parameter-value pair is separated by an equal sign (=) and a sequence of such pairs are separated by the ampersand characters (&). For example:

https://gepi.coling.uni-jena.de/api/v1/interactions?alist=mtor&blist=jun

A comprehensive list of parameters and their possible values is given in the table below.
NOTE:

  • If a parameter is multi-valued - like alist and blist, for example - multiple values are separated via commata (,)
  • Whitespaces and special characters must be URL-encoded.
namedescriptionexamples
alistItems of list A as described in Input Specification.mtor,s6k
blistItems of list B as described in Input Specification.mtor,s6k
taxidsOrganism IDs from the NCBI Taxonomy.9606
taxidsaOrganism IDs from the NCBI Taxonomy.9606
taxidsbOrganism IDs from the NCBI Taxonomy.9606
eventtypesA list of items in {Regulation, Positive_regulation, Negative_regulation, Binding, Localization, Phosphorylation}. Omitting this parameter leads to no restrictions in event types.Regulation,Phosphorylation
factualityA number between 0 and 6 (inclusive). Represents the minimum factuality rating of returned events where 0 means a negated event statement and 6 means an assertive statement.3
filterconnectorOne of {AND or OR}. Specifies how sentencefilter and paragraphfilter (see blow) are combined. I.e. must only one filter match or must both match for an event to be returned by the search?AND
sentencefilterSentence-level filter query. See Full-text Search.obesity
paragraphfilterParagraph- or abstract-level filter query. See Full-text Search.obesity
sectionnamefilterQuery to match any title/heading in a document. Is matched against document title, section headings, caption headings etc.results
includeunaryWhether to include events like phosphorylation of BRCA2. False by default.true
docidCan be a single document ID from PubMed or PMC. Note that PMC document IDs are prefixed with PMC.PMC4502726
limitThe maximum number of events to return. Defaults to unlimited.100
formatThe download file format. One of excel, tsv or web where the last options returns the GePI result page which is not useful for download.tsv
The API parameters correspond to the input form elements on the GePI query input page. Refer the tooltips there for more information on each parameter.