Frequently Asked Questions

How to cite GePI?

Please cite the following paper in your work:

Erik Faessler, Udo Hahn, Sascha Schäuble, GePI: large-scale text mining, customized retrieval and flexible filtering of gene/protein interactions, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W237–W242, https://doi.org/10.1093/nar/gkad445

Which browsers and operating systems were tested with GePI?

The table below lists combinations of operating systems and browsers that have been tested with GePI.

OS	Version	Chrome	Firefox	Microsoft Edge	Safari
Linux	Mint 20.3, Ubuntu 22.04	108.0.5359.124	108.0.1	-	-
Mac OS	11.6.8, 11.7.1	108.0.5359.124	108.0.1	-	15.4, 16.1
Windows	10, 11	108.0.5359.125	108.0.1	108.0.1462.54	-

Which NLP tools are used in GePI to find molecular interactions in the literature?

Basic linguistic processing: JCoRe
Gene recognition: Adapted JULIE Lab version of GNormPlus
Interaction extraction: BioSem trained on BioNLP Shared Task 2011 data
Factuality assessment: JCoRe components for hedge word detection and assignment to interaction descriptions

Can I do the NLP processing on my own?

All the required components and code are freely available, see below. However, PubMed and PubMed central (open access subset) are large text repositories and require significant computational resources for reasonable processing times. As a rule of thumb, 50-60 CPU cores and machines with at least 64GB of memory are recommended. Depending on the quality of the hardware, lower numbers might still be adequate. The GePI pipeline has actually rather modest requirements on hardware. No GPUs are needed.

Which technology is used for GePI?

NLP pipeline: JCoRe UIMA components (see above)
Neo4j 4.4 with the JULIE Lab Neo4j Concept Server Plugin
ElasticSearch 7.18 with the Preanalyzed Mapper Plugin
Tapestry 5.8

How exactly are gene names mapped to gene IDs in a GePI query?

We match the input names after a normalization step to the NCBI Gene symbols in our database. The normalization step includes lower-casing of the name and the removal of punctuation and white spaces so that, for example, il2 and il-2 are both mapped to IL2. Gene name matching will often find multiple matches in our database despite the fact that we use the NCBI gene_orthologs file to create single representatives for orthologous genes. Sometimes not all species are (yet) included in the file. Since the genes that exist in several species often carry the same name, this could result in multiple input matches. It is also possible that the normalization causes multiple symbols to match. For this reason, the symbol mapping table in the statistics element of the result dashboard shows the most frequent target name for an input gene name. Still, all found elements will be searched for in GePI. If this leads to unwanted results, it is recommended to use canonical gene IDs in the query.

Why do I receive interactions with genes that are not in my search query?

Queries for families, complexes and gene groups also match their parts or members. For example, a query for AKT will retrieve interactions including AKT1. This is most obvious when searching for GO terms where the query result consists of genes annotated with the GO term contained in the query - even if the query does not contain a single gene name or ID.

Why do full-text query or filter results highlight words that do not seem to match a query term?

For full-text queries GePI expands abbreviations. Consider the abbreviation oscillatory shear stress (OS), used for example in PMC10835076. It is introduced at the beginning of the document and then used throughout the text. Thus, the query stress would only work on the first occurrence of the term when the long form is given. To allow matches for all the other places, too, we internally expand abbreviations to make such matches possible.

Show this example in GePI

Is GePI open source?

It is! Please find the complete source code and documentation at https://github.com/JULIELab/gepi.