How to cite GePI?
Please cite the following paper in your work:
Erik Faessler, Udo Hahn, Sascha Schäuble, GePI: large-scale text mining, customized retrieval and flexible filtering of gene/protein interactions, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W237–W242, https://doi.org/10.1093/nar/gkad445
-
Which browsers and operating systems were tested with GePI?
-
The table below lists combinations of operating systems and browsers that have been tested with GePI.
OS | Version | Chrome | Firefox | Microsoft Edge | Safari | Linux | Mint 20.3, Ubuntu 22.04 | 108.0.5359.124 | 108.0.1 | - | - |
Mac OS | 11.6.8, 11.7.1 | 108.0.5359.124 | 108.0.1 | - | 15.4, 16.1 |
Windows | 10, 11 | 108.0.5359.125 | 108.0.1 | 108.0.1462.54 | - |
-
Which NLP tools are used in GePI to find molecular interactions in the literature?
-
Can I do the NLP processing on my own?
-
All the required components and code are freely available, see below.
However, PubMed and PubMed central
(open access subset) are large text repositories and require significant
computational resources for reasonable processing times. As a rule of thumb, 50-60 CPU cores
and machines with at least 64GB of memory are recommended. Depending on the quality of the hardware,
lower numbers might still be adequate. The GePI pipeline has actually rather modest
requirements on hardware. No GPUs are needed.
-
Which technology is used for GePI?
-
How exactly are gene names mapped to gene IDs in a GePI query?
-
We match the input names after a normalization step to the NCBI Gene symbols in our database.
The normalization step includes lower-casing of the name and the removal of punctuation and white spaces so that, for example,
il2
and il-2
are both mapped to IL2
.
Gene name matching will often find multiple matches in our database despite the fact that we use the NCBI gene_orthologs file to create single representatives for orthologous genes. Sometimes not all species are (yet) included in the file. Since the genes that exist in several species often carry the same name, this could result in multiple input matches. It is also possible that the normalization causes multiple symbols to match.
For this reason, the symbol mapping table in the statistics element of the result dashboard shows the most frequent target name for an input gene name. Still, all found elements will be searched for in GePI. If this leads to unwanted results, it is recommended to use canonical gene IDs in the query.
-
Why do I receive interactions with genes that are not in my search query?
-
Queries for families, complexes and gene groups also match their parts or members. For example, a query for
AKT
will retrieve interactions including AKT1
. This is most obvious when searching for GO terms where the query result consists of genes annotated with the GO term contained in the query - even if the query does not contain a single gene name or ID.
- Why do full-text query or filter results highlight words that do not seem to match a query term?
-
For full-text queries GePI expands abbreviations. Consider the abbreviation oscillatory shear stress (OS), used for example in PMC10835076. It is introduced at the beginning of the document and then used throughout the text. Thus, the query stress would only work on the first occurrence of the term when the long form is given. To allow matches for all the other places, too, we internally expand abbreviations to make such matches possible.
Show this example in GePI
-
Is GePI open source?
It is! Please find the complete source code and documentation at https://github.com/JULIELab/gepi.