About AllergenOnline
The Food Allergy Research and Resource Program (FARRP) AllergenOnline.org database has been updated to version 22 on May 25, 2023. Version 22 contains a comprehensive list (2290 protein (amino acid) sequence entries that are categorized into 952 taxonomic-protein groups of unique proven or putative allergens (food, airway, venom/salivary and contact)from 412 species. Note: Gal g 6 NCBI entry is a multimeric egg yolk protein of 1912 amino acids, but only the c-terminal 284 AA are the allergen based on PMID 20509661. Some of the allergenic wheat gliadins or glutenins may also cause celiac disease (see Celiac Database ), however they are listed on the allergen site if there is evidence of IgE binding.
The annual update process includes collecting new sequences designated as "allerg*" in reference files from NCBI protein database (compiled from GenBank, RefSeq and TPA databases as well as protein sequences from SwissProt, PIR, PRF and PDB databases). In a few instances, sequences are taken directly from a peer reviewed publication as they have not been entered into the NCBI or other available databases. Duplicate and inappropriate sequences are removed by a process described below. The sequences are categorized by taxonomic group (genus/species) and protein sequence identity (close homology). The new draft dataset is compared with sequences contained in the previous version of AllergenOnline.org and integrated into existing groups if appropriate, or classified into new groups. Peer-reviewed publications are identified from PubMed and other resources, then collected and reviewed for evidence of allergenicity of the source organism and the specific protein. Additional information is gathered from the Allergen Nomenclature Committee website of WHO/IUIS (World Health Organization/International Union of Immunological Societies) and occasionally Wikipedia. This information was reviewed for each group of sequences as described below to classify the entries as likely "allergenic" (absolute proof including challenge testing, or putative, specific IgE binding using sera from individuals with allergies to the source organism), or "insufficient proof" of allergy due to a lack of convincing evidence of allergenicity. During the review process, an attempt is made to identify new publications demonstrating proof of allergy for groups of potential allergens that were designated as having insufficient proof of allergenicity in previous versions. A consensus decision by the whole peer review panel is normally reached for each group regarding the designation as an "allergen" or having "insufficient proof". However in a few instances a majority decision is taken. Criteria used to reach a decision to include or exclude each sequence or allergen group is described below.
Recent Open Access Publications
Defeating Late Blight disease of potato Open Access 480 483 March 2021
2021_ African and Asian agriculture
Genetic Engineering--future food security
Whole proteomes from Genomes vs AllergenOnline 2021
A protein sequence search strategy used for the current update is described in the publication describing
construction and curation of the AllergenOnline database (Goodman et al., 2016). Each year the NCBI Protein database
is searched using a keyword limit of "allerg*", with date limits from the last download (for version 20) to the
current download (15 August, 2022) possible entry in version 22. This year a few thousand new sequences were considered, but that was paired down to approximately 100 that appeared to represent potential allergens. PubMed and Web of Science were searched to remove irrelevant entries. The sequences are grouped into taxonomic protein groups using FASTA comparisons. In addition, new entries in the WHO/IUIS database that have published references were included in our review.
The 100 sequemces and associated publications were reviewed by Dr. Goodman and if there were publications showing data on serum IgE binding or allergy, the committee of six reviewers were involved in either entering the protein as a putative allergen or a proven allergen based on scientific data. Otherwise the entry was set aside as not having sufficient proof of IgE binding or biological activity of basophil activity or skin prick tests. Information is recorded from the initial entry through reviewers comments.
Compilation of a list of sequences for review by the entire expert panel includes screening to remove sequences that
are included only based on being "similar to" an allergen, or homologous. Many peptides are from a taxonomic organism
that is associated with allergy (e.g. Aspergillus sp., Alternaria sp.). In order to reduce the list to manageable
size without excluding likely true allergens we exclude sequences from genome model organisms (e.g. Drosophila
melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, etc.). However, proteins from allergenic species that
are also genetic models (rice, mice and corn) that have information suggesting allergenicity are included in the
initial list, but without inclusion of published references related to allergenicity are excluded. Proteins that are
obviously merely associated with an allergic response (e.g. cytokines, chemokines, immunoglobulins and transcription
factors) are also excluded. Sequences are then screened and grouped based on sequence identity and taxonomic identity
to those already in the AllergenOnline.org database (e.g. allergens, putative allergens or sequences with
insufficient evidence to demonstrate allergy, see below) from previous versions. Relevant publications are collected
for the review panel, using references from the NCBI sequence entry as well as separate searches of the PubMed
database, based on keyword searches of the taxa, common name and sequence authors. The information for each allergen
group is triaged to gather more specific information and reviewed by the expert panel in a three stage process as
described below. Goal: To update and curate the list of sequences included in the AllergenOnline.org (FARRP) database on an
annual basis to include only protein sequences that are supported by evidence demonstrating that the protein is a
proven allergen or that there is substantial proof of allergy to the source of the protein as well as immunoglobulin
E (IgE) binding to the specific protein using sera from individuals with allergies to the source. Nearly identical
sequences in the same taxonomic / protein group are included if it is clear there are variants of the protein that
might contribute to allergy. Rationale: The AllergenOnline database is intended for use as a tool for evaluating the safety of proteins
included in foods through processing or genetic modification. The Codex Alimentarius Guidelines (2003 and 2009) established a process for evaluating potential allergenicity based on evidence that the protein is likely to cause allergic reactions in consumers. A key component in the evaluation process is comparison of candidate products
(proteins) with those of known allergens using a bioinformatics approach such as FASTA or BLASTP local alignment tools to identify proteins that would require further testing by serum IgE binding and/or clinical testing to evaluate safety. It is therefore important to have scientific evidence that the database entries are allergens or
probable (putative) allergens in order to maximize the reliability of bioinformatics searches. Peer Review Process: In 2005, FARRP brought together a panel of seven food allergy experts to define
criteria for inclusion in future versions of the database. A protocol was developed for including sequences for
consideration, for classifying sequences into groups (allergen, putative allergen, insufficient evidence to classify as an allergen or putative allergen), collecting publications for review, providing information to reviewers and finally for voting to accept sequences as allergens or putative allergens. In general we have included sequences in
the taxonomic allergen group that are at least 67% identical to the protein that is the subject of peer-reviewed
published study supporting IgE binding to the protein, using sera from clinically defined subjects allergic to the
source. The identity limit was initially suggested by the IUIS Allergen Nomenclature Subcommittee as a limit for
defining isoallergen groups. Information regarding the individual proteins should demonstrate the protein is actually
expressed in the source material that causes allergic reactions. Criteria for three classes of assignment were agreed to: Allergen is a protein that has been demonstrated
to specifically bind IgE using sera from individuals with clear allergies to the source of the gene/protein and
further that the protein causes basophil activation or histamine release, skin test reactivity or challenge test
reactivity using subjects allergic to the source. Putative allergen is a protein that has met most of the
criteria of an allergen, but has a missing component, usually biological activity (basophil activation or in vivo
reactivity), less well defined clinical population or lack of data demonstrating the specific protein was used in
reported testing. Both Allergens and Putative Allergens are retained in the list of sequence searchable protein
entries in AllergenOnline.org. The third category, those with Insufficient Evidence of Allergenicity
(Unproven), are not included in the sequence searchable protein list because they were judged to be lacking
critical evidence of specific IgE binding, the serum donors were not demonstrated to be allergic to the source and
there was no allergic biological activity demonstrated for the protein. The proteins categorized as "Insufficient
evidence" are maintained in a list for future annual reviews as new candidate "allergens" are identified from NCBI
and the published literature. If new evidence supports reclassification in the opinion of the reviewers, they would
be included in future versions of the database. In rare instances after 2007 individual sequence entries in the
database that were previously included in the searchable allergen list have been removed after more detailed reviews
have failed to identify published evidence the protein is expressed in allergenic material or that the original
review miss-interpreted the data in the available publications. The amount and quality of published objective data supporting the classification of various proteins as allergens
varies remarkably. For many food, airway or contact allergens there is unquestionable objective data of the identity, characterization and purity of the protein and clear evidence that human subjects with relevant allergic histories and symptoms were tested to demonstrate reactions upon challenge, or at least clear evidence of specific IgE binding.
However, there are also a number of proteins labeled as allergens in the literature or in the NCBI sequence database
(or in UniProt) for which there is not sufficient objective data characterizing the protein used in testing, or data
to demonstrate human reactivity or specific IgE binding. Our peer review process is designed to review the collective
literature for individual proteins and classify the individual allergen groups based on our stated criteria. The review process includes triage and initial evaluation summary Dr. Goodman at FARRP. Often additional
references are identified and added for further review. Then each sequence group is assigned to two other reviewers
from the expert panel. The detailed review comments from all three reviewers are compiled and presented to the entire
group of seven experts for a final round of reviews. Comments and votes are recorded in the database files as an
archive file. Later changes in status and reasons for changes are also included in the archive. A list of relevant
references that were included in the review process are included in the public view of each version of the database.
At the end of the review period a search is made of the WHO/IUIS Allergen Nomenclature website (www.allergen.org) and
new entries that are not in AllergenOnline are reviewed for published evidence and these are added to the review
list. The WHO/IUIS entries are often identified prior to publication. Therefore the entries are reviewed again each
year. Before release of the database the sequences, GI numbers (now accession numbers since NCBI has stopped issuing GI
numbers), taxonomy of the source and reference lists are compiled and checked before release of the new version to
the public. The public website shows relevant information for each sequence. Baumert, Joe, PhD, FARRP, University of Nebraska, USA Joerg Kleine-Tebbe, MD, Allergy & Asthma Center, Westend. Berlin, Germany(2018-2019) Financial support for this database has been provided by grants from corporate subscribers and by FARRP (the Food Allergy Research and Resource Program, Department of Food Science & Technology at the University of Nebraska-Lincoln) and faculty. The majority of the scientific information for the Allergen database and now for the Celiac database is collected and evaluated by Rick Goodman, then verified by the review team. The database is updated annually. The database construction was performed by John Wise, who now consults to help maintain it. This website includes a sequence comparison routine, FASTA (Pearson and Lipman, 1988)
which may be used to compare a protein sequence (the query sequence) to entries in the allergen database.
This version of the FASTA search interface utilizes the FASTA3 (Pearson, 2000) algorithm.
The purpose of the comparison routine is to evaluate whether the query protein sequence
is identical to, or homologous with known or putative allergens in the database. Alignments with high identity scores may indicate a potential for allergenic cross-reactions.
However, there is not sufficient scientific data to establish a simple scoring boundary
(E-score or percent identity), beyond which cross-reactivity is certain, or below which cross-reactivity is not
possible. Based on historical data, cross-reactivity is not likely for proteins with less
than 50% identity over the entire protein sequence, and is fairly common above 70% identity (Aalberse, 2000).
Through experience we find that sequences of two proteins having published evidence of
cross-reactivity will align in AllergenOnline.org with a relatively high percent identity (>50% over nearly
full-length) and have an E score (statistical expectation score) smaller than 1e-7 (0.0000001). Thus if a query
protein matches a sequence in AllergenOnline.org with higher identity and smaller E scores, the protein should be
considered as likely to be cross-reactive in the absence of extensive testing (IgE binding and possibly clinical
challenges). Proteins sharing lower identity matches by FASTA alignment and having higher E scores are not likely to
share IgE binding. Experimental studies would be needed to confirm that proteins sharing identities lower than 50%
and having E scores larger than 1e-4 share IgE binding and clinical reactivity. Evaluation of literature regarding
the matched allergen would help to identify appropriately allergic study subjects.
In addition to the full-length FASTA search, we have added an option to automatically
scan each possible 80 amino acid segment (1-80, 2-81, 3-82, etc.) of the entered search protein against the
AllergenOnline database, looking for matches of at least 35% identity. The 35% identity
for 80 amino acid segments was suggested in a scientific advisory to regulators for evaluating proteins in
genetically modified crops (see FAO/WHO 2001, and Codex 2003). This short segment
matching routine evaluating segments of 80 amino acids appears to be quite conservative, and precautionary as
discussed in Goodman et al. (2005) and Goodman and Hefle (2005). However, the 80 amino
acid segment search appears to be far more likely to be informative than a search for shorter identical segments of 6
or 8 contiguous amino acids as originally recommended by Metcalfe et al. (1996) or the FAO/WHO 2001 approach, based
on evaluations by Hileman et al., (2002) and Silvanovich et al. (2006). See also the
summary report from the bioinformatics workshop on evaluating potential allergenicity (Goodman, 2006).In the past
AllergenOnline.org has employed an E()-value (E-score) threshold of 100 as a statistical cutoff limit in the 80-mer
search in identifying alignments with >35% identity matches that should be evaluated further. However, we have
determined that the very large E score allows alignments with multiple gaps and leads to alignments in some cases
that do not make sense when compared to full-length alignments. Reexamination of publications by Pearson in 2004 and
earlier publications clearly support the use of the default E = 10 as a limit for FASTA or in exceptional cases with
specialized, small databases or sequences, the limit could be set lower (e.g. E = 0.01). We have therefore modified
the search parameters to evaluate only alignments with E scores = 10 or less in the release of AllergenOnline.org
version 15 (12 January, 2015). It is important to keep in mind that the default E()-value is simply a starting
threshold used to allow alignments to be observed and then investigated using 35% identity and 80 amino acid overlap
as the criteria. In cases when the alignment identified matches of >35% identity in the sliding 80mer search,
additional bioinformatics comparisons maybe useful to evaluate likely biological significance, or specific serum
testing may prove useful if appropriate specifically allergic serum donors can be identified to evaluate the
potential cross-reactivity suggested by the match.
Although the CODEX mentions using short segment (6 or 8 amino acid) sequence matches, it also indicates that
searches must be based on scientific proof. As we have searched for, and been unable to find examples where an
isolated identity match of 6 or 8 amino acids was found between cross-reactive proteins unless there was at least a
35% identity match over 80 amino acids, we previously did not include that search routine on our database (Goodman et
al. 2008). However, since some countries still require an eight amino acid identity search, even in the lack of
evidence demonstrating a positive predictive value, we now provide that as an option. Additional or alternative bioinformatics tools and databases may also be
useful for the evaluation of potential allergens (see also LINKS):Removal of "False" Entries
Peer Review Process for Categorizing Sequence Groups: Proof of Allergenicity
Peer Review Panel
Bohle, Barbara, PhD, Division of Immunopathology, Medical University of Vienna, Austria
Ebisawa, Motohiro, MD, Pediatric Allergy, National Sagamihara Hospital, Japan
Fatima Ferreira, PhD, University of Salzburg, Austria
Goodman, Rick, PhD, FARRP, University of Nebraska, USA
Taylor, Steve, PhD, FARRP, University of Nebraska, USA
van Ree, Ronald, PhD, University of Amsterdam, The NetherlandsFormer Members:
Sampson, Hugh, MD, Pediatric Allergy, Mount Sinai Medical Center, New York, USA (2005-2015)
Vieths, Stefan, PhD, Paul-Ehrlich-Institut, Germany (2005-2012)
Hefle, Sue, PhD, FARRP, University of Nebraska, USA (2005-2006)Financial Support
Sponsors (2004 - 2015):
Subscribers (2016 - 2017):
Subscribers 2018 - 2021:
Allergen Database Search Routines
Full-length FASTA
However in the future higher identity matches (greater than 35% ID) and smaller E-scores are likely to be adopted as the reference Whole proteomes from Genomes vs AllergenOnline by Abdelmoteleb et al., (2021) show evolutionary homologues make the CODEX Allergenicity guideline of 2003 obsolete.
Sliding 80mer FASTA
8mer Identity Match
References
AllergenOnline database:
For bioinformatic analysis:
For protein sequence (structure) and allergenicity: