mgfp is a flex/bison-based C++ MGF parser library.
It includes the library code as well as the following set of MGF processing tools:
The Steen & Steen Lab provides the library under the terms of a BSD license for use in academic and non-academic environments.
If you make use of mgfp in your own projects, please cite the following article:
If you use ms2preproc in your data analysis pipeline, please cite
Binary packages for Microsoft Windows, Linux (64 bit, built on Ubuntu 10.4) and MacOS X (Snow Leopard) are available for download from
Building mgfp from source is straightforward. However, it requires a working CMake build system (available from http://cmake.org/) and CMake >= 2.6.
With cmake available, the build process is
tar xvzf mgfp-xxxxxxx.tar.gz mkdir mgfp-build cd mgfp-build cd build cmake ../mgfp-xxxxxxx make make test make install
Optionally, if you want to build your own precompiled packages, you can add
make package
To use the parser, one must first create a parser driver instance:
mgf::MgfFile mgfFile; mgf::Driver driver(mgfFile);
Then set the verbosity flage (defaulting to off/false)
driver.trace_parsing = true; driver.trace_scanning = true;
and parse the input. The input is a stream (like std::cin in the example here or std::fstream).
bool result = driver.parse_stream(std::cin);
One should always check if the parsing was successful and only continue if so.
if (!result) { std::cerr << std::endl << "Error parsing data stream (use -v for details)." << std::endl; return -1; }
If the parsing was successful, the contents of the MGF file are available in terms of an MgfFile object; it is possible to iterate over the MS/MS spectra and to read/modify/otherwise process the contents. The example here attempts to extract TMT reporter ion intensities from centroid mode MS/MS spectra:
typedef mgf::MgfFile::iterator MFI; for (MFI i = mgfFile.begin(); i != mgfFile.end(); ++i) { // sort the MS/MS spectrum by m/z std::sort(i->begin(), i->end(), mgf::LessThanMz()); typedef mgf::MgfSpectrum::iterator MSI; // extract TMT reporter ion intensities std::tr1::array<double, 6> obsTmtAbundances; for (size_t n = 0; n < 6; ++n) { MSI closestIt = findClosestMz(i->begin(), i->end(), tmtMasses[n]); // check if the closest centroid is close enough if (std::abs(closestIt->first - tmtMasses[n]) < 0.5) { obsTmtAbundances[n] = closestIt->second; } else { obsTmtAbundances[n] = 0.0; } tmts.push_back(obsTmtAbundances); } }
Coding examples are in the applications/ subdirectory
The following is the current MGF Grammar, extracted from Parser.ypp.
0 $accept: start "end of file"
1 ion: "double" "double" "end of line"
2 | "integer" "double" "end of line"
3 | "double" "integer" "end of line"
4 | "integer" "integer" "end of line"
5 ions: ions ion
6 | ion
7 charge: "integer" '+'
8 | "integer" '-'
9 charges: '(' charges ')'
10 | charges ',' charge
11 | charges "and keyword" charge
12 | charge
13 csintegerlist: csintegerlist ',' "integer"
14 | "integer"
15 blocks: /* empty */
16 | blocks block
17 block: "begin_ions keyword" "end of line" localparams ions "end_ions keyword" "end of line"
18 | "begin_ions keyword" "end of line" localparams "end_ions keyword" "end of line"
19 globalparams: /* empty */
20 | globalparams globalparam
21 globalparam: "enzyme keyword" '=' "string" "end of line"
22 | "search title keyword" '=' "string" "end of line"
23 | "database keyword" '=' "string" "end of line"
24 | "MS/MS datafile format keyword" '=' "string" "end of line"
25 | "MS/MS ion series keyword" '=' "string" "end of line"
26 | "variable modifications keyword" '=' "string" "end of line"
27 | "units for ITOL keyword" '=' "string" "end of line"
28 | "mass type (mono or avg) keyword" '=' "string" "end of line"
29 | "fixed modifications keyword" '=' "string" "end of line"
30 | "quantitation method keyword" '=' "string" "end of line"
31 | "maximum hits keyword" '=' "string" "end of line"
32 | "type of report keyword" '=' "string" "end of line"
33 | "type of search keyword" '=' "string" "end of line"
34 | "taxonomy keyword" '=' "string" "end of line"
35 | "tolerance units keyword" '=' "string" "end of line"
36 | "user keyword" '=' "string" "end of line"
37 | "user email keyword" '=' "string" "end of line"
38 | "username keyword" '=' "string" "end of line"
39 | "perform decoy search keyword" '=' "integer" "end of line"
40 | "error tolerance keyword" '=' "integer" "end of line"
41 | "partials keyword" '=' "integer" "end of line"
42 | "fragment ion tolerance keyword" '=' "double" "end of line"
43 | "fragment ion tolerance keyword" '=' "integer" "end of line"
44 | "misassigned 13C keyword" '=' "double" "end of line"
45 | "precursor m/z keyword" '=' "double" "end of line"
46 | "precursor m/z keyword" '=' "integer" "end of line"
47 | "protein mass (kDa) keyword" '=' "double" "end of line"
48 | "protein mass (kDa) keyword" '=' "integer" "end of line"
49 | "peptide mass tolerance keyword" '=' "double" "end of line"
50 | "peptide mass tolerance keyword" '=' "integer" "end of line"
51 | "charge set keyword" '=' charges "end of line"
52 | "NA translation keyword" '=' csintegerlist "end of line"
53 | "comment" "end of line"
54 localparams: /* empty */
55 | localparams localparam
56 localparam: "title keyword and full title string" "end of line"
57 | "amino acid composition keyword" '=' "string" "end of line"
58 | "MS/MS ion series keyword" '=' "string" "end of line"
59 | "variable modifications keyword" '=' "string" "end of line"
60 | "retention time or range keyword" '=' "double" "end of line"
61 | "retention time or range keyword" '=' "integer" "end of line"
62 | "retention time or range keyword" '=' "double" '-' "double" "end of line"
63 | "retention time or range keyword" '=' "double" '-' "integer" "end of line"
64 | "retention time or range keyword" '=' "integer" '-' "double" "end of line"
65 | "retention time or range keyword" '=' "integer" '-' "integer" "end of line"
66 | "scan number of range keyword" '=' "integer" "end of line"
67 | "scan number of range keyword" '=' "integer" '-' "integer" "end of line"
68 | "tolerance units keyword" '=' "string" "end of line"
69 | "amino acid sequence keyword" '=' "string" "end of line"
70 | "sequence tag keyword" '=' "string" "end of line"
71 | "error tolerant sequence keyword" '=' "string" "end of line"
72 | "peptide mass tolerance keyword" '=' "double" "end of line"
73 | "peptide mass tolerance keyword" '=' "integer" "end of line"
74 | "charge set keyword" '=' charges "end of line"
75 | "precursor mass keyword" '=' "double" "end of line"
76 | "precursor mass keyword" '=' "double" "double" "end of line"
77 | "precursor mass keyword" '=' "double" "integer" "end of line"
78 | "precursor mass keyword" '=' "integer" "end of line"
79 | "precursor mass keyword" '=' "integer" "double" "end of line"
80 | "precursor mass keyword" '=' "integer" "integer" "end of line"
81 | "comment" "end of line"
82 contents: globalparams blocks "end of file"
83 start: contents
1.6.1