mgfp: Mascot Generic Format (MGF) Parser

Introduction

mgfp is a flex/bison-based C++ MGF parser library.
It includes the library code as well as the following set of MGF processing tools:

The Steen & Steen Lab provides the library under the terms of a BSD license for use in academic and non-academic environments.

Citation

If you make use of mgfp in your own projects, please cite the following article:

If you use ms2preproc in your data analysis pipeline, please cite

Installation

Obtaining the Software

Binary packages for Microsoft Windows, Linux (64 bit, built on Ubuntu 10.4) and MacOS X (Snow Leopard) are available for download from

Building from Source

Building mgfp from source is straightforward. However, it requires a working CMake build system (available from http://cmake.org/) and CMake >= 2.6.

With cmake available, the build process is

 tar xvzf mgfp-xxxxxxx.tar.gz
 mkdir mgfp-build
 cd mgfp-build
 cd build
 cmake ../mgfp-xxxxxxx
 make
 make test
 make install

Optionally, if you want to build your own precompiled packages, you can add

  make package

Usage

To use the parser, one must first create a parser driver instance:

    mgf::MgfFile mgfFile;
    mgf::Driver driver(mgfFile);

Then set the verbosity flage (defaulting to off/false)

    driver.trace_parsing = true;
    driver.trace_scanning = true;

and parse the input. The input is a stream (like std::cin in the example here or std::fstream).

    bool result = driver.parse_stream(std::cin);

One should always check if the parsing was successful and only continue if so.

    if (!result) {
        std::cerr << std::endl
          << "Error parsing data stream (use -v for details)." << std::endl;
        return -1;
    }

If the parsing was successful, the contents of the MGF file are available in terms of an MgfFile object; it is possible to iterate over the MS/MS spectra and to read/modify/otherwise process the contents. The example here attempts to extract TMT reporter ion intensities from centroid mode MS/MS spectra:

    typedef mgf::MgfFile::iterator MFI;
    for (MFI i = mgfFile.begin(); i != mgfFile.end(); ++i) {
        // sort the MS/MS spectrum by m/z
        std::sort(i->begin(), i->end(), mgf::LessThanMz());
        typedef mgf::MgfSpectrum::iterator MSI;
        // extract TMT reporter ion intensities
        std::tr1::array<double, 6> obsTmtAbundances;
        for (size_t n = 0; n < 6; ++n) {
            MSI closestIt = findClosestMz(i->begin(), i->end(), tmtMasses[n]);
            // check if the closest centroid is close enough
            if (std::abs(closestIt->first - tmtMasses[n]) < 0.5) {
                obsTmtAbundances[n] = closestIt->second;
            } else {
                obsTmtAbundances[n] = 0.0;
            }
            tmts.push_back(obsTmtAbundances);
        }
    }

Examples

Coding examples are in the applications/ subdirectory

Appendix

Known Issues

The MGF Grammar

The following is the current MGF Grammar, extracted from Parser.ypp.

    0 $accept: start "end of file"

    1 ion: "double" "double" "end of line"
    2    | "integer" "double" "end of line"
    3    | "double" "integer" "end of line"
    4    | "integer" "integer" "end of line"

    5 ions: ions ion
    6     | ion

    7 charge: "integer" '+'
    8       | "integer" '-'

    9 charges: '(' charges ')'
   10        | charges ',' charge
   11        | charges "and keyword" charge
   12        | charge

   13 csintegerlist: csintegerlist ',' "integer"
   14              | "integer"

   15 blocks: /* empty */
   16       | blocks block

   17 block: "begin_ions keyword" "end of line" localparams ions "end_ions keyword" "end of line"
   18      | "begin_ions keyword" "end of line" localparams "end_ions keyword" "end of line"

   19 globalparams: /* empty */
   20             | globalparams globalparam

   21 globalparam: "enzyme keyword" '=' "string" "end of line"
   22            | "search title keyword" '=' "string" "end of line"
   23            | "database keyword" '=' "string" "end of line"
   24            | "MS/MS datafile format keyword" '=' "string" "end of line"
   25            | "MS/MS ion series keyword" '=' "string" "end of line"
   26            | "variable modifications keyword" '=' "string" "end of line"
   27            | "units for ITOL keyword" '=' "string" "end of line"
   28            | "mass type (mono or avg) keyword" '=' "string" "end of line"
   29            | "fixed modifications keyword" '=' "string" "end of line"
   30            | "quantitation method keyword" '=' "string" "end of line"
   31            | "maximum hits keyword" '=' "string" "end of line"
   32            | "type of report keyword" '=' "string" "end of line"
   33            | "type of search keyword" '=' "string" "end of line"
   34            | "taxonomy keyword" '=' "string" "end of line"
   35            | "tolerance units keyword" '=' "string" "end of line"
   36            | "user keyword" '=' "string" "end of line"
   37            | "user email keyword" '=' "string" "end of line"
   38            | "username keyword" '=' "string" "end of line"
   39            | "perform decoy search keyword" '=' "integer" "end of line"
   40            | "error tolerance keyword" '=' "integer" "end of line"
   41            | "partials keyword" '=' "integer" "end of line"
   42            | "fragment ion tolerance keyword" '=' "double" "end of line"
   43            | "fragment ion tolerance keyword" '=' "integer" "end of line"
   44            | "misassigned 13C keyword" '=' "double" "end of line"
   45            | "precursor m/z keyword" '=' "double" "end of line"
   46            | "precursor m/z keyword" '=' "integer" "end of line"
   47            | "protein mass (kDa) keyword" '=' "double" "end of line"
   48            | "protein mass (kDa) keyword" '=' "integer" "end of line"
   49            | "peptide mass tolerance keyword" '=' "double" "end of line"
   50            | "peptide mass tolerance keyword" '=' "integer" "end of line"
   51            | "charge set keyword" '=' charges "end of line"
   52            | "NA translation keyword" '=' csintegerlist "end of line"
   53            | "comment" "end of line"

   54 localparams: /* empty */
   55            | localparams localparam

   56 localparam: "title keyword and full title string" "end of line"
   57           | "amino acid composition keyword" '=' "string" "end of line"
   58           | "MS/MS ion series keyword" '=' "string" "end of line"
   59           | "variable modifications keyword" '=' "string" "end of line"
   60           | "retention time or range keyword" '=' "double" "end of line"
   61           | "retention time or range keyword" '=' "integer" "end of line"
   62           | "retention time or range keyword" '=' "double" '-' "double" "end of line"
   63           | "retention time or range keyword" '=' "double" '-' "integer" "end of line"
   64           | "retention time or range keyword" '=' "integer" '-' "double" "end of line"
   65           | "retention time or range keyword" '=' "integer" '-' "integer" "end of line"
   66           | "scan number of range keyword" '=' "integer" "end of line"
   67           | "scan number of range keyword" '=' "integer" '-' "integer" "end of line"
   68           | "tolerance units keyword" '=' "string" "end of line"
   69           | "amino acid sequence keyword" '=' "string" "end of line"
   70           | "sequence tag keyword" '=' "string" "end of line"
   71           | "error tolerant sequence keyword" '=' "string" "end of line"
   72           | "peptide mass tolerance keyword" '=' "double" "end of line"
   73           | "peptide mass tolerance keyword" '=' "integer" "end of line"
   74           | "charge set keyword" '=' charges "end of line"
   75           | "precursor mass keyword" '=' "double" "end of line"
   76           | "precursor mass keyword" '=' "double" "double" "end of line"
   77           | "precursor mass keyword" '=' "double" "integer" "end of line"
   78           | "precursor mass keyword" '=' "integer" "end of line"
   79           | "precursor mass keyword" '=' "integer" "double" "end of line"
   80           | "precursor mass keyword" '=' "integer" "integer" "end of line"
   81           | "comment" "end of line"

   82 contents: globalparams blocks "end of file"

   83 start: contents

 All Classes Namespaces Functions Variables Typedefs Friends

Generated on Sun Oct 10 10:20:34 2010 for mgfp by  doxygen 1.6.1