Lutefisk Operators Manual

LutefiskXP v1.0.4 Operators Manual

Documentation updated Aug. 26, 2005

Copyright © 2005 Rich Johnson ([email protected])
All rights reserved worldwide

Lutefisk Homepage: lutefiskxp.sourceforge.net

Table of Contents

Overview
Lutefisk files
Compilation notes
Contact information
Version history

Overview

Lutefisk is a program for the de novo interpretation of peptide CID spectra. While it has a rudimentary interface, it can be compiled for virtually any operating system with a C compiler. [Source code and instructions are provided for MacOS, Win32, OSF, Solaris, Irix, and Linux.].

Lutefisk can be used in conjunction with homology-based database search programs (e.g., OpenSea, MS-BLAST, or CIDentify) as a supplement to standard MS/MS database search programs. We use it to find modified peptides or sequence variants, as well as to help validate the results obtained from database search programs (e.g, Mascot, Sequest, etc). We also find it useful for flagging high quality data that the search programs (Mascot, etc) fail to identify, which may be due to a variety of reasons (searching a human protein sequence database for an MSMS spectrum of a mycobacterial derived peptide, for example).

Lutefisk Files

To run Lutefisk, you need to have four files within the same directory or folder:

CID data file (data files can be specified with a full or partial pathname)
Lutefisk.details
Lutefisk.params
Lutefisk.residues

One additional file is optional:

Database.sequence

Once these files (containing the appropriate information as described below) and the Lutefisk application are gathered together in one folder, you start the application. Once execution begins, Lutefisk proceeds with minimal user intervention.

On a Mac, a single dialog box appears where you can specify a variety of command line arguments (if interested type -h to see help); in most cases, you will click on the "Ok" button and use all of the default values. On Windows, there is no such dialog box; however, command-line arguments can be implemented by starting Lutefisk from the Command Prompt program supplied with the Windows operating system.

A simulated teletype interface appears and indicates the various stages of processing that have been achieved. When it is finished, the teletype interface provides an initial list of sequences ranked in order of the "intensity score", followed by a more refined and shorter list of sequences. This short list of sequences is also placed in a file with the default name identical to the CID data file name with ".lut" appended. The header to this file contains the information found in the parameter file "Lutefisk.params".

Note regarding use of Lutefisk with CIDentify: The output file can be read directly by the modified FASTA program called CIDentify. If you don't like Lutefisk, you can still use CIDentify without using Lutefisk. This can easily be done by editing a Lutefisk output file so that it contains your own sequences (determined by hand or another sequencing program). Alternatively, you can obtain the CIDentify source code and modify the data input format to suit yourself. It is also worth pointing out that the Lutefisk output file can be edited in order to eliminate any sequences that you somehow know are incorrect.

CID data files

Lutefisk can read CID data files in four different formats:

ASCII files created by the Finnigan TSQ program called "List". These are created by starting the "List" program within ICIS Executive" and opening the data file of interest within "List". Under the "File" menu, go to "Print...". A dialog box appears wherein you select "ASCII" as the saved formats, and under "Text Displays" select "Multiple Pages". Provide a file name and select the "Save to File" button (don't select "Print" or else you will have reams of scratch paper).
ASCII files created by the Finnigan LCQ File Converter program. From the destination box select "text" as the format, select the LCQ .raw files to convert, click on the arrow button, and then click on the convert button.
Tab-delineated ASCII text files. The first column contains m/z values followed by a tab and the second column is unitless relative intensity (an example is shown on the web site).

For any of these first three formats, the data can contain profile data with multiple data points per mass unit, or it can be centroided (or peak top) m/z values.

In addition, Lutefisk can read the Sequest ".dta" files; for ".dta" files "C" should be selected for the parameter "Profile or Centroid" found in the Lutefisk.params file (see below). This is our favorite data file format, since our Micromass Qtof and Thermofinnigan ion traps all have software that converts raw LC/MS/MS data to dta files.

Lutefisk.details

The Lutefisk.details file contains the so-called "ion probabilities" for each type of ion. Here is an example. Each column in the file contains the "ion probabilities" for different fragmentation patterns (see the description of "fragmentation patterns" below). Currently there are only two types of fragmentation pattern that have been coded, which is for low energy CID of tryptic peptides on triple quadrupole (or Qtof) instruments or ion traps, and these ion probabilities are listed in the second and third columns. The first column is not used (oddly enough).

Lutefisk.residues

The Lutefisk.residues file contains the single letter code, monoisotopic masses, average masses, and nominal masses for each amino acid. The default Lutefisk.residue file is shown here. To add an additional residue to the list, replace the 0's in one of the rows w/ the corresponding monoisotopic, average, and nominal masses. Up to five additional non-traditional residues can be entered here, and will be given the single letter code of J, O, U, X, or Z. Also, if you don't like my masses for the usual amino acids, you should feel free to change them here.

Database.sequence

The Database.sequence file is a text file containing a sequence or a list of sequences that might have been derived from a sequence database search. An example of such a file is shown here. Although this seems like one is giving Lutefisk the answers up front, in fact, Lutefisk will do its usual de novo sequencing regardless of the Database.sequence list. In the final steps, where it determines scores for the candidate sequences, Lutefisk tosses in these database-derived sequences along with the de novo sequence candidates to determine if the database sequences are as good as or better than the de novo sequences. If so, then this constitutes evidence that the database derived sequences might actually be correct.

Lutefisk.params

The Lutefisk.params file is where most of the user-selected variables are altered. Once appropriate parameters have been chosen for a given set of data, one usually needs to change three parameters -- "CID Filename", "Peptide MW", and "Charge-state" (of the precursor ion). If the CID file is in the "dta" format, the latter two parameters can be automatically read from the file header and invoking the program like 'lutefisk <dta_file>' will override all three parameters with those in the specifed data file(s).

Certain parameters need to be changed to accommodate data obtained from different instruments. The mass tolerances should match the anticipated errors for each type of instrument. The tolerance parameter "Final Fragment Err" should be set to zero unless the data was obtained from a Qtof, in which case, it should be a value of 0.02 - 0.05 ( I currently use 0.04). The "Peak Width" parameter should be set to 1 for unit resolved data (ion trap), and 0.75 for higher resolution data obtained from a Qtof. Triple quads are often run in a low resolution mode to enhance sensitivity, so the peak widths might be 2-3 u. For less than unit resolved spectra (triple quad, say) set the "Transition Mass" to 1800. This is the mass above which average (rather than monoisotopic) masses are used in the calculations; for unit resolved or better data, set this high (5000) so that average masses are never used. Since fragmentation patterns are slightly different for triple quads, ion traps, and Qtof's, set the parameter "Fragmentation Pattern" accordingly. Finally, the parameter "Auto Tag" should be set to N for ion trap data, and Y for Qtof or triple quad data. Here are the parameter files I use for data obtained by LC/MS/MS using an ion trap, and LC/MS/MS using a Qtof.

Here is a more complete description of the Lutefisk.params file parameters.

Output files (.lut)

The header repeats the information contained in the params file, and also lists several scores that need some explanation. The candidate sequences are ranked according to Pr(C) which is the estimated probability of being correct. I find that values over 0.5 are worth submitting to a homology-based sequence database search program, and anything over 0.8 is particularly worthy of serious consideration. Pr(C) is calculated from an empirically-derived 2-order polynomial fit to a weighted average of the four remaining scores (Pevzscr, Quality, Intscr, and X-corr). Pevzscr is an adaptation of the ideas presented by Dancik et al (J. of Comput. Biol (1999) Vol 6, 327), which is a score that penalizes for the absence of expected ions and accounts for the possibility of random matches. Quality is the percentage of the peptide mass that can be accounted for by a contiguous ion series. Intscr is the percentage of the fragment ion intensity that can be accounted for as b, y, internal fragment, etc, ions. X-corr is the cross-correlation score that has been normalized by its auto-correlation score.

If "Mass Scrambles for Statistics" in the params file was used, then the bottom of the output file contains a summary of the statistical analysis. The first column "1st ranked" lists the un-normalized scores for the top ranked sequence. The column "St Deviations" shows how many standard deviations the top ranked sequence scores were compared to the average wrong scores. The column "Average Wrong" lists these wrong score averages, and the column "Correct/Wrong" shows the ratio of the top correct score versus the wrong score. Currently, I don’t recommend using this, so just give a zero value for the parameter “Mass Scrambles for Statistics” (lutefisk.params file).

Compilation Notes

See the 'README' file for compilation information.

Unix/Linux:

After untarring the archive, copy the makefile for your system to "Makefile". Use the "make lutefisk" command.

Win32:

Current Metrowerks Projects are included in the "Win32" folder. If you have an older compiler you will need to create a new "C Console App" project and add the source files as specified in the '0_ReadMe' file.

Macintosh:

Current Metrowerks Projects are included in the "Macintosh" folder (provided as a self- extracting archive to maintain fidelity). If you have an older compiler you will need to create a new "Std C Console PPC" project and add the source files as specified in the '0_ReadMe' file. Set the preferred heap size to 16M, the minimum heap size to 8M, and the stack size to 512k.

Contact & License Information

Lutefisk is software for de novo sequencing of peptides from tandem mass spectra.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

Contact:

Richard S Johnson
4650 Forest Ave SE
Mercer Island, WA 98040

jsrichar -at - alum.mit.edu

· When Q is in the second position, then a c1 ion is present. It is now scored, and used to distinguish the two amino acids if they are not already.

· Limits placed on the number of subsequences (drops by half) if the program has been processing for over 30 seconds. It drops by another half if the processing is over a minute. This will speed long ones up.

· Limits placed on the deconvolution of the C-terminal unsequenced chunk of mass (in Haggis). If the overall processing has gone on for over 45 sec then it drops the mass to 600.

· For LCQ data, b/y pairs are found and labeled as "golden boys" that cannot be gotten rid of as easily. These could also be considered "favored sons", although we can all hope that the favored son "W" is eliminated in November (post-election news: sadly the favored son is still in DC).

· Fixed the ranking so that the db derived sequences don't fowl up the rank numbering.

· Fixed a bug that, in certain cases, produced negative values for summed residue masses displayed within brackets

· Changed the LCQ b/y pair procedures (both in GetCID and main) such that pairs were either both singly charged, or one was singly-charged and the other doubly-charged (but only if the precursor is > 2 charge state).

· Changed Haggis a bit so that the minimum number of edges required to be considered a sequence varied with the number of Lutefisk sequences already obtained as well as peptide molecular weight. Minimum stayed at 4, but could be higher.

· Changed Haggis so that if the array capacity was exceeded, it dumped the results obtained with the Lutefisk sequences and then continued to run (rather than exit(1), which is a waste).

· Changed the way the ion intensity is altered in the final scoring. The high intensity ions are reduced, but the low intensity ones are not increased

· The Haggis sequencing was modified so that the two unsequenced masses at either end could be matched to randomly

· derived sequences. The random sequences that fit with the most b and y ions is saved and replaces the chunk of mass.

· Added two output variables to the params file. Now can specify the number of sequences and their Pr(c) limit.

· Modified final scoring slightly, such that "quality" has less of an influence.