LutefiskXP v1.0.4 Operators Manual
Lutefisk is a program for the de novo interpretation of peptide CID spectra. While it has a rudimentary interface, it can be compiled for virtually any operating system with a C compiler. [Source code and instructions are provided for MacOS, Win32, OSF, Solaris, Irix, and Linux.].
Lutefisk can be used in conjunction with homology-based database search programs (e.g., OpenSea, MS-BLAST, or CIDentify) as a supplement to standard MS/MS database search programs. We use it to find modified peptides or sequence variants, as well as to help validate the results obtained from database search programs (e.g, Mascot, Sequest, etc). We also find it useful for flagging high quality data that the search programs (Mascot, etc) fail to identify, which may be due to a variety of reasons (searching a human protein sequence database for an MSMS spectrum of a mycobacterial derived peptide, for example).
To run Lutefisk, you need to have four files within the same directory or folder:
One additional file is optional:
Once these files (containing the appropriate information as described below) and the Lutefisk application are gathered together in one folder, you start the application. Once execution begins, Lutefisk proceeds with minimal user intervention.
On a Mac, a single dialog box appears where you can specify a variety of
command line arguments (if interested type -h to see help); in most cases, you
will click on the "Ok" button and use all of the default values. On
Windows, there is no such dialog box; however, command-line arguments can be
implemented by starting Lutefisk from the Command Prompt program supplied with
the Windows operating system.
A simulated teletype interface appears and indicates the various stages of processing that have been achieved. When it is finished, the teletype interface provides an initial list of sequences ranked in order of the "intensity score", followed by a more refined and shorter list of sequences. This short list of sequences is also placed in a file with the default name identical to the CID data file name with ".lut" appended. The header to this file contains the information found in the parameter file "Lutefisk.params".
Note regarding use of Lutefisk with CIDentify: The output file can be read directly by the modified FASTA program called CIDentify. If you don't like Lutefisk, you can still use CIDentify without using Lutefisk. This can easily be done by editing a Lutefisk output file so that it contains your own sequences (determined by hand or another sequencing program). Alternatively, you can obtain the CIDentify source code and modify the data input format to suit yourself. It is also worth pointing out that the Lutefisk output file can be edited in order to eliminate any sequences that you somehow know are incorrect.
Lutefisk can read CID data files in four different formats:
For any of
these first three formats, the data can contain profile data with multiple data
points per mass unit, or it can be centroided (or peak top) m/z values.
The Lutefisk.details file contains the so-called "ion probabilities" for each type of ion. Here is an example. Each column in the file contains the "ion probabilities" for different fragmentation patterns (see the description of "fragmentation patterns" below). Currently there are only two types of fragmentation pattern that have been coded, which is for low energy CID of tryptic peptides on triple quadrupole (or Qtof) instruments or ion traps, and these ion probabilities are listed in the second and third columns. The first column is not used (oddly enough).
The Lutefisk.residues file contains the single letter code, monoisotopic masses, average masses, and nominal masses for each amino acid. The default Lutefisk.residue file is shown here. To add an additional residue to the list, replace the 0's in one of the rows w/ the corresponding monoisotopic, average, and nominal masses. Up to five additional non-traditional residues can be entered here, and will be given the single letter code of J, O, U, X, or Z. Also, if you don't like my masses for the usual amino acids, you should feel free to change them here.
The Database.sequence file is a text file containing a sequence or a list of sequences that might have been derived from a sequence database search. An example of such a file is shown here. Although this seems like one is giving Lutefisk the answers up front, in fact, Lutefisk will do its usual de novo sequencing regardless of the Database.sequence list. In the final steps, where it determines scores for the candidate sequences, Lutefisk tosses in these database-derived sequences along with the de novo sequence candidates to determine if the database sequences are as good as or better than the de novo sequences. If so, then this constitutes evidence that the database derived sequences might actually be correct.
The Lutefisk.params file is where most of the user-selected variables are altered. Once appropriate parameters have been chosen for a given set of data, one usually needs to change three parameters -- "CID Filename", "Peptide MW", and "Charge-state" (of the precursor ion). If the CID file is in the "dta" format, the latter two parameters can be automatically read from the file header and invoking the program like 'lutefisk <dta_file>' will override all three parameters with those in the specifed data file(s).
Certain parameters need to be changed to accommodate data obtained from different instruments. The mass tolerances should match the anticipated errors for each type of instrument. The tolerance parameter "Final Fragment Err" should be set to zero unless the data was obtained from a Qtof, in which case, it should be a value of 0.02 - 0.05 ( I currently use 0.04). The "Peak Width" parameter should be set to 1 for unit resolved data (ion trap), and 0.75 for higher resolution data obtained from a Qtof. Triple quads are often run in a low resolution mode to enhance sensitivity, so the peak widths might be 2-3 u. For less than unit resolved spectra (triple quad, say) set the "Transition Mass" to 1800. This is the mass above which average (rather than monoisotopic) masses are used in the calculations; for unit resolved or better data, set this high (5000) so that average masses are never used. Since fragmentation patterns are slightly different for triple quads, ion traps, and Qtof's, set the parameter "Fragmentation Pattern" accordingly. Finally, the parameter "Auto Tag" should be set to N for ion trap data, and Y for Qtof or triple quad data. Here are the parameter files I use for data obtained by LC/MS/MS using an ion trap, and LC/MS/MS using a Qtof.
Here is a more complete description of the Lutefisk.params file parameters.
The header repeats the information contained in the params file, and also lists several scores that need some explanation. The candidate sequences are ranked according to Pr(C) which is the estimated probability of being correct. I find that values over 0.5 are worth submitting to a homology-based sequence database search program, and anything over 0.8 is particularly worthy of serious consideration. Pr(C) is calculated from an empirically-derived 2-order polynomial fit to a weighted average of the four remaining scores (Pevzscr, Quality, Intscr, and X-corr). Pevzscr is an adaptation of the ideas presented by Dancik et al (J. of Comput. Biol (1999) Vol 6, 327), which is a score that penalizes for the absence of expected ions and accounts for the possibility of random matches. Quality is the percentage of the peptide mass that can be accounted for by a contiguous ion series. Intscr is the percentage of the fragment ion intensity that can be accounted for as b, y, internal fragment, etc, ions. X-corr is the cross-correlation score that has been normalized by its auto-correlation score.
If "Mass Scrambles for Statistics" in the params file was used, then the bottom of the output file contains a summary of the statistical analysis. The first column "1st ranked" lists the un-normalized scores for the top ranked sequence. The column "St Deviations" shows how many standard deviations the top ranked sequence scores were compared to the average wrong scores. The column "Average Wrong" lists these wrong score averages, and the column "Correct/Wrong" shows the ratio of the top correct score versus the wrong score. Currently, I don’t recommend using this, so just give a zero value for the parameter “Mass Scrambles for Statistics” (lutefisk.params file).
See the 'README' file for compilation information.
After untarring the archive, copy the makefile for your system to "Makefile". Use the "make lutefisk" command.
Current Metrowerks Projects are included in the "Win32" folder. If you have an older compiler you will need to create a new "C Console App" project and add the source files as specified in the '0_ReadMe' file.
Current Metrowerks Projects are included in the "Macintosh" folder (provided as a self- extracting archive to maintain fidelity). If you have an older compiler you will need to create a new "Std C Console PPC" project and add the source files as specified in the '0_ReadMe' file. Set the preferred heap size to 16M, the minimum heap size to 8M, and the stack size to 512k.
Contact & License Information
Lutefisk is software for de novo sequencing of peptides from tandem mass spectra.
Copyright © 1995-2005 Richard S. Johnson
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
Richard S Johnson
4650 Forest Ave SE
Mercer Island, WA 98040
jsrichar -at - alum.mit.edu
version 1.0.4 – External release
When Q is in
the second position, then a c1 ion is present.
It is now scored, and used to distinguish the two amino acids if they
are not already.
on the number of subsequences (drops by half) if the program has been
processing for over 30 seconds. It drops by another half if the processing is
over a minute. This will speed long ones up.
on the deconvolution of the C-terminal unsequenced chunk of mass (in
Haggis). If the overall processing has
gone on for over 45 sec then it drops the mass to 600.
For LCQ data,
b/y pairs are found and labeled as "golden boys" that cannot be
gotten rid of as easily. These could
also be considered "favored sons", although we can all hope that the
favored son "W" is eliminated in November (post-election news: sadly
the favored son is still in DC).
ranking so that the db derived sequences don't fowl up the rank numbering.
Fixed a bug
that, in certain cases, produced negative values for summed residue masses
displayed within brackets
Changed the LCQ
b/y pair procedures (both in GetCID and main) such that pairs were either both
singly charged, or one was singly-charged and the other doubly-charged (but
only if the precursor is > 2 charge state).
a bit so that the minimum number of edges required to be considered a sequence
varied with the number of Lutefisk sequences already obtained as well as
peptide molecular weight. Minimum
stayed at 4, but could be higher.
so that if the array capacity was exceeded, it dumped the results obtained with
the Lutefisk sequences and then continued to run (rather than exit(1), which is
Changed the way
the ion intensity is altered in the final scoring. The high intensity ions are reduced, but the low intensity ones
are not increased
version 1.0 – External release
sequencing was modified so that two ion series could be combined.
sequencing was modified so that the two unsequenced masses at either end could
be matched to randomly
sequences. The random sequences that
fit with the most b and y ions is saved and replaces the chunk of mass.
Added two output
variables to the params file. Now can
specify the number of sequences and their Pr(c) limit.
scoring slightly, such that "quality" has less of an influence.
Lutefisk1900 version 1.3.8 - Internal release only
Lutefisk1900 version 1.3.7 - Internal release only
Lutefisk1900 version 1.3.5 - Internal release only
Lutefisk1900 version 1.3.4 - Internal release only
Lutefisk1900 version 1.3.2 - Released 1/28/02
Lutefisk1900 version 1.2.9 - Internal release only
Version 1900 1.2.8 - Internal release
Version 1900 1.2.7 - Internal release
Version 1900 1.2.6 - Released 10/15/00
Version 1900 1.2.5 - Internal release
Version 1900 1.2.4 - Released 4/6/00
Version 2.0.5 - Released 10/16/98
Version 1.3 - Released 11/17/97
Version 1.1 - Released 6/22/97
Version 1.0 - Released 5/30/97