Welcome to LipidAnalyst, a user-friendly web
application for lipidomics data analysis and visualization.
This tutorial will guide you through the key steps:
Purpose:
Provide the raw input for analysis. The app requires a lipidomics
dataset, metadata and optionally internal standards to guide processing.
Lipids in the data sheet would be reodered in an alphabetical order.
Options:
- Lipidomics file: must be in wide format.
- Metadata file: must contain grouping variables
(e.g. control vs treatment). Optional: cell counts, protein
concentration…
- Internal standard file (optional): improves
quantification by correcting for technical variation.
Upload instructions:
- Files need to be in CSV/TSV/XLS/XLSX format. If the uploaded file is
an Excel workbook, only the first sheet will be processed.
- Missing values in the sheet may appear in different forms, such as NA,
NaN, N/A, an empty string, or just a blank cell.
- Ensure sample IDs match between lipidomics data and metadata/Internal
standard.
- Use provided example dataset if you are unsure about format.
- All the lipids should be on the column in the preview. If sample on
the column, please change the setting when you upload the file.
Lipid Naming guidelines:
LipidAnalyst support parsing common or trivial names, such as “oleic acid”, “palmitoyl-PC”. LipidAnalyst would convert them automatically by matching the name from the LIPID MAPS Structural Database (LMSD).
The shorthand nomenclature format is preferred. <class> <chain>:<unsaturation> or <class> <chain>:<unsaturation>/<chain>:<unsaturation> (e.g., FFA 18:1, PC 16:0/18:1, TAG 54:4).
Parentheses are optional — both FFA 18:1 and FFA(18:1) are accepted.
Special prefixes like O- or P- are supported (e.g., PC
O-16:0/18:1).
Complex annotations after ; or + will be ignored.
If only one aceyl chain information is provided for TG (Triacylglycerol) lipids (e.g., TAG54:4-FA20:3), TG (Triacylglycerol) are parsed by their total carbon and total unsaturation, not individual sn-positions. If sn-specific annotation is present (e.g., TAG54:4-FA20:3), LipidAnalyst will still correctly extract the total composition (TAG 54:4).
| Lipid | Sample1 | Sample2 | Sample3 | Sample4 |
|---|---|---|---|---|
| PC(16:0/18:1) | 12345 | 15678 | 13456 | 14321 |
| PE(18:0/20:4) | 9876 | 10234 | 11012 | 10098 |
| TAG(54:3) | 5634 | 5890 | 6234 | 5901 |
Note:
- Select the lipid on the row or lipid on the column to make the
software identify lipids.
| Sample | Group | Tissue |
|---|---|---|
| Sample1 | CKD | Plasma |
| Sample2 | CKD | Plasma |
| Sample3 | Sham | Plasma |
| Sample4 | Sham | Plasma |
Note:
- The metadata must put the sample name on the first
column, and please ensure the sample IDs match between lipidomics data
and metadata/Internal standard.
- The metadata must have a column to be defined as grouping
variable.
- Other information such as tissue, cell count, and protein
concentration can be included as additional columns. You may use these
additional columns to normalize the data as well.
| Lipid | Sample1 | Sample2 | Sample3 | Sample4 |
|---|---|---|---|---|
| PC(17:0)-d+ | 1234 | 1567 | 1345 | 1432 |
| PE(17:0)-d+ | 987 | 1023 | 1101 | 1009 |
| TAG(17:0)-d | 563 | 589 | 623 | 590 |
Note:
- You can select lipid on the row or lipid on the column to make the
software identify lipids.
- If there are multiple internal standards for one specific lipid class,
the app would trigger a warning and let you select one for future
potential internal standard normalization.
Purpose:
Remove lipids with unreliable measurements to improve downstream
analysis.
Options:
- Filter by low-quality (e.g. missing values >
threshold)
- Filter by low-abundance (below detection in most
samples)
- Filter by low-variance (features that do not change
across groups)
đź’ˇ Tips:
- Features with 100% missing values will be removed automatically.
- Default thresholds are suggested, but can be customized.
- Skip filtering is available if you want to keep some lipids.
Missing values are common in lipidomics datasets and can arise from
several sources, including:
- Technical limitations of the mass spectrometer, such as detection
thresholds or signal suppression,
- Variability introduced during sample extraction, handling, or
instrument runs,
- True biological absence of a lipid in certain samples or
conditions.
We categorize missingness into two major types:
Group-level missingness:
A lipid is almost completely missing within one experimental group
(e.g., all disease group samples have NA). This often suggests true
biological absence or very low abundance.
In these cases, methods such as Limit of Detection (LoD) imputation is
generally more appropriate.
General missingness:
Values are sporadically missing across samples but not confined to a
single group. This pattern typically reflects technical noise or
stochastic signal dropout.
K-Nearest Neighbors (KNN) imputation(sample-wise)vis recommended as it
leverages similarity among samples to estimate reasonable values.
Purpose:
Replace missing data points to allow statistical analysis.
Options:
- KNN (k-nearest neighbors) featurewise/samplewise:
predicts missing values from similar features/samples.
- LoD 1/5 minimum value: replaces missing with 1/5 of
featurewise minimum value.
- Mean/median substitution: simple but may reduce
variability.
đź’ˇ Tips:
- Use missing values heatmap to determine the reason for missingness,
such as random missingness, below detection limit, or systematic
missingness.
- KNN is suggested for large dataset.
- Use 1/5 minimum value for targeted lipidomics where missing = below
detection.
- Choose the skip check box if no missing values.
Purpose: Deal with duplicated lipids or lipids with different adducts. Combine them by sum, mean, median, max, or min.
Options:
- Combine all duplicated lipids based on clean names in the
parse table: merge all the lipids with same methods.
- Lipid class specific merging: choose one lipid class
and select the corresponding combining method.
đź’ˇ Tip:
- Press add criteria button to add new rules for merging the lipids.
Purpose: Ensure lipid names are correctly interpreted for class-based analysis. Lipid parsing helps identify the lipid class, and Sn1 - Sn4 chains from the lipid name. If some lipid names do not explicitly show the lipid structure (such as palmtic acid, oleic acid), we can automatically search for structure in the LIPID MAPS® Structure Database (LMSD).
đź’ˇ Tips:
The parsing table is editable. Double-click a cell to make changes. If
any lipid information is incorrect, you may revise it directly in the
table.
Purpose: Review the data preview before normalization through Lipid composition pie plot, Lipid class boxplot, PCA plot, and sample boxplot.
đź’ˇ Tip: Hover on the data for more information. The plots are interactive.
Purpose:
Remove technical bias so differences reflect biology.
Options:
- Internal standard normalization (Internal standards
must be available)
đź’ˇ Tips:
- Internal standards correct for run-to-run variation.
- If no standards available, use another internal standard that shares
similar chemistry characteristics.
Purpose: Adjust for sample amount differences (e.g. cell count, protein concentration).
Options:
- User defined constant value e.g. Dilution factor,
Weight, Volume
- Metadata variable e.g. Cell count, Protein
concentration, which is from the uploaded metadata
file.
Purpose: Make data comparable across samples and lipids.
Options:
- Normalized by sum (samplewise) – Divides each feature
by the total abundance in the sample.
- Normalized by median (samplewise) – Divides each
feature by the sample median.
- Normalized by mean (samplewise) – Divides each
feature by the sample mean.
- Lipid class sum normalization – Normalizes within
each lipid class based on the total abundance of that class. If there is
only one lipid within the lipid class, the original value of the lipid
would be kept.
- Lipid class median normalization – Uses the median of
each lipid class for normalization.
- Lipid class mean normalization – Uses the mean of
each lipid class for normalization.
- Quantile normalization – Aligns the distribution of
all samples to make them comparable.
💡 Tip: Lipid class–based normalization is particularly useful when focusing on relative changes within specific lipid categories.
After normalization, data transformation helps stabilize variance and reduce skewness.
Available data transformation methods:
After normalization, scaling helps prepare the data for downstream analyses such as PCA or clustering.
Available scaling methods:
Purpose:
Summarize and communicate findings effectively.
LipidAnalyst offers a variety of interactive visualization options:
Global Distribution Boxplot - View overall lipid abundance patterns.
Boxplot for all samples
Boxplot for lipid classes
Boxplot for all lipids (Only recommend when the number of lipid is less than 80)
Principal Component Analysis (PCA) – Explore global variation and sample clustering.
3D or 2D options available.
scale or unscaled data available.
Differential Mean Lipid Heatmap – Visualize overall lipid expression patterns.
Class Level Lipid Comparison - Compare lipid class abundance across groups. Total double bond and total carbon number can be controlled for refined analysis.
Individual Lipid Comparison - Examine specific lipid changes.
Volcano plots – Identify significantly changed lipids by fold change and p value.
đź’ˇ Tips:
- You can view the Differential Mean Lipid Heatmap to generate some
interesting hypothesis, and then use class level or individual lipid
level to compare differences between the group.
- P value statistics is available in class level and individual level
lipid comparison.
- Export figures as publication-ready images use the camera icon at the
upper right corner of the plot.
Purpose:
Identify lipids that differ significantly between groups.
Options:
- Lipidomics Mean Calculator Calculate mean abundance
for each group
- t-test (2-group comparison) Welch’s t-test (unequal
variances allowed) and Student’s t-test (equal variances assumed).
- ANOVA (multiple groups)
- Correlation Reveal relationships between
lipids.Correlation heatmap available.
- DPSC network DSPC (Debiased Sparse Partial
Correlation) is a statistical framework designed to infer sparse
molecular networks by estimating partial correlations and correcting by
graphical lasso.
đź’ˇ Tips:
- Always check assumptions (normality, equal variance).
- Adjust for multiple testing (e.g. FDR).
- Able to review the PCA plots with significant features.
- Suggest to put DSPC table into Metscape3 (with Cytoscape) for
further refinement of the network.
PLS-DA PLS-DA (Partial Least Squares Discriminant Analysis) is a supervised multivariate statistical method that identifies features that best discriminate between predefined groups. This technique is particularly useful for biomarker discovery, as it highlights the most relevant features contributing to group separation while handling complex, high-dimensional datasets typical in lipidomics studies.
In the model setting, you can set the number of permutations and number of cross validations.
In the model ouput summary, you can check the model performance by looking at the R2Y and Q2 values. R2Y indicates how well the model fits the data, while Q2 reflects the model’s predictive ability. Generally, a good PLS-DA model should have high R2Y and Q2 values, with Q2 being less than R2Y to avoid overfitting.
Parameters interpretation:
| Metric | Description | What It Evaluates | Interpretation Guideline |
|---|---|---|---|
| R2X(cum) | Cumulative proportion of variance in X explained by the model | How well predictors are summarized | Higher values indicate better representation of X-space |
| R2Y(cum) | Cumulative proportion of variance in Y explained by the model | Goodness of fit to class labels | >0.5 generally indicates moderate-to-good fit |
| Q2(cum) | Cross-validated predictive ability | Model generalizability | >0.3 acceptable, >0.5 good |
| RMSEE | Root Mean Square Error of Estimation (training error) | Average fitting error | Lower values indicate better fit |
| pre | Number of predictive components | Model complexity | Small numbers reduce overfitting |
| ort | Number of orthogonal components | X-variation unrelated to Y | Helps separate noise from predictive signal |
| pR2Y | Permutation-test p-value for R2Y | Significance of model fit | p < 0.05 indicates non-random fit |
| pQ2 | Permutation-test p-value for Q2 | Significance of predictive ability | p < 0.05 indicates non-random prediction |
Permutation plots are used to assess the statistical significance of the PLS-DA model by comparing the original model’s performance metrics (R2Y and Q2) against those obtained from models built on permuted class labels. A significant model will show that the original R2Y and Q2 values are substantially higher than those from the permuted models, indicating that the observed separation is not due to random chance.
VIP plots (Variable Importance in Projection) highlight the features that contribute most to the model’s ability to discriminate between groups. Features with VIP scores greater than 1 are generally considered important for group separation and may be potential biomarkers.
OPLS-DA Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) is a supervised multivariate statistical method used to identify differences between predefined groups in complex datasets. OPLS-DA explicitly separates the predictive variation (related to group separation) from the orthogonal variation (unrelated to group separation), enhancing the interpretability of the model. This method is particularly useful in lipidomics for identifying potential biomarkers and understanding underlying biological differences between conditions.
Score plot :
The score plot visualizes samples in a reduced latent variable space,
showing how well the model separates predefined groups based on
systematic variation in the predictor matrix.
Outlier plot:
The outlier plot (e.g., Hotelling’s T² vs. DModX) identifies samples
that fall outside the model’s confidence limits, indicating observations
with extreme leverage or poor model fit that may unduly influence the
OPLS-DA model.
Random Forest Random Forest model is an ensemble machine learning technique that builds multiple decision trees using bootstrap sampling and random feature selection.
Parameters interpretation:
| Metric | Description | What It Evaluates | Interpretation Guideline |
|---|---|---|---|
| Accuracy | Proportion of correctly classified samples | Overall classification performance | Closer to 1 indicates better performance; can be misleading with imbalanced classes |
| Kappa | Agreement between predicted and true labels adjusted for chance | Model reliability beyond random guessing | >0.6 substantial agreement, >0.8 strong agreement |
| Accuracy (Lower / Upper CI) | Confidence interval for accuracy | Statistical uncertainty of accuracy estimate | Narrow interval indicates stable performance |
| Accuracy Null | Accuracy expected by random chance | Baseline model comparison | Model accuracy should exceed this value |
| Accuracy P-Value | Significance test comparing model accuracy to null | Whether model performs better than chance | <0.05 indicates statistically significant improvement over null |
| McNemar’s P-Value | Test of symmetry in classification errors | Whether error types are balanced | <0.05 suggests systematic prediction bias |
| Sensitivity (Recall) | True positive rate | Ability to detect positive class | Higher values indicate better detection of positives |
| Specificity | True negative rate | Ability to detect negative class | Higher values indicate better detection of negatives |
| Precision | Proportion of predicted positives that are true positives | Reliability of positive predictions | Higher values reduce false positives |
| F1 Score | Harmonic mean of precision and recall | Balance between sensitivity and precision | Useful for imbalanced datasets |
| OOB Error | Out-of-bag misclassification rate | Internal cross-validated error estimate | Lower values indicate better generalization |
| Mean Decrease Accuracy | Drop in accuracy when a variable is permuted | Feature importance (predictive impact) | Larger decrease = more important feature |
| Mean Decrease Gini | Total decrease in node impurity contributed by a feature | Feature importance (splitting power) | Higher values indicate stronger discriminatory power |
Purpose:Random Forest model is an ensemble machine
learning technique that builds multiple decision trees using bootstrap
sampling and random feature selection
Export processed data, statistical results, and visualizations.
In this tutorial, you learned how to:
LipidAnalyst simplifies lipidomics data analysis into an intuitive, code-free workflow — enabling you to focus on biological insights rather than technical complexity.