1 Introduction

Welcome to LipidAnalyst, a user-friendly web application for lipidomics data analysis and visualization.
This tutorial will guide you through the key steps:

Uploading lipidomics data, metadata, and internal standard
Preprocessing and normalization
Visualization
Statistical testing

2 Access

Open the app at: https://lily-apps.shinyapps.io/LipidExplorer/
Works best in Google Chrome / Microsoft Edge.

3 Workflow

3.1 Upload Data

Purpose:
Provide the raw input for analysis. The app requires a lipidomics dataset, metadata and optionally internal standards to guide processing. Lipids in the data sheet would be reodered in an alphabetical order.

Options:
- Lipidomics file: must be in wide format.
- Metadata file: must contain grouping variables (e.g. control vs treatment). Optional: cell counts, protein concentration…

- Internal standard file (optional): improves quantification by correcting for technical variation.

Upload instructions:
- Files need to be in CSV/TSV/XLS/XLSX format. If the uploaded file is an Excel workbook, only the first sheet will be processed.
- Missing values in the sheet may appear in different forms, such as NA, NaN, N/A, an empty string, or just a blank cell.
- Ensure sample IDs match between lipidomics data and metadata/Internal standard.
- Use provided example dataset if you are unsure about format.
- All the lipids should be on the column in the preview. If sample on the column, please change the setting when you upload the file.

Lipid Naming guidelines:

LipidAnalyst support parsing common or trivial names, such as “oleic acid”, “palmitoyl-PC”. LipidAnalyst would convert them automatically by matching the name from the LIPID MAPS Structural Database (LMSD).
The shorthand nomenclature format is preferred. <class> <chain>:<unsaturation> or <class> <chain>:<unsaturation>/<chain>:<unsaturation> (e.g., FFA 18:1, PC 16:0/18:1, TAG 54:4).
Parentheses are optional — both FFA 18:1 and FFA(18:1) are accepted.
Special prefixes like O- or P- are supported (e.g., PC O-16:0/18:1).
Complex annotations after ; or + will be ignored.
If only one aceyl chain information is provided for TG (Triacylglycerol) lipids (e.g., TAG54:4-FA20:3), TG (Triacylglycerol) are parsed by their total carbon and total unsaturation, not individual sn-positions. If sn-specific annotation is present (e.g., TAG54:4-FA20:3), LipidAnalyst will still correctly extract the total composition (TAG 54:4).

3.1.1 Example lipidomics data format

Lipid	Sample1	Sample2	Sample3	Sample4
PC(16:0/18:1)	12345	15678	13456	14321
PE(18:0/20:4)	9876	10234	11012	10098
TAG(54:3)	5634	5890	6234	5901

Note:
- Select the lipid on the row or lipid on the column to make the software identify lipids.

3.1.2 Example metadata format

Sample	Group	Tissue
Sample1	CKD	Plasma
Sample2	CKD	Plasma
Sample3	Sham	Plasma
Sample4	Sham	Plasma

Note:
- The metadata must put the sample name on the first column, and please ensure the sample IDs match between lipidomics data and metadata/Internal standard.
- The metadata must have a column to be defined as grouping variable.
- Other information such as tissue, cell count, and protein concentration can be included as additional columns. You may use these additional columns to normalize the data as well.

3.1.3 Example Internal Standard format

Lipid	Sample1	Sample2	Sample3	Sample4
PC(17:0)-d+	1234	1567	1345	1432
PE(17:0)-d+	987	1023	1101	1009
TAG(17:0)-d	563	589	623	590

Note:
- You can select lipid on the row or lipid on the column to make the software identify lipids.
- If there are multiple internal standards for one specific lipid class, the app would trigger a warning and let you select one for future potential internal standard normalization.

3.2 Preprocessing

3.2.1 Data Filtering

Purpose:
Remove lipids with unreliable measurements to improve downstream analysis.

Options:
- Filter by low-quality (e.g. missing values > threshold)
- Filter by low-abundance (below detection in most samples)
- Filter by low-variance (features that do not change across groups)

💡 Tips:
- Features with 100% missing values will be removed automatically.
- Default thresholds are suggested, but can be customized.
- Skip filtering is available if you want to keep some lipids.

3.2.2 Missing Value Imputation

Missing values are common in lipidomics datasets and can arise from several sources, including:
- Technical limitations of the mass spectrometer, such as detection thresholds or signal suppression,
- Variability introduced during sample extraction, handling, or instrument runs,
- True biological absence of a lipid in certain samples or conditions.

We categorize missingness into two major types:

Group-level missingness:
A lipid is almost completely missing within one experimental group (e.g., all disease group samples have NA). This often suggests true biological absence or very low abundance.
In these cases, methods such as Limit of Detection (LoD) imputation is generally more appropriate.
General missingness:
Values are sporadically missing across samples but not confined to a single group. This pattern typically reflects technical noise or stochastic signal dropout.
K-Nearest Neighbors (KNN) imputation(sample-wise)vis recommended as it leverages similarity among samples to estimate reasonable values.

Purpose:
Replace missing data points to allow statistical analysis.

Options:
- KNN (k-nearest neighbors) featurewise/samplewise: predicts missing values from similar features/samples.
- LoD 1/5 minimum value: replaces missing with 1/5 of featurewise minimum value.
- Mean/median substitution: simple but may reduce variability.

💡 Tips:
- Use missing values heatmap to determine the reason for missingness, such as random missingness, below detection limit, or systematic missingness.
- KNN is suggested for large dataset.
- Use 1/5 minimum value for targeted lipidomics where missing = below detection.
- Choose the skip check box if no missing values.

3.2.3 Combine and Data integration

Purpose: Deal with duplicated lipids or lipids with different adducts. Combine them by sum, mean, median, max, or min.

Options:
- Combine all duplicated lipids based on clean names in the parse table: merge all the lipids with same methods.
- Lipid class specific merging: choose one lipid class and select the corresponding combining method.

💡 Tip:

- Press add criteria button to add new rules for merging the lipids.

3.2.4 Lipid Parsing

Purpose: Ensure lipid names are correctly interpreted for class-based analysis. Lipid parsing helps identify the lipid class, and Sn1 - Sn4 chains from the lipid name. If some lipid names do not explicitly show the lipid structure (such as palmtic acid, oleic acid), we can automatically search for structure in the LIPID MAPS® Structure Database (LMSD).

💡 Tips:
The parsing table is editable. Double-click a cell to make changes. If any lipid information is incorrect, you may revise it directly in the table.

3.2.5 Data Preview

Purpose: Review the data preview before normalization through Lipid composition pie plot, Lipid class boxplot, PCA plot, and sample boxplot.

💡 Tip: Hover on the data for more information. The plots are interactive.

3.3 Normalization

3.3.1 Quantification by Internal Standard

Purpose:
Remove technical bias so differences reflect biology.

Options:
- Internal standard normalization (Internal standards must be available)

💡 Tips:
- Internal standards correct for run-to-run variation.
- If no standards available, use another internal standard that shares similar chemistry characteristics.

3.3.2 Quantification by User Defined Factors

Purpose: Adjust for sample amount differences (e.g. cell count, protein concentration).

Options:
- User defined constant value e.g. Dilution factor, Weight, Volume

- Metadata variable e.g. Cell count, Protein concentration, which is from the uploaded metadata file.

3.3.3 Normalization and Data Scaling

3.3.3.1 Normalization

Purpose: Make data comparable across samples and lipids.

Options:
- Normalized by sum (samplewise) – Divides each feature by the total abundance in the sample.
- Normalized by median (samplewise) – Divides each feature by the sample median.
- Normalized by mean (samplewise) – Divides each feature by the sample mean.
- Lipid class sum normalization – Normalizes within each lipid class based on the total abundance of that class. If there is only one lipid within the lipid class, the original value of the lipid would be kept.
- Lipid class median normalization – Uses the median of each lipid class for normalization.
- Lipid class mean normalization – Uses the mean of each lipid class for normalization.
- Quantile normalization – Aligns the distribution of all samples to make them comparable.

💡 Tip: Lipid class–based normalization is particularly useful when focusing on relative changes within specific lipid categories.

3.3.3.2 Data Transformation

After normalization, data transformation helps stabilize variance and reduce skewness.

Available data transformation methods:

Logit transformation – Suggested only after lipid class sum normalization; ideal for data bounded between 0 and 1. Otherwise, NA values would occur.
Log transformation – Reduces skewness and stabilizes variance. Log base is selectable.
Cubic root transformation – Reduces the impact of extreme values.
Square root transformation – A milder alternative to log transformation.

3.3.3.3 Scaling

After normalization, scaling helps prepare the data for downstream analyses such as PCA or clustering.

Available scaling methods:

Mean centering – Centers variables around zero.
Auto scaling (Z-score) – Mean-centers and scales by standard deviation.
Pareto scaling – Scales by the square root of the standard deviation.
Range scaling – Scales data to a fixed range (e.g., 0–1).

3.4 Visualization

Purpose:
Summarize and communicate findings effectively.

3.4.1 Visualization Tool

LipidAnalyst offers a variety of interactive visualization options:

Global Distribution Boxplot - View overall lipid abundance patterns.
- Boxplot for all samples
- Boxplot for lipid classes
- Boxplot for all lipids (Only recommend when the number of lipid is less than 80)
Principal Component Analysis (PCA) – Explore global variation and sample clustering.
- 3D or 2D options available.
- scale or unscaled data available.
Differential Mean Lipid Heatmap – Visualize overall lipid expression patterns.
Class Level Lipid Comparison - Compare lipid class abundance across groups. Total double bond and total carbon number can be controlled for refined analysis.
Individual Lipid Comparison - Examine specific lipid changes.
Volcano plots – Identify significantly changed lipids by fold change and p value.

💡 Tips:
- You can view the Differential Mean Lipid Heatmap to generate some interesting hypothesis, and then use class level or individual lipid level to compare differences between the group.
- P value statistics is available in class level and individual level lipid comparison.
- Export figures as publication-ready images use the camera icon at the upper right corner of the plot.

3.5 Statistical Analysis

Purpose:
Identify lipids that differ significantly between groups.

Options:
- Lipidomics Mean Calculator Calculate mean abundance for each group
- t-test (2-group comparison) Welch’s t-test (unequal variances allowed) and Student’s t-test (equal variances assumed).
- ANOVA (multiple groups)
- Correlation Reveal relationships between lipids.Correlation heatmap available.

- DPSC network DSPC (Debiased Sparse Partial Correlation) is a statistical framework designed to infer sparse molecular networks by estimating partial correlations and correcting by graphical lasso.

💡 Tips:
- Always check assumptions (normality, equal variance).
- Adjust for multiple testing (e.g. FDR).
- Able to review the PCA plots with significant features.
- Suggest to put DSPC table into Metscape3 (with Cytoscape) for further refinement of the network.

PLS-DA PLS-DA (Partial Least Squares Discriminant Analysis) is a supervised multivariate statistical method that identifies features that best discriminate between predefined groups. This technique is particularly useful for biomarker discovery, as it highlights the most relevant features contributing to group separation while handling complex, high-dimensional datasets typical in lipidomics studies.

In the model setting, you can set the number of permutations and number of cross validations.
In the model ouput summary, you can check the model performance by looking at the R2Y and Q2 values. R2Y indicates how well the model fits the data, while Q2 reflects the model’s predictive ability. Generally, a good PLS-DA model should have high R2Y and Q2 values, with Q2 being less than R2Y to avoid overfitting.

Parameters interpretation:

Metric	Description	What It Evaluates	Interpretation Guideline
R2X(cum)	Cumulative proportion of variance in X explained by the model	How well predictors are summarized	Higher values indicate better representation of X-space
R2Y(cum)	Cumulative proportion of variance in Y explained by the model	Goodness of fit to class labels	>0.5 generally indicates moderate-to-good fit
Q2(cum)	Cross-validated predictive ability	Model generalizability	>0.3 acceptable, >0.5 good
RMSEE	Root Mean Square Error of Estimation (training error)	Average fitting error	Lower values indicate better fit
pre	Number of predictive components	Model complexity	Small numbers reduce overfitting
ort	Number of orthogonal components	X-variation unrelated to Y	Helps separate noise from predictive signal
pR2Y	Permutation-test p-value for R2Y	Significance of model fit	p < 0.05 indicates non-random fit
pQ2	Permutation-test p-value for Q2	Significance of predictive ability	p < 0.05 indicates non-random prediction

Permutation plots are used to assess the statistical significance of the PLS-DA model by comparing the original model’s performance metrics (R2Y and Q2) against those obtained from models built on permuted class labels. A significant model will show that the original R2Y and Q2 values are substantially higher than those from the permuted models, indicating that the observed separation is not due to random chance.
VIP plots (Variable Importance in Projection) highlight the features that contribute most to the model’s ability to discriminate between groups. Features with VIP scores greater than 1 are generally considered important for group separation and may be potential biomarkers.

OPLS-DA Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) is a supervised multivariate statistical method used to identify differences between predefined groups in complex datasets. OPLS-DA explicitly separates the predictive variation (related to group separation) from the orthogonal variation (unrelated to group separation), enhancing the interpretability of the model. This method is particularly useful in lipidomics for identifying potential biomarkers and understanding underlying biological differences between conditions.
- Score plot :
  The score plot visualizes samples in a reduced latent variable space, showing how well the model separates predefined groups based on systematic variation in the predictor matrix.
- Outlier plot:
  The outlier plot (e.g., Hotelling’s T² vs. DModX) identifies samples that fall outside the model’s confidence limits, indicating observations with extreme leverage or poor model fit that may unduly influence the OPLS-DA model.

Random Forest Random Forest model is an ensemble machine learning technique that builds multiple decision trees using bootstrap sampling and random feature selection.