Chemie  |  Biochemie  |  Medizin


Tim Stadler, 2004 | Riehen, BS


Big data has become an important aspect of many research areas. Researchers in fields like Biology or Economics often see themselves confronted with the challenge to find a weak signal (“needle”) in a vast and often noisy dataset (“haystack”), and the question of how this could be best achieved. One approach is to use subsampling techniques, which reduce the amount of data to a smaller but still representative subset while keeping or even enriching the signal of interest. In my work I focused on single-cell transcriptomics datasets and tried to find the best subsampling method to identify rare cell types (the “needle”). I compared the performance and efficiency of three methods (Geosketch, SCSampler and Maxdissim) using measures like the Hausdorff-distance and the relative entropy. My results show that on both simulated and real datasets SCSampler created the best subsets. However, since my results indicate a strong dependency on the studied dataset, further research will be needed to improve our understanding on this topic.


Which single-cell transcriptomics subsampling method is the best to reduce frequent cell types and preserve rare cell types on unlabeled data?


I performed all analyses using the R programming language and R packages. I used both experimental data as well as simulated data created using the “splatter” package and applied three different subsampling methods (Geosketch, SCSampler and Maxdissim) to create subsamples for the comparison. I chose relative entropy (Kullback-Leiber-divergence) and Hausdorff-distance as measures to describe the cell type composition in the subsamples, and how representative the subsamples are for the variability in the complete data, respectively. Furthermore I visualized the subsampled data using dimensionality reduction techniques like Principal Component Analysis and t-distributed stochastic neighborhood embedding (t-SNE).


As expected, my findings showed that all three tested subsampling methods perform consistently better compared to random (uniform) subsampling. The results obtained from the analyses of relative entropy and Hausdorff-distance furthermore indicate that SCSampler performed best, both on real and simulated data. In terms of computational efficiency, Maxdissim was found to be the slowest method which did not scale well for larger datasets and subsample sizes.


When interpreting the results obtained from the analysis of Hausdorff-distance, we need to be careful because two of the methods (SCSampler and Geosketch) are directly or indirectly optimizing the Hausdorff-distance. Since these two methods perform well in the analysis based on relative entropy too, my results are still meaningful and indicate that the methods could be useful in practice.


I hope for my work to be helpful and encouraging for the use of subsampling as an approach to reduce large datasets while preserving weak signals. I think that the direct comparison of these methods and my results in the context of single-cell transcriptomics data is new and helpful because it compares methods from different research groups without giving more weight to one particular method. To better understand how subsampling performs on different single-cell datasets, and whether it may generalize to data from other sources beyond biology, further investigations will be needed.



Würdigung durch den Experten

Reto Gerber

In seiner Arbeit nimmt Tim Stadler verschiedene Methoden zum Subsampling von single cell transcriptomics Daten genauer unter die Lupe. Das Thema ist in Zeiten immer grösserer Datensätzen von Relevanz und ist Gegenstand aktueller Forschung. Die Arbeit zeichnet sich durch eine klare Fragestellung und eine wissenschaftliche Vorgehensweise aus. Der experimentelle Teil, bestehend aus einer Simulationsstudie und zweier Fallstudien, ist in der Programmiersprache R geschrieben, deren Kenntnis sich Tim Stadler für diese Arbeit angeeignet hat.






Wirtschaftsgymnasium / Wirtschaftsmittelschule, Basel
Lehrer: Martin Bumann