Given the heterogeneity in the clinical behavior of cancer patients with identical histopathological diagnosis, the search for unrecognized molecular subtypes, subtype-specific markers and the evaluation of their clinical-biological relevance are a necessity. This task is benefiting today from the high-throughput genomic technologies and free access to the datasets generated by the international genomic projects and the repositories of information. Machine learning strategies have proven to be useful in the identification of hidden trends in large datasets, contributing to the understanding of the molecular mechanisms and subtyping of cancer. However, the translation of new molecular subclasses and biomarkers into clinical settings requires their analytic validation and clinical trials to determine their clinical utility. Here, we provide an overview of the workflow to identify and confirm cancer subtypes, summarize a variety of methodological principles, and highlight representative studies. The generation of public big data on the most common malignancies is turning the molecular pathology into a database-driven discipline.