Data Sources - PSDI What We Provide
Data Sources
PSDI provides access to many databases and repositories of physical sciences data. PSDI and our partners also provide a range of other data sources of interest to the physical science community that can be accessed through the links on this page.
You can narrow down the list of data sources displayed on this page by selecting from the filters on the left. Click on a data source of interest to find out more about it, access the resource landing page, and to access related resources in the same Resource Theme for those that have one.
Cambridge Structural Database (CSD)
The Cambridge Structural Database (CSD), maintained by the Cambridge Crystallographic Data Centre (CCDC), is the world's largest curated collection of experimentally determined crystal structures, containing more than 1.3 million accurate 3D structures derived from X ray, neutron, and electron diffraction analyses. It is widely used across research and industry including pharmaceuticals, agrochemicals, and fine chemicals, to support structure prediction, materials design, and data driven discovery. Within PSDI, access to the CSD is available to authenticated users from the UK academic community. Users can search the CSD through the Find a Substance option in the PSDI Cross Data Search service or via the dedicated webCSD interface. CSD software tools are also available through PSDI's Remote Desktop Access. Use the Visit or resource landing page link to search the CSD via PSDI Cross Data Search, and see Linked Resources for links to webCSD and CSD software. Find out more using the links in the Further Information section.
Chemical Availability Search (ChASe)
The Chemical Availability Search (ChASe) data source enables users to search and compare commercially available chemicals using the PSDI Cross Data Search service. Enter product or vendor keywords, CAS number, InChI or SMILES to search pricing and supplier information for over 250,000 unique chemicals from UK suppliers. Search results include product name, supplier name, quantity, purity, prices, CAS number, and catalogue number with a link to the supplier's website.
Check the supplier's website for the most up to date product information.
Propersea (Property Prediction)
Propersea contains calculated predictions for a wide range of molecular and physicochemical properties for small molecules, such as melting and boiling points, density, solubility, polarizability, and more. It employs various algorithms, including RDKit, semi-empirical quantum methods, Bayesian regression trees, and transformer neural networks. Propersea also contains predicted IUPAC names generated using a machine learning model. Propersea can be searched using the PSDI Cross Data Search service using InChIs, SMILES or by drawing a molecule. Results include predicted values, confidence intervals and reliability scores for the prediction.
AI Ready Datasets
AI ready datasets give physical sciences researchers data they can trust and reuse. These AI-Ready datasets are hosted as a PSDI Community Data Collection and are designed to support AI and machine learning workflows, ranging from general purpose collections that researchers can shape for their own models, to task-specific datasets that already include annotations and predefined training, validation, and test splits. Every record includes one or more datafiles accompanied by a Croissant format metadata description (https://mlcommons.org/working-groups/data/croissant/), ensuring that structure, provenance, and context are captured in a machine readable way.
BioSimDB (Biomolecular Simulations Database)
Topology, trajectory, and AiiDA data provenance files from molecular dynamics simulations of biomolecules are stored in a repository specifically designed for their preservation and accessibility.
Catalysis Data Infrastructure (CDI) Data Objects
The Catalysis Data Infrastructure (CDI) Data Objects resource provides access to the UK Catalysis Hub (UKCH) published data objects that support research outputs. Through the CDI Portal and the PSDI Cross Data Search, users can search, explore, and reuse these data resources.
In the Catalysis Data Infrastructure (CDI) knowledge graph, data objects are linked with researchers, institutions, publications, and research themes, fostering data discovery, highlighting collaborations, and supporting analysis of data-driven research. CDI Data Objects can be extended to include domain-specific data resources such as chemical entities, materials, or analytical methods.
Catalysis Data Infrastructure (CDI) Publications
The Catalysis Data Infrastructure (CDI) Publications resource provides access to the UK Catalysis Hub (UKCH) Publications, including peer-reviewed articles, books, and theses. CDI Publications offer a dedicated entry point into the CDI knowledge graph through the CDI Portal and the PSDI Cross Data Search to explore and analyse research publications.
Within CDI, publications are linked to researchers (authors), institutions, research themes, and associated data objects. The linking enhances publication discovery, displays research relationships, and supports insight into research activity. CDI Publications can be expanded to include domain-specific indexes about chemicals, materials, and analysis methods.
Collaborative Computational Project for NMR Crystallography (CCP-NC) Magres Database
The Collaborative Computational Project for NMR Crystallography (CCP-NC) database of calculated solid-state NMR data from DFT codes in MAGRES format.
Critical Micelle Concentration (CMC) Data Collection
This dataset is derived from the National Institute of Standards and Technology (NIST) publication, "Critical Micelle Concentrations of Aqueous Surfactant Systems" (http://doi.org/10.6028/NBS.NSRDS.36). The data was extracted using the AI-driven Data Revival service (http://www.data-revival.com/), then manually enhanced by adding SMILES strings, and finally reprocessed by Data Revival to be fully converted into a machine-readable format for improved accessibility and interoperability.
Crystallography Open Database
Open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers
Data to Knowledge Community Data Collections
Data-to-Knowledge Collections provide curated datasets specifically designed for use in machine learning or generated through machine learning processes. An example is the Machine Learning Interatomic Potentials (MLIPs) data collection, which includes MLIP XYZ files used for training, the trained MLIP models, and, where available, additional metadata such as AIIDA provenance information. By making these datasets accessible, researchers without the resources to generate such data themselves can leverage them for machine learning. Additionally, the MLIP models can be directly applied in modeling tasks, enabling broader exploration and advancements in research.
Example Galaxy XAFS RO-Crate Workflows
The Reproducible XAFS Analyses Zenodo Community provides the published results of X-ray absorption fine structure (XAFS) analyses reproduced using X-ray absorption spectroscopy (XAS) Galaxy tools developed in collaboration with PSDI, and use cases and data from the UK Catalysis Hub. The results shared in the Reproducible XAFS Analyses Zenodo Community are in the form of FAIR Data Objects in RO-Crate format and provide examples relevant to the catalysis domain.
Physical Chemistry Properties Data Collection
This collection of datasets focuses on solubility-related information, including boiling points, Henry's law constants, LogS, melting points, and mole fractions. The Physical Chemistry Properties Data Collection provides carefully curated and standardized data on key chemical properties, empowering researchers to effectively model, analyze, and drive innovation in the field of physical chemistry. This dataset collection is available for download from a Github repository as zipped csv files.
Physical Chemistry Properties Data Sets (PChProp)
These datasets focuses on solubility-related information, including boiling points, Henry's law constants, LogS, melting points, and mole fractions. The Physical Chemistry Properties Data Collection provides carefully curated and standardized data on key chemical properties, empowering researchers to effectively model, analyze, and drive innovation in the field of physical chemistry. These datasets are available via PSDI Cross Data Search (https://data-search.psdi.ac.uk/find-a-substance/molecule?db=psdi.pchprop) and OPTIMADE (https://www.optimade.org/) API endpoints (https://pchprop-optimade.psdi.ac.uk/).
Project M: Calcium Carbonate Diffraction Datasets With Fit and Parameter Data
Project M (https://www.diamond.ac.uk/ProjectM/) was a citizen science initiative run with UK secondary schools that generated an open collection of more than 650 calcium carbonate diffraction datasets, crystallised under a wide range of additives and conditions, with all associated fits and parameters included. These records are hosted as a PSDI Community Data Collection and contain the raw X ray powder diffraction data from Diamond Light Source, accompanied by the essential experimental metadata.
SimpNMR_DB (a Curated Database of SimpNMR Inputs)
SimpNMR_DB is a shared collection for ab initio calculations of hyperfine interactions that are used as inputs for SimpNMR. SimpNMR is an open-source python package for analysing paramagnetic NMR data that links Density Functional Theory (DFT) calculations with NMR spectra and helps to assign the peaks and extract the magnetic susceptibility tensor form paramagnetic shifts. This ab initio data is useful for the community involved in computing magnetic properties of molecules due to the difficulty in accurate prediction. It provides access to computational results that are often time consuming to generate, giving researchers a consistent reference point for comparison, method testing, and reproducibility. As the database grows, it will also support benchmarking studies and the development of machine learning models. Therefore we encourage researchers to share their computational files in SimpNMR_DB to benefit the wider community.
Note that this repository is currently under rapid and active development.
TEM-ParticlesDB: Next-Generation Transmission Electron Microscopy (TEM) images
TEM-ParticlesDB is an open repository of Transmission Electron Microscopy (TEM) nanoparticle images and associated segmentation data that have been automatically segmented using machine learning to identify particle size, shape and morphology across a range of substrates. Each entry includes the original experimental TEM images, the synthetic training data used to develop the segmentation models, and the resulting segmented particle outlines. The synthetic data is generated using the TEMPOS (Transmission Electron Microscopy Particle Outline Segmentation) software and used to train segmentation models via transfer learning, before being applied to the experimental images.
2DMatpedia
2DMatpedia, an open computational database of two-dimensional materials from top-down and bottom-up approaches
AFLOW
Description of provider: Automatic FLOW (AFLOW) database for computational materials science. Description of dataset: The AFLOW OPTIMADE endpoint
Alexandria: alexandria-pbe
Description of provider: A collection of databases from the group of Prof Miguel A. L. Marques at Ruhr University Bochum. Description of dataset: A dataset of 2.5m+ stable and metastable materials calculated with the PBE functional
Alexandria: alexandria-pbesol
Description of provider: A collection of databases from the group of Prof Miguel A. L. Marques at Ruhr University Bochum. Description of dataset: A new dataset of 415k stable and metastable materials calculated with the PBEsol and SCAN functionals
Chemotion Repository
The Chemotion Repository is an open repository of well-structured and machine-readable chemical data created by the Karlsruhe Institute of Technology. The focus of the repository is synthetic and analytical chemistry data. It includes data pertaining to reactions, molecules and physical samples. Data in the repository is uploaded by the community, and is subject to peer review.
Computational materials repository (CMR)
CMR is a collection of materials repositories from different projects such as C2DB, QPOD and many more
Material Properties Open Database (MPOD)
Description of provider: The Material Properties Open Database (MPOD) is a web-based, open access repository of quantitative information about the physical properties of crystalline materials. MPOD is oriented to design engineers, scientists, science teachers and students. Properties are generally treated as tensor magnitudes. In MPOD the compact matrix notation is applied. To bring an intuitive view of tensor properties, so-called longitudinal properties surfaces are displayed. 3D printing of properties surfaces is implemented via creation of stl files. A dictionary of properties definitions is included. Eventually, comments are added. Syntax and notation in MPOD files is oriented towards matching IUCr standards and so tries to comply with CIF format. Description of dataset: MPOD is a web-based, open access repository of quantitative information about the physical properties of crystalline materials.
Material-Property-Descriptor Database
Material-Property-Descriptor Database (MPDD) of atomic structures. Optimized for the high-throughput deployment of material featurizers and ML models. Maintained by Phases Research Lab (phaseslab.org) at The Pennsylvania State University
Filter by
Subject
Catalysis (3)
Chemical Sciences (2)
Condensed Matter Physics (1)
Crystals (29)
Data Processing (3)
Electron Microscopy (1)
Inorganic Chemistry (3)
Machine Learning (2)
Molecular Biology (1)
Multidisciplinary (3)
Nano-materials (1)
Organic Chemistry (3)
Organometallic Chemistry (1)
Physical Chemistry (3)
Quantum Chemistry (1)
Simulation Software (3)
Soft Matter Physics (1)
Spectroscopy (1)
Access
Open Access (44)
Restricted Access (1)
Publisher
Cambridge Crystallographic Data Centre (1)
Karlsruhe Institute Of Technology (1)
PSDI (15)
Science And Technology Facilities Council (1)
Unspecified (27)
License
CC-BY-4.0 (19)
CC-BY-NC-4.0 (1)
CC-BY-NC-SA-4.0 (1)
CC-BY-SA-4.0 (2)
CC0-1.0 (3)
CCDC License Agreement (1)
MIT (1)
Multiple Licenses (17)
ODC-By-1.0 (1)
PDDL-1.0 (1)
Unspecified (3)
Sort by
Most Relevant
Sort by