torchdrug.datasets#

Knowledge Graph Datasets#

FB15k#

class FB15k(path, verbose=1)[source]#

Subset of Freebase knowledge base for knowledge graph reasoning.

Statistics:
  • #Entity: 14,951

  • #Relation: 1,345

  • #Triplet: 592,213

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

FB15k237#

class FB15k237(path, verbose=1)[source]#

A filtered version of FB15k dataset without trivial cases.

Statistics:
  • #Entity: 14,541

  • #Relation: 237

  • #Triplet: 310,116

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

WN18#

class WN18(path, verbose=1)[source]#

WordNet knowledge base.

Statistics:
  • #Entity: 40,943

  • #Relation: 18

  • #Triplet: 151,442

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

WN18RR#

class WN18RR(path, verbose=1)[source]#

A filtered version of WN18 dataset without trivial cases.

Statistics:
  • #Entity: 40,943

  • #Relation: 11

  • #Triplet: 93,003

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

Hetionet#

class Hetionet(path, verbose=1)[source]#

Hetionet for knowledge graph reasoning.

Statistics:
  • #Entity: 45,158

  • #Relation: 24

  • #Triplet: 2,025,177

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

Molecule Property Prediction Datasets#

BACE#

class BACE(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Binary binding results for a set of inhibitors of human \(\beta\)-secretase 1(BACE-1).

Statistics:
  • #Molecule: 1,513

  • #Classification task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

BBBP#

class BBBP(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Binary labels of blood-brain barrier penetration.

Statistics:
  • #Molecule: 2,039

  • #Classification task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

CEP#

class CEP(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Photovoltaic efficiency estimated by Havard clean energy project.

Statistics:
  • #Molecule: 20,000

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

ChEMBLFiltered#

class ChEMBLFiltered(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#
Statistics:
  • #Molecule: 430,710

  • #Regression task: 1,310

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

ClinTox#

class ClinTox(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.

Statistics:
  • #Molecule: 1,478

  • #Classification task: 2

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

Delaney#

class Delaney(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Log-scale water solubility of molecules.

Statistics:
  • #Molecule: 1,128

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

FreeSolv#

class FreeSolv(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Experimental and calculated hydration free energy of small molecules in water.

Statistics:
  • #Molecule: 642

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

HIV#

class HIV(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Experimentally measured abilities to inhibit HIV replication.

Statistics:
  • #Molecule: 41,127

  • #Classification task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

Lipophilicity#

class Lipophilicity(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Experimental results of octanol/water distribution coefficient (logD at pH 7.4).

Statistics:
  • #Molecule: 4,200

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

MUV#

class MUV(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Subset of PubChem BioAssay by applying a refined nearest neighbor analysis.

Statistics:
  • #Molecule: 93,087

  • #Classification task: 17

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

Malaria#

class Malaria(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Half-maximal effective concentration (EC50) against a parasite that causes malaria.

Statistics:
  • #Molecule: 10,000

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

OPV#

class OPV(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Quantum mechanical calculations on organic photovoltaic candidate molecules.

Statistics:
  • #Molecule: 94,576

  • #Regression task: 8

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

QM8#

class QM8(path, node_position=False, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Electronic spectra and excited state energy of small molecules.

Statistics:
  • #Molecule: 21,786

  • #Regression task: 12

Parameters
  • path (str) – path to store the dataset

  • node_position (bool, optional) – load node position or not. This will add node_position as a node attribute to each sample.

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

QM9#

class QM9(path, node_position=False, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Geometric, energetic, electronic and thermodynamic properties of DFT-modeled small molecules.

Statistics:
  • #Molecule: 133,885

  • #Regression task: 12

Parameters
  • path (str) – path to store the dataset

  • node_position (bool, optional) – load node position or not. This will add node_position as a node attribute to each sample.

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

SIDER#

class SIDER(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Marketed drugs and adverse drug reactions (ADR) dataset, grouped into 27 system organ classes.

Statistics:
  • #Molecule: 1,427

  • #Classification task: 27

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

Tox21#

class Tox21(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.

Statistics:
  • #Molecule: 7,831

  • #Classification task: 12

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

ToxCast#

class ToxCast(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Toxicology data based on in vitro high-throughput screening.

Statistics:
  • #Molecule: 8,575

  • #Classification task: 617

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

ZINC250k#

class ZINC250k(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Subset of ZINC compound database for virtual screening.

Statistics:
  • #Molecule: 498,910

  • #Regression task: 2

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

ZINC2m#

class ZINC2m(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

ZINC compound database for virtual screening. This dataset doesn’t contain any label information.

Statistics:
  • #Molecule: 2,000,000

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

MOSES#

class MOSES(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Subset of ZINC database for molecule generation. This dataset doesn’t contain any label information.

Statistics:
  • #Molecule: 1,936,963

Parameters
  • path (str) – path for the CSV dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

PCQM4M#

class PCQM4M(path, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Quantum chemistry dataset originally curated under the PubChemQC of molecules.

Statistics:
  • #Molecule: 3,803,453

  • #Regression task: 1

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

Retrosynthesis Datasets#

USPTO50k#

class USPTO50k(path, as_synthon=False, verbose=1, transform=None, lazy=False, node_feature='default', edge_feature='default', graph_feature=None, with_hydrogen=False, kekulize=False)[source]#

Chemical reactions extracted from USPTO patents.

Statistics:
  • #Reaction: 50,017

  • #Reaction class: 10

Parameters
  • path (str) – path to store the dataset

  • as_synthon (bool, optional) – whether decompose (reactant, product) pairs into (reactant, synthon) pairs

  • verbose (int, optional) – output verbose level

  • transform (Callable, optional) – data transformation function

  • lazy (bool, optional) – if lazy mode is used, the molecules are processed in the dataloader. This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.

  • node_feature (str or list of str, optional) – node features to extract

  • edge_feature (str or list of str, optional) – edge features to extract

  • graph_feature (str or list of str, optional) – graph features to extract

  • with_hydrogen (bool, optional) – store hydrogens in the molecule graph. By default, hydrogens are dropped

  • kekulize (bool, optional) – convert aromatic bonds to single/double bonds. Note this only affects the relation in edge_list. For bond_type, aromatic bonds are always stored explicitly. By default, aromatic bonds are stored.

property reaction_types#

All reaction types.

Citation Network Datasets#

Cora#

class Cora(path, verbose=1)[source]#

A citation network of scientific publications with binary word features.

Statistics:
  • #Node: 2,708

  • #Edge: 5,429

  • #Class: 7

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

CiteSeer#

class CiteSeer(path, verbose=1)[source]#

A citation network of scientific publications with binary word features.

Statistics:
  • #Node: 3,327

  • #Edge: 8,059

  • #Class: 6

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level

PubMed#

class PubMed(path, verbose=1)[source]#

A citation network of scientific publications with TF-IDF word features.

Statistics:
  • #Node: 19,717

  • #Edge: 44,338

  • #Class: 3

Parameters
  • path (str) – path to store the dataset

  • verbose (int, optional) – output verbose level