This is an old revision of the document!


Download preprocessed TGP data

Data set description:
Data for the four studies are zip-compressed and available to download through the links below. Each zip-file contains four files in CSV-format (comma-separated values): the FARMS-summarized gene expression values per gene (exprs_*.csv), the informative/non-informative (I/NI) call per gene (ini_*.csv), the sample names (sampleNames_*.csv) and the gene names (geneNames_*.csv). Each sample corresponds to one drug measurement. In the gene expression matrix the columns are genes and the rows are samples. The I/NI call is a filter criteria, which allows detecting information carrying genes (e.g., genes with an I/NI call below 0.5 - smaller I/NI calls means more information). Replicate measurements were collapsed to one measurement per gene.

TGP drug info and pathological findings (CSV, EXCEL format)

Study – rat in vivo single (CSV format)

Study – rat in vivo repeated (CSV format)

Study – rat in vitro single (CSV format)

Study – human in vitro (CSV format)

Download rat in vitro study (LIBSVM format)

Data set description:
The example classification data sets below were build using the drug information from “Drug Information.csv” and the expression data from the rat in vivo single study (CSV format). The example data set contain the gene expression values (FARMS preprocessed) and as labels the drug induced liver injury (DILI) classes (”-1”,”+1”). For different time points (2h,8h, and 24h) and dose-levels (low, middle, and high) the data is stored in LIBSVM format. These binary classification data sets are ready to be analysed using the LIBSVM package. Samples (drugs) being of no DILI concern were labeled as ”-1” and those of most DILI concern as ”+1”. For more details regarding the categorization of DILI see here.

Example data sets

Description data preprocessing

The Japanese Toxicogenomics Project (TGP) includes gene expression data, toxicological information and pathological data of 131 compounds in vitro and in vivo screened for toxicity in rat and in vitro screened for toxicity in human.

Upper panel: The y-axis shows the log expression values of the fatty acid-binding protein 1 (Fabp1) estimated by FARMS after quantile normalization, while the grouped compounds are shown on the x-axis. The time points are encoded by orange, green and blue for 2h, 8h and 24h, respectively. The plot shows strong cell-culture e ffects, within the three time points and compounds, which could not be removed by the quantile normalization.
Lower panel: Same as upper panel but batch corrected. The correction with the matched control within cell-culture clearly reduces the cell-culture e ffects, while compound induced expression changes are preserved.

The standard microarray preprocessing procedure consists of normalization, summarization and filtering. However, the standard preprocessing pipeline can not be applied to these data sets, as the initial quality control of the microarray data revealed severe eff ects between the cell-cultures (see upper panel). To remove these effects, first, the probe-level data of the microarrays were quantile normalized. Secondly, a compound batch correction was made by calculating probe intensity ratios using the corresponding control measurement for the cell-culture (only vehicle without compound) as reference. For the next preprocessing step, summarization, probe sets were defined corresponding to genes using alternative CDFs (Version 15.1.0, ENTREZG) from Brainarray [2] and applied FARMS [1] for summarizing the intensity ratios at probe set level to obtain expression values per gene. For the last preprocessing step, gene filtering, the FARMS based informative/non-informative (I/NI) call [3] was applied to identify all non-informative probe sets.

 References:

  1. Hochreiter S, Clevert DA, and Obermayer K (2006). A new summarization method for A ffymetrix probe level data, Bioinformatics, 22(8):943-949
  2. Dai M, Wang P, Boyd AD, et al. (2005). Evolving gene/transcript de finitions signifi cantly alter the interpretation of GeneChip data, Nucleic Acids Res., 33(20):e175
  3. Talloen W, Clevert DA, Hochreiter S, et al. (2007). I/NI-calls for the exclusion of non-informative genes: a highly eff ective feature filtering tool for microarray data, Bioinformatics, 23(21):2897-2902
PAST KEYNOTE SPEAKERS

Atul Butte, MD, PhD
Atul Butte, MD, PhD
Stanford University School of Medicine

Nikolaus Rajewsky, PhD
Nikolaus Rajewsky, PhD
Max-Delbrück-Center for Molecular Medicine

Terry Speed, PhD
Terry Speed, PhD
The Walter and Eliza Hall Institute of Medical Research

Sandrine Dupoit, PhD
Sandrine Dudoit, PhD
University of California, Berkeley

John Quackenbush, PhD
John Quackenbush, PhD
Harvard School of Public Health

Eran Segal, PhD
Eran Segal, PhD
Weizmann Institute of Science

John Storey, PhD
John Storey, PhD
Princeton University

Chris Sander, PhD
Chris Sander, PhD
Memorial Sloan Kettering Cancer Center

Temple F. Smith, PhD
Temple F. Smith, PhD
Boston University

Curtis Huttenhower, PhD
Curtis Huttenhower, PhD
Harvard School of Public Health

Christopher E. Mason, PhD
Christopher E. Mason, PhD
Weill Cornell Medicine

Mick Watson, PhD
Mick Watson, PhD
The Roslin Institute

IMPORTANT DATES
Extended Abstract Proposals Due19 May 2017
Notification of Accepted Contributions26 May 2017
Early Registration Closes15 Jun 2017
CAMDA2017 Conference22-23 Jul 2017
ISMB/ECCB 2017 Conference21–25 Jul 2017
Full Paper Submission Click to save the dates!24 Sep 2017
CAMDA PARTNERS

Agilent Technologies

F1000Research

Biology Direct

ISMB/ECCB 2017 MAIN EVENT
STAY CONNECTED