# Medical Signal Detection

Statistical data mining is gaining importance for both pharmaceutical industry and health authorities in order to assess benefit-risk profiles of postmarket drugs. The practice is generally known as pharmacovigilance, and a process of seeking valuable information in this context is called a ``medical signal detection.'' It involves high-dimensional data, consisting of covariate variables of more than ten thousand marketed drugs, and response variables of a similarly large number of medical terms describing adverse events. Even though databases contain millions of reports, detecting previously undiscovered associations is a formidable challenge, often compared to ``finding needles in a haystack'' by experts.Statistics vs. Data Mining. (lecture1.pdf) Here we begin with the review of classical problems in observational studies, preparing the stage for the promising development of data mining.

Medical Signal Detection and Bayesian Methodology. (lecture2.pdf) One of the early success in the area of statistical data mining is the application of empirical Bayes for disproportionality analysis over drug-event combination. The idea of Gamma-Poisson shrinker (GPS) provides an elegant solution to excess signal levels for rare events.

Data set and R code. A zip file (AERS.zip) consists of the result of disproportionality analysis for a total of 1,090 drugs and 1,072 medical terms which were reported at least 50 individual incidents from January 2004 to March 2005.

Markov Chain Monte Carlo Methods. (lecture3.pdf) Besides Bayesian methodology we explore the method of Markov chain Monte Carlo (MCMC) which becomes indispensable for wide-ranging applications in statistics and data mining.

R code for Gibbs sampler on Potts model. MCMC methods are illustrated by various implementation strategies of Potts model (potts.r).

Bayesian Approach for Statistical Inference. (lecture4.pdf) We review a concept of frequentist and Bayesian approach in statistics, and compare Bayesian estimates with maximum likelihood estimate.

R demonstration.
Download bernoulli.r.
The function `bernoulli()`
generates data, and compares the estimates of two distinct methods.
See how they differ in a particular outcome,
and repeat the experiment with the same size.
Increase the size, and observe the similarity of the two estimate.

> source("bernoulli.r") > bernoulli() > bernoulli(size=20)

Stochastic Algorithms and MCMC. (lecture5.pdf) We introduce Metropolis algorithm as a method of searching the global maximum of objective function, and investigate its properties as an implementation of MCMC.

R demonstration. Download nm.r and search.r into your own machine. Use the normal mixture density function on the interval as the objective function of choice, and see if the algorithm achieves the global maximum.

> source("nm.r") > source("search.r") > search(ff=nm)

Then compare the Metropolis-Hastings Algorithm (metro.r) to generate a density of interest on .

> source("metro.r") > sample = bwalk.metro(ff=nm, run.time=100, delta=0.8, theta=20, sample.size=500) > hist(sample, freq=F, breaks=seq(0,1,by=0.05), col="red") > x = seq(0,1,by=0.05) > lines(x, nm(x))

Association Measures for Medical Signal Detection. (lecture6.pdf) We investigate various association measures for disproportionality analysis in medical signal detection, and discuss their properties and limitations.

R demonstration.
First install the package `iplots` into R:

> install.packages("iplots")

Unzip
iaers.zip
which contains the result (`am.RData`) for three different association measures
(log odds ratio, log GPS, logit coefficient)
for DEC (drug-event combination).
The calculation has been done for a total of 222 drugs
and 410 medical terms which were reported at least 500 individual incidents
from January 2004 to March 2005.

> source("iaers.R") > iaers()

Logit Models and MCMC. (lecture7.pdf) In the calculation of logit parameters (i.e., coefficients for a logistic regression model) we can obtain their Bayesian estimate by running a Gibbs sampler.

R demonstration.
Unzip
mcmc.zip
which contains the complete data set (`FORM.csv`)
for a total of 222 drugs (`DRUGno.csv`)
and 410 medical terms (`REACno.csv`)
which were reported at least 500 individual incidents
from January 2004 to March 2005.

Parameter values (marked "o" in red) indicate the current state of MCMC, and they change as a Gibbs sampler updates the parameter values. They are compared with the MLE estimate (marked "x" in black) of parameters.

> source("mcmc.R") > mcmc()

This presentation was developed for a series of lectures at Tokyo Institute of Technology in 2011.

© TTU Mathematics