Disparities in Participation in Cancer Clinical Trials

Emily Alagha - RDM 102

This project analyzes disparities in clinical trial representation for Black/African American participants. The project compares participation rates in U.S. cancer clincial trials with current prevalence rates by race.

Two data sources were used for this analysis:

  • Clinical trial data - participant data by race for clinical trials of every cancer drug approved by the U.S. Food and Drug Administration between January 2015 and June 2018. The clinical trials dataset was compiled and cleaned by ProPublica
  • Cancer incidence data - Data on incidence rates by race for 25 types of cancer was collected from the National Cancer Institute's SEER database.

The clinical trials dataset was compiled and cleaned by ProPublica. Although their analysis focuses on cancer drug trials, the dataset includes demographic data for drugs developed for other indications, such as HIV, cystic fibrosis, and malaria prevention. The excerpt below is copied from the Readme file ProPublica published with the cleaned dataset.

ProPublica dataset description:

This dataset contains the demographic breakdowns of participants in clinical trials for FDA-approved drugs between January 2015 and June 2018. The FDA has been providing demographic reports for each approved drug since January 2015. While the FDA provides summary reports by year, sometimes in PDF format only, this dataset was compiled to include all available data across years in an easily usable format.

The columns of the dataset include: brand name; drug indication; percentage of women in the clinical trials; percentage of participants by race: white, black or African American, Asian, and other; percentage of participants of Hispanic ethnicity; percentage of participants who are age 65 and older; and year.

"The "Other" race category was used as a catch-all for any of these categories: American Indian/Alaska Native (AI/AN), Native Hawaiian or Other Pacific Islander (NH/OPI), mixed race, multiple races, Unknown, Unreported, and Other. While the FDA also provides these demographic breakdowns by drug, which contains more detailed information, raw numbers for patients, and occasionally disaggregated "Other" categories, we did not include this information here. For individual drugs, the disaggregated "Other" categories are not consistent.

For drugs approved in 2015 and 2016, percentages for the "Other" category were provided in FDA summary reports. For 2017 drugs, we calculated this percentage by subtracting the other categories from 100 percent. For 2018 drugs, we manually compiled these percentages from the reports for each individual drug.

Limitations:
Categories for "Other" are not consistent across clinical trials, which may limit the reliability of this variable.

Snapshots included clinical trials run in the United States and internationally, but did not begin until 2017 to report what percentage of trials were conducted in the U.S. Though Asians appear to be well-represented in most trials, many of these trials were likely based outside of the United States. Analysis of 2017 data shows that, for drugs with at least 70 percent of trials conducted within the U.S., Asians make up only 1.7 percent of participants. Furthermore, the “Asian” category does not say if participants are of East Asian, South Asian, Southeast Asian, or Pacific Islander descent.

Reports did not include a Hispanic ethnicity category until 2017, and do not distinguish between white and non-white Hispanics, or between Hispanics of European or Latin American descent.

The Cancer Incidence dataset was extracted by ProPublic Staff from the National Cancer Institute SEER database.

The SEER age-adjusted incidence rate for a cancer type is the number of new cases of that cancer per 100,000 people, weighted by the age distribution of the U.S. standard population.

Finally, SEER groups “Asian or Pacific Islander” into one category and does not provide disaggregated data for patients of East Asian, South Asian, Southeast Asian, or Pacific Islander descent.

Dataset Exploration

Descriptive statistics

Mean and median particpation in all trials

Women White Black Age 65+
Mean 48% 76% 9% 26%
Median 44% 80% 4% 19%

Discussion

The data show overrepresentation of white participants in clinical trials and under-representation of black or African Americans, as well as of patients older than 65. There are some outliers that may be related to drug indication. For example, the drug Addyi targets hypoactive sexual desire disorder in women, so 100% of the participants were women.The data will need to be cross-referenced with incidence rates to get a better understanding of how indication may be impacting partcipant diversity. Although there is underepresentation for several groups, there are likely to be different factors driving this lower participation. Low representation of blacks and African Americans may be due to problems with the recruitment and retention process, while underrpresentation of older participants may be related to trial exclusion criteria regarding comorbidities. Additional data are needed to explore these barriers to participation.

Research Question

Does racial representation in cancer treatment clinical trials mirror racial distribution of the burden of the indicated disease as measured by rates of new cases per 100,000 people?

Dataset Cleaning

I used OpenRefine to clean both datasets. To replicate the analysis from ProPublica, I followed these steps:

  • created a subset of the data that included only cancer-related trial data
  • removed all columns related to age and year
  • removed column with data for Hispanic participants due to incomplete data availability for this category.
  • changed the data type for percentage of participatns by race from text to numeric, to enable easier analysis
  • removed the percentage character as well as the "<" character (where the value was "<1%", the field is now assumed to be 1%.

All cleaning steps for the FDA data can be replicated by running this text in OpenRefine. I also renamed the columns in the cancer incidence dataset and changed the datatypes to numeric where appropriate to facilitate analysis. The cleaning steps for the cancer incident data can be replicated in OpenRefine using this JSON script.

Dataset Visualization

For 31 cancer drugs, white partcipants were the most represented group in almost every trial.

Infographic comparing comparing overall cancer incidence rates and overrrepresentation of white participants in cancer trials

In [7]:
#load packages
import numpy as np 
import pandas as pd
import csv
import matplotlib.pyplot as plt
%matplotlib inline
In [18]:
#load data
cancertrials = pd.read_csv('/home/project4/Notebooks/emily.couvillon/clinicalTrialsCancer.csv', index_col = 'brandName')
cancerincidence = pd.read_csv('/home/project4/Notebooks/emily.couvillon/cancerIncidenceRatesPer100000.csv', index_col = 'cancerType')

Black Americans Face the Highest Risk of Multiple Myeloma but are Underrepresented in Trials Treating the Cancer

New Cases of Multiple Myeloma per 100,000 people

In [10]:
colors = ['#DFE0DF', '#C66104', '#FF878A', '#DCA684']
races = ('White','Black','Asian','Other')
ax = cancerincidence.iloc[19:20].plot.barh(color = colors, legend = True, figsize=(20,10))
#clarify legend labels
ax.legend(labels=races)

# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Set labels
ax.set_xlabel("New cases per 100,000", labelpad=20, weight='bold', size=14)
ax.set_yticklabels(' ')
ax.set_ylabel('', labelpad=20, weight='bold', size=12)
#add title
ax.set_title('Multiple Myeloma Incidence', weight='bold', size=16)

#Add bar annotations for clarity
ax.annotate('4',xy=(1,1),xytext=(4,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('6.3',xy=(.5,.5),xytext=(6.3,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('5.9',xy=(.5,.5),xytext=(5.9,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('13.8',xy=(.5,.5),xytext=(13.8,-.08),
            annotation_clip=False, fontsize = 16)
ax.annotate('Asian',xy=(1,1),xytext=(-.75,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('White',xy=(.5,.5),xytext=(-.75,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('Other',xy=(.5,.5),xytext=(-.75,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('Black',xy=(.5,.5),xytext=(-.75,-.08),
            annotation_clip=False, fontsize = 16)
Out[10]:
Text(-0.75,-0.08,'Black')

Representation in Clinical Trials for Multiple Myeloma

In [12]:
colors2=['#00C7F7','#086F7E','#725998','#EEE8A9']
races = ('White','Black','Asian','Other')
#bar graph for representation in multiple myeloma trials
ax = cancertrials.loc[['DARZALEX','EMPLICITI','FARYDAK','NINLARO'],['white_particip','asian_particip', 'blackOrAA_particip', 'other_particip']].plot.barh(color = colors2, legend = True, figsize=(20,10))
#clarify legend labels
ax.legend(labels=races)
# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Set labels
ax.set_xlabel("Percent of participants", labelpad=20, weight='bold', size=14)
ax.set_ylabel('Brand Name', labelpad=20, weight='bold', size=14)
#add title
ax.set_title('Trial Participation for Multiple Myeloma Treatments', weight='bold', size=16)
Out[12]:
Text(0.5,1,'Trial Participation for Multiple Myeloma Treatments')

Black Americans Face the Highest Risk of ALK Positive Non-Small Cell Lung Cancer but are Underrepresented in Trials Treating the Cancer

New Cases of NSCLC per 100,000 people

In [13]:
#set colors
colors = ['#DFE0DF', '#C66104', '#FF878A', '#DCA684']
#list legend lsbels
races = ('White','Black','Asian','Other')
#create bar chart
ax=cancerincidence.iloc[1:2].plot.barh(color = colors, legend = True, figsize=(20,10))
#clarify legend labels
ax.legend(labels=races)

# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Set lebels
ax.set_xlabel("New cases per 100,000", labelpad=20, weight='bold', size=14)
ax.set_yticklabels(' ')
ax.set_ylabel('', labelpad=20, weight='bold', size=12)
ax.set_title('ALK-Positive Non-Small Cell Lung Cancer Incidence', weight='bold', size=16)

#add annotations for clarity
ax.annotate('35.2',xy=(1,1),xytext=(35.2,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('54.2',xy=(.5,.5),xytext=(54.2,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('36.3',xy=(.5,.5),xytext=(36.3,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('62.5',xy=(.5,.5),xytext=(62.5,-.08),
            annotation_clip=False, fontsize = 16)
ax.annotate('Asian',xy=(1,1),xytext=(-3,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('White',xy=(.5,.5),xytext=(-3,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('Other',xy=(.5,.5),xytext=(-3,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('Black',xy=(.5,.5),xytext=(-3,-.08),
            annotation_clip=False, fontsize = 16)
Out[13]:
Text(-3,-0.08,'Black')

Representation in Clinical Trials for NSCLC

In [97]:
#bar graph for representation in NSCLC trials
ax=cancertrials.loc[['ALUNBRIG','ALECENSA'],['white_particip', 'blackOrAA_particip','asian_particip', 'other_particip']].plot.barh(color = colors2, legend = True,figsize=(20,10))

#clarify legend labels
ax.legend(labels=races)

# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)    
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Set labels
ax.set_xlabel("Percent of participants", labelpad=20, weight='bold', size=14)
ax.set_ylabel('Brand Name', labelpad=20, weight='bold', size=14)
ax.set_title('Trial Participation for ALK-Positive Non-Small Cell Lung Cancer Treatments', weight='bold', size=16)
Out[97]:
Text(0.5,1,'Trial Participation for ALK-Positive Non-Small Cell Lung Cancer Treatments')

Black Americans Face the Highest Risk of Prostate Cancer but are Underrepresented in Trials Treating the Cancer

New Cases of Prostate Cancer per 100,000 people

In [15]:
#set colors
colors = ['#DFE0DF', '#C66104', '#FF878A', '#DCA684']
#list legend lsbels
races = ('White','Black','Asian','Other')
#create bar chart
ax=cancerincidence.iloc[21:22].plot.barh(color = colors, legend = True, figsize=(20,10))

#clarify legend labels
ax.legend(labels=races)
# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Set labels
ax.set_xlabel("New cases per 100,000", labelpad=20, weight='bold', size=14)
ax.set_yticklabels(' ')
ax.set_ylabel('', labelpad=20, weight='bold', size=12)
ax.set_title('Prostate Cancer Incidence', weight='bold', size=16)

#Add bar annotations for clarity
ax.annotate('59.1',xy=(1,1),xytext=(59.1,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('105.7',xy=(.5,.5),xytext=(105.7,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('54.8',xy=(.5,.5),xytext=(54.8,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('178.3',xy=(.5,.5),xytext=(178.3,-.08),
            annotation_clip=False, fontsize = 16)
ax.annotate('Asian',xy=(1,1),xytext=(-8,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('White',xy=(.5,.5),xytext=(-8,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('Other',xy=(.5,.5),xytext=(-8,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('Black',xy=(.5,.5),xytext=(-8,-.08),
            annotation_clip=False, fontsize = 16)
Out[15]:
Text(-8,-0.08,'Black')

Representation in Clinical Trials for Prostate Cancer

In [16]:
### Representation in Clinical Trials for prostate cancer
ax = cancertrials.loc[['ERLEADA'],['white_particip', 'blackOrAA_particip','asian_particip', 'other_particip']].plot.barh(color = colors2, legend = True, figsize=(20,10))

#clarify legend labels
ax.legend(labels=races)

# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Set labels
ax.set_xlabel("Percent of participants", labelpad=20, weight='bold', size=14)
ax.set_ylabel('Brand Name', labelpad=20, weight='bold', size=14)
ax.set_title('Trial Participation for Prostate Cancer Treatment', weight='bold', size=16)
Out[16]:
Text(0.5,1,'Trial Participation for Prostate Cancer Treatment')

Black Americans Face the Highest Risk of Colorectal Cancer but are Underrepresented in Trials Treating the Cancer

New Cases of Colorectal Cancer per 100,000 people

In [92]:
#set colors
colors = ['#DFE0DF', '#C66104', '#FF878A', '#DCA684']
#list legend lsbels
races = ('White','Black','Asian','Other')

#create chart
ax=cancerincidence.iloc[5:6].plot.barh(color = colors, legend = True, figsize=(20,10))
#clarify legend labels
ax.legend(labels=races)

# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Set labels
ax.set_xlabel("New cases per 100,000", labelpad=20, weight='bold', size=14)
ax.set_yticklabels(' ')
ax.set_ylabel('', labelpad=20, weight='bold', size=14)
ax.set_title('Colorectal Cancer Incidence', weight='bold', size=16)

#Add bar annotations for clarity
ax.annotate('33.7',xy=(1,1),xytext=(33.7,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('39.2',xy=(.5,.5),xytext=(39.2,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('42.2',xy=(.5,.5),xytext=(42.2,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('48.7',xy=(.5,.5),xytext=(48.7,-.08),
            annotation_clip=False, fontsize = 16)
ax.annotate('Asian',xy=(1,1),xytext=(-2,.04),
            annotation_clip=False, fontsize = 16)
ax.annotate('White',xy=(.5,.5),xytext=(-2,-.21),
            annotation_clip=False, fontsize = 16)
ax.annotate('Other',xy=(.5,.5),xytext=(-2,.169),
            annotation_clip=False, fontsize = 16)
ax.annotate('Black',xy=(.5,.5),xytext=(-2,-.08),
            annotation_clip=False, fontsize = 16)
Out[92]:
Text(-2,-0.08,'Black')

Representation in Clinical Trials for Colorectal Cancer

In [17]:
### Create chart for representation in clinical trials for colorectal cancer
ax = cancertrials.loc[['LONSURF'],['white_particip', 'blackOrAA_particip','asian_particip', 'other_particip']].plot.barh(color = colors2, legend = True, figsize=(20,10))

#clarify legend labels
ax.legend(labels=races)
# Despine
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# Set labels
ax.set_xlabel("Percent of participants", labelpad=20, weight='bold', size=14)
ax.set_ylabel('Brand Name', labelpad=20, weight='bold', size=14)
ax.set_title('Trial Participation for Colorectal Cancer Treatment', weight='bold', size=16)
Out[17]:
Text(0.5,1,'Trial Participation for Colorectal Cancer Treatment')

Conclusions

  • Black patients are underrepresented in clinical trials for cancer treatments.
  • Further exploration of the FDA dataset could help determine if underrepresentation extends to conditions beyond cancer.
  • Future research should explore potential correlations between undrerrepresentation and adverse events associated with these treatments.
  • More inclusive trial recruitment strategies are needed.

Reflections

  • Learning about data story types helped me to identify components that make ProPublica story effective data journalism. They effectively used the "drill-down" technique to analyze a subset of the FDA data and highlight disparities in representation. They also used personal patient stories to humanize the dataset and generate reader investment what the findings means for individuals.
  • While the bar chart visualizations would have been faster for me to complete in Excel, using Python helped me to better understand the utility of Jupyter notebooks. The ability to document all the code and keep all files in one place is great for reproducibility and collaboration. I plan to incorporate Jupyter notebooks into future instruction efforts on research reproducibility.