Exploring the Challenge Data

This example shows you how to load and explore the challenge data with biolearn

Loading up the data for the competition

from biolearn.data_library import DataLibrary

challenge_data = DataLibrary().get("BoAChallengeData").load()

The challenge data has methylation data

GSM7866999 GSM7867305 GSM7867077 ... GSM7867431 GSM7867126 GSM7867000
cg00000029 0.720084 0.453491 0.535564 ... 0.487361 0.337215 0.062426
cg00000109 0.938044 0.925149 0.949446 ... 0.938523 0.940187 0.942090
cg00000155 0.957093 0.959030 0.948919 ... 0.966912 0.968026 0.970274
cg00000158 0.965191 0.957272 0.967278 ... 0.971039 0.969745 0.969660
cg00000165 0.144061 0.122099 0.167346 ... 0.156199 0.133038 0.641788
... ... ... ... ... ... ... ...
rs9292570 0.523412 0.980810 0.986609 ... 0.525876 0.537722 0.538440
rs9363764 0.960815 0.038638 0.451884 ... 0.978343 0.971838 0.973633
rs951295 0.971034 0.558011 0.553397 ... 0.485287 0.526686 0.499980
rs966367 0.455920 0.963994 0.453094 ... 0.455700 0.029342 0.026907
rs9839873 0.599735 0.595571 0.630871 ... 0.576566 0.579195 0.973815

930659 rows × 500 columns



The challenge data also has proteomic data

AlamarTargetID t10319 t10413 t10466 ... t8358 t8367 t8399
GSM7867173 13.387778 16.674015 12.890963 ... 12.734120 11.062422 11.599597
GSM7867127 12.948052 15.992939 13.556167 ... 15.238780 12.291734 14.642498
GSM7867083 13.256551 15.547946 13.528756 ... 11.673970 12.604654 14.590723
GSM7867170 13.385639 15.562234 13.437340 ... 12.901178 12.185173 15.815356
GSM7867410 13.030234 15.186056 13.350693 ... 13.228718 15.403618 14.379472
... ... ... ... ... ... ... ...
GSM7867364 13.668576 14.857433 13.039565 ... 16.680131 13.210160 14.044917
GSM7867179 14.941948 14.366243 13.965880 ... 12.575219 14.748597 15.610305
GSM7867095 12.062449 14.827118 14.683530 ... 13.103683 11.803189 12.073775
GSM7867115 13.390582 15.872686 13.667331 ... 13.949114 14.313347 14.184558
GSM7867385 13.587867 15.584386 13.483932 ... 13.794795 13.830717 12.040475

503 rows × 374 columns



You can learn more about what the protein identifies in our reference

from biolearn.util import get_data_file

reference = pd.read_csv(get_data_file("reference/alamar_reference.csv"))
reference
AlamarTargetID UniProtID Target ProteinName
0 t10034 O75888 TNFSF13 TNF superfamily member 13
1 t10319 O95760 IL33 Interleukin 33
2 t10412 P01258 CALCA Calcitonin [Cleaved into: Calcitonin; Katacalcin
3 t10417 P01903 HLA-DRA Major histocompatibility complex, class II, DR...
4 t10433 P02745 C1QA Complement C1q A chain
... ... ... ... ...
317 t8244 P35247 SFTPD Pulmonary surfactant-associated protein D
318 t8246 P22303 ACHE Acetylcholinesterase
319 t8254 Q8N474 SFRP1 Secreted frizzled-related protein 1
320 t8355 P0DJI8 SAA1 Serum amyloid A-1 protein
321 t9441 O15240 VGF VGF nerve growth factor inducible

322 rows × 4 columns



Some of the data overlaps while some does not but all the metadata is combined

age ethnicity race1 ... sex subject_id tissue
GSM7866964 31.35 Non Hispanic White ... 1 BoA1 blood
GSM7866965 79.45 Non Hispanic Asian ... 0 BoA2 blood
GSM7866966 60.42 Non Hispanic Asian ... 1 BoA3 blood
GSM7866967 59.24 HISPANIC White ... 0 BoA4 blood
GSM7866968 22.41 Non Hispanic White ... 0 BoA5 blood
... ... ... ... ... ... ... ...
P499 80.98 NaN NaN ... 0 NaN NaN
P500 89.00 NaN NaN ... 0 NaN NaN
P501 80.92 NaN NaN ... 1 NaN NaN
P502 69.80 NaN NaN ... 0 NaN NaN
P503 68.18 NaN NaN ... 0 NaN NaN

651 rows × 7 columns



You can easily run several models on them

from biolearn.mortality import run_predictions

prediction_dict = {
    "Horvathv1": "Predicted",
    "Hannum": "Predicted"
}

predictions = run_predictions(challenge_data, prediction_dict)
predictions
Horvathv1 Hannum
GSM7866999 21.822423 17.303025
GSM7867305 56.266752 52.803410
GSM7867077 39.660930 36.211078
GSM7867095 29.806215 32.821235
GSM7867457 51.285108 49.308239
... ... ...
GSM7867136 37.539954 32.494127
GSM7867452 74.975617 68.639608
GSM7867431 33.083565 33.906176
GSM7867126 91.428559 74.870807
GSM7867000 75.005273 77.630528

500 rows × 2 columns



We can then compare the output from the two models

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataframe is named 'df'
# df should have columns 'Horvathv1' and 'Hannum'

# Create a scatter plot with a regression line
plt.figure(figsize=(8, 6))
sns.regplot(x='Horvathv1', y='Hannum', data=predictions, ci=None)

plt.title('Scatter Plot with Regression Line')
plt.xlabel('Horvathv1')
plt.ylabel('Hannum')
plt.grid(True)
plt.show()
Scatter Plot with Regression Line

Total running time of the script: (0 minutes 29.987 seconds)

Estimated memory usage: 15204 MB

Gallery generated by Sphinx-Gallery