Note

This page is a reference documentation. It only explains the class signature, and not how to use it. Please refer to the user guide for the big picture.

biolearn.data_library.DataLibrary

class biolearn.data_library.DataLibrary(library_file=None, cache=None)

Manages a collection of data sources for biomarkers research.

The DataLibrary class is responsible for loading, storing, and retrieving data sources. Data sources are defined in a library file and new sources can easily be added at runtime. Currently DNA methylation data from GEO is supported.

__init__(library_file=None, cache=None)

Initializes the DataLibrary instance with an optional library file and cache mechanism.

Parameters:
  • library_file (str, optional) – The path to the library file. If None, the default biolearn library file is loaded.

  • cache (object, optional) – An object that adheres to the caching interface used in the caching module. If None, the default cache is used. This cache will be used by all returned data sources

load_sources(library_file)

Loads data sources from a given library file appending them to the current set of data sources.

Parameters:

library_file (str) – The file path of the library file to load data sources from.

get(source_id)

Retrieves a data source by its identifier.

Parameters:

source_id (str) – The identifier of the data source to retrieve.

Returns:

The data source with the given identifier if found, otherwise None.

lookup_sources(organism=None, format=None)

Looks up data sources based on the specified organism and/or format.

Parameters:
  • organism (str, optional) – The organism to filter the data sources by.

  • format (str, optional) – The format to filter the data sources by.

Returns:

A list of data sources that match the specified organism and format criteria.

search(**criteria)

Search and preview metadata across all available datasets without loading them.

This method allows you to explore what datasets are available and their metadata characteristics before deciding which ones to load. It’s particularly useful for discovering datasets that match specific criteria like sex, age, or other metadata fields.

Parameters:

criteria (keyword arguments) –

Keyword arguments for filtering datasets. Common filters include:

  • sex (str): Filter by sex (“male”, “female”, “unknown”)

  • min_age (float): Minimum age threshold

  • max_age (float): Maximum age threshold

Returns:

A DataFrame with columns including ‘series_id’ and available metadata fields for each matching dataset.

Return type:

pandas.DataFrame

Examples

>>> # Find all datasets with female subjects
>>> library = DataLibrary()
>>> female_datasets = library.search(sex="female")
>>> # Find datasets with elderly subjects (70+ years)
>>> elderly_datasets = library.search(min_age=70)
>>> # Find male datasets with subjects over 50
>>> male_elderly = library.search(sex="male", min_age=50)
>>> # View available metadata fields
>>> all_datasets = library.search()
>>> print(all_datasets.columns.tolist())

Notes

Sex encoding follows the DNA Methylation Array Data Standard: - 0 = female - 1 = male - NaN = unknown/missing

Examples using biolearn.data_library.DataLibrary

Training an ElasticNet model

Training an ElasticNet model