Skip to content

feat: use __doc__ as dataset description#3021

Draft
dhairya-pandya wants to merge 2 commits intopgmpy:devfrom
dhairya-pandya:reference-dataset-description
Draft

feat: use __doc__ as dataset description#3021
dhairya-pandya wants to merge 2 commits intopgmpy:devfrom
dhairya-pandya:reference-dataset-description

Conversation

@dhairya-pandya
Copy link
Contributor

The following checklist is mandatory.

Your PR will be closed if you remove the checklist or do not answer the questions to a satisfactory level. Use of LLMs is strictly forbidden for any part of this checklist (including for improving language), and will result in a ban if we find any use of LLMs.

Your checklist for this pull request

  • Have you followed all the steps from our Contributing Guide?
  • Does the PR fully address the linked issue and is within its defined scope? If you are still working on the PR, mark it as draft.
  • Are all the GitHub Actions checks passing? If not, mark your PR as draft while you fix it.

Please answer the following questions:

  • Did you use an LLM for any assistance with this PR? Please describe in detail (around a paragraph) how and what you used it for?
    [Please Answer Here]

  • What steps have you taken to verify that the changes correctly address the issue? And what edge cases have you considered? Other than running tests, what else have you verified?
    [Please Answer Here]

  • Has the LLM added try-except blocks? They will need to be removed; any error handling must be explicit.
    [Please Answer Here]

  • Have you used LLM for generating tests? They need to be compressed into a smaller number of tests without reducing coverage.
    [Please Answer Here]

Issue number(s) that this pull request fixes

List of changes to the codebase in this pull request

  • Instead of trying to "parse" references, we now take the class's raw docstring (doc) and assign it directly to a new description field on the Dataset object as discussed in the PR [ENH]: Add get_reference() for programmatic access to dataset/model citations #2684 inside the load_datase
      return Dataset(
          name=name,
          data=target_cls.load_dataframe(),
          expert_knowledge=target_cls.load_expert_knowledge(),
          ground_truth=target_cls.load_ground_truth(),
          description=target_cls.__doc__,
          tags=target_cls.get_class_tags(),
      )
    

Copilot AI review requested due to automatic review settings March 17, 2026 17:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a human-readable dataset description sourced directly from the dataset class docstring (__doc__) and surfaces it via the Dataset object.

Changes:

  • Add description field to Dataset and include it in Dataset.__str__.
  • Populate Dataset.description in load_dataset() from target_cls.__doc__ and update docs example.
  • Refactor load_model() lookup via a helper and align the “available models” error message + tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pgmpy/datasets/_base.py Introduces Dataset.description, prints it, and assigns it from __doc__ in load_dataset().
pgmpy/example_models/_base.py Adds _find_model_class() helper and updates load_model() error handling/message.
pgmpy/tests/test_datasets/test_datasets.py Minor formatting-only change (blank line).
pgmpy/tests/test_example_models/test_example_models.py Updates expected error message string for load_model().
pgmpy/utils/utils.py Formatting-only change (blank line).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 138 to 143
def test_invalid_tag():
with pytest.raises(ValueError, match="Unrecognized filter argument"):
list_datasets(is_paraterized=True) # typo

with pytest.raises(ValueError, match="Unrecognized filter argument"):
list_datasets(num_samples=100) # wrong key name entirely
data: pd.DataFrame
expert_knowledge: Optional[ExpertKnowledge] = None
ground_truth: Optional[DAG] = None
description: str | None = None
Comment on lines 32 to 37
def __str__(self) -> str:
return (
f"Dataset(name={self.name}, \n data=DataFrame of size: {self.data.shape}, \n "
f"expert_knowledge={self.expert_knowledge}, \n ground_truth={self.ground_truth}, \n tags={self.tags})"
f"expert_knowledge={self.expert_knowledge}, \n ground_truth={self.ground_truth}, \n "
f"description={self.description}, \n tags={self.tags})"
)
@dhairya-pandya dhairya-pandya force-pushed the reference-dataset-description branch from b2ecbbc to 2d652fe Compare March 17, 2026 19:04
@ankurankan
Copy link
Member

Marking this draft till it is ready for review.

@ankurankan ankurankan marked this pull request as draft March 17, 2026 19:52
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.64%. Comparing base (43e76b0) to head (2d652fe).
⚠️ Report is 2 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##              dev    #3021   +/-   ##
=======================================
  Coverage   95.64%   95.64%           
=======================================
  Files         504      504           
  Lines       29111    29117    +6     
=======================================
+ Hits        27844    27850    +6     
  Misses       1267     1267           
Files with missing lines Coverage Δ
pgmpy/datasets/_base.py 94.28% <100.00%> (+0.03%) ⬆️
pgmpy/example_models/_base.py 95.58% <100.00%> (+0.35%) ⬆️
...y/tests/test_example_models/test_example_models.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Figure out a better way to allow uses to access references for example datasets and example models

3 participants