Repodata patching¶
Oh no, your favorite package, nanoqc, pulls in bokeh as a dependency, but there
was recently a major version change in it that breaks backward compatibility.
That wouldn’t be a problem, except nanoqc doesn’t restrict the range of bokeh
that can be installed. Thus, you and everyone else in the world, will
constantly get non-functional installs whenever creating new environments with
nanoqc. Sure you can get around this with conda create -n nanoqc nanoqc bokeh=2.4.3,
but that still leaves things broken for everyone else in the world. Another
option would be to update the nanoqc recipe in bioconda-recipes, and this
should also be done if the latest version is problematic, but this would not
solve all other existing, earlier versions of nanoqc packages that also do not
restrict bokeh.
Wouldn’t it be nice if you could just fix the nanoqc package, in place, such that it magically contains a restriction on the version of bokeh that it depends on? It turns out you CAN do this with repodata patching.
What is repodata?¶
Repodata is a JSON file that contains a variety of information for each package
in Bioconda. There is one for each architecture (linux-64, linux-aarch64, osx-64, osx-arm64, noarch, etc.)
and they’re hosted in the bioconda channel. For example, the noarch repodata.json file is
available here.
Let’s take a look at the what sorts of things are stored within this file for a
single package:
{ "info": { "subdir": "noarch" }, "packages": { ... many packages .. "nanoqc-0.9.4-py_0.tar.bz2": { "build": "py_0", "build_number": 0, "depends": [ "biopython", "bokeh", "numpy", "python >=3" ], "license": "MIT License", "license_family": "MIT", "md5": "ebe00c0c2390252a67f5f6419dd4db80", "name": "nanoqc", "noarch": "python", "sha256": "32d12881d92396cf85274268baf82d400ae353edf370096c0ddf559993807ce8", "size": 9867, "subdir": "noarch", "timestamp": 1592396392980, "version": "0.9.4" }, }, "packages.conda": {}, "removed": [], "repodata_version": 1 }
This is what’s used by conda and mamba for determining what packages exist
and their dependencies. Note that the exact fields have changed over time, with
older packages lacking things like the timestamp value.
In order to add/remove/update a dependency (like restricting the bokeh
version), we need to modify the depends list in each package that should be
modified. We can’t directly modify these files, instead we need to patch them
with the bioconda-repodata-patches package.
The bioconda-repodata-patches package¶
The bioconda-repodata-patches package itself contains a set of patches for
each architecture:
.
├── linux-64
│   └── patch_instructions.json
├── noarch
│   └── patch_instructions.json
└── osx-64
    └── patch_instructions.json
For an individual package, the json file will eventually contain the updated dependencies.
For the nanoqc example above, that will eventually look like:
"nanoqc-0.9.4-py_0.tar.bz2": {
  "depends": [
    "biopython",
    "bokeh >=2.4,<3",
    "numpy",
    "python >=3"
  ]
},
Thankfully, you do not need to manually update each package and in fact if you
do your changes will almost certainly be lost over time. Instead, within the
bioconda-repodata-patches recipe, edit and run the gen_patch_json.py script
as described below.
Modifying gen_patch_json.py¶
We will be working in the bioconda-repodata-patches recipe.
The gen_patch_json.py file is borrowed from conda-forge and has one function
that should typically be modified: _gen_new_index. Within this function, each
record in repodata.json is iterated over and changes that should be made to it
are returned to a comparison function. So if you make changes within this
function, they’ll end up in the appropriate patch_instructions.json file.
Let’s use the nanoqc example to see how we can do this.
We first need to come up with a strategy for finding and updating the nanoqc
packages. A process like that might look like the following:
Find any package whose name starts with nanoqc
See if it has
bokehlisted as a dependency.
Change that dependency to
bokeh >=2.4,<3
One thing we should think about is what will happen if a new version of nanoqc
comes out that IS compatible with new versions of bokeh. We certainly
don’t want to continue adding this version constraint to new releases. To avoid
this, we can use the timestamp, so we only update packages that currently
exist. The code for this might look like the following:
# Nanoqc requires bokeh >=2.4,<3 if record_name.startswith('nanoqc') and has_dep(record, "bokeh") and record.get('timestamp', 0) < 1592397000000: for i, dep in enumerate(deps): if dep.startswith('bokeh'): deps[i] = 'bokeh >=2.4,<3' break
So, we’re only modifying packages that start with nanoqc, have bokeh as a
dependency and are sufficiently old.
After making this change, we then need to run gen_patch_json.py to actually
generate the new patch files. This is why the patch files should not be manually
modified themselves, the changes will be overwritten the next time this script
is run.
Confirming the patch is correct¶
Now that the patches have been made, it’s good to check that they actually
contain the right changes before proceeding. To do this, we can use the
show_diff.py script. In the example above, this would produce:
noarch::nanoqc-0.9.1-py_0.tar.bz2
-    "bokeh",
+    "bokeh >=2.4,<3",
noarch::nanoqc-0.9.2-py_0.tar.bz2
-    "bokeh",
+    "bokeh >=2.4,<3",
noarch::nanoqc-0.9.4-py_0.tar.bz2
-    "bokeh",
+    "bokeh >=2.4,<3",
linux-64::nanoqc-0.6.0-py35_0.tar.bz2
-    "bokeh",
+    "bokeh >=2.4,<3",
linux-64::nanoqc-0.6.0-py36_0.tar.bz2
-    "bokeh",
+    "bokeh >=2.4,<3",
... and many more ...
Note that you must have conda-bld in your path for this to work.
As long as all of the packages that should be updated are listed there, then these changes are ready for committing and pushing. Don’t be surprised if additional packages are also updated. It’s not unusual for bioconductor package repodata to get updated over time, for example.
Please ping the core team in gitter when proposing changes to this package!
