Strict YAML deserialization with marshmallow

The problem

  • I want to read some not so simple config from .yaml file.
  • I have config structure described as dataclasses.
  • I want to all type checks have been performed and in case of invalid data exception will be raised.

So basically I want something like

def strict_load_yaml(yaml: str, loaded_type: Type[Any]):
    """
    Here is some magic
    """
    pass

And then use it like this:

@dataclass
class MyConfig:
    """
    Here is object tree
    """
    pass

try:
    config = strict_load_yamp(open("config.yaml", "w").read(), MyConfig)
except Exception:
    logging.exception("Config is invalid")

Config classes

Here is my config.py file with example dataclasses:

from dataclasses import dataclass
from enum import Enum
from typing import Optional


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"


@dataclass
class BattleStationConfig:
    @dataclass
    class Processor:
        core_count: int
        manufacturer: str

    processor: Processor
    memory_gb: int
    led_color: Optional[Color] = None

Solution that didn’t work

This is a very common pattern, right? It must be very easy. Just import standard yaml library and problem solved?

So I imported PyYaml and call load method:

from pprint import pprint

from yaml import load, SafeLoader


yaml = """
processor:
  core_count: 8
  manufacturer: Intel
memory_gb: 8
led_color: red
"""

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

and I have got:

{'led_color': 'red',
 'memory_gb': 8,
 'processor': {'core_count': 8, 'manufacturer': 'Intel'}}

Yaml loaded just fine, but it is a dict. No problem, I can pass it as **args constructor:

parsed_config = BattleStationConfig(**loaded)
pprint(parsed_config)

and result will be:

BattleStationConfig(processor={'core_count': 8, 'manufacturer': 'Intel'}, memory_gb=8, led_color='red')

Wow! Easy! But… Wait. Is processor field a dict? Damn it.

Python don’t perform type checking at constructor and do not parse Processor class. Well, this is the time to go to stackowerflow.

Solution that required yaml tags and almost works

I’ve read stackowerflow answers and PyYaml documentation and have found out that you can mark your yaml doc with tags for types. Your classes must be descendants of YAMLObject and so my config_with_tag.py will look like this:

from dataclasses import dataclass
from enum import Enum
from typing import Optional

from yaml import YAMLObject, SafeLoader


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"


@dataclass
class BattleStationConfig(YAMLObject):
    yaml_tag = "!BattleStationConfig"
    yaml_loader = SafeLoader

    @dataclass
    class Processor(YAMLObject):
        yaml_tag = "!Processor"
        yaml_loader = SafeLoader

        core_count: int
        manufacturer: str

    processor: Processor
    memory_gb: int
    led_color: Optional[Color] = None

And loading code:

from pprint import pprint

from yaml import load, SafeLoader

from config_with_tag import BattleStationConfig


yaml = """
--- !BattleStationConfig
processor: !Processor
  core_count: 8
  manufacturer: Intel
memory_gb: 8
led_color: red
"""

a = BattleStationConfig

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

And what I will get?

BattleStationConfig(processor=BattleStationConfig.Processor(core_count=8, manufacturer='Intel'), memory_gb=8, led_color='red')

Good. But my YAML is full of tags and lost its readability. And Color is still string. So I can just add YAMLObject to parent classes? Right? No.

class Color(Enum, YAMLObject):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

Will lead to:

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

I didn’t find a quick way to resolve it. And I did want to add tags to my yaml, so I’ve decided to keep looking for a solution.

Solution with marshmallow

I found a recommendation to use marshmallow to parse dict generated from JSON object. I decided that these cases are the same as mine only uses JSON instead of YAML. And so I tried to use class_schema generator for dataclass schema:

from pprint import pprint

from yaml import load, SafeLoader
from marshmallow_dataclass import class_schema

from config import BattleStationConfig


yaml = """
processor:
  core_count: 8
  manufacturer: Intel
memory_gb: 8
led_color: red
"""

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

BattleStationConfigSchema = class_schema(BattleStationConfig)

result = BattleStationConfigSchema().load(loaded)
pprint(result)

And I get:

marshmallow.exceptions.ValidationError: {'led_color': ['Invalid enum member red']}

So, marshmallow wants enum name, not value. I can change my yaml to:

processor:
  core_count: 8
  manufacturer: Intel
memory_gb: 8
led_color: RED

And I will get my ideally deserialized object:

BattleStationConfig(processor=BattleStationConfig.Processor(core_count=8, manufacturer='Intel'), memory_gb=8, led_color=<Color.RED: 'red'>)

But I felt there was a way to use my original yaml. So I’ve explored marshmallow documentation and found following lines:

Setting by_value=True. This will cause both dumping and loading to use the value of the enum.

Turn out, you can pass this configuration to metadata dictionary of field generator from dataclasses like this:

@dataclass
class BattleStationConfig:
    led_color: Optional[Color] = field(default=None, metadata={"by_value": True})

And I will get the object parsed from my original yaml.

Magic function

And after all I can collect my magic function:

def strict_load_yaml(yaml: str, loaded_type: Type[Any]):
    schema = class_schema(loaded_type)
    return schema().load(load(yaml, Loader=SafeLoader))

This function can require additional set up for dataclass but solve my problem and do not require tags in yaml.

Some words about ForwardRef

If you define your dataclasses with forward reference (string with class name) marshmallow can be confused and didn’t parse your classes.

For example this configuration

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, ForwardRef


@dataclass
class BattleStationConfig:
    processor: ForwardRef("Processor")
    memory_gb: int
    led_color: Optional["Color"] = field(default=None, metadata={"by_value": True})

    @dataclass
    class Processor:
        core_count: int
        manufacturer: str


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

will lead to

marshmallow.exceptions.RegistryError: Class with name 'Processor' was not found. You may need to import the class.

And if we move Processor class upper marshmallow will lost Color with the same error. So keep your classes without ForwardRef if possible.

Code

All code available on GitHub repository.

Me

Menu

  • Homepage
  • Projects
  • Code katas
  • Blog
  • Posts
    • 2022-10-20 Oak build is released
    • 2022-09-19 A paradigm shift: from dogebuild as universal buider to make alternative
    • 2022-09-17 Back online
    • 2020-10-13 CV continuous delivery
    • 2020-09-07 One man scrum. React blog. Iteration 1: failed. Iteration 2: planning.
    • 2020-08-27 One man scrum. React blog. Iteration 1: planning.
    • 2020-08-26 React blog project planning part 2
    • 2020-08-25 React blog project planning part 1
    • 2020-08-05 Strict YAML deserialization with marshmallow
    • 2020-06-18 How my blog works
    • 2019-12-17 Second planning attempt
    • 2019-12-14 Planning

Design based on: HTML5 UP.