lib/protocols/polymeric.ex

defprotocol Bio.Polymeric do
  @moduledoc """
  Define Polymeric interface of a sequence type.

  The `Bio.Polymeric` protocol allows us to define implementations
  of a `kmers/2` function. This is part of the approach to translating different
  polymers according to the nature of actual biological or chemical processes.

  The idea is that defining how a sequence is sub-divided into k-mers for
  enumeration is something that must occur for specific conversions. However,
  it's also something that you would not necessarily want to have to do every
  single time you applied the conversion.

  Essentially, each structural definition of a sequence will have some
  meaningful way of splitting it into a Kmer enumeration. This is used in all
  forms of computation, largely though, in conversions. For example, DNA -> RNA
  conversions require element-wise (k=1) conversion functions. Whereas, RNA ->
  Amino Acid requires codon-wise (k=3).

  In order to preserve the standard interface defined by `Bio.Polymer`
  and `Bio.Polymer.convert/3`, we define this as a protocol.

  For a valid return, the consideration should be:
  1. The enumerable returned (`Enum.t()`) should contain the information
  required to perform a conversion. Examples can be found in the
  `Bio.Sequence.DnaStrand` and `Bio.Sequence.DnaDoubleStrand` modules. There,
  you'll see that for a simple sequence, it makes sense to simple iterate the
  grouped chunks. Whereas the double stranded sequence returns a list of tuples
  of chunks.
  2. The `map()` should contain relevant data for the re-capitulation of a
  struct. So if you're converting a `DnaStrand`, you should consider passing
  back out the `label` field. This allows the conversion function to attach it
  to the newly constructed type.

  The error mode for various sequences will vary, but generally the idea of
  mismatching the sequence length to the `k` value will hold. For the build in
  `Bio.Sequence.DnaStrand`, this is merely the even division. For the
  `Bio.Sequence.DnaDoubleStrand` it's more complicated. That type assumes that
  you want to see pairs of aggregated values (top/bottom), but they may be
  offset. So you can't just look at if the values are empty.

  Instead, it looks to see if there can be complete aggregates, even if they're
  paired with empty space.

  Keep these considerations in mind implementing your own `Polymeric` types.

  In addition to the enumeration of the elements, this also makes sense as the
  location for defining validity. That is, there are two further methods
  `valid?/2` and `validate/2`.

  These make the assumption that a relevant `alphabet` is defined for the
  polymer. For example, [IUPAC DNA
  Codes](https://en.wikipedia.org/wiki/Nucleic_acid_notation).

  Your implementation of `valid?/2` and `validate/2` should prefer the alphabet
  given to them. This will be respected by the `Bio.Polymer.valid?/2` and
  `Bio.Polymer.validate/2` function. Essentially, when used, these will always
  prefer the given value, but will default back to the value attached to the
  type if it is defined.

  # Example
      iex>alias Bio.Sequence.Alphabets.Dna, as: Alpha
      ...>Bio.Sequence.DnaStrand.new("atgcnn", alphabet: Alpha.common())
      ...>|> Bio.Polymer.valid?()
      false

      iex>alias Bio.Sequence.Alphabets.Dna, as: Alpha
      ...>Bio.Sequence.DnaStrand.new("atgcnn", alphabet: Alpha.common())
      ...>|> Bio.Polymer.valid?(Alpha.with_n())
      true

  > #### Note {: .neutral}
  > In case neither is defined, the `validate/2` function will return an error
  > tuple, where the `valid?` will simply return false.

  The `validate/2` function behaves similarly, but it should return a new struct
  with the `valid?` key set.

  # Example
      iex>alias Bio.Sequence.Alphabets.Dna, as: Alpha
      ...>Bio.Sequence.DnaStrand.new("atgcnn", alphabet: Alpha.common())
      ...>|> Bio.Polymer.validate()
      {
        :error,
        [{:mismatch_alpha, "n", 4}, {:mismatch_alpha, "n", 5}]
      }

      iex>alias Bio.Sequence.Alphabets.Dna, as: Alpha
      ...>Bio.Sequence.DnaStrand.new("atgcnn", alphabet: Alpha.common())
      ...>|> Bio.Polymer.validate(Alpha.with_n())
      {
        :ok,
        %Bio.Sequence.DnaStrand{
          sequence: "atgcnn",
          length: 6,
          alphabet: "ACGTNacgtn",
          valid?: true
        }
      }


  > #### Note {: .neutral}
  > The applied alphabet is the one that is returned in the struct. This ensures
  > that you are correctly tracking what a type is valid _for_. So be careful
  > about assumptions.
  """

  @doc """
  Split a polymer into chunks of `k` size
  """
  @spec kmers(struct(), integer()) :: {:ok, Enum.t(), map()} | {:error, :seq_len_mismatch}
  def kmers(given, k)

  @doc """
  Determine if the content of a polymer matches an alphabet
  """
  @spec valid?(struct(), String.t()) :: true | false
  def valid?(given, alphabet)

  @doc """
  Validate if the content of a polymer matches an alphabet, returning an updated
  struct.

  Depends on the struct implementing both an `alphabet` and `valid?` keys.
  """
  @spec validate(struct(), String.t() | nil) ::
          {:ok, struct()}
          | {:error, {atom(), String.t(), integer()}}
          | {:error, [{atom(), String.t(), integer()}]}
  def validate(given, alphabet \\ nil)
end