lib/explorer/datasets.ex

defmodule Explorer.Datasets do
  alias Explorer.DataFrame

  @datasets_dir Path.join(File.cwd!(), "datasets")

  @doc """
  CO2 emissions from fossil fuels since 2010, by country

  ## Citation

    Boden, T.A., G. Marland, and R.J. Andres. 2013. Global, Regional, and National Fossil-Fuel CO2
    Emissions. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S.
    Department of Energy, Oak Ridge, Tenn., U.S.A. doi 10.3334/CDIAC/00001_V2013
  """
  def fossil_fuels, do: read_dataset!("fossil_fuels")

  @doc """
  Wine Dataset - The data is the result of a chemical analysis of wines grown in the same
  region in Italy but derived from three different cultivars. The analysis determined the quantities
  of 13 constituents found in each of the three types of wines.

  Downloaded and modified from: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

  ## Citation

    Original Owners:
    Forina, M. et al, PARVUS -
    An Extendible Package for Data Exploration, Classification and Correlation.
    Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,
    16147 Genoa, Italy.

    Wine. (1991). UCI Machine Learning Repository.
  """
  def wine, do: read_dataset!("wine")

  @doc """
  Iris Dataset - This classic dataset was collected by Edgar Anderson in 1936 and made famous by R. A. Fisher's 1936 paper.
  It consists of several measurements of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

  Downloaded and modified from: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

  ## Citation

    Original Owners:
    R. A. Fisher (1936)
    The use of multiple measurements in taxonomic problems.
    Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x

    Iris. (1936). UCI Machine Learning Repository.
  """
  def iris, do: read_dataset!("iris")

  defp read_dataset!(name) do
    key = {:explorer_datasets, name}

    # Persistent term is used as a cache, in order to avoid
    # several calls to the filesystem. This is mostly useful
    # to speed up reads in tests.
    case :persistent_term.get(key, nil) do
      nil ->
        @datasets_dir
        |> Path.join("#{name}.csv")
        |> DataFrame.from_csv!()
        |> tap(&:persistent_term.put(key, &1))

      %DataFrame{} = df ->
        df
    end
  end
end