khisto: Khiops Histograms Tool
==============================
How to use
----------
Usage: khisto [VALUES] [HISTOGRAM]
Compute histogram from the data in FILE.
The resulting histogram is output in HISTOGRAM file, with the lower bound, upper bound,
length, frequency, probability and density per bin.
Available options are
-e output a series of histograms by increasing accuracy for exploratory analysis purposes
-j outputs are produced in one json file
-h display this help and exit
-v display version information and exit
The output histogram is as accurate and interpretable as possible. Using the -e option, all
histograms internally computed are output by increasing accuracy. Each histogram of the series uses
an index in its suffix(e.g. ".1"), and an additional file with the suffix ".series" is produced,
with indicators per histogram. The -j option can be combined with the -e option to get all outputs
in one file.
The indicators produced for the series of histograms are detailed in (Boulle, 2023):
- File name: histogram file name,
- Granularity: size (by power of two) of the elementary bins on which the intervals are built,
- Interval number: number of intervals of the histogram,
- Peak interval number: a "peaks" is an interval whose density is greater than that of its preceding
and following intervals,
- Spike interval number: a "spike" is a peak containing a single value,
- Empty interval number: an empty interval does contains no instances,
- Level: indicator between 0 and 1, which evaluates the quality of the density estimate,
- Information rate: normalized level, between 0 and 1, with 1 for the most accurate and
interpretable histogram,
- Truncation epsilon: difference between two closest consecutive values (ex: 1 for integer data),
used in the "truncation heuristic",
- Removed singularity number: number of intervals removes during the "singularity removal
heuristic", where a "singularity" is a spike which preceding and following intervals are empty,
- Raw: indicates the histogram is not "interpretable", as it was obtained before the "truncation
"heuristic" and the "singularity removal heuristic" were applied.
Using the JSON format, the "truncation epsilon" and "removed singularity number" indicators are
given only once, since they are the same along the histogram series. The "raw" indicator, which
concerns at most the last histograms of the series, can be deduced from the "histogramNumber" and
"interpretableHistogramNumber" fields in the JSON format.
Examples
--------
Basic use:
khisto gaussian.txt gaussian_histogram.csv
Using the `-e` option:
khisto -e gaussian.txt gaussian_histogram.csv
In complement to the resulting histogram gaussian_histogram.csv, the following file are output:
gaussian_histogram.series.csv: synthetic indicators per histogram produced in the series
gaussian_histogram.1.csv: first histogram of the series
gaussian_histogram.2.csv: second histogram of the series
...
Using the `-j` option:
khisto -j gaussian.txt gaussian_histogram.json
khisto -j -e gaussian.txt gaussian_histogram.json
More examples are available in the samples directory of this package.
Package Contents
----------------
- khisto.exe: executable for Windows
- khisto: executable for linux Ubuntu 20.04 and higher
allow executable permissions before use (ex: sudo chmod +x khisto)
- README.txt: this file
- LICENSE.txt: license file
- samples: directory with code and dataset samples
- build_histogram.py: example of python script to build histograms and plot them
- histogram_exploratory_analysis.py: example of python script to perform exploratoy analysis based on histograms
- data: datasets
- gaussian.txt: values from a Gaussian distribution
- levy.txt: values from a Levy distribution
- adult_age.txt: values from the variable age of the dataset Adult
https://archive.ics.uci.edu/ml/datasets/adult
- moon_crater_radius.txt: from the radius variable of Moon Crater Database v1 Salamuniccar
https://astrogeology.usgs.gov/search/map/Moon/Research/Craters/GoranSalamuniccar_MoonCraters
- histograms: histograms built using the python script build_histogram.py
- exploratory_analysis: exploratory analysis results built using the python script
histogram_exploratory_analysis.py
- json_format: histogram and exploratory analysis for gaussian results, built using executable
with the -j option
Technical Limits
----------------
For human readability reasons, all input values exploit a value domain with a 10-digit mantissa and
an exponent between 10^-100 and 10^100. The total number of input values cannot exceed 2^31, that is
about two billions.
Reference
---------
Marc Boulle, Floating-point histograms for exploratory analysis of large scale real-world data sets,
submitted for publication, 2023