/************************************************************************/
/*                                                                      */
/*    vspline - a set of generic tools for creation and evaluation      */
/*              of uniform b-splines                                    */
/*                                                                      */
/*            Copyright 2015 - 2017 by Kay F. Jahnke                    */
/*                                                                      */
/*    The git repository for this software is at                        */
/*                                                                      */
/*    https://bitbucket.org/kfj/vspline                                 */
/*                                                                      */
/*    Please direct questions, bug reports, and contributions to        */
/*                                                                      */
/*    kfjahnke+vspline@gmail.com                                        */
/*                                                                      */
/*    Permission is hereby granted, free of charge, to any person       */
/*    obtaining a copy of this software and associated documentation    */
/*    files (the "Software"), to deal in the Software without           */
/*    restriction, including without limitation the rights to use,      */
/*    copy, modify, merge, publish, distribute, sublicense, and/or      */
/*    sell copies of the Software, and to permit persons to whom the    */
/*    Software is furnished to do so, subject to the following          */
/*    conditions:                                                       */
/*                                                                      */
/*    The above copyright notice and this permission notice shall be    */
/*    included in all copies or substantial portions of the             */
/*    Software.                                                         */
/*                                                                      */
/*    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND    */
/*    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES   */
/*    OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND          */
/*    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT       */
/*    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,      */
/*    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING      */
/*    FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR     */
/*    OTHER DEALINGS IN THE SOFTWARE.                                   */
/*                                                                      */
/************************************************************************/

// This header doesn't contain any code, only the text for the main page of the documentation.

/*! \mainpage

 \section intro_sec Introduction

 vspline is a header-only generic C++ library for the creation and use of uniform B-splines. It aims to be as comprehensive as feasibly possible, yet at the same time producing code which performs well, so that it can be used in production.
 
 vspline was developed on a Linux system using clang++ and g++. It has not been tested much with other systems or compilers, and as of this writing I am aware that the code probably isn't fully portable. The code uses the C++11 standard.
 
 Note: in November 2017, with help from Bernd Gaucke, vspline's companion program pv, which uses vspline heavily, was successfully compiled with 'Visual Studio Platform toolset V141'. While no further tests have been done, I hope that I can soon extend the list of supported platforms.
 
 vspline's main focus is bulk data processing. It was developed to be used for image processing software. In image processing, oftentimes large amounts of pixels need to be submitted to identical operations, suggesting a functional approach. vspline offers functional programming elements to implement such programs.
 
 vspline relies heavily on two other libraries:
 
 - <a href="http://ukoethe.github.io/vigra/">VIGRA</a>, mainly for handling of multidimensional arrays and general signal processing
 
 - <a href="https://github.com/VcDevel/Vc">Vc</a>, for the use of the CPU's vector units
 
 I find VIGRA indispensible, omitting it from vspline is not really an option. Use of Vc is optional, though, and has to be activated by defining 'USE_VC'. This should be done by passing -DUSE_VC to the compiler; defining USE_VC only for parts of a project may or may not work. Please note that vspline uses Vc's 1.3 branch, not the master branch. 1.3 is what you are likely to find in your distro's packet repositories; if you check out Vc from github, make sure you pick the 1.3 branch.
 
 I have made an attempt to generalize the code so that it can handle

 - arbitrary real data types and their aggregates
 
 - a reasonable selection of boundary conditions
 
 - prefiltering with implicit and explicit extrapolation schemes
 
 - arbitrary spline orders
 
 - arbitrary dimensions of the spline
 
 - in multithreaded code
 
 - using the CPU's vector units if possible

On the evaluation side I provide

 - evaluation of the spline at point locations in the defined range
 
 - evaluation of the spline's derivatives

 - mapping of arbitrary coordinates into the defined range
 
 - evaluation of nD arrays of coordinates (remap function)
 
 - generalized 'transform' and 'apply' functions

On top you get a unary functor type and some functional constructs
to go with it.

 \section install_sec Installation
 
 vspline is header-only, so it's sufficient to place the headers where your code can access them. VIGRA and Vc are supposed to be installed in a location where they can be found so that includes along the lines of #include <vigra/...> succeed.

 \section compile_sec Compilation
 
 While your distro's packages may be sufficient to get vspline up and running, you may need newer versions of VIGRA and Vc. At the time of this writing the latest versions commonly available were Vc 1.3.0 and VIGRA 1.11.0; I compiled Vc and VIGRA from source, using up-to-date pulls from their respective repositories. Vc 0.x.x will not work with vspline.
 
 update: ubuntu 17.04 has vigra and Vc packages which are sufficiently up-to-date.
 
 To compile software using vspline, I use clang++:
 
~~~~~~~~~~~~~~
 clang++ -D USE_VC -pthread -O3 -march=native --std=c++11 your_code.cc -lVc -lvigraimpex
~~~~~~~~~~~~~~
 
 where the -lvigraimpex can be omitted if vigraimpex (VIGRA's image import/export library) is not used, and linking libVc.a in statically is a good option; on my system the resulting code is faster.
 
 Please note that an executable using Vc produced on your system may likely not work on a machine with another CPU. It's best to compile on the intended target. Alternatively, the target architecture can be passed explicitly to the compiler (-march...). 'Not work' in this context means that it may as well crash due to an illegal instruction or wrong alignment.
 
 If you can't use Vc, the code can be made to compile without Vc by omitting -D USE_VC and other flags relevant for Vc:
 
~~~~~~~~~~~~~~
 clang++ -pthread -O3 --std=c++11 your_code.cc -lvigraimpex
~~~~~~~~~~~~~~
 
 IF you don't want to use clang++, g++ will also work.
 
 All access to Vc in the code is inside #ifdef USE_VC .... #endif statements, so not defining USE_VC will effectively prevent it's use.
 
 \section license_sec License

 vspline is free software, licensed under this license:
 
~~~~~~~~~~~~
    vspline - a set of generic tools for creation and evaluation
              of uniform b-splines

            Copyright 2015 - 2017 by Kay F. Jahnke

    Permission is hereby granted, free of charge, to any person
    obtaining a copy of this software and associated documentation
    files (the "Software"), to deal in the Software without
    restriction, including without limitation the rights to use,
    copy, modify, merge, publish, distribute, sublicense, and/or
    sell copies of the Software, and to permit persons to whom the
    Software is furnished to do so, subject to the following
    conditions:

    The above copyright notice and this permission notice shall be
    included in all copies or substantial portions of the
    Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
    OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
    FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
    OTHER DEALINGS IN THE SOFTWARE.
~~~~~~~~~~~~

 \section quickstart_sec Quickstart
 
 vspline uses vigra to handle data. There are two vigra data types which are used throughout: vigra::MultiArrayView is used for multidimensional arrays. It's a thin wrapper around the three parameters needed to describe arbitrary n-dimensional arrays of data in memory: a pointer to some 'base' location coinciding with the coordinate origin, a shape and a stride. If your code does not use vigra MultiArrayViews, it's easy to create them for the data you have, vigra offers a constructor for MultiArrayViews taking these three parameters. The other vigra data type used throughout vspline is vigra::TinyVector, a small fixed-size container type used to hold things like multidimensional coordinates or pixels. This type is also just a wrapper around a small 1D C array. It's zero overhead and contains nothing else, but offers lots of functionality like arithmetic operations. I recommend looking into vigra's documentation to get an idea about these data types, even if you only wrap your extant data in them to interface with vspline. vspline follows vigra's default axis ordering scheme: the fastest-varying index is first, so coordinates are (x,y,z...). Coordinates, strides and shapes are given relative to the MultiArrayView's value_type.  
 
 If you stick with the high-level code, using class bspline or the transform functions, most of the parametrization is easy. Here are a few quick examples what you can do. This is really just to give you a first idea - there is example code covering most features of vspline where things are covered in more detail, with plenty of comments. the code in this text is also there, see quickstart.cc.
 
 Let's suppose you have data in a 2D vigra MultiArray 'a'. vspline can handle real data like float and double, and also their 'aggregates', meaning data types like pixels or vigra's TinyVector. But for now, let's assume you have plain float data. Creating the bspline object is easy:
 
~~~~~~~~~~~~~~
#include <vspline/vspline.h>

// given a vigra::MultiArray of data (initialization omitted)
vigra::MultiArray < 2 , float > a ( 10 , 20 ) ;

// let's initialize the whole array with 42
a = 42 ;

// fix the type of the corresponding b-spline
typedef vspline::bspline < float , 2 > spline_type ;
 
// create bspline object 'bspl' fitting the shape of your data
spline_type bspl ( a.shape() ) ;
 
// copy the source data into the bspline object's 'core' area
bspl.core = a ;
 
// run prefilter() to convert original data to b-spline coefficients
bspl.prefilter() ;
~~~~~~~~~~~~~~
 
 The memory needed to hold the coefficients is allocated when the bspline object is constructed.
 
 Obviously many things have been done by default here: The default spline degree was used - it's 3, for a cubic spline. Also, boundary treatment mode 'MIRROR' was used per default. Further default parameters cause the spline to be 'braced' so that it can be evaluated with vspline's evaluation routines, Vc (if compiled in) was used for prefiltering, and the process is automatically partitioned and run in parallel by a thread pool. The only mandatory template arguments are the value type and the number of dimensions, which have to be known at compile time.
 
 While the sequence of operations indicated here looks a bit verbose (why not create the bspline object by a call like bspl(a) ?), in 'real' code you would use bspl.core straight away as the space to contain your data - you might get the data from a file or by some other process or do something like this  where the bspline object provides the array and you interface it via a view to it's 'core':
   
~~~~~~~~~~~~~~
vspline::bspline < double , 1 > bsp ( 10001 , degree , vspline::MIRROR ) ;
 
auto v1 = bsp.core ; // get a view to the bspline's 'core'
 
for ( auto & r : v1 ) r = ... ; // assign some values
 
bsp.prefilter() ; // perform the prefiltering
~~~~~~~~~~~~~~
 
 This is a common idiom, because it reflects a common mode of operation where you don't need the original, unfiltered data any more after creating the spline, so the prefiltering is done in-place overwriting the original data. If you do need the original data later, you can also use a third idiom:
 
~~~~~~~~~~~~~~
vigra::MultiArrayView < 3 , double > my_data ( vigra::Shape3 ( 5 , 6 , 7 ) ) ;
 
vspline::bspline < double , 3 > bsp ( my_data.shape() ) ;
 
bsp.prefilter ( my_data ) ;
~~~~~~~~~~~~~~
 
 Here, the bspline object is first created with the appropriate 'core' size, and prefilter() is called with an array matching the bspline's core. This results in my_data being read into the bspline object during the first pass of the prefiltering process.
 
 There are more ways of setting up a bspline object, please refer to class bspline's constructor. Of course you are also free to directly use vspline's lower-level routines to create a set of coefficients. The lowest level of filtering routine is simply a forward-backward recursive filter with a set of arbitrary poles. This code is in filter.h.
 
 Next you may want to evaluate the spline from the first example at some pair of coordinates x, y. Evaluation of the spline can be done using vspline's 'evaluator' objects. Using the highest level of access, these objects are set up with a bspline object and, after being set up, provide methods to evaluate the spline at given cordinates. Technically, evaluator objects are functors which don't have mutable state (all state is created at creation time and constant afterwards), so they are thread-safe and 'pure' in a functional-programming sense. The evaluation is done by calling the evaluator's eval() member function, which takes it's first argument (the coordinate) as a const reference and writes the result to it's second argument, which is a reference to a variable capable of holding the result.

~~~~~~~~~~~~~~
// for a 2D spline, we want 2D coordinates
 
typedef vigra::TinyVector < float ,2 > coordinate_type ;
 
// get the appropriate evaluator type
 
typedef vspline::evaluator < coordinate_type , float > eval_type ;
 
// create the evaluator
 
eval_type ev ( bspl ) ;

// create variables for input and output,

coordinate_type coordinate ( 3 , 4 ) ;
float result ;

// use the evaluator to evaluate the spline at ( 3 , 4 )
// storing the result in 'result'

ev.eval ( coordinate , result ) ;
~~~~~~~~~~~~~~

 Again, some things have happened by default. The evaluator was constructed from a bspline object, making sure that the evaluator is compatible.
 
 You may ask why an evaluator doesn't provide operator(). This has technical reasons - if you're interested in the details, please refer to the documentation for vspline::unary_functor. If you need function call syntax for a vspline::unary_functor, vspline offers vspline::callable:
 
~~~~~~~~~~~~~
// wrap the evaluator in a vspline::callable
auto f = vspline::callable ( ev ) ;
 
// the resulting object can be called as a function
float r = f ( coordinate ) ;

assert ( r == result ) ;
~~~~~~~~~~~~~
 
 What about the remap function? The little introduction demonstrated how you can evaluate the spline at a single location. Most of the time, though, you'll require evaluation at many coordinates. This is what remap does. Instead of a single coordinate, you pass a whole vigra::MultiArrayView full of coordinates to it - and another MultiArrayView of the same dimension and shape to accept the results of evaluating the spline at every coordinate in the first array. Here's a simple example, using the same array 'a' as above:

~~~~~~~~~~~~
// create a 1D array containing (2D) coordinates into 'a'
vigra::MultiArray < 1 , coordinate_type > coordinate_array ( 3 ) ;

// we initialize the coordinate array by hand...
coordinate_array = coordinate ;

// create an array to accommodate the result of the remap operation
vigra::MultiArray < 1 , float > target_array ( 3 ) ;
 
// perform the remap
vspline::remap ( a , coordinate_array , target_array ) ;

auto ic = coordinate_array.begin() ;
for ( auto k : target_array )
  assert ( k == f ( *(ic++) ) ;
~~~~~~~~~~~~

 This is an 'ad-hoc' remap, passing source data as an array. You can also set up a bspline object and perform a 'transform' using an evaluator for this bspline object, with the same effect:
 
~~~~~~~~~~~~
// instead of the remap, we can use transform, passing the evaluator for
// the b-spline over 'a' instead of 'a' itself. the result is the same.
vspline::transform ( ev , coordinate_array , target_array ) ; 
~~~~~~~~~~~~

 This routine has wider scope: while in this example, ev is a b-spline evaluator, ev's type can be any functor capable of yielding a value of the type held in 'target_array' for a value held in 'coordinate_array'. Here, you'd typically use an object derived from class vspline::unary_functor, and vspline::evaluator is in fact derived from this base class. A unary_functor's input and output can be any data type suitable for processing with vspline (elementary types and their uniform aggregates), you're not limited to things which can be thought of as 'coordinates' etc.
 
 This generalization of remap is named 'transform' and is similar to vigra's point operator code, but uses vspline's automatic multithreading and vectorization to make it very efficient. There's a variation of it where the 'coordinate array' and the 'target array' are the same, effectively performing an in-place transformation, which is useful for things like coordinate transformations or colour space manipulations. This variation is called vspline::apply.

 There is one variation of transform(). This overload doesn't take a 'coordinate array', but instead feeds the unary_functor with discrete coordinates of the target location that is being filled in.
 It's probably easiest to understand this variant if you start out thinking of feeding the previous transform() with an array which contains discrete indices. In 2D, this array would contain
 
 (0,0) , (1,0) , (2,0) ...
 (0,1) , (1,1) , (2,1) ...
 ...
 
 So why would you set up such an array, if it merely contains the coordinates of every cell? You might as well create these values on-the-fly and omit the coordinate array. This is precisely what the second variant of transform does:
 
~~~~~~~~~~~~~
// create a 2D array for the index-based transform operation
vigra::MultiArray < 2 , float > target_array_2d ( 3 , 4 ) ;

// use transform to evaluate the spline for the coordinates of
// all values in this array
vspline::transform ( ev , target_array_2d ) ;

// verify
for ( int x = 0 ; x < 3 ; x ++ )
{
  for ( y = 0 ; y < 4 ; y++ )
  {
    coordinate_type c { x , y } ;
    assert ( target_array_2d [ c ] == f ( c ) ) ;
  }
}
~~~~~~~~~~~~~

 If you use this variant of transform directly with a vspline::evaluator, it will reproduce your original data - within arithmetic precision of the evaluation:
 
~~~~~~~~~~~~~
vigra::MultiArray < 2 , float > b ( a.shape() ) ;
vspline::transform ( ev , b ) ;

auto ia = a.begin() ;
for ( auto r : b )
  assert ( vigra::closeAtTolerance ( *(ia++) , r , .00001 ) ) ;
~~~~~~~~~~~~~
 
 Class vspline::unary_functor is coded to make it easy to implement functors for things like image processing pipelines. For more complex operations, you'd code a functor representing your processing pipeline - often by delegating to 'inner' objects also derived from vspline::unary_functor - and finally use transform() to bulk-process your data with this functor. This is about as efficient as it gets, since the data are only accessed once, and vspline's transform code does the tedious work of multithreading, deinterleaving and interleaving for you, while you are left to concentrate on the interesting bit, writing the processing pipeline code. vspline::unary_functors are reasonably straightforward to set up; for prototyping you can get away without writing vectorized code (by using broadcasting, see vspline::grok), and you'll see that writing vectorized code with Vc isn't too hard either - if your code doesn't need conditionals, you can often even get away with using the same code for vectorized and unvectorized operation. Please refer to the examples. vspline offers some functional programming constructs for functor combination, like feeding one functor's output as input to the next (vspline::chain) or translating coordinates to a different range (vspline::domain).
 
 And that's about it - vspline aims to provide all possible variants of b-splines, code to create and evaluate them and to do so for arbitraryly shaped and strided nD arrays of data. If you dig deeper into the code base, you'll find that you can stray off the default path, but there should rarely be any need not to use the high-level object 'bspline' or the transform functions.
 
 While one might argue that the remap/transform routines I present shouldn't be lumped together with the 'proper' b-spline code, I feel that only by tightly coupling them with the b-spline code I can make them really fast. And only by processing several values at once (by multithreading and vectorization) the hardware can be exploited fully.
 
\section speed_sec Speed

 While performance will vary from system to system and between different compiles, I'll quote some measurements from my own system. I include benchmarking code (roundtrip.cc in the examples folder). Here are some measurements done with "roundtrip", working on a full HD (1920*1080) RGB image, using single precision floats internally - the figures are averages of 32 runs:

~~~~~~~~~~~~~~~~~~~~~
testing bc code MIRROR spline degree 3
avg 32 x prefilter:............................ 13.093750 ms
avg 32 x transform from unsplit coordinates:... 59.218750 ms
avg 32 x remap with internal spline:........... 75.125000 ms
avg 32 x transform from indices ............... 57.781250 ms

testing bc code MIRROR spline degree 3 using Vc
avg 32 x prefilter:............................ 9.562500 ms
avg 32 x transform from unsplit coordinates:... 22.406250 ms
avg 32 x remap with internal spline:........... 35.687500 ms
avg 32 x transform from indices ............... 21.656250 ms
~~~~~~~~~~~~~~~~~~~~~

As can be seen from these test results, using Vc on my system speeds evaluation up a good deal. When it comes to prefiltering, a lot of time is spent buffering data to make them available for fast vector processing. The time spent on actual calculations is much less. Therefore prefiltering for higher-degree splines doesn't take much more time (when using Vc):

~~~~~~~~~~~~~~~~~~~~~
testing bc code MIRROR spline degree 5 using Vc
avg 32 x prefilter:........................ 10.687500 ms

testing bc code MIRROR spline degree 7 using Vc
avg 32 x prefilter:........................ 13.656250 ms
~~~~~~~~~~~~~~~~~~~~~

Using double precision arithmetics, vectorization doesn't help so much, and prefiltering is actually slower on my system when using Vc. Doing a complete roundtrip run on your system should give you an idea about which mode of operation best suits your needs.

\section design_sec Design
 
 You can probably do everything vspline does with other software - there are several freely available implementations of b-spline interpolation and remap/transform routines. What I wanted to create was an implementation which was as general as possible and at the same time as fast as possible, and, on top of that, comprehensive.

 These demands are not easy to satisfy at the same time, but I feel that my design comes  close. While generality is achieved by generic programming, speed needs exploitation of hardware features, and merely relying on the compiler is not enough. The largest speedup I saw was from multithreading the code. This may seem like a trivial observation, but my design is influenced by it: in order to efficiently multithread, the problem has to be partitioned so that it can be processed by independent threads. You can see the partitioning both in prefiltering and later in the transform routines, in fact, both even share code to do so.
 
 Another speedup method is data-parallel processing. This is often thought to be the domain of GPUs, but modern CPUs also offer it in the form of vector units. I chose implementing data-parallel processing in the CPU, as it offers tight integration with unvectorized CPU code. It's almost familiar terrain, and the way from writing conventional CPU code to vector unit code is not too far, when using tools like Vc, which abstract the hardware away. Using horizontal vectorization does require some rethinking, though - mainly a conceptual shift from an AoS to an SoA approach. vspline doesn't use vertical vectorization at all, so the code may look odd to someone looking for vector representations of, say, pixels: instead of finding SIMD vectors with three elements, there are structures of three SIMD vectors of vsize elements.
 
 To use vectorized evaluation efficiently, incoming data have to be presented to the evaluation code in vectorized form, but usually they will come from interleaved  memory. Keeping the data in interleaved memory is even desirable, because it preserves locality, and usually processing accesses all parts of a value (i.e. all three channels of an RGB value) at once. After the evaluation is complete, data have to be stored again to interleaved memory. The deinterleaving and interleaving operations take time and the best strategy is to load once from interleaved memory, perform all necessary operations on deinterleaved, vectorized data and finally store the result back to interleaved memory. The sequence of operations performed on the vectorized data constitute a processing pipeline, and some data access code will feed the pipeline and dispose of it's result. vspline's unary_functor class is designed to occupy the niche of pipeline code, while remap, apply and transform provide the feed-and-dispose code. So with the framework of these routines, setting up vectorized processing pipelines becomes easy, since all the boilerplate code is there already, and only the point operations/operations on single vectors need to be provided by deriving from unary_functor.

 Using all these techniques together makes vspline fast. The target I was roughly aiming at was to achieve frame rates of ca. 50 fps in RGB and full HD, producing the images via transform from a precalculated warp array. On my system, I have almost reached that goal - my transform times are around 25 msec (for a cubic spline), and with memory access etc. I come up to frame rates over half of what I was aiming at. My main tesing ground is pv, my panorama viewer. Here I can often take the spline degree up to two (a quadratic spline) and still have smooth animation in full resolution. Note that class evaluator has specializations for degree-1 and degree-0 splines (aka linear and nearest-neighbour interpolation) which use optimizations making the specialized evaluator even faster than the general-purpose code.
 
 Even without using vectorization, the code is certainly fast enough for casual use and may suffice for some production scenarios. This way, vigra becomes the only dependency, and the same binary will work on a wide range of hardware.
 
 \section Literature
 
 There is a large amount of literature on b-splines available online. Here's a pick:
 
 http://bigwww.epfl.ch/thevenaz/interpolation/
 
 http://soliton.ae.gatech.edu/people/jcraig/classes/ae4375/notes/b-splines-04.pdf
 
 http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/B-spline/bspline-basis.html
 
 http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/B-spline/bspline-ex-1.html
*/
