Microsoft.ML.DataView
Akin to FindIndexSorted, except stores the found index in the output
index parameter, and returns whether that index is a valid index
pointing to a value equal to the input parameter value.
Assumes input is sorted and finds value using BinarySearch.
If value is not found, returns the logical index of 'value' in the sorted list i.e index of the first element greater than value.
In case of duplicates it returns the index of the first one.
It guarantees that items before the returned index are < value, while those at and after the returned index are >= value.
Assumes input is sorted and finds value using BinarySearch.
If value is not found, returns the logical index of 'value' in the sorted list i.e index of the first element greater than value.
In case of duplicates it returns the index of the first one.
It guarantees that items before the returned index are < value, while those at and after the returned index are >= value.
A structure serving as the identifier of a row of .
For datasets with millions of records, those IDs need to be unique, therefore the need for such a large structure to hold the values.
Those Ids are derived from other Ids of the previous components of the pipelines, and dividing the structure in two: high order and low order of bits,
and reduces the changes of those collisions even further.
The low order bits. Corresponds to H1 in the Murmur algorithms.
The high order bits. Corresponds to H2 in the Murmur algorithms.
Initializes a new instance of
The low order ulong.
The high order ulong.
An operation that treats the value as an unmixed Murmur3 128-bit hash state,
and returns the hash state that would result if we hashed an addition 16 bytes
that were all zeros, except for the last bit which is one.
An operation that treats the value as an unmixed Murmur3 128-bit hash state,
and returns the hash state that would result if we hashed an addition 16 bytes
that were all zeros.
An operation that treats the value as an unmixed Murmur3 128-bit hash state,
and returns the hash state that would result if we took ,
scrambled it using , then hashed the result of that.
This is the abstract base class for all types in the type system.
Those that wish to extend the type system should derive from one of
the more specific abstract classes or .
Constructor for extension types, which must be either or .
The raw for this . Note that this is the raw representation type
and not the complete information content of the .
Code should not assume that a uniquely identifiers a .
For example, most practical instances of ML.NET's KeyType and will have a
of , but both are very different in the types of information conveyed in that number.
Return if is equivalent to and otherwise.
Another to be compared with .
The abstract base class for all non-primitive types.
This class stands in constrast to . As that class is defined
to encapsulate cases where instances of the representation type can be freely copied without concerns
about ownership, mutability, or dispoal, this is defined for those types where these factors become concerns.
To take the most conspicuous example, is a structure type,
which through the buffer sharing mechanisms of its representation type,
does not have assignment as sufficient to create an independent copy.
The abstract base class for all primitive types. Values of these types can be freely copied
without concern for ownership, mutation, or disposing.
The standard text type. This has representation type of with type parameter .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
The standard number type. This class is not directly instantiable. All allowed instances of this
type are singletons, and are accessible as static properties on this class.
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The singleton instance of the with representation type of .
The type. This has representation type of .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
The standard boolean type. This has representation type of .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
The standard date time type. This has representation type of .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
The standard date time offset type. This has representation type of .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
The standard timespan type. This has representation type of .
Note this can have only one possible value, accessible by the singleton static property .
The singleton instance of this type.
should be used to decorated class properties and fields, if that class' instances will be loaded as ML.NET .
The function will be called to register a for a with its s.
Whenever a value typed to the registered and its s, that value's type (i.e., a )
in would be the associated .
A function implicitly invoked by ML.NET when processing a custom type. It binds a DataViewType to a custom type plus its attributes.
Return if is equivalent to and otherwise.
Another to be compared with .
Type representing categorical or enumerated values, most commonly used for
the values of labels in multiclass classification models.
The underlying .NET type is one of the unsigned integer types. The default is
, but it can also be ,
, or .
Despite keys being numerical types, the information is not inherently numeric,
so typically, arithmetic is not meaningful.
Missing values are mapped to 0.
The first non-missing value of the set is always 1.
The other values range up to the value of .
For example, if you have a key value with a of 3, then
the value 0 corresponds to missing key values, and
one of the values of 1, 2, or 3 is of the valid values,
and no other values are used.
Initializes a new instance of the class.
The underlying representation type. Should be one of , ,
(the most common choice), or .
The cardinality of the underlying set. This must not exceed the associated maximum value of the
representation type. For example, if is , then this must not
exceed .
Initializes a new instance of the class. This differs from the hypothetically more general
constructor by taking an for
, to more naturally facilitate the most common case that the key value is being used
as an enumeration over an array or list of some form.
The underlying representation type. Should be one of , ,
(the most common choice), or .
The cardinality of the underlying set. This must not exceed the associated maximum value of the
representation type. For example, if is , then this must not
exceed .
Returns true iff the given type is valid for a . The valid ones are
, , , and , that is, the unsigned
integer types.
is the cardinality of the .
The typical legal values for data of this type ranges from the missing value of 0, and non-missing
values ranging from to 1 through , inclusive, being the enumeration into whatever
set the key values are enumerated over.
Determine if this object is equal to another instance.
Checks if the other item is the type of , if the
is the same, and if the is the same.
The other object to compare against.
if both objects are equal, otherwise .
Determine if a instance is equal to another instance.
Checks if any object is the type of , if the
is the same, and if the is the same.
The other object to compare against.
if both objects are equal, otherwise .
Retrieves the hash code.
An integer representing the hash code.
The string representation of the .
A formatted string.
A buffer that supports both dense and sparse representations. This is the representation type for all
instances. The explicitly defined values of this vector are exposed through
and, if not dense, .
This structure is by itself immutable, but to enable buffer editing including re-use of the internal buffers,
a mutable variant can be accessed through .
Throughout the code, we make the assumption that a sparse is logically equivalent to
a dense with the default value for filling in the default values.
The type of the vector. There are no compile-time restrictions on what this could be, but
this code and practically all code that uses makes the assumption that an assignment of
a value is sufficient to make a completely independent copy of it. So, for example, this means that a buffer of
buffers is not possible. But, things like , , and , are totally fine.
The internal re-usable array of values.
The internal re-usable array of indices.
The number of items explicitly represented. This equals when the representation
is dense and less than when sparse.
The logical length of the buffer.
Note that if this vector , then this will be the same as the
as returned from , since all values are explicitly represented in a dense representation. If
this is a sparse representation, then that will be somewhat shorter, as this
field contains the number of both explicit and implicit entries.
The explicitly represented values. When this , the
of the returned value will equal , and otherwise will have length less than
.
The indices. For a dense representation, this array is not used, and will return the default "empty" span.
For a sparse representation it is parallel to that returned from and specifies the
logical indices for the corresponding values, in increasing order, between 0 inclusive and
exclusive, corresponding to all explicitly defined values. All values at unspecified
indices should be treated as being implicitly defined with the default value of .
To give one example, if returns [3, 5] and () produces [98, 76],
this stands for a vector with non-zero values 98 and 76 respectively at the 4th and 6th
coordinates, and zeros at all other indices. (Zero, because that is the default value for all .NET numeric
types.)
Gets a value indicating whether every logical element is explicitly represented in the buffer.
Construct a dense representation. The array is often unspecified, but if
specified it should be considered a buffer to be held on to, to be possibly used.
The logical length of the resulting instance.
The values to be used. This must be at least as long as . If
is 0, it is legal for this to be . The constructed buffer
takes ownership of this array.
The internal indices buffer. Because this constructor is for dense representations
this will not be immediately useful, but it does provide a buffer to be potentially reused to avoid
allocation. This is mostly non-null in situations where you want to produce a dense
, but you happen to have an indices array "left over" and you don't want to
needlessly lose.
The resulting structure takes ownership of the passed in arrays, so they should not be used for
other purposes in the future.
Construct a possibly sparse vector representation.
The length of the constructed buffer.
The count of explicit entries. This must be between 0 and , both
inclusive. If it equals the result is a dense vector, and if less this will be a
sparse vector.
The values to be used. This must be at least as long as . If
is 0, it is legal for this to be .
The indices to be used. If we are constructing a dense representation, or
is 0, this can be . Otherwise, this must be at least as long
as .
The resulting structure takes ownership of the passed in arrays, so they should not be used for
other purposes in the future.
Copy from this buffer to the given destination, forcing a dense representation.
The destination buffer. After the copy, this will have
of .
Copy from this buffer to the given destination.
The destination buffer. After the copy, this will have
of .
Copy a range of values from this buffer to the given destination.
The destination buffer. After the copy, this will have
of .
The minimum inclusive index to start copying from this vector.
The logical number of values to copy from this vector into .
Copy from this buffer to the given destination span. This "densifies."
The destination buffer. This must have least .
Copy from this buffer to the given destination span, starting at the specified index. This "densifies."
The destination buffer. This must be at least
plus .
The starting index of at which to start copying.
The value to fill in for the implicit sparse entries. This is a potential exception to
general expectation of sparse that the implicit sparse entries have the default value
of .
Copy from a section of a source array to the given destination.
Returns the joint list of all index/value pairs.
If all pairs, even those implicit values of a sparse representation,
will be returned, with the implicit values having the default value, as is appropriate. If left
then only explicitly defined values are returned.
The index/value pairs.
Returns an enumerable with items, representing the values.
Gets the item stored in this structure. In the case of a dense vector this is a simple lookup.
In the case of a sparse vector, it will try to find the entry with that index, and set
to that stored value, or if no such value was found, assign it the default value.
In the case where is , this will take constant time since it an
directly lookup. For sparse vectors, however, because it must perform a bisection search on the indices to
find the appropriate value, that takes logarithmic time with respect to the number of explicitly represented
items, which is to say, the of the return value of .
For that reason, a single completely isolated lookup, since constructing as
does is not a free operation, it may be more efficient to use this method. However
if one is doing a more involved computation involving many operations, it may be faster to utilize
and, if appropriate, directly.
The index, which must be a non-negative number less than .
The value stored at that index, or if this is a sparse vector where this is an implicit
entry, the default value for .
A variant of that returns the value instead of passing it
back using a reference parameter.
The index, which must be a non-negative number less than .
The value stored at that index, or if this is a sparse vector where this is an implicit
entry, the default value for .
Returns an enumerator that iterates through the values in VBuffer.
A helper method that gives us an iterable over the items given the fields from a .
Note that we have this in a separate utility class, rather than in its more natural location of
itself, due to a bug in the C++/CLI compiler. (DevDiv 1097919:
[C++/CLI] Nested generic types are not correctly imported from metadata). So, if we want to use
in C++/CLI projects, we cannot have a generic struct with a nested class
that has the outer struct type as a field.
Various methods for creating instances.
Creates a with the same shape
(length and density) as the .
The destination buffer. Note that the resulting is assumed to take ownership
of this passed in object, and so whatever was passed in as this parameter should not be used again, since its
underlying buffers are being potentially reused.
Creates a using
's values and indices buffers.
The destination buffer. Note that the resulting is assumed to take ownership
of this passed in object, and so whatever was passed in as this parameter should not be used again, since its
underlying buffers are being potentially reused.
The logical length of the new buffer being edited.
The optional number of physical values to be represented in the buffer.
The buffer will be dense if is omitted.
The optional number of maximum physical values to represent in the buffer.
The buffer won't grow beyond this maximum size.
True means that the old buffer values and indices are preserved, if possible (Array.Resize is called).
False means that a new array will be allocated, if necessary.
True means to ensure the Indices buffer is available, even if the buffer will be dense.
An object capable of editing a by filling out
(and if the buffer is not dense).
The structure by itself is immutable. However, the purpose of
is to enable buffer re-use we can edit them through this structure, as created through
or
.
The mutable span of values.
The mutable span of indices.
Gets a value indicating whether a new array was allocated.
Gets a value indicating whether a new array was allocated.
Commits the edits and creates a new using the current and .
Note that this structure and its properties should not be used once this is called.
The newly created .
Commits the edits and creates a new using
the current Values and Indices, while allowing to truncate the length
of and, if sparse, .
Like , this structure and its properties should not be used once this is called.
The new number of physical values to be represented in the created buffer.
The newly created .
This method allows to modify the length of the explicitly defined values.
This is useful in sparse situations where the
was created with a larger physical value count than was needed
because the final value count was not known at creation time.
The standard vector type. The representation type of this is ,
where the type parameter is in .
The dimensions. This will always have at least one item. All values will be non-negative.
As with , a zero value indicates that the vector type is considered to have
unknown length along that dimension.
In the case where this is a multi-dimensional type, that is, a situation where
has length greater than one, since itself is a single dimensional structure,
we must clarify what we mean. The indices represent a "flattened" view of the coordinates implicit in the
dimensions. We consider that the last dimension is the most "minor" index. In the case where
has length 2, this is commonly referred to as row-major order. So, if you hypothetically had
dimensions of { 5, 2 }, then the values would be all of length 10,
and the flattened indices 0, 1, 2, 3, 4, ... would correspond to "coordinates" of
(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), ..., respectively.
Constructs a new single-dimensional vector type.
The type of the items contained in the vector.
The size of the single dimension.
Constructs a potentially multi-dimensional vector type.
The type of the items contained in the vector.
The dimensions. Note that, like , must be non-empty, with all
non-negative values. Also, because is the product of , the result of
multiplying all these values together must not overflow .
Constructs a potentially multi-dimensional vector type.
The type of the items contained in the vector.
The dimensions. Note that, like , must be non-empty, with all
non-negative values. Also, because is the product of , the result of
multiplying all these values together must not overflow .
Whether this is a vector type with known size.
Equivalent to > 0.
The type of the items stored as values in vectors of this type.
The size of the vector. A value of zero means it is a vector whose size is unknown.
A vector whose size is known should correspond to values that always have the same ,
whereas one whose size is unknown may have values whose varies from record to record.
Note that this is always the product of the elements in .
Represents the schema of an or an .
The schema is a collection of .
Number of columns in the schema.
Get the column by name. Throws an exception if such column does not exist.
Note that if multiple columns exist with the same name, the one with the biggest index is returned.
The other columns are considered 'hidden', and only accessible by their index.
Get the column by index.
Get the column by name, or null if the column is not present.
This class describes one column in the particular schema.
The name of the column.
The column's index in the schema.
Whether this column is hidden (accessible only by index).
The type of the column.
The annotations of the column.
This class represents the schema of one column of a data view, without an attachment to a particular .
The name of the column.
The type of the column.
The annotations associated with the column.
Creates an instance of a .
Create an instance of from an existing schema's column.
The schema annotations of one .
Annotation getter delegates. Useful to construct annotations out of other annotations.
The schema of the annotations row. It is different from the schema that the column belongs to.
Create an annotations row by supplying the schema columns and the getter delegates for all the values.
Note: The array is owned by this instance.
Get a getter delegate for one value of the annotations row.
Get the value of an annotation, by annotation kind (aka column name).
Class containing operations to build an .
Add some columns from into our new annotations, by applying
to all the names.
The annotations row to take values from.
The predicate describing which annotation columns to keep.
Add one annotation column, strongly-typed version.
The type of the value.
The annotation name.
The annotation type.
The getter delegate.
Annotations of the input column. Note that annotations on an annotation column is somewhat rare
except for certain types (for example, slot names for a vector, key values for something of key type).
Add one annotation column, weakly-typed version.
The annotation name.
The annotation type.
The getter delegate that provides the value. Note that the type of the getter is still checked
inside this method.
Annotations of the input column. Note that annotations on an annotation column is somewhat rare
except for certain types (for example, slot names for a vector, key values for something of key type).
Add one annotation column for a primitive value type.
The annotation name.
The annotation type.
The value of the annotation.
Annotations of the input column. Note that annotations on an annotation column is somewhat rare
except for certain types (for example, slot names for a vector, key values for something of key type).
Returns a row that contains the current contents of this .
Class containing operations to build a .
Create a new instance of .
Add one column to the schema being built.
The column name.
The column type.
The column annotations.
Add multiple existing columns to the schema being built.
Columns to add.
Add multiple existing columns to the schema being built.
Columns to add.
Returns a that contains the current contents of this .
This constructor should only be called by .
The input columns. The constructed instance takes ownership of the array.
The input and output of Query Operators (Transforms). This is the fundamental data pipeline
type, comparable to for LINQ.
Whether this IDataView supports shuffling of rows, to any degree.
Returns the number of rows if known. Returning null means that the row count is unknown but
it might return a non-null value on a subsequent call. This indicates, that the transform does
not YET know the number of rows, but may in the future. Its implementation's computation
complexity should be O(1).
Most implementation will return the same answer every time. Some, like a cache, might
return null until the cache is fully populated.
Get a row cursor. The indicate the active columns that are needed
to iterate over. If set to an empty no column is requested. The schema of the returned
cursor will be the same as the schema of the IDataView, but getting a getter for inactive columns will throw.
The active columns needed. If passed an empty no column is requested.
An instance of to seed randomizing the access for a shuffled cursor.
This constructs a set of parallel batch cursors. The value is a recommended limit on
cardinality. If is non-positive, this indicates that the caller has no recommendation,
and the implementation should have some default behavior to cover this case. Note that this is strictly a
recommendation: it is entirely possible that an implementation can return a different number of cursors.
The cursors should return the same data as returned through
, except partitioned: no two cursors should return the
"same" row as would have been returned through the regular serial cursor, but all rows should be returned by
exactly one of the cursors returned from this cursor. The cursors can have their values reconciled
downstream through the use of the property.
The typical usage pattern is that a set of cursors is requested, each of them is then given to a set of
working threads that consume from them independently while, ultimately, the results are finally collated in
the end by exploiting the ordering of the property described above. More typical
scenarios will be content with pulling from the single serial cursor of
.
The active columns needed. If passed an empty no column is requested.
The suggested degree of parallelism.
An instance of to seed randomizing the access.
Gets an instance of Schema.
Delegate type to get a value. This can be used for efficient access to data in a
or .
A logical row of data. May be a row of an or a stand-alone row.
This is incremented when the underlying contents changes, giving clients a way to detect change. It should be
-1 when the object is in a state where values cannot be fetched. In particular, for an ,
this will be before if ever called for the first time, or after the first time
is called and returns .
Note that this position is not position within the underlying data, but position of this cursor only. If
one, for example, opened a set of parallel streaming cursors, or a shuffled cursor, each such cursor's first
valid entry would always have position 0.
This provides a means for reconciling multiple rows that have been produced generally from
. When getting a set, there is a need
to, while allowing parallel processing to proceed, always have an aim that the original order should be
recoverable. Note, whether or not a user cares about that original order in one's specific application is
another story altogether (most callers of this as a practical matter do not, otherwise they would not call
it), but at least in principle it should be possible to reconstruct the original order one would get from an
identically configured . So: for any cursor
implementation, batch numbers should be non-decreasing. Furthermore, any given batch number should only
appear in one of the cursors as returned by
. In this way, order is determined by
batch number. An operation that reconciles these cursors to produce a consistent single cursoring, could do
so by drawing from the single cursor, among all cursors in the set, that has the smallest batch number
available.
Note that there is no suggestion that the batches for a particular entry will be consistent from cursoring
to cursoring, except for the consistency in resulting in the same overall ordering. The same entry could
have different batch numbers from one cursoring to another. There is also no requirement that any given
batch number must appear, at all. It is merely a mechanism for recovering ordering from a possibly arbitrary
partitioning of the data. It also follows from this, of course, that considering the batch to be a property
of the data is completely invalid.
A getter for a 128-bit ID value. It is common for objects to serve multiple
instances to iterate over what is supposed to be the same data, for example, in a
a cursor set will produce the same data as a serial cursor, just partitioned, and a shuffled cursor will
produce the same data as a serial cursor or any other shuffled cursor, only shuffled. The ID exists for
applications that need to reconcile which entry is actually which. Ideally this ID should be unique, but for
practical reasons, it suffices if collisions are simply extremely improbable.
Note that this ID, while it must be consistent for multiple streams according to the semantics above, is not
considered part of the data per se. So, to take the example of a data view specifically, a single data view
must render consistent IDs across all cursorings, but there is no suggestion at all that if the "same" data
were presented in a different data view (as by, say, being transformed, cached, saved, or whatever), that
the IDs between the two different data views would have any discernible relationship.
Returns whether the given column is active in this row.
Returns a value getter delegate to fetch the value of the given , from the row.
This throws if the column is not active in this row, or if the type
differs from this column's type.
is the column's content type.
is the output column whose getter should be returned.
Gets a , which provides name and type information for variables
(i.e., columns in ML.NET's type system) stored in this row.
Implementation of dispose. Calls with .
The disposable method for the disposable pattern. This default implementation does nothing.
Whether this was called from .
Subclasses that implement should call this method with
, but I hasten to add that implementing finalizers should be
avoided if at all possible..
Class used to cursor through rows of an .
Note that this is also an . The is
incremented by . Prior to the first call to , or after
returns , is -1.
Otherwise, when returns , >= 0.
Advance to the next row. When the cursor is first created, this method should be called to
move to the first row. Returns if there are no more rows.
The debugger proxy for .
The debugger proxy for .