mirror of https://github.com/GOSTSec/ccminer
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
662 lines
27 KiB
662 lines
27 KiB
####################################### |
|
# Thrust v1.7.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as |
|
well as several new algorithms and performance improvements. With this new |
|
interface, users may directly control how algorithms execute as well as details |
|
such as the allocation of temporary storage. Key/value versions of thrust::merge |
|
and the set operation algorithms have been added, as well stencil versions of |
|
partitioning algorithms. thrust::tabulate has been introduced to tabulate the |
|
values of functions taking integers. For 32b types, new CUDA merge and set |
|
operations provide 2-15x faster performance while a new CUDA comparison sort |
|
provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation |
|
provides 80% faster performance. |
|
|
|
Breaking API Changes |
|
Dispatch |
|
Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead |
|
of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch. |
|
See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples. |
|
|
|
thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized. |
|
|
|
Iterators |
|
iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated. |
|
iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor). |
|
iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade). |
|
iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access). |
|
All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible. |
|
Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type. |
|
|
|
Other |
|
normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution). |
|
Placeholder expressions may no longer include the comma operator. |
|
|
|
New Features |
|
Execution Policies |
|
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments. |
|
For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution. |
|
The following execution policies are supported in this version: |
|
|
|
thrust::host |
|
thrust::device |
|
thrust::cpp::par |
|
thrust::cuda::par |
|
thrust::omp::par |
|
thrust::tbb::par |
|
|
|
Algorithms |
|
free |
|
get_temporary_buffer |
|
malloc |
|
merge_by_key |
|
partition with stencil |
|
partition_copy with stencil |
|
return_temporary_buffer |
|
set_difference_by_key |
|
set_intersection_by_key |
|
set_symmetric_difference_by_key |
|
set_union_by_key |
|
stable_partition with stencil |
|
stable_partition_copy with stencil |
|
tabulate |
|
|
|
New Examples |
|
uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector. |
|
|
|
Other Enhancements |
|
Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter. |
|
Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device. |
|
THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend. |
|
CUDA merge performance is 2-15x faster. |
|
CUDA comparison sort performance is 1.3-4x faster. |
|
CUDA set operation performance is 1.5-15x faster. |
|
TBB reduce_by_key performance is 80% faster. |
|
Several algorithms have been parallelized with TBB. |
|
Support for user allocators in vectors has been improved. |
|
The sparse_vector example is now implemented with merge_by_key instead of sort_by_key. |
|
Warnings have been eliminated in various contexts. |
|
Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts. |
|
Documentation about algorithm requirements have been improved. |
|
Simplified the minimal_custom_backend example. |
|
Simplified the cuda/custom_temporary_allocation example. |
|
Simplified the cuda/fallback_allocator example. |
|
|
|
Bug Fixes |
|
#248 fix broken counting_iterator<float> behavior with OpenMP |
|
#231, #209 fix set operation failures with CUDA |
|
#187 fix incorrect occupancy calculation with CUDA |
|
#153 fix broken multigpu behavior with CUDA |
|
#142 eliminate warning produced by thrust::random::taus88 and MSVC 2010 |
|
#208 correctly initialize elements in temporary storage when necessary |
|
#16 fix compilation error when sorting bool with CUDA |
|
#10 fix ambiguous overloads of reinterpret_tag |
|
|
|
Known Issues |
|
g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation. |
|
|
|
Acknowledgments |
|
Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA. |
|
Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA. |
|
Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm. |
|
|
|
####################################### |
|
# Thrust v1.6.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.6.0 provides an interface for customization and extension and a new |
|
backend system based on the Threading Building Blocks library. With this |
|
new interface, programmers may customize the behavior of specific algorithms |
|
as well as control the allocation of temporary storage or invent entirely new |
|
backends. These enhancements also allow multiple different backend systems |
|
such as CUDA and OpenMP to coexist within a single program. Support for TBB |
|
allows Thrust programs to integrate more naturally into applications which |
|
may already employ the TBB task scheduler. |
|
|
|
Breaking API Changes |
|
The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h> |
|
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator |
|
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM |
|
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA |
|
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP |
|
thrust::host_space_tag has been renamed thrust::host_system_tag |
|
thrust::device_space_tag has been renamed thrust::device_system_tag |
|
thrust::any_space_tag has been renamed thrust::any_system_tag |
|
thrust::iterator_space has been renamed thrust::iterator_system |
|
|
|
|
|
New Features |
|
Backend Systems |
|
Threading Building Blocks (TBB) is now supported |
|
Functions |
|
for_each_n |
|
raw_reference_cast |
|
Types |
|
pointer |
|
reference |
|
|
|
New Examples |
|
cuda/custom_temporary_allocation |
|
cuda/fallback_allocator |
|
device_ptr |
|
expand |
|
minimal_custom_backend |
|
raw_reference_cast |
|
set_operations |
|
|
|
Other Enhancements |
|
thrust::for_each now returns the end of the input range similar to most other algorithms |
|
thrust::pair and thrust::tuple have swap functionality |
|
all CUDA algorithms now support large data types |
|
iterators may be dereferenced in user __device__ or __global__ functions |
|
the safe use of different backend systems is now possible within a single binary |
|
|
|
Bug Fixes |
|
#469 min_element and max_element algorithms no longer require a const comparison operator |
|
|
|
Known Issues |
|
cudafe++.exe may crash when parsing TBB headers on Windows. |
|
|
|
####################################### |
|
# Thrust v1.5.3 # |
|
####################################### |
|
|
|
Summary |
|
Small bug fixes |
|
|
|
Bug Fixes |
|
Avoid warnings about potential race due to __shared__ non-POD variable |
|
|
|
####################################### |
|
# Thrust v1.5.2 # |
|
####################################### |
|
|
|
Summary |
|
Small bug fixes |
|
|
|
Bug Fixes |
|
Fixed warning about C-style initialization of structures |
|
|
|
####################################### |
|
# Thrust v1.5.1 # |
|
####################################### |
|
|
|
Summary |
|
Small bug fixes |
|
|
|
Bug Fixes |
|
Sorting data referenced by permutation_iterators on CUDA produces invalid results |
|
|
|
####################################### |
|
# Thrust v1.5.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.5.0 provides introduces new programmer productivity and performance |
|
enhancements. New functionality for creating anonymous "lambda" functions has |
|
been added. A faster host sort provides 2-10x faster performance for sorting |
|
arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides |
|
2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting |
|
arithmetic types with the OpenMP backend the combined performance improvement |
|
is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x |
|
(8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster |
|
performance. |
|
|
|
Breaking API Changes |
|
device_ptr<void> no longer unsafely converts to device_ptr<T> without an |
|
explicit cast. Use the expression |
|
device_pointer_cast(static_cast<int*>(void_ptr.get())) |
|
to convert, for example, device_ptr<void> to device_ptr<int>. |
|
|
|
New Features |
|
Functions |
|
stencil-less transform_if |
|
|
|
Types |
|
lambda placeholders |
|
|
|
New Examples |
|
lambda |
|
|
|
Other Enhancements |
|
host sort is 2-10x faster for arithmetic types |
|
OMP sort provides speedup over host sort |
|
reduce_by_key is 2-3x faster |
|
reduce_by_key no longer requires O(N) temporary storage |
|
CUDA scan algorithms are 10-40% faster |
|
host_vector and device_vector are now documented |
|
out-of-memory exceptions now provide detailed information from CUDART |
|
improved histogram example |
|
device_reference now has a specialized swap |
|
reduce_by_key and scan algorithms are compatible with discard_iterator |
|
|
|
Removed Functionality |
|
|
|
Bug Fixes |
|
#44 allow host_vector to compile when value_type uses __align__ |
|
#198 allow adjacent_difference to permit safe in-situ operation |
|
#303 make thrust thread-safe |
|
#313 avoid race conditions in device_vector::insert |
|
#314 avoid unintended adl invocation when dispatching copy |
|
#365 fix merge and set operation failures |
|
|
|
Known Issues |
|
None |
|
|
|
Acknowledgments |
|
Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived. |
|
Thanks to Jean-Francois Bastien for suggesting a fix for issue 303. |
|
|
|
####################################### |
|
# Thrust v1.4.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature |
|
and performance improvements. New set theoretic algorithms operating on |
|
sorted sequences have been added. Additionally, a new fancy iterator |
|
allows discarding redundant or otherwise unnecessary output from |
|
algorithms, conserving memory storage and bandwidth. |
|
|
|
Breaking API Changes |
|
Eliminations |
|
thrust/is_sorted.h |
|
thrust/utility.h |
|
thrust/set_intersection.h |
|
thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein |
|
thrust::deprecated::copy_when |
|
thrust::deprecated::absolute_value |
|
|
|
New Features |
|
Functions |
|
copy_n |
|
merge |
|
set_difference |
|
set_symmetric_difference |
|
set_union |
|
|
|
Types |
|
discard_iterator |
|
|
|
Device support |
|
Compute Capability 2.1 GPUs |
|
|
|
New Examples |
|
run_length_decoding |
|
|
|
Other Enhancements |
|
Compilation warnings are substantially reduced in various contexts. |
|
The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key, |
|
and thrust::stable_sort_by_key are substantially reduced. |
|
A fast sort implementation is used when sorting primitive types with thrust::greater. |
|
The performance of thrust::set_intersection is improved. |
|
The performance of thrust::fill is improved on SM 1.x devices. |
|
A code example is now provided in each algorithm's documentation. |
|
thrust::reverse now operates in-place |
|
|
|
Removed Functionality |
|
thrust::deprecated::copy_when |
|
thrust::deprecated::absolute_value |
|
thrust::experimental::cuda::ogl_interop_allocator |
|
thrust::gather and thrust::scatter from host to device and vice versa are no longer supported. |
|
Operations which modify the elements of a thrust::device_vector are no longer |
|
available from source code compiled without nvcc when the device backend is CUDA. |
|
Instead, use the idiom from the cpp_interop example. |
|
|
|
Bug Fixes |
|
#212 set_intersection works correctly for large input sizes. |
|
#275 counting_iterator and constant_iterator work correctly with OpenMP as the |
|
backend when compiling with optimization |
|
#256 min and max correctly return their first argument as a tie-breaker |
|
#248 NDEBUG is interpreted correctly |
|
|
|
Known Issues |
|
nvcc may generate code containing warnings when compiling some Thrust algorithms. |
|
When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue |
|
benign pointer advisories. |
|
When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly. |
|
thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key, |
|
and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator. |
|
|
|
Acknowledgments |
|
Thanks to David Tarjan for improving the performance of set_intersection. |
|
Thanks to Duane Merrill for continued help with sort. |
|
Thanks to Nathan Whitehead for help with CUDA Toolkit integration. |
|
|
|
####################################### |
|
# Thrust v1.3.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature |
|
and performance enhancements. |
|
|
|
Performance of the sort and sort_by_key algorithms is improved by as much |
|
as 3x in certain situations. The performance of stream compaction algorithms, |
|
such as copy_if, is improved by as much as 2x. Reduction performance is |
|
also improved, particularly for small input sizes. |
|
|
|
CUDA errors are now converted to runtime exceptions using the system_error |
|
interface. Combined with a debug mode, also new in v1.3, runtime errors |
|
can be located with greater precision. |
|
|
|
Lastly, a few header files have been consolidated or renamed for clarity. |
|
See the deprecations section below for additional details. |
|
|
|
|
|
Breaking API Changes |
|
Promotions |
|
thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface |
|
thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface |
|
thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface |
|
thrust::next::gather has been renamed thrust::gather |
|
thrust::next::gather_if has been renamed thrust::gather_if |
|
thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy |
|
Deprecations |
|
thrust::copy_when has been renamed thrust::deprecated::copy_when |
|
thrust::absolute_value has been renamed thrust::deprecated::absolute_value |
|
The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead |
|
The header thrust/utility.h is now deprecated; use thrust/swap.h instead |
|
The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead |
|
Eliminations |
|
thrust::deprecated::gather |
|
thrust::deprecated::gather_if |
|
thrust/experimental/arch.h and the functions therein |
|
thrust/sorting/merge_sort.h |
|
thrust/sorting/radix_sort.h |
|
|
|
New Features |
|
Functions |
|
exclusive_scan_by_key |
|
find |
|
find_if |
|
find_if_not |
|
inclusive_scan_by_key |
|
is_partitioned |
|
is_sorted_until |
|
mismatch |
|
partition_point |
|
reverse |
|
reverse_copy |
|
stable_partition_copy |
|
|
|
Types |
|
system_error and related types |
|
experimental::cuda::ogl_interop_allocator |
|
bit_and, bit_or, and bit_xor |
|
|
|
Device support |
|
gf104-based GPUs |
|
|
|
New Examples |
|
opengl_interop.cu |
|
repeated_range.cu |
|
simple_moving_average.cu |
|
sparse_vector.cu |
|
strided_range.cu |
|
|
|
Other Enhancements |
|
Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types |
|
Performance of thrust::copy_if is substantially improved |
|
Performance of thrust::reduce and related reductions is improved |
|
THRUST_DEBUG mode added |
|
Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error |
|
The number of compiler warnings generated by Thrust has been substantially reduced |
|
Comparison sort now works correctly for input sizes > 32M |
|
min & max usage no longer collides with <windows.h> definitions |
|
Compiling against the OpenMP backend no longer requires nvcc |
|
Performance of device_vector initialized in .cpp files is substantially improved in common cases |
|
Performance of thrust::sort_by_key on the host is substantially improved |
|
|
|
Removed Functionality |
|
nvcc 2.3 is no longer supported |
|
|
|
Bug Fixes |
|
Debug device code now compiles correctly |
|
thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host |
|
|
|
Known Issues |
|
#212 set_intersection is known to fail for large input sizes |
|
partition_point is known to fail for 64b types with nvcc 3.2 |
|
|
|
Acknowledgments |
|
Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation |
|
Thanks to Erich Elsen for contributing an implementation of find_if |
|
Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc |
|
Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports |
|
Thanks to Cliff Woolley for help with testing |
|
|
|
####################################### |
|
# Thrust v1.2.1 # |
|
####################################### |
|
|
|
Summary |
|
Small fixes for compatibility with CUDA 3.1 |
|
|
|
Known Issues |
|
inclusive_scan & exclusive_scan may fail with very large types |
|
the Microsoft compiler may fail to compile code using both sort and binary search algorithms |
|
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device |
|
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads |
|
default_random_engine::discard is not accelerated with nvcc 2.3 |
|
nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48. |
|
|
|
####################################### |
|
# Thrust v1.2.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.2 introduces support for compilation to multicore CPUs |
|
and the Ocelot virtual machine, and several new facilities for |
|
pseudo-random number generation. New algorithms such as set |
|
intersection and segmented reduction have also been added. Lastly, |
|
improvements to the robustness of the CUDA backend ensure |
|
correctness across a broad set of (uncommon) use cases. |
|
|
|
Breaking API Changes |
|
thrust::gather's interface was incorrect and has been removed. |
|
The old interface is deprecated but will be preserved for Thrust |
|
version 1.2 at thrust::deprecated::gather & |
|
thrust::deprecated::gather_if. The new interface is provided at |
|
thrust::next::gather & thrust::next::gather_if. The new interface |
|
will be promoted to thrust:: in Thrust version 1.3. For more details, |
|
please refer to this thread: |
|
http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd |
|
|
|
The thrust::sorting namespace has been deprecated in favor of the |
|
top-level sorting functions, such as thrust::sort() and |
|
thrust::sort_by_key(). |
|
|
|
New Features |
|
Functions |
|
reduce_by_key |
|
set_intersection |
|
tie |
|
unique_copy |
|
unique_by_key |
|
unique_copy_by_key |
|
|
|
Types |
|
Random Number Generation |
|
discard_block_engine |
|
default_random_engine |
|
linear_congruential_engine |
|
linear_feedback_shift_engine |
|
minstd_rand |
|
minstd_rand0 |
|
normal_distribution (experimental) |
|
ranlux24 |
|
ranlux48 |
|
ranlux24_base |
|
ranlux48_base |
|
subtract_with_carry_engine |
|
taus88 |
|
uniform_int_distribution |
|
uniform_real_distribution |
|
xor_combine_engine |
|
Functionals |
|
project1st |
|
project2nd |
|
|
|
Fancy Iterators |
|
permutation_iterator |
|
reverse_iterator |
|
|
|
Device support |
|
Add support for multicore CPUs via OpenMP |
|
Add support for Fermi-class GPUs |
|
Add support for Ocelot virtual machine |
|
|
|
New Examples |
|
cpp_integration |
|
histogram |
|
mode |
|
monte_carlo |
|
monte_carlo_disjoint_sequences |
|
padded_grid_reduction |
|
permutation_iterator |
|
row_sum |
|
run_length_encoding |
|
segmented_scan |
|
stream_compaction |
|
summary_statistics |
|
transform_iterator |
|
word_count |
|
|
|
Other Enhancements |
|
vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit |
|
integer sorting performance is improved when max is large but (max - min) is small and when min is negative |
|
performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types |
|
support for nvcc 3.0 |
|
|
|
Removed Functionality |
|
removed support for equal between host & device sequences |
|
removed support for gather() and scatter() between host & device sequences |
|
|
|
Bug Fixes |
|
# 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time |
|
# 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms |
|
# 46 gather & scatter handle any space iterators correctly |
|
# 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs |
|
# 52 avoid collisions with common user macros such as BLOCK_SIZE |
|
# 62 provide better documentation for device_reference |
|
# 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode |
|
# 102 eliminated a race condition in device_vector::erase |
|
various compilation warnings eliminated |
|
|
|
Known Issues |
|
inclusive_scan & exclusive_scan may fail with very large types |
|
the Microsoft compiler may fail to compile code using both sort and binary search algorithms |
|
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device |
|
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads |
|
default_random_engine::discard is not accelerated with nvcc 2.3 |
|
|
|
Acknowledgments |
|
Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection |
|
Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot |
|
Thanks to Tom Bradley for contributing an implementation of normal_distribution |
|
Thanks to Joseph Rhoads for contributing the example summary_statistics |
|
|
|
####################################### |
|
# Thrust v1.1.1 # |
|
####################################### |
|
|
|
Summary |
|
Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard. |
|
|
|
####################################### |
|
# Thrust v1.1.0 # |
|
####################################### |
|
|
|
Summary |
|
Thrust v1.1 introduces fancy iterators, binary search functions, and |
|
several specialized reduction functions. Experimental support for |
|
segmented scan has also been added. |
|
|
|
Breaking API Changes |
|
counting_iterator has been moved into the thrust namespace (previously thrust::experimental) |
|
|
|
New Features |
|
Functions |
|
copy_if |
|
lower_bound |
|
upper_bound |
|
vectorized lower_bound |
|
vectorized upper_bound |
|
equal_range |
|
binary_search |
|
vectorized binary_search |
|
all_of |
|
any_of |
|
none_of |
|
minmax_element |
|
advance |
|
inclusive_segmented_scan (experimental) |
|
exclusive_segmented_scan (experimental) |
|
|
|
Types |
|
pair |
|
tuple |
|
device_malloc_allocator |
|
|
|
Fancy Iterators |
|
constant_iterator |
|
counting_iterator |
|
transform_iterator |
|
zip_iterator |
|
|
|
New Examples |
|
computing the maximum absolute difference between vectors |
|
computing the bounding box of a two-dimensional point set |
|
sorting multiple arrays together (lexicographical sorting) |
|
constructing a summed area table |
|
using zip_iterator to mimic an array of structs |
|
using constant_iterator to increment array values |
|
|
|
Other Enhancements |
|
added pinned memory allocator (experimental) |
|
added more methods to host_vector & device_vector (issue #4) |
|
added variant of remove_if with a stencil argument (issue #29) |
|
scan and reduce use cudaFuncGetAttributes to determine grid size |
|
exceptions are reported when temporary device arrays cannot be allocated |
|
|
|
Bug Fixes |
|
#5 make vector work for larger data types |
|
#9 stable_partition_copy doesn't respect OutputIterator concept semantics |
|
#10 scans should return OutputIterator |
|
#16 make algorithms work for larger data types |
|
#27 dispatch radix_sort even when comp=less<T> is explicitly provided |
|
|
|
Known Issues |
|
Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1 |
|
uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device. |
|
inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver |
|
|
|
|
|
####################################### |
|
# Thrust v1.0.0 # |
|
####################################### |
|
|
|
Breaking API changes |
|
Rename top level namespace komrade to thrust. |
|
Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface. |
|
Rename range() to sequence() to avoid collision with Boost.Range. |
|
Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if(). |
|
|
|
New Features |
|
Add C++0x style cbegin() & cend() methods to host_vector & device_vector. |
|
Add transform_if function. |
|
Add stencil versions of replace_if() & replace_copy_if(). |
|
Allow counting_iterator to work with for_each(). |
|
Allow types with constructors in comparison sort & reduce. |
|
|
|
Other Enhancements |
|
merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device. |
|
|
|
Bug fixes |
|
Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6) |
|
Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7) |
|
|
|
|