You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
663 lines
27 KiB
663 lines
27 KiB
11 years ago
|
#######################################
|
||
|
# Thrust v1.7.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as
|
||
|
well as several new algorithms and performance improvements. With this new
|
||
|
interface, users may directly control how algorithms execute as well as details
|
||
|
such as the allocation of temporary storage. Key/value versions of thrust::merge
|
||
|
and the set operation algorithms have been added, as well stencil versions of
|
||
|
partitioning algorithms. thrust::tabulate has been introduced to tabulate the
|
||
|
values of functions taking integers. For 32b types, new CUDA merge and set
|
||
|
operations provide 2-15x faster performance while a new CUDA comparison sort
|
||
|
provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation
|
||
|
provides 80% faster performance.
|
||
|
|
||
|
Breaking API Changes
|
||
|
Dispatch
|
||
|
Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead
|
||
|
of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
|
||
|
See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
|
||
|
|
||
|
thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.
|
||
|
|
||
|
Iterators
|
||
|
iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
|
||
|
iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
|
||
|
iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
|
||
|
iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
|
||
|
All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
|
||
|
Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
|
||
|
|
||
|
Other
|
||
|
normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
|
||
|
Placeholder expressions may no longer include the comma operator.
|
||
|
|
||
|
New Features
|
||
|
Execution Policies
|
||
|
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
|
||
|
For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
|
||
|
The following execution policies are supported in this version:
|
||
|
|
||
|
thrust::host
|
||
|
thrust::device
|
||
|
thrust::cpp::par
|
||
|
thrust::cuda::par
|
||
|
thrust::omp::par
|
||
|
thrust::tbb::par
|
||
|
|
||
|
Algorithms
|
||
|
free
|
||
|
get_temporary_buffer
|
||
|
malloc
|
||
|
merge_by_key
|
||
|
partition with stencil
|
||
|
partition_copy with stencil
|
||
|
return_temporary_buffer
|
||
|
set_difference_by_key
|
||
|
set_intersection_by_key
|
||
|
set_symmetric_difference_by_key
|
||
|
set_union_by_key
|
||
|
stable_partition with stencil
|
||
|
stable_partition_copy with stencil
|
||
|
tabulate
|
||
|
|
||
|
New Examples
|
||
|
uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.
|
||
|
|
||
|
Other Enhancements
|
||
|
Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
|
||
|
Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
|
||
|
THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
|
||
|
CUDA merge performance is 2-15x faster.
|
||
|
CUDA comparison sort performance is 1.3-4x faster.
|
||
|
CUDA set operation performance is 1.5-15x faster.
|
||
|
TBB reduce_by_key performance is 80% faster.
|
||
|
Several algorithms have been parallelized with TBB.
|
||
|
Support for user allocators in vectors has been improved.
|
||
|
The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
|
||
|
Warnings have been eliminated in various contexts.
|
||
|
Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
|
||
|
Documentation about algorithm requirements have been improved.
|
||
|
Simplified the minimal_custom_backend example.
|
||
|
Simplified the cuda/custom_temporary_allocation example.
|
||
|
Simplified the cuda/fallback_allocator example.
|
||
|
|
||
|
Bug Fixes
|
||
|
#248 fix broken counting_iterator<float> behavior with OpenMP
|
||
|
#231, #209 fix set operation failures with CUDA
|
||
|
#187 fix incorrect occupancy calculation with CUDA
|
||
|
#153 fix broken multigpu behavior with CUDA
|
||
|
#142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
|
||
|
#208 correctly initialize elements in temporary storage when necessary
|
||
|
#16 fix compilation error when sorting bool with CUDA
|
||
|
#10 fix ambiguous overloads of reinterpret_tag
|
||
|
|
||
|
Known Issues
|
||
|
g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
|
||
|
|
||
|
Acknowledgments
|
||
|
Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
|
||
|
Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
|
||
|
Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.6.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.6.0 provides an interface for customization and extension and a new
|
||
|
backend system based on the Threading Building Blocks library. With this
|
||
|
new interface, programmers may customize the behavior of specific algorithms
|
||
|
as well as control the allocation of temporary storage or invent entirely new
|
||
|
backends. These enhancements also allow multiple different backend systems
|
||
|
such as CUDA and OpenMP to coexist within a single program. Support for TBB
|
||
|
allows Thrust programs to integrate more naturally into applications which
|
||
|
may already employ the TBB task scheduler.
|
||
|
|
||
|
Breaking API Changes
|
||
|
The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
|
||
|
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
|
||
|
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
|
||
|
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
|
||
|
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
|
||
|
thrust::host_space_tag has been renamed thrust::host_system_tag
|
||
|
thrust::device_space_tag has been renamed thrust::device_system_tag
|
||
|
thrust::any_space_tag has been renamed thrust::any_system_tag
|
||
|
thrust::iterator_space has been renamed thrust::iterator_system
|
||
|
|
||
|
|
||
|
New Features
|
||
|
Backend Systems
|
||
|
Threading Building Blocks (TBB) is now supported
|
||
|
Functions
|
||
|
for_each_n
|
||
|
raw_reference_cast
|
||
|
Types
|
||
|
pointer
|
||
|
reference
|
||
|
|
||
|
New Examples
|
||
|
cuda/custom_temporary_allocation
|
||
|
cuda/fallback_allocator
|
||
|
device_ptr
|
||
|
expand
|
||
|
minimal_custom_backend
|
||
|
raw_reference_cast
|
||
|
set_operations
|
||
|
|
||
|
Other Enhancements
|
||
|
thrust::for_each now returns the end of the input range similar to most other algorithms
|
||
|
thrust::pair and thrust::tuple have swap functionality
|
||
|
all CUDA algorithms now support large data types
|
||
|
iterators may be dereferenced in user __device__ or __global__ functions
|
||
|
the safe use of different backend systems is now possible within a single binary
|
||
|
|
||
|
Bug Fixes
|
||
|
#469 min_element and max_element algorithms no longer require a const comparison operator
|
||
|
|
||
|
Known Issues
|
||
|
cudafe++.exe may crash when parsing TBB headers on Windows.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.5.3 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Small bug fixes
|
||
|
|
||
|
Bug Fixes
|
||
|
Avoid warnings about potential race due to __shared__ non-POD variable
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.5.2 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Small bug fixes
|
||
|
|
||
|
Bug Fixes
|
||
|
Fixed warning about C-style initialization of structures
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.5.1 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Small bug fixes
|
||
|
|
||
|
Bug Fixes
|
||
|
Sorting data referenced by permutation_iterators on CUDA produces invalid results
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.5.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.5.0 provides introduces new programmer productivity and performance
|
||
|
enhancements. New functionality for creating anonymous "lambda" functions has
|
||
|
been added. A faster host sort provides 2-10x faster performance for sorting
|
||
|
arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
|
||
|
2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
|
||
|
arithmetic types with the OpenMP backend the combined performance improvement
|
||
|
is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
|
||
|
(8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
|
||
|
performance.
|
||
|
|
||
|
Breaking API Changes
|
||
|
device_ptr<void> no longer unsafely converts to device_ptr<T> without an
|
||
|
explicit cast. Use the expression
|
||
|
device_pointer_cast(static_cast<int*>(void_ptr.get()))
|
||
|
to convert, for example, device_ptr<void> to device_ptr<int>.
|
||
|
|
||
|
New Features
|
||
|
Functions
|
||
|
stencil-less transform_if
|
||
|
|
||
|
Types
|
||
|
lambda placeholders
|
||
|
|
||
|
New Examples
|
||
|
lambda
|
||
|
|
||
|
Other Enhancements
|
||
|
host sort is 2-10x faster for arithmetic types
|
||
|
OMP sort provides speedup over host sort
|
||
|
reduce_by_key is 2-3x faster
|
||
|
reduce_by_key no longer requires O(N) temporary storage
|
||
|
CUDA scan algorithms are 10-40% faster
|
||
|
host_vector and device_vector are now documented
|
||
|
out-of-memory exceptions now provide detailed information from CUDART
|
||
|
improved histogram example
|
||
|
device_reference now has a specialized swap
|
||
|
reduce_by_key and scan algorithms are compatible with discard_iterator
|
||
|
|
||
|
Removed Functionality
|
||
|
|
||
|
Bug Fixes
|
||
|
#44 allow host_vector to compile when value_type uses __align__
|
||
|
#198 allow adjacent_difference to permit safe in-situ operation
|
||
|
#303 make thrust thread-safe
|
||
|
#313 avoid race conditions in device_vector::insert
|
||
|
#314 avoid unintended adl invocation when dispatching copy
|
||
|
#365 fix merge and set operation failures
|
||
|
|
||
|
Known Issues
|
||
|
None
|
||
|
|
||
|
Acknowledgments
|
||
|
Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
|
||
|
Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.4.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
|
||
|
and performance improvements. New set theoretic algorithms operating on
|
||
|
sorted sequences have been added. Additionally, a new fancy iterator
|
||
|
allows discarding redundant or otherwise unnecessary output from
|
||
|
algorithms, conserving memory storage and bandwidth.
|
||
|
|
||
|
Breaking API Changes
|
||
|
Eliminations
|
||
|
thrust/is_sorted.h
|
||
|
thrust/utility.h
|
||
|
thrust/set_intersection.h
|
||
|
thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
|
||
|
thrust::deprecated::copy_when
|
||
|
thrust::deprecated::absolute_value
|
||
|
|
||
|
New Features
|
||
|
Functions
|
||
|
copy_n
|
||
|
merge
|
||
|
set_difference
|
||
|
set_symmetric_difference
|
||
|
set_union
|
||
|
|
||
|
Types
|
||
|
discard_iterator
|
||
|
|
||
|
Device support
|
||
|
Compute Capability 2.1 GPUs
|
||
|
|
||
|
New Examples
|
||
|
run_length_decoding
|
||
|
|
||
|
Other Enhancements
|
||
|
Compilation warnings are substantially reduced in various contexts.
|
||
|
The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key,
|
||
|
and thrust::stable_sort_by_key are substantially reduced.
|
||
|
A fast sort implementation is used when sorting primitive types with thrust::greater.
|
||
|
The performance of thrust::set_intersection is improved.
|
||
|
The performance of thrust::fill is improved on SM 1.x devices.
|
||
|
A code example is now provided in each algorithm's documentation.
|
||
|
thrust::reverse now operates in-place
|
||
|
|
||
|
Removed Functionality
|
||
|
thrust::deprecated::copy_when
|
||
|
thrust::deprecated::absolute_value
|
||
|
thrust::experimental::cuda::ogl_interop_allocator
|
||
|
thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
|
||
|
Operations which modify the elements of a thrust::device_vector are no longer
|
||
|
available from source code compiled without nvcc when the device backend is CUDA.
|
||
|
Instead, use the idiom from the cpp_interop example.
|
||
|
|
||
|
Bug Fixes
|
||
|
#212 set_intersection works correctly for large input sizes.
|
||
|
#275 counting_iterator and constant_iterator work correctly with OpenMP as the
|
||
|
backend when compiling with optimization
|
||
|
#256 min and max correctly return their first argument as a tie-breaker
|
||
|
#248 NDEBUG is interpreted correctly
|
||
|
|
||
|
Known Issues
|
||
|
nvcc may generate code containing warnings when compiling some Thrust algorithms.
|
||
|
When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue
|
||
|
benign pointer advisories.
|
||
|
When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
|
||
|
thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
|
||
|
and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.
|
||
|
|
||
|
Acknowledgments
|
||
|
Thanks to David Tarjan for improving the performance of set_intersection.
|
||
|
Thanks to Duane Merrill for continued help with sort.
|
||
|
Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.3.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
|
||
|
and performance enhancements.
|
||
|
|
||
|
Performance of the sort and sort_by_key algorithms is improved by as much
|
||
|
as 3x in certain situations. The performance of stream compaction algorithms,
|
||
|
such as copy_if, is improved by as much as 2x. Reduction performance is
|
||
|
also improved, particularly for small input sizes.
|
||
|
|
||
|
CUDA errors are now converted to runtime exceptions using the system_error
|
||
|
interface. Combined with a debug mode, also new in v1.3, runtime errors
|
||
|
can be located with greater precision.
|
||
|
|
||
|
Lastly, a few header files have been consolidated or renamed for clarity.
|
||
|
See the deprecations section below for additional details.
|
||
|
|
||
|
|
||
|
Breaking API Changes
|
||
|
Promotions
|
||
|
thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
|
||
|
thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
|
||
|
thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
|
||
|
thrust::next::gather has been renamed thrust::gather
|
||
|
thrust::next::gather_if has been renamed thrust::gather_if
|
||
|
thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
|
||
|
Deprecations
|
||
|
thrust::copy_when has been renamed thrust::deprecated::copy_when
|
||
|
thrust::absolute_value has been renamed thrust::deprecated::absolute_value
|
||
|
The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
|
||
|
The header thrust/utility.h is now deprecated; use thrust/swap.h instead
|
||
|
The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
|
||
|
Eliminations
|
||
|
thrust::deprecated::gather
|
||
|
thrust::deprecated::gather_if
|
||
|
thrust/experimental/arch.h and the functions therein
|
||
|
thrust/sorting/merge_sort.h
|
||
|
thrust/sorting/radix_sort.h
|
||
|
|
||
|
New Features
|
||
|
Functions
|
||
|
exclusive_scan_by_key
|
||
|
find
|
||
|
find_if
|
||
|
find_if_not
|
||
|
inclusive_scan_by_key
|
||
|
is_partitioned
|
||
|
is_sorted_until
|
||
|
mismatch
|
||
|
partition_point
|
||
|
reverse
|
||
|
reverse_copy
|
||
|
stable_partition_copy
|
||
|
|
||
|
Types
|
||
|
system_error and related types
|
||
|
experimental::cuda::ogl_interop_allocator
|
||
|
bit_and, bit_or, and bit_xor
|
||
|
|
||
|
Device support
|
||
|
gf104-based GPUs
|
||
|
|
||
|
New Examples
|
||
|
opengl_interop.cu
|
||
|
repeated_range.cu
|
||
|
simple_moving_average.cu
|
||
|
sparse_vector.cu
|
||
|
strided_range.cu
|
||
|
|
||
|
Other Enhancements
|
||
|
Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
|
||
|
Performance of thrust::copy_if is substantially improved
|
||
|
Performance of thrust::reduce and related reductions is improved
|
||
|
THRUST_DEBUG mode added
|
||
|
Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
|
||
|
The number of compiler warnings generated by Thrust has been substantially reduced
|
||
|
Comparison sort now works correctly for input sizes > 32M
|
||
|
min & max usage no longer collides with <windows.h> definitions
|
||
|
Compiling against the OpenMP backend no longer requires nvcc
|
||
|
Performance of device_vector initialized in .cpp files is substantially improved in common cases
|
||
|
Performance of thrust::sort_by_key on the host is substantially improved
|
||
|
|
||
|
Removed Functionality
|
||
|
nvcc 2.3 is no longer supported
|
||
|
|
||
|
Bug Fixes
|
||
|
Debug device code now compiles correctly
|
||
|
thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host
|
||
|
|
||
|
Known Issues
|
||
|
#212 set_intersection is known to fail for large input sizes
|
||
|
partition_point is known to fail for 64b types with nvcc 3.2
|
||
|
|
||
|
Acknowledgments
|
||
|
Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
|
||
|
Thanks to Erich Elsen for contributing an implementation of find_if
|
||
|
Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
|
||
|
Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
|
||
|
Thanks to Cliff Woolley for help with testing
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.2.1 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Small fixes for compatibility with CUDA 3.1
|
||
|
|
||
|
Known Issues
|
||
|
inclusive_scan & exclusive_scan may fail with very large types
|
||
|
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
|
||
|
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
|
||
|
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
|
||
|
default_random_engine::discard is not accelerated with nvcc 2.3
|
||
|
nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.2.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.2 introduces support for compilation to multicore CPUs
|
||
|
and the Ocelot virtual machine, and several new facilities for
|
||
|
pseudo-random number generation. New algorithms such as set
|
||
|
intersection and segmented reduction have also been added. Lastly,
|
||
|
improvements to the robustness of the CUDA backend ensure
|
||
|
correctness across a broad set of (uncommon) use cases.
|
||
|
|
||
|
Breaking API Changes
|
||
|
thrust::gather's interface was incorrect and has been removed.
|
||
|
The old interface is deprecated but will be preserved for Thrust
|
||
|
version 1.2 at thrust::deprecated::gather &
|
||
|
thrust::deprecated::gather_if. The new interface is provided at
|
||
|
thrust::next::gather & thrust::next::gather_if. The new interface
|
||
|
will be promoted to thrust:: in Thrust version 1.3. For more details,
|
||
|
please refer to this thread:
|
||
|
http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd
|
||
|
|
||
|
The thrust::sorting namespace has been deprecated in favor of the
|
||
|
top-level sorting functions, such as thrust::sort() and
|
||
|
thrust::sort_by_key().
|
||
|
|
||
|
New Features
|
||
|
Functions
|
||
|
reduce_by_key
|
||
|
set_intersection
|
||
|
tie
|
||
|
unique_copy
|
||
|
unique_by_key
|
||
|
unique_copy_by_key
|
||
|
|
||
|
Types
|
||
|
Random Number Generation
|
||
|
discard_block_engine
|
||
|
default_random_engine
|
||
|
linear_congruential_engine
|
||
|
linear_feedback_shift_engine
|
||
|
minstd_rand
|
||
|
minstd_rand0
|
||
|
normal_distribution (experimental)
|
||
|
ranlux24
|
||
|
ranlux48
|
||
|
ranlux24_base
|
||
|
ranlux48_base
|
||
|
subtract_with_carry_engine
|
||
|
taus88
|
||
|
uniform_int_distribution
|
||
|
uniform_real_distribution
|
||
|
xor_combine_engine
|
||
|
Functionals
|
||
|
project1st
|
||
|
project2nd
|
||
|
|
||
|
Fancy Iterators
|
||
|
permutation_iterator
|
||
|
reverse_iterator
|
||
|
|
||
|
Device support
|
||
|
Add support for multicore CPUs via OpenMP
|
||
|
Add support for Fermi-class GPUs
|
||
|
Add support for Ocelot virtual machine
|
||
|
|
||
|
New Examples
|
||
|
cpp_integration
|
||
|
histogram
|
||
|
mode
|
||
|
monte_carlo
|
||
|
monte_carlo_disjoint_sequences
|
||
|
padded_grid_reduction
|
||
|
permutation_iterator
|
||
|
row_sum
|
||
|
run_length_encoding
|
||
|
segmented_scan
|
||
|
stream_compaction
|
||
|
summary_statistics
|
||
|
transform_iterator
|
||
|
word_count
|
||
|
|
||
|
Other Enhancements
|
||
|
vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
|
||
|
integer sorting performance is improved when max is large but (max - min) is small and when min is negative
|
||
|
performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
|
||
|
support for nvcc 3.0
|
||
|
|
||
|
Removed Functionality
|
||
|
removed support for equal between host & device sequences
|
||
|
removed support for gather() and scatter() between host & device sequences
|
||
|
|
||
|
Bug Fixes
|
||
|
# 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
|
||
|
# 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
|
||
|
# 46 gather & scatter handle any space iterators correctly
|
||
|
# 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
|
||
|
# 52 avoid collisions with common user macros such as BLOCK_SIZE
|
||
|
# 62 provide better documentation for device_reference
|
||
|
# 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
|
||
|
# 102 eliminated a race condition in device_vector::erase
|
||
|
various compilation warnings eliminated
|
||
|
|
||
|
Known Issues
|
||
|
inclusive_scan & exclusive_scan may fail with very large types
|
||
|
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
|
||
|
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
|
||
|
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
|
||
|
default_random_engine::discard is not accelerated with nvcc 2.3
|
||
|
|
||
|
Acknowledgments
|
||
|
Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
|
||
|
Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
|
||
|
Thanks to Tom Bradley for contributing an implementation of normal_distribution
|
||
|
Thanks to Joseph Rhoads for contributing the example summary_statistics
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.1.1 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.1.0 #
|
||
|
#######################################
|
||
|
|
||
|
Summary
|
||
|
Thrust v1.1 introduces fancy iterators, binary search functions, and
|
||
|
several specialized reduction functions. Experimental support for
|
||
|
segmented scan has also been added.
|
||
|
|
||
|
Breaking API Changes
|
||
|
counting_iterator has been moved into the thrust namespace (previously thrust::experimental)
|
||
|
|
||
|
New Features
|
||
|
Functions
|
||
|
copy_if
|
||
|
lower_bound
|
||
|
upper_bound
|
||
|
vectorized lower_bound
|
||
|
vectorized upper_bound
|
||
|
equal_range
|
||
|
binary_search
|
||
|
vectorized binary_search
|
||
|
all_of
|
||
|
any_of
|
||
|
none_of
|
||
|
minmax_element
|
||
|
advance
|
||
|
inclusive_segmented_scan (experimental)
|
||
|
exclusive_segmented_scan (experimental)
|
||
|
|
||
|
Types
|
||
|
pair
|
||
|
tuple
|
||
|
device_malloc_allocator
|
||
|
|
||
|
Fancy Iterators
|
||
|
constant_iterator
|
||
|
counting_iterator
|
||
|
transform_iterator
|
||
|
zip_iterator
|
||
|
|
||
|
New Examples
|
||
|
computing the maximum absolute difference between vectors
|
||
|
computing the bounding box of a two-dimensional point set
|
||
|
sorting multiple arrays together (lexicographical sorting)
|
||
|
constructing a summed area table
|
||
|
using zip_iterator to mimic an array of structs
|
||
|
using constant_iterator to increment array values
|
||
|
|
||
|
Other Enhancements
|
||
|
added pinned memory allocator (experimental)
|
||
|
added more methods to host_vector & device_vector (issue #4)
|
||
|
added variant of remove_if with a stencil argument (issue #29)
|
||
|
scan and reduce use cudaFuncGetAttributes to determine grid size
|
||
|
exceptions are reported when temporary device arrays cannot be allocated
|
||
|
|
||
|
Bug Fixes
|
||
|
#5 make vector work for larger data types
|
||
|
#9 stable_partition_copy doesn't respect OutputIterator concept semantics
|
||
|
#10 scans should return OutputIterator
|
||
|
#16 make algorithms work for larger data types
|
||
|
#27 dispatch radix_sort even when comp=less<T> is explicitly provided
|
||
|
|
||
|
Known Issues
|
||
|
Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1
|
||
|
uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device.
|
||
|
inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver
|
||
|
|
||
|
|
||
|
#######################################
|
||
|
# Thrust v1.0.0 #
|
||
|
#######################################
|
||
|
|
||
|
Breaking API changes
|
||
|
Rename top level namespace komrade to thrust.
|
||
|
Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface.
|
||
|
Rename range() to sequence() to avoid collision with Boost.Range.
|
||
|
Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if().
|
||
|
|
||
|
New Features
|
||
|
Add C++0x style cbegin() & cend() methods to host_vector & device_vector.
|
||
|
Add transform_if function.
|
||
|
Add stencil versions of replace_if() & replace_copy_if().
|
||
|
Allow counting_iterator to work with for_each().
|
||
|
Allow types with constructors in comparison sort & reduce.
|
||
|
|
||
|
Other Enhancements
|
||
|
merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device.
|
||
|
|
||
|
Bug fixes
|
||
|
Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6)
|
||
|
Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)
|
||
|
|