663 lines
27 KiB
Plaintext
Raw Permalink Normal View History

2014-03-18 22:17:40 +01:00
#######################################
# Thrust v1.7.0 #
#######################################
Summary
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as
well as several new algorithms and performance improvements. With this new
interface, users may directly control how algorithms execute as well as details
such as the allocation of temporary storage. Key/value versions of thrust::merge
and the set operation algorithms have been added, as well stencil versions of
partitioning algorithms. thrust::tabulate has been introduced to tabulate the
values of functions taking integers. For 32b types, new CUDA merge and set
operations provide 2-15x faster performance while a new CUDA comparison sort
provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation
provides 80% faster performance.
Breaking API Changes
Dispatch
Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead
of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.
Iterators
iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
Other
normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
Placeholder expressions may no longer include the comma operator.
New Features
Execution Policies
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
The following execution policies are supported in this version:
thrust::host
thrust::device
thrust::cpp::par
thrust::cuda::par
thrust::omp::par
thrust::tbb::par
Algorithms
free
get_temporary_buffer
malloc
merge_by_key
partition with stencil
partition_copy with stencil
return_temporary_buffer
set_difference_by_key
set_intersection_by_key
set_symmetric_difference_by_key
set_union_by_key
stable_partition with stencil
stable_partition_copy with stencil
tabulate
New Examples
uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.
Other Enhancements
Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
CUDA merge performance is 2-15x faster.
CUDA comparison sort performance is 1.3-4x faster.
CUDA set operation performance is 1.5-15x faster.
TBB reduce_by_key performance is 80% faster.
Several algorithms have been parallelized with TBB.
Support for user allocators in vectors has been improved.
The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
Warnings have been eliminated in various contexts.
Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
Documentation about algorithm requirements have been improved.
Simplified the minimal_custom_backend example.
Simplified the cuda/custom_temporary_allocation example.
Simplified the cuda/fallback_allocator example.
Bug Fixes
#248 fix broken counting_iterator<float> behavior with OpenMP
#231, #209 fix set operation failures with CUDA
#187 fix incorrect occupancy calculation with CUDA
#153 fix broken multigpu behavior with CUDA
#142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
#208 correctly initialize elements in temporary storage when necessary
#16 fix compilation error when sorting bool with CUDA
#10 fix ambiguous overloads of reinterpret_tag
Known Issues
g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
Acknowledgments
Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
#######################################
# Thrust v1.6.0 #
#######################################
Summary
Thrust v1.6.0 provides an interface for customization and extension and a new
backend system based on the Threading Building Blocks library. With this
new interface, programmers may customize the behavior of specific algorithms
as well as control the allocation of temporary storage or invent entirely new
backends. These enhancements also allow multiple different backend systems
such as CUDA and OpenMP to coexist within a single program. Support for TBB
allows Thrust programs to integrate more naturally into applications which
may already employ the TBB task scheduler.
Breaking API Changes
The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
thrust::host_space_tag has been renamed thrust::host_system_tag
thrust::device_space_tag has been renamed thrust::device_system_tag
thrust::any_space_tag has been renamed thrust::any_system_tag
thrust::iterator_space has been renamed thrust::iterator_system
New Features
Backend Systems
Threading Building Blocks (TBB) is now supported
Functions
for_each_n
raw_reference_cast
Types
pointer
reference
New Examples
cuda/custom_temporary_allocation
cuda/fallback_allocator
device_ptr
expand
minimal_custom_backend
raw_reference_cast
set_operations
Other Enhancements
thrust::for_each now returns the end of the input range similar to most other algorithms
thrust::pair and thrust::tuple have swap functionality
all CUDA algorithms now support large data types
iterators may be dereferenced in user __device__ or __global__ functions
the safe use of different backend systems is now possible within a single binary
Bug Fixes
#469 min_element and max_element algorithms no longer require a const comparison operator
Known Issues
cudafe++.exe may crash when parsing TBB headers on Windows.
#######################################
# Thrust v1.5.3 #
#######################################
Summary
Small bug fixes
Bug Fixes
Avoid warnings about potential race due to __shared__ non-POD variable
#######################################
# Thrust v1.5.2 #
#######################################
Summary
Small bug fixes
Bug Fixes
Fixed warning about C-style initialization of structures
#######################################
# Thrust v1.5.1 #
#######################################
Summary
Small bug fixes
Bug Fixes
Sorting data referenced by permutation_iterators on CUDA produces invalid results
#######################################
# Thrust v1.5.0 #
#######################################
Summary
Thrust v1.5.0 provides introduces new programmer productivity and performance
enhancements. New functionality for creating anonymous "lambda" functions has
been added. A faster host sort provides 2-10x faster performance for sorting
arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
arithmetic types with the OpenMP backend the combined performance improvement
is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
(8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
performance.
Breaking API Changes
device_ptr<void> no longer unsafely converts to device_ptr<T> without an
explicit cast. Use the expression
device_pointer_cast(static_cast<int*>(void_ptr.get()))
to convert, for example, device_ptr<void> to device_ptr<int>.
New Features
Functions
stencil-less transform_if
Types
lambda placeholders
New Examples
lambda
Other Enhancements
host sort is 2-10x faster for arithmetic types
OMP sort provides speedup over host sort
reduce_by_key is 2-3x faster
reduce_by_key no longer requires O(N) temporary storage
CUDA scan algorithms are 10-40% faster
host_vector and device_vector are now documented
out-of-memory exceptions now provide detailed information from CUDART
improved histogram example
device_reference now has a specialized swap
reduce_by_key and scan algorithms are compatible with discard_iterator
Removed Functionality
Bug Fixes
#44 allow host_vector to compile when value_type uses __align__
#198 allow adjacent_difference to permit safe in-situ operation
#303 make thrust thread-safe
#313 avoid race conditions in device_vector::insert
#314 avoid unintended adl invocation when dispatching copy
#365 fix merge and set operation failures
Known Issues
None
Acknowledgments
Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.
#######################################
# Thrust v1.4.0 #
#######################################
Summary
Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
and performance improvements. New set theoretic algorithms operating on
sorted sequences have been added. Additionally, a new fancy iterator
allows discarding redundant or otherwise unnecessary output from
algorithms, conserving memory storage and bandwidth.
Breaking API Changes
Eliminations
thrust/is_sorted.h
thrust/utility.h
thrust/set_intersection.h
thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
New Features
Functions
copy_n
merge
set_difference
set_symmetric_difference
set_union
Types
discard_iterator
Device support
Compute Capability 2.1 GPUs
New Examples
run_length_decoding
Other Enhancements
Compilation warnings are substantially reduced in various contexts.
The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key,
and thrust::stable_sort_by_key are substantially reduced.
A fast sort implementation is used when sorting primitive types with thrust::greater.
The performance of thrust::set_intersection is improved.
The performance of thrust::fill is improved on SM 1.x devices.
A code example is now provided in each algorithm's documentation.
thrust::reverse now operates in-place
Removed Functionality
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
thrust::experimental::cuda::ogl_interop_allocator
thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
Operations which modify the elements of a thrust::device_vector are no longer
available from source code compiled without nvcc when the device backend is CUDA.
Instead, use the idiom from the cpp_interop example.
Bug Fixes
#212 set_intersection works correctly for large input sizes.
#275 counting_iterator and constant_iterator work correctly with OpenMP as the
backend when compiling with optimization
#256 min and max correctly return their first argument as a tie-breaker
#248 NDEBUG is interpreted correctly
Known Issues
nvcc may generate code containing warnings when compiling some Thrust algorithms.
When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue
benign pointer advisories.
When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.
Acknowledgments
Thanks to David Tarjan for improving the performance of set_intersection.
Thanks to Duane Merrill for continued help with sort.
Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
#######################################
# Thrust v1.3.0 #
#######################################
Summary
Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
and performance enhancements.
Performance of the sort and sort_by_key algorithms is improved by as much
as 3x in certain situations. The performance of stream compaction algorithms,
such as copy_if, is improved by as much as 2x. Reduction performance is
also improved, particularly for small input sizes.
CUDA errors are now converted to runtime exceptions using the system_error
interface. Combined with a debug mode, also new in v1.3, runtime errors
can be located with greater precision.
Lastly, a few header files have been consolidated or renamed for clarity.
See the deprecations section below for additional details.
Breaking API Changes
Promotions
thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
thrust::next::gather has been renamed thrust::gather
thrust::next::gather_if has been renamed thrust::gather_if
thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
Deprecations
thrust::copy_when has been renamed thrust::deprecated::copy_when
thrust::absolute_value has been renamed thrust::deprecated::absolute_value
The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
The header thrust/utility.h is now deprecated; use thrust/swap.h instead
The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
Eliminations
thrust::deprecated::gather
thrust::deprecated::gather_if
thrust/experimental/arch.h and the functions therein
thrust/sorting/merge_sort.h
thrust/sorting/radix_sort.h
New Features
Functions
exclusive_scan_by_key
find
find_if
find_if_not
inclusive_scan_by_key
is_partitioned
is_sorted_until
mismatch
partition_point
reverse
reverse_copy
stable_partition_copy
Types
system_error and related types
experimental::cuda::ogl_interop_allocator
bit_and, bit_or, and bit_xor
Device support
gf104-based GPUs
New Examples
opengl_interop.cu
repeated_range.cu
simple_moving_average.cu
sparse_vector.cu
strided_range.cu
Other Enhancements
Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
Performance of thrust::copy_if is substantially improved
Performance of thrust::reduce and related reductions is improved
THRUST_DEBUG mode added
Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
The number of compiler warnings generated by Thrust has been substantially reduced
Comparison sort now works correctly for input sizes > 32M
min & max usage no longer collides with <windows.h> definitions
Compiling against the OpenMP backend no longer requires nvcc
Performance of device_vector initialized in .cpp files is substantially improved in common cases
Performance of thrust::sort_by_key on the host is substantially improved
Removed Functionality
nvcc 2.3 is no longer supported
Bug Fixes
Debug device code now compiles correctly
thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host
Known Issues
#212 set_intersection is known to fail for large input sizes
partition_point is known to fail for 64b types with nvcc 3.2
Acknowledgments
Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
Thanks to Erich Elsen for contributing an implementation of find_if
Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
Thanks to Cliff Woolley for help with testing
#######################################
# Thrust v1.2.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 3.1
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48.
#######################################
# Thrust v1.2.0 #
#######################################
Summary
Thrust v1.2 introduces support for compilation to multicore CPUs
and the Ocelot virtual machine, and several new facilities for
pseudo-random number generation. New algorithms such as set
intersection and segmented reduction have also been added. Lastly,
improvements to the robustness of the CUDA backend ensure
correctness across a broad set of (uncommon) use cases.
Breaking API Changes
thrust::gather's interface was incorrect and has been removed.
The old interface is deprecated but will be preserved for Thrust
version 1.2 at thrust::deprecated::gather &
thrust::deprecated::gather_if. The new interface is provided at
thrust::next::gather & thrust::next::gather_if. The new interface
will be promoted to thrust:: in Thrust version 1.3. For more details,
please refer to this thread:
http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd
The thrust::sorting namespace has been deprecated in favor of the
top-level sorting functions, such as thrust::sort() and
thrust::sort_by_key().
New Features
Functions
reduce_by_key
set_intersection
tie
unique_copy
unique_by_key
unique_copy_by_key
Types
Random Number Generation
discard_block_engine
default_random_engine
linear_congruential_engine
linear_feedback_shift_engine
minstd_rand
minstd_rand0
normal_distribution (experimental)
ranlux24
ranlux48
ranlux24_base
ranlux48_base
subtract_with_carry_engine
taus88
uniform_int_distribution
uniform_real_distribution
xor_combine_engine
Functionals
project1st
project2nd
Fancy Iterators
permutation_iterator
reverse_iterator
Device support
Add support for multicore CPUs via OpenMP
Add support for Fermi-class GPUs
Add support for Ocelot virtual machine
New Examples
cpp_integration
histogram
mode
monte_carlo
monte_carlo_disjoint_sequences
padded_grid_reduction
permutation_iterator
row_sum
run_length_encoding
segmented_scan
stream_compaction
summary_statistics
transform_iterator
word_count
Other Enhancements
vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
integer sorting performance is improved when max is large but (max - min) is small and when min is negative
performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
support for nvcc 3.0
Removed Functionality
removed support for equal between host & device sequences
removed support for gather() and scatter() between host & device sequences
Bug Fixes
# 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
# 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
# 46 gather & scatter handle any space iterators correctly
# 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
# 52 avoid collisions with common user macros such as BLOCK_SIZE
# 62 provide better documentation for device_reference
# 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
# 102 eliminated a race condition in device_vector::erase
various compilation warnings eliminated
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
Acknowledgments
Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
Thanks to Tom Bradley for contributing an implementation of normal_distribution
Thanks to Joseph Rhoads for contributing the example summary_statistics
#######################################
# Thrust v1.1.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.
#######################################
# Thrust v1.1.0 #
#######################################
Summary
Thrust v1.1 introduces fancy iterators, binary search functions, and
several specialized reduction functions. Experimental support for
segmented scan has also been added.
Breaking API Changes
counting_iterator has been moved into the thrust namespace (previously thrust::experimental)
New Features
Functions
copy_if
lower_bound
upper_bound
vectorized lower_bound
vectorized upper_bound
equal_range
binary_search
vectorized binary_search
all_of
any_of
none_of
minmax_element
advance
inclusive_segmented_scan (experimental)
exclusive_segmented_scan (experimental)
Types
pair
tuple
device_malloc_allocator
Fancy Iterators
constant_iterator
counting_iterator
transform_iterator
zip_iterator
New Examples
computing the maximum absolute difference between vectors
computing the bounding box of a two-dimensional point set
sorting multiple arrays together (lexicographical sorting)
constructing a summed area table
using zip_iterator to mimic an array of structs
using constant_iterator to increment array values
Other Enhancements
added pinned memory allocator (experimental)
added more methods to host_vector & device_vector (issue #4)
added variant of remove_if with a stencil argument (issue #29)
scan and reduce use cudaFuncGetAttributes to determine grid size
exceptions are reported when temporary device arrays cannot be allocated
Bug Fixes
#5 make vector work for larger data types
#9 stable_partition_copy doesn't respect OutputIterator concept semantics
#10 scans should return OutputIterator
#16 make algorithms work for larger data types
#27 dispatch radix_sort even when comp=less<T> is explicitly provided
Known Issues
Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1
uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device.
inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver
#######################################
# Thrust v1.0.0 #
#######################################
Breaking API changes
Rename top level namespace komrade to thrust.
Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface.
Rename range() to sequence() to avoid collision with Boost.Range.
Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if().
New Features
Add C++0x style cbegin() & cend() methods to host_vector & device_vector.
Add transform_if function.
Add stencil versions of replace_if() & replace_copy_if().
Allow counting_iterator to work with for_each().
Allow types with constructors in comparison sort & reduce.
Other Enhancements
merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device.
Bug fixes
Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6)
Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)