####################################### # Thrust v1.7.0 # ####################################### Summary Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation provides 80% faster performance. Breaking API Changes Dispatch Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch. See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples. thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized. Iterators iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated. iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor). iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade). iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access). All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible. Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type. Other normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution). Placeholder expressions may no longer include the comma operator. New Features Execution Policies Users may directly control the dispatch of algorithm invocations with optional execution policy arguments. For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution. The following execution policies are supported in this version: thrust::host thrust::device thrust::cpp::par thrust::cuda::par thrust::omp::par thrust::tbb::par Algorithms free get_temporary_buffer malloc merge_by_key partition with stencil partition_copy with stencil return_temporary_buffer set_difference_by_key set_intersection_by_key set_symmetric_difference_by_key set_union_by_key stable_partition with stencil stable_partition_copy with stencil tabulate New Examples uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector. Other Enhancements Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter. Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device. THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend. CUDA merge performance is 2-15x faster. CUDA comparison sort performance is 1.3-4x faster. CUDA set operation performance is 1.5-15x faster. TBB reduce_by_key performance is 80% faster. Several algorithms have been parallelized with TBB. Support for user allocators in vectors has been improved. The sparse_vector example is now implemented with merge_by_key instead of sort_by_key. Warnings have been eliminated in various contexts. Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts. Documentation about algorithm requirements have been improved. Simplified the minimal_custom_backend example. Simplified the cuda/custom_temporary_allocation example. Simplified the cuda/fallback_allocator example. Bug Fixes #248 fix broken counting_iterator<float> behavior with OpenMP #231, #209 fix set operation failures with CUDA #187 fix incorrect occupancy calculation with CUDA #153 fix broken multigpu behavior with CUDA #142 eliminate warning produced by thrust::random::taus88 and MSVC 2010 #208 correctly initialize elements in temporary storage when necessary #16 fix compilation error when sorting bool with CUDA #10 fix ambiguous overloads of reinterpret_tag Known Issues g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation. Acknowledgments Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA. Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA. Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm. ####################################### # Thrust v1.6.0 # ####################################### Summary Thrust v1.6.0 provides an interface for customization and extension and a new backend system based on the Threading Building Blocks library. With this new interface, programmers may customize the behavior of specific algorithms as well as control the allocation of temporary storage or invent entirely new backends. These enhancements also allow multiple different backend systems such as CUDA and OpenMP to coexist within a single program. Support for TBB allows Thrust programs to integrate more naturally into applications which may already employ the TBB task scheduler. Breaking API Changes The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h> thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP thrust::host_space_tag has been renamed thrust::host_system_tag thrust::device_space_tag has been renamed thrust::device_system_tag thrust::any_space_tag has been renamed thrust::any_system_tag thrust::iterator_space has been renamed thrust::iterator_system New Features Backend Systems Threading Building Blocks (TBB) is now supported Functions for_each_n raw_reference_cast Types pointer reference New Examples cuda/custom_temporary_allocation cuda/fallback_allocator device_ptr expand minimal_custom_backend raw_reference_cast set_operations Other Enhancements thrust::for_each now returns the end of the input range similar to most other algorithms thrust::pair and thrust::tuple have swap functionality all CUDA algorithms now support large data types iterators may be dereferenced in user __device__ or __global__ functions the safe use of different backend systems is now possible within a single binary Bug Fixes #469 min_element and max_element algorithms no longer require a const comparison operator Known Issues cudafe++.exe may crash when parsing TBB headers on Windows. ####################################### # Thrust v1.5.3 # ####################################### Summary Small bug fixes Bug Fixes Avoid warnings about potential race due to __shared__ non-POD variable ####################################### # Thrust v1.5.2 # ####################################### Summary Small bug fixes Bug Fixes Fixed warning about C-style initialization of structures ####################################### # Thrust v1.5.1 # ####################################### Summary Small bug fixes Bug Fixes Sorting data referenced by permutation_iterators on CUDA produces invalid results ####################################### # Thrust v1.5.0 # ####################################### Summary Thrust v1.5.0 provides introduces new programmer productivity and performance enhancements. New functionality for creating anonymous "lambda" functions has been added. A faster host sort provides 2-10x faster performance for sorting arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting arithmetic types with the OpenMP backend the combined performance improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x (8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster performance. Breaking API Changes device_ptr<void> no longer unsafely converts to device_ptr<T> without an explicit cast. Use the expression device_pointer_cast(static_cast<int*>(void_ptr.get())) to convert, for example, device_ptr<void> to device_ptr<int>. New Features Functions stencil-less transform_if Types lambda placeholders New Examples lambda Other Enhancements host sort is 2-10x faster for arithmetic types OMP sort provides speedup over host sort reduce_by_key is 2-3x faster reduce_by_key no longer requires O(N) temporary storage CUDA scan algorithms are 10-40% faster host_vector and device_vector are now documented out-of-memory exceptions now provide detailed information from CUDART improved histogram example device_reference now has a specialized swap reduce_by_key and scan algorithms are compatible with discard_iterator Removed Functionality Bug Fixes #44 allow host_vector to compile when value_type uses __align__ #198 allow adjacent_difference to permit safe in-situ operation #303 make thrust thread-safe #313 avoid race conditions in device_vector::insert #314 avoid unintended adl invocation when dispatching copy #365 fix merge and set operation failures Known Issues None Acknowledgments Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived. Thanks to Jean-Francois Bastien for suggesting a fix for issue 303. ####################################### # Thrust v1.4.0 # ####################################### Summary Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature and performance improvements. New set theoretic algorithms operating on sorted sequences have been added. Additionally, a new fancy iterator allows discarding redundant or otherwise unnecessary output from algorithms, conserving memory storage and bandwidth. Breaking API Changes Eliminations thrust/is_sorted.h thrust/utility.h thrust/set_intersection.h thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein thrust::deprecated::copy_when thrust::deprecated::absolute_value New Features Functions copy_n merge set_difference set_symmetric_difference set_union Types discard_iterator Device support Compute Capability 2.1 GPUs New Examples run_length_decoding Other Enhancements Compilation warnings are substantially reduced in various contexts. The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key, and thrust::stable_sort_by_key are substantially reduced. A fast sort implementation is used when sorting primitive types with thrust::greater. The performance of thrust::set_intersection is improved. The performance of thrust::fill is improved on SM 1.x devices. A code example is now provided in each algorithm's documentation. thrust::reverse now operates in-place Removed Functionality thrust::deprecated::copy_when thrust::deprecated::absolute_value thrust::experimental::cuda::ogl_interop_allocator thrust::gather and thrust::scatter from host to device and vice versa are no longer supported. Operations which modify the elements of a thrust::device_vector are no longer available from source code compiled without nvcc when the device backend is CUDA. Instead, use the idiom from the cpp_interop example. Bug Fixes #212 set_intersection works correctly for large input sizes. #275 counting_iterator and constant_iterator work correctly with OpenMP as the backend when compiling with optimization #256 min and max correctly return their first argument as a tie-breaker #248 NDEBUG is interpreted correctly Known Issues nvcc may generate code containing warnings when compiling some Thrust algorithms. When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue benign pointer advisories. When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly. thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key, and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator. Acknowledgments Thanks to David Tarjan for improving the performance of set_intersection. Thanks to Duane Merrill for continued help with sort. Thanks to Nathan Whitehead for help with CUDA Toolkit integration. ####################################### # Thrust v1.3.0 # ####################################### Summary Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature and performance enhancements. Performance of the sort and sort_by_key algorithms is improved by as much as 3x in certain situations. The performance of stream compaction algorithms, such as copy_if, is improved by as much as 2x. Reduction performance is also improved, particularly for small input sizes. CUDA errors are now converted to runtime exceptions using the system_error interface. Combined with a debug mode, also new in v1.3, runtime errors can be located with greater precision. Lastly, a few header files have been consolidated or renamed for clarity. See the deprecations section below for additional details. Breaking API Changes Promotions thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface thrust::next::gather has been renamed thrust::gather thrust::next::gather_if has been renamed thrust::gather_if thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy Deprecations thrust::copy_when has been renamed thrust::deprecated::copy_when thrust::absolute_value has been renamed thrust::deprecated::absolute_value The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead The header thrust/utility.h is now deprecated; use thrust/swap.h instead The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead Eliminations thrust::deprecated::gather thrust::deprecated::gather_if thrust/experimental/arch.h and the functions therein thrust/sorting/merge_sort.h thrust/sorting/radix_sort.h New Features Functions exclusive_scan_by_key find find_if find_if_not inclusive_scan_by_key is_partitioned is_sorted_until mismatch partition_point reverse reverse_copy stable_partition_copy Types system_error and related types experimental::cuda::ogl_interop_allocator bit_and, bit_or, and bit_xor Device support gf104-based GPUs New Examples opengl_interop.cu repeated_range.cu simple_moving_average.cu sparse_vector.cu strided_range.cu Other Enhancements Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types Performance of thrust::copy_if is substantially improved Performance of thrust::reduce and related reductions is improved THRUST_DEBUG mode added Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error The number of compiler warnings generated by Thrust has been substantially reduced Comparison sort now works correctly for input sizes > 32M min & max usage no longer collides with <windows.h> definitions Compiling against the OpenMP backend no longer requires nvcc Performance of device_vector initialized in .cpp files is substantially improved in common cases Performance of thrust::sort_by_key on the host is substantially improved Removed Functionality nvcc 2.3 is no longer supported Bug Fixes Debug device code now compiles correctly thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host Known Issues #212 set_intersection is known to fail for large input sizes partition_point is known to fail for 64b types with nvcc 3.2 Acknowledgments Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation Thanks to Erich Elsen for contributing an implementation of find_if Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports Thanks to Cliff Woolley for help with testing ####################################### # Thrust v1.2.1 # ####################################### Summary Small fixes for compatibility with CUDA 3.1 Known Issues inclusive_scan & exclusive_scan may fail with very large types the Microsoft compiler may fail to compile code using both sort and binary search algorithms uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads default_random_engine::discard is not accelerated with nvcc 2.3 nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48. ####################################### # Thrust v1.2.0 # ####################################### Summary Thrust v1.2 introduces support for compilation to multicore CPUs and the Ocelot virtual machine, and several new facilities for pseudo-random number generation. New algorithms such as set intersection and segmented reduction have also been added. Lastly, improvements to the robustness of the CUDA backend ensure correctness across a broad set of (uncommon) use cases. Breaking API Changes thrust::gather's interface was incorrect and has been removed. The old interface is deprecated but will be preserved for Thrust version 1.2 at thrust::deprecated::gather & thrust::deprecated::gather_if. The new interface is provided at thrust::next::gather & thrust::next::gather_if. The new interface will be promoted to thrust:: in Thrust version 1.3. For more details, please refer to this thread: http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd The thrust::sorting namespace has been deprecated in favor of the top-level sorting functions, such as thrust::sort() and thrust::sort_by_key(). New Features Functions reduce_by_key set_intersection tie unique_copy unique_by_key unique_copy_by_key Types Random Number Generation discard_block_engine default_random_engine linear_congruential_engine linear_feedback_shift_engine minstd_rand minstd_rand0 normal_distribution (experimental) ranlux24 ranlux48 ranlux24_base ranlux48_base subtract_with_carry_engine taus88 uniform_int_distribution uniform_real_distribution xor_combine_engine Functionals project1st project2nd Fancy Iterators permutation_iterator reverse_iterator Device support Add support for multicore CPUs via OpenMP Add support for Fermi-class GPUs Add support for Ocelot virtual machine New Examples cpp_integration histogram mode monte_carlo monte_carlo_disjoint_sequences padded_grid_reduction permutation_iterator row_sum run_length_encoding segmented_scan stream_compaction summary_statistics transform_iterator word_count Other Enhancements vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit integer sorting performance is improved when max is large but (max - min) is small and when min is negative performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types support for nvcc 3.0 Removed Functionality removed support for equal between host & device sequences removed support for gather() and scatter() between host & device sequences Bug Fixes # 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time # 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms # 46 gather & scatter handle any space iterators correctly # 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs # 52 avoid collisions with common user macros such as BLOCK_SIZE # 62 provide better documentation for device_reference # 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode # 102 eliminated a race condition in device_vector::erase various compilation warnings eliminated Known Issues inclusive_scan & exclusive_scan may fail with very large types the Microsoft compiler may fail to compile code using both sort and binary search algorithms uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads default_random_engine::discard is not accelerated with nvcc 2.3 Acknowledgments Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot Thanks to Tom Bradley for contributing an implementation of normal_distribution Thanks to Joseph Rhoads for contributing the example summary_statistics ####################################### # Thrust v1.1.1 # ####################################### Summary Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard. ####################################### # Thrust v1.1.0 # ####################################### Summary Thrust v1.1 introduces fancy iterators, binary search functions, and several specialized reduction functions. Experimental support for segmented scan has also been added. Breaking API Changes counting_iterator has been moved into the thrust namespace (previously thrust::experimental) New Features Functions copy_if lower_bound upper_bound vectorized lower_bound vectorized upper_bound equal_range binary_search vectorized binary_search all_of any_of none_of minmax_element advance inclusive_segmented_scan (experimental) exclusive_segmented_scan (experimental) Types pair tuple device_malloc_allocator Fancy Iterators constant_iterator counting_iterator transform_iterator zip_iterator New Examples computing the maximum absolute difference between vectors computing the bounding box of a two-dimensional point set sorting multiple arrays together (lexicographical sorting) constructing a summed area table using zip_iterator to mimic an array of structs using constant_iterator to increment array values Other Enhancements added pinned memory allocator (experimental) added more methods to host_vector & device_vector (issue #4) added variant of remove_if with a stencil argument (issue #29) scan and reduce use cudaFuncGetAttributes to determine grid size exceptions are reported when temporary device arrays cannot be allocated Bug Fixes #5 make vector work for larger data types #9 stable_partition_copy doesn't respect OutputIterator concept semantics #10 scans should return OutputIterator #16 make algorithms work for larger data types #27 dispatch radix_sort even when comp=less<T> is explicitly provided Known Issues Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1 uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device. inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver ####################################### # Thrust v1.0.0 # ####################################### Breaking API changes Rename top level namespace komrade to thrust. Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface. Rename range() to sequence() to avoid collision with Boost.Range. Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if(). New Features Add C++0x style cbegin() & cend() methods to host_vector & device_vector. Add transform_if function. Add stencil versions of replace_if() & replace_copy_if(). Allow counting_iterator to work with for_each(). Allow types with constructors in comparison sort & reduce. Other Enhancements merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device. Bug fixes Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6) Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)