Intrinsics Guide

Technologies
MMX
SSE
SSE2
SSE3
SSSE3
SSE4.1
SSE4.2
AVX
AVX2
FMA
AVX-512
KNC
SVML
Other

Categories
Application-Targeted
Arithmetic
Bit Manipulation
Cast
Compare
Convert
Cryptography
Elementary Math Functions
General Support
Logical
Miscellaneous
Move
OS-Targeted
Probability/Statistics
Random
Set
Shift
Special Math Functions
Store
String Compare
Swizzle
Trigonometry

Legal Statement
The Intel Intrinsics Guide is an interactive reference tool for Intel intrinsic instructions, which are C style functions that provide access to many Intel instructions - including Intel® SSE, AVX, AVX-512, and more - without the need to write assembly code.
?

vp4dpwssd
__m512i _mm512_4dpwssd_epi32 (_m512i src, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)

# Synopsis

__m512i _mm512_4dpwssd_epi32 (_m512i src, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
#include <immintrin.h>
Instruction: vp4dpwssd zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4VNNIW

# Description

Compute 4 sequential operand source-block dot-products of two signed 16-bit element operands with 32-bit element accumulation, and store the results in dst.

# Operation

FOR j := 0 to 15 FOR m := 0 to 3 lim_base := m*32 i := j*32 tl := b[lim_base+15:lim_base] tu := b[lim_base+31:lim_base+16] lword := a{m}[i+15:i] * tl uword := a{m}[i+31:i+16] * tu dst[i+31:i] := src[i+31:i] + lword + uword ENDFOR ENDFOR dst[MAX:512] := 0
vp4dpwssd
__m512i _mm512_mask_4dpwssd_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)

# Synopsis

__m512i _mm512_mask_4dpwssd_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
#include <immintrin.h>
Instruction: vp4dpwssd zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4VNNIW

# Description

Compute 4 sequential operand source-block dot-products of two signed 16-bit element operands with 32-bit element accumulation with mask, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 IF mask[j] FOR m := 0 to 3 lim_base := m*32 i := j*32 tl := b[lim_base+15:lim_base] tu := b[lim_base+31:lim_base+16] lword := a{m}[i+15:i] * tl uword := a{m}[i+31:i+16] * tu dst[i+31:i] := src[i+31:i] + lword + uword ENDFOR ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
vp4dpwssd
__m512i _mm512_maskz_4dpwssd_epi32 (_mmask16 k, _m512i src, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
vp4dpwssds
__m512i _mm512_4dpwssds_epi32 (_m512i src, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)

# Synopsis

__m512i _mm512_4dpwssds_epi32 (_m512i src, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
#include <immintrin.h>
Instruction: vp4dpwssds zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4VNNIW

# Description

Compute 4 sequential operand source-block dot-products of two signed 16-bit element operands with 32-bit element accumulation and signed saturation, and store the results in dst.

# Operation

FOR j := 0 to 15 FOR m := 0 to 3 lim_base := m*32 i := j*32 tl := b[lim_base+15:lim_base] tu := b[lim_base+31:lim_base+16] lword := a{m}[i+15:i] * tl uword := a{m}[i+31:i+16] * tu dst[i+31:i] := SIGNED_DWORD_SATURATE(src[i+31:i] + lword + uword) ENDFOR ENDFOR dst[MAX:512] := 0
vp4dpwssds
__m512i _mm512_mask_4dpwssds_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)

# Synopsis

__m512i _mm512_mask_4dpwssds_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
#include <immintrin.h>
Instruction: vp4dpwssds zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4VNNIW

# Description

Compute 4 sequential operand source-block dot-products of two signed 16-bit element operands with 32-bit element accumulation with mask and signed saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set)..

# Operation

FOR j := 0 to 15 IF mask[i] FOR m := 0 to 3 lim_base := m*32 i := j*32 tl := b[lim_base+15:lim_base] tu := b[lim_base+31:lim_base+16] lword := a{m}[i+15:i] * tl uword := a{m}[i+31:i+16] * tu dst[i+31:i] := SIGNED_DWORD_SATURATE(src[i+31:i] + lword + uword) ENDFOR ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
vp4dpwssds
__m512i _mm512_maskz_4dpwssds_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)

# Synopsis

__m512i _mm512_maskz_4dpwssds_epi32 (_m512i src, _mmask16 k, _m512i a0, _m512i a1, _m512i a2, _m512i a3, _m128i * b)
#include <immintrin.h>
Instruction: vp4dpwssds zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4VNNIW

# Description

Compute 4 sequential operand source-block dot-products of two signed 16-bit element operands with 32-bit element accumulation with mask and signed saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set)..

# Operation

FOR j := 0 to 15 IF mask[i] FOR m := 0 to 3 lim_base := m*32 i := j*32 tl := b[lim_base+15:lim_base] tu := b[lim_base+31:lim_base+16] lword := a{m}[i+15:i] * tl uword := a{m}[i+31:i+16] * tu dst[i+31:i] := SIGNED_DWORD_SATURATE(src[i+31:i] + lword + uword) ENDFOR ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m512 _mm512_4fmadd_ps (_m512 a, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_4fmadd_ps (_m512 a, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating with the corresponding elements in a. Store the results in dst.

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] + b{m}[i+31:i] * c[n+31:n]) ENDFOR ENDFOR dst[MAX:512] := 0
__m512 _mm512_mask_4fmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_mask_4fmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating with the corresponding elements in a. Store the results in dst using writemask k (elements are copied from a when the corresponding mask bit is not set).

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 IF mask[j] dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] + b{m}[i+31:i] * c[n+31:n]) FI ENDFOR ENDFOR dst[MAX:512] := 0
__m512 _mm512_maskz_4fmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_maskz_4fmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating with the corresponding elements in a. Store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 IF mask[j] dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] + b{m}[i+31:i] * c[n+31:n]) ELSE dst[i+31:i] := 0 FI ENDFOR ENDFOR dst[MAX:512] := 0
__m128 _mm_4fmadd_ss (__m128 a, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_4fmadd_ss (__m128 a, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating with the lower element in a. Store the result in the lower element of dst.

# Operation

dst := a FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] + b{j}[31:0] * c[i+31:i]) ENDFOR dst[MAX:32] := 0
__m128 _mm_mask_4fmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_mask_4fmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating with the lower element in a. Store the result in the lower element of dst using writemask k (the element is copied from a when mask bit 0 is not set).

# Operation

dst := a IF k[0] FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] + b{j}[31:0] * c[i+31:i]) ENDFOR FI dst[MAX:32] := 0
__m128 _mm_maskz_4fmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_maskz_4fmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating with the lower element in a. Store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set).

# Operation

dst := a IF k[0] FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] + b{j}[31:0] * c[i+31:i]) ENDFOR ELSE dst[31:0] := 0 FI dst[MAX:32] := 0
__m512 _mm512_4fnmadd_ps (_m512 a, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_4fnmadd_ps (_m512 a, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fnmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating the negated intermediate result with the corresponding elements in a. Store the results in dst.

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] - b{m}[i+31:i] * c[n+31:n]) ENDFOR ENDFOR dst[MAX:512] := 0
__m512 _mm512_mask_4fnmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_mask_4fnmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fnmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating the negated intermediate result with the corresponding elements in a. Store the results in dst using writemask k (elements are copied from a when the corresponding mask bit is not set).

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 IF mask[j] dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] - b{m}[i+31:i] * c[n+31:n]) FI ENDFOR ENDFOR dst[MAX:512] := 0
__m512 _mm512_maskz_4fnmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)

# Synopsis

__m512 _mm512_maskz_4fnmadd_ps (_m512 a, _mmask16 k, _m512i b0, _m512i b1, _m512i b2, _m512i b3, _m128i * c)
#include <immintrin.h>
Instruction: v4fnmaddps zmm {k}, zmm+3, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply packed single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by the 4 corresponding packed elements in c, accumulating the negated intermediate result with the corresponding elements in a. Store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

dst := a FOR m := 0 to 3 FOR j := 0 to 15 i := j*32 n := m*32 IF mask[j] dst[i+31:i] := RoundFPControl_MXCSR(dst[i+31:i] - b{m}[i+31:i] * c[n+31:n]) ELSE dst[i+31:i] := 0 FI ENDFOR ENDFOR dst[MAX:512] := 0
__m128 _mm_4fnmadd_ss (__m128 a, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_4fnmadd_ss (__m128 a, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fnmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating the negated intermediate result with the lower element in a. Store the result in the lower element of dst.

# Operation

dst := a FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] - b{j}[31:0] * c[i+31:i]) ENDFOR dst[MAX:32] := 0
__m128 _mm_mask_4fnmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_mask_4fnmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fnmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating the negated intermediate result with the lower element in a. Store the result in the lower element of dst using writemask k (the element is copied from a when mask bit 0 is not set).

# Operation

dst := a IF k[0] FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] - b{j}[31:0] * c[i+31:i]) ENDFOR FI dst[MAX:32] := 0
__m128 _mm_maskz_4fnmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)

# Synopsis

__m128 _mm_maskz_4fnmadd_ss (__m128 a, __mmask8 k, __m128 b0, __m128 b1, __m128 b2, __m128 b3, __m128 * c)
#include <immintrin.h>
Instruction: v4fnmaddss xmm {k}, xmm, m128
CPUID Flags: AVX512_4FMAPS

# Description

Multiply the lower single-precision (32-bit) floating-point elements specified in 4 consecutive operands b0 through b3 by corresponding element in c, accumulating the negated intermediate result with the lower element in a. Store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set).

# Operation

dst := a IF k[0] FOR j := 0 to 3 i := j*32 dst[31:0] := RoundFPControl_MXCSR(dst[31:0] - b{j}[31:0] * c[i+31:i]) ENDFOR ELSE dst[31:0] := 0 FI dst[MAX:32] := 0
pabsw
__m128i _mm_abs_epi16 (__m128i a)

# Synopsis

__m128i _mm_abs_epi16 (__m128i a)
#include <tmmintrin.h>
Instruction: pabsw xmm, xmm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:128] := 0
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:128] := 0
vpabsw
__m256i _mm256_abs_epi16 (__m256i a)

# Synopsis

__m256i _mm256_abs_epi16 (__m256i a)
#include <immintrin.h>
Instruction: vpabsw ymm, ymm
CPUID Flags: AVX2

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:256] := 0
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:256] := 0
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:256] := 0
vpabsw
__m512i _mm512_abs_epi16 (__m512i a)

# Synopsis

__m512i _mm512_abs_epi16 (__m512i a)
#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 31 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR dst[MAX:512] := 0
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:512] := 0
vpabsw

# Synopsis

#include <immintrin.h>
Instruction: vpabsw
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := ABS(a[i+15:i]) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:512] := 0
pabsd
__m128i _mm_abs_epi32 (__m128i a)

# Synopsis

__m128i _mm_abs_epi32 (__m128i a)
#include <tmmintrin.h>
Instruction: pabsd xmm, xmm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:128] := 0
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0
vpabsd
__m256i _mm256_abs_epi32 (__m256i a)

# Synopsis

__m256i _mm256_abs_epi32 (__m256i a)
#include <immintrin.h>
Instruction: vpabsd ymm, ymm
CPUID Flags: AVX2

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:256] := 0
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:256] := 0
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0
vpabsd
__m512i _mm512_abs_epi32 (__m512i a)

# Synopsis

__m512i _mm512_abs_epi32 (__m512i a)
#include <immintrin.h>
Instruction: vpabsd zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR dst[MAX:512] := 0
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
vpabsd

# Synopsis

#include <immintrin.h>
Instruction: vpabsd zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := ABS(a[i+31:i]) ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:512] := 0
vpabsq
__m128i _mm_abs_epi64 (__m128i a)

# Synopsis

__m128i _mm_abs_epi64 (__m128i a)
#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ABS(a[i+63:i]) ENDFOR dst[MAX:128] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:128] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0
vpabsq
__m256i _mm256_abs_epi64 (__m256i a)

# Synopsis

__m256i _mm256_abs_epi64 (__m256i a)
#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ABS(a[i+63:i]) ENDFOR dst[MAX:256] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:256] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq
CPUID Flags: AVX512VL + AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0
vpabsq
__m512i _mm512_abs_epi64 (__m512i a)

# Synopsis

__m512i _mm512_abs_epi64 (__m512i a)
#include <immintrin.h>
Instruction: vpabsq zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := ABS(a[i+63:i]) ENDFOR dst[MAX:512] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
vpabsq

# Synopsis

#include <immintrin.h>
Instruction: vpabsq zmm {k}, zmm
CPUID Flags: AVX512F

# Description

Compute the absolute value of packed 64-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := ABS(a[i+63:i]) ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:512] := 0
pabsb
__m128i _mm_abs_epi8 (__m128i a)

# Synopsis

__m128i _mm_abs_epi8 (__m128i a)
#include <tmmintrin.h>
Instruction: pabsb xmm, xmm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:128] := 0
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:128] := 0
vpabsb
__m256i _mm256_abs_epi8 (__m256i a)

# Synopsis

__m256i _mm256_abs_epi8 (__m256i a)
#include <immintrin.h>
Instruction: vpabsb ymm, ymm
CPUID Flags: AVX2

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:256] := 0
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:256] := 0
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512VL + AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:256] := 0
vpabsb
__m512i _mm512_abs_epi8 (__m512i a)

# Synopsis

__m512i _mm512_abs_epi8 (__m512i a)
#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 63 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR dst[MAX:512] := 0
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:512] := 0
vpabsb

# Synopsis

#include <immintrin.h>
Instruction: vpabsb
CPUID Flags: AVX512BW

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := ABS(a[i+7:i]) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:512] := 0
vpandq
__m512d _mm512_abs_pd (__m512d v2)

# Synopsis

__m512d _mm512_abs_pd (__m512d v2)
#include <immintrin.h>
Instruction: vpandq zmm {k}, zmm, m512
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Finds the absolute value of each packed double-precision (64-bit) floating-point element in v2, storing the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := ABS(v2[i+63:i]) ENDFOR dst[MAX:512] := 0
vpandq

# Synopsis

#include <immintrin.h>
Instruction: vpandq zmm {k}, zmm, m512
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Finds the absolute value of each packed double-precision (64-bit) floating-point element in v2, storing the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := ABS(v2[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
pabsw
__m64 _mm_abs_pi16 (__m64 a)

# Synopsis

__m64 _mm_abs_pi16 (__m64 a)
#include <tmmintrin.h>
Instruction: pabsw mm, mm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 16-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 3 i := j*16 dst[i+15:i] := ABS(a[i+15:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
pabsd
__m64 _mm_abs_pi32 (__m64 a)

# Synopsis

__m64 _mm_abs_pi32 (__m64 a)
#include <tmmintrin.h>
Instruction: pabsd mm, mm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 32-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 1 i := j*32 dst[i+31:i] := ABS(a[i+31:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
pabsb
__m64 _mm_abs_pi8 (__m64 a)

# Synopsis

__m64 _mm_abs_pi8 (__m64 a)
#include <tmmintrin.h>
Instruction: pabsb mm, mm
CPUID Flags: SSSE3

# Description

Compute the absolute value of packed 8-bit integers in a, and store the unsigned results in dst.

# Operation

FOR j := 0 to 7 i := j*8 dst[i+7:i] := ABS(a[i+7:i]) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5
vpandd
__m512 _mm512_abs_ps (__m512 v2)

# Synopsis

__m512 _mm512_abs_ps (__m512 v2)
#include <immintrin.h>
Instruction: vpandd zmm {k}, zmm, m512
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Finds the absolute value of each packed single-precision (32-bit) floating-point element in v2, storing the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := ABS(v2[i+31:i]) ENDFOR dst[MAX:512] := 0
vpandd

# Synopsis

#include <immintrin.h>
Instruction: vpandd zmm {k}, zmm, m512
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Finds the absolute value of each packed single-precision (32-bit) floating-point element in v2, storing the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := ABS(v2[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
...
__m128d _mm_acos_pd (__m128d a)

# Synopsis

__m128d _mm_acos_pd (__m128d a)
#include <immintrin.h>
CPUID Flags: SSE

# Description

Compute the inverse cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ACOS(a[i+63:i]) ENDFOR dst[MAX:128] := 0
...
__m256d _mm256_acos_pd (__m256d a)

# Synopsis

__m256d _mm256_acos_pd (__m256d a)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Compute the inverse cosine of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ACOS(a[i+63:i]) ENDFOR dst[MAX:256] := 0
...
__m512d _mm512_acos_pd (__m512d a)

# Synopsis

__m512d _mm512_acos_pd (__m512d a)
#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := ACOS(a[i+63:i]) ENDFOR dst[MAX:512] := 0
...

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := ACOS(a[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
...
__m128 _mm_acos_ps (__m128 a)

# Synopsis

__m128 _mm_acos_ps (__m128 a)
#include <immintrin.h>
CPUID Flags: SSE

# Description

Compute the inverse cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ACOS(a[i+31:i]) ENDFOR dst[MAX:128] := 0
...
__m256 _mm256_acos_ps (__m256 a)

# Synopsis

__m256 _mm256_acos_ps (__m256 a)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Compute the inverse cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ACOS(a[i+31:i]) ENDFOR dst[MAX:256] := 0
...
__m512 _mm512_acos_ps (__m512 a)

# Synopsis

__m512 _mm512_acos_ps (__m512 a)
#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := ACOS(a[i+31:i]) ENDFOR dst[MAX:512] := 0
...

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := ACOS(a[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
...
__m128d _mm_acosh_pd (__m128d a)

# Synopsis

__m128d _mm_acosh_pd (__m128d a)
#include <immintrin.h>
CPUID Flags: SSE

# Description

Compute the inverse hyperbolic cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := ACOSH(a[i+63:i]) ENDFOR dst[MAX:128] := 0
...
__m256d _mm256_acosh_pd (__m256d a)

# Synopsis

__m256d _mm256_acosh_pd (__m256d a)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Compute the inverse hyperbolic cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := ACOSH(a[i+63:i]) ENDFOR dst[MAX:256] := 0
...
__m512d _mm512_acosh_pd (__m512d a)

# Synopsis

__m512d _mm512_acosh_pd (__m512d a)
#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse hyperbolic cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := ACOSH(a[i+63:i]) ENDFOR dst[MAX:512] := 0
...

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse hyperbolic cosine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := ACOSH(a[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
...
__m128 _mm_acosh_ps (__m128 a)

# Synopsis

__m128 _mm_acosh_ps (__m128 a)
#include <immintrin.h>
CPUID Flags: SSE

# Description

Compute the inverse hyperbolic cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := ACOSH(a[i+31:i]) ENDFOR dst[MAX:128] := 0
...
__m256 _mm256_acosh_ps (__m256 a)

# Synopsis

__m256 _mm256_acosh_ps (__m256 a)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Compute the inverse hyperbolic cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := ACOSH(a[i+31:i]) ENDFOR dst[MAX:256] := 0
...
__m512 _mm512_acosh_ps (__m512 a)

# Synopsis

__m512 _mm512_acosh_ps (__m512 a)
#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse hyperbolic cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := ACOSH(a[i+31:i]) ENDFOR dst[MAX:512] := 0
...

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F

# Description

Compute the inverse hyperbolic cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := ACOSH(a[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpadcd zmm {k}, k, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition of packed 32-bit integers in v2 and v3 and the corresponding bit in k2, storing the result of the addition in dst and the result of the carry in k2_res.

# Operation

FOR j := 0 to 15 i := j*32 k2_res[j] := Carry(v2[i+31:i] + v3[i+31:i] + k2[j]) dst[i+31:i] := v2[i+31:i] + v3[i+31:i] + k2[j] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpadcd zmm {k}, k, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition of packed 32-bit integers in v2 and v3 and the corresponding bit in k2, storing the result of the addition in dst and the result of the carry in k2_res using writemask k1 (elements are copied from v2 when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k1[j] k2_res[j] := Carry(v2[i+31:i] + v3[i+31:i] + k2[j]) dst[i+31:i] := v2[i+31:i] + v3[i+31:i] + k2[j] ELSE dst[i+31:i] := v2[i+31:i] FI ENDFOR dst[MAX:512] := 0
__m128i _mm_add_epi16 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_add_epi16 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 16-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_add_epi16 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_add_epi16 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 16-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_add_epi16 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_add_epi16 (__m512i a, __m512i b)
#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 31 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := a[i+15:i] + b[i+15:i] ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128i _mm_add_epi32 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_add_epi32 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 32-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 32-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 32-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_add_epi32 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_add_epi32 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 32-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 32-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 32-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_add_epi32 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_add_epi32 (__m512i a, __m512i b)
#include <immintrin.h>
Instruction: vpaddd zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed 32-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpaddd zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed 32-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpaddd zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed 32-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128i _mm_add_epi64 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_add_epi64 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 64-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_add_epi64 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_add_epi64 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 64-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] :=0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_add_epi64 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_add_epi64 (__m512i a, __m512i b)
#include <immintrin.h>
Instruction: vpaddq zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpaddq zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vpaddq zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed 64-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128i _mm_add_epi8 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_add_epi8 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 8-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_add_epi8 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 8-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_add_epi8 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_add_epi8 (__m512i a, __m512i b)
#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 63 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := a[i+7:i] + b[i+7:i] ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128d _mm_add_pd (__m128d a, __m128d b)

# Synopsis

__m128d _mm_add_pd (__m128d a, __m128d b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 1 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 1 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256d _mm256_add_pd (__m256d a, __m256d b)

# Synopsis

__m256d _mm256_add_pd (__m256d a, __m256d b)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512d _mm512_add_pd (__m512d a, __m512d b)

# Synopsis

__m512d _mm512_add_pd (__m512d a, __m512d b)
#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m64 _mm_add_pi16 (__m64 a, __m64 b)

# Synopsis

__m64 _mm_add_pi16 (__m64 a, __m64 b)
#include <mmintrin.h>
CPUID Flags: MMX

# Description

Add packed 16-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*16 dst[i+15:i] := a[i+15:i] + b[i+15:i] ENDFOR
__m64 _mm_add_pi32 (__m64 a, __m64 b)

# Synopsis

__m64 _mm_add_pi32 (__m64 a, __m64 b)
#include <mmintrin.h>
CPUID Flags: MMX

# Description

Add packed 32-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 1 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR
__m64 _mm_add_pi8 (__m64 a, __m64 b)

# Synopsis

__m64 _mm_add_pi8 (__m64 a, __m64 b)
#include <mmintrin.h>
CPUID Flags: MMX

# Description

Add packed 8-bit integers in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*8 dst[i+7:i] := a[i+7:i] + b[i+7:i] ENDFOR
__m128 _mm_add_ps (__m128 a, __m128 b)

# Synopsis

__m128 _mm_add_ps (__m128 a, __m128 b)
#include <xmmintrin.h>
CPUID Flags: SSE

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 3 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256 _mm256_add_ps (__m256 a, __m256 b)

# Synopsis

__m256 _mm256_add_ps (__m256 a, __m256 b)
#include <immintrin.h>
CPUID Flags: AVX

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512F + AVX512VL

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512 _mm512_add_ps (__m512 a, __m512 b)

# Synopsis

__m512 _mm512_add_ps (__m512 a, __m512 b)
#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm
CPUID Flags: AVX512F

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m512d _mm512_add_round_pd (__m512d a, __m512d b, int rounding)

# Synopsis

__m512d _mm512_add_round_pd (__m512d a, __m512d b, int rounding)
#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := a[i+63:i] + b[i+63:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddpd zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F

# Description

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := a[i+63:i] + b[i+63:i] ELSE dst[i+63:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m512 _mm512_add_round_ps (__m512 a, __m512 b, int rounding)

# Synopsis

__m512 _mm512_add_round_ps (__m512 a, __m512 b, int rounding)
#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := a[i+31:i] + b[i+31:i] ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F for AVX-512, KNCNI for KNC

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddps zmm {k}, zmm, zmm {er}
CPUID Flags: AVX512F

# Description

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := a[i+31:i] + b[i+31:i] ELSE dst[i+31:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128d _mm_add_round_sd (__m128d a, __m128d b, int rounding)

# Synopsis

__m128d _mm_add_round_sd (__m128d a, __m128d b, int rounding)
#include <immintrin.h>
Instruction: vaddsd xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

dst[63:0] := a[63:0] + b[63:0] dst[127:64] := a[127:64] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddsd xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst using writemask k (the element is copied from src when mask bit 0 is not set), and copy the upper element from a to the upper element of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

IF k[0] dst[63:0] := a[63:0] + b[63:0] ELSE dst[63:0] := src[63:0] FI dst[127:64] := a[127:64] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddsd xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set), and copy the upper element from a to the upper element of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

IF k[0] dst[63:0] := a[63:0] + b[63:0] ELSE dst[63:0] := 0 FI dst[127:64] := a[127:64] dst[MAX:128] := 0
__m128 _mm_add_round_ss (__m128 a, __m128 b, int rounding)

# Synopsis

__m128 _mm_add_round_ss (__m128 a, __m128 b, int rounding)
#include <immintrin.h>
Instruction: vaddss xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

dst[31:0] := a[31:0] + b[31:0] dst[127:32] := a[127:32] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddss xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst using writemask k (the element is copied from src when mask bit 0 is not set), and copy the upper 3 packed elements from a to the upper elements of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

IF k[0] dst[31:0] := a[31:0] + b[31:0] ELSE dst[31:0] := src[31:0] FI dst[127:32] := a[127:32] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddss xmm {k}, xmm, xmm {er}
CPUID Flags: AVX512F

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set), and copy the upper 3 packed elements from a to the upper elements of dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

IF k[0] dst[31:0] := a[31:0] + b[31:0] ELSE dst[31:0] := 0 FI dst[127:32] := a[127:32] dst[MAX:128] := 0
__m128d _mm_add_sd (__m128d a, __m128d b)

# Synopsis

__m128d _mm_add_sd (__m128d a, __m128d b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst, and copy the upper element from a to the upper element of dst.

# Operation

dst[63:0] := a[63:0] + b[63:0] dst[127:64] := a[127:64]

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
Instruction: vaddsd xmm {k}, xmm, xmm
CPUID Flags: AVX512F

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst using writemask k (the element is copied from src when mask bit 0 is not set), and copy the upper element from a to the upper element of dst.

# Operation

IF k[0] dst[63:0] := a[63:0] + b[63:0] ELSE dst[63:0] := src[63:0] FI dst[127:64] := a[127:64] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddsd xmm {k}, xmm, xmm
CPUID Flags: AVX512F

# Description

Add the lower double-precision (64-bit) floating-point element in a and b, store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set), and copy the upper element from a to the upper element of dst.

# Operation

IF k[0] dst[63:0] := a[63:0] + b[63:0] ELSE dst[63:0] := 0 FI dst[127:64] := a[127:64] dst[MAX:128] := 0
__m64 _mm_add_si64 (__m64 a, __m64 b)

# Synopsis

__m64 _mm_add_si64 (__m64 a, __m64 b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add 64-bit integers a and b, and store the result in dst.

# Operation

dst[63:0] := a[63:0] + b[63:0]

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.33
Haswell10.5
Ivy Bridge10.5
__m128 _mm_add_ss (__m128 a, __m128 b)

# Synopsis

__m128 _mm_add_ss (__m128 a, __m128 b)
#include <xmmintrin.h>
CPUID Flags: SSE

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst.

# Operation

dst[31:0] := a[31:0] + b[31:0] dst[127:32] := a[127:32]

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake40.5
Haswell31
Ivy Bridge31

# Synopsis

#include <immintrin.h>
Instruction: vaddss xmm {k}, xmm, xmm
CPUID Flags: AVX512F

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst using writemask k (the element is copied from src when mask bit 0 is not set), and copy the upper 3 packed elements from a to the upper elements of dst.

# Operation

IF k[0] dst[31:0] := a[31:0] + b[31:0] ELSE dst[31:0] := src[31:0] FI dst[127:32] := a[127:32] dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddss xmm {k}, xmm, xmm
CPUID Flags: AVX512F

# Description

Add the lower single-precision (32-bit) floating-point element in a and b, store the result in the lower element of dst using zeromask k (the element is zeroed out when mask bit 0 is not set), and copy the upper 3 packed elements from a to the upper elements of dst.

# Operation

IF k[0] dst[31:0] := a[31:0] + b[31:0] ELSE dst[31:0] := 0 FI dst[127:32] := a[127:32] dst[MAX:128] := 0
unsigned char _addcarry_u32 (unsigned char c_in, unsigned int a, unsigned int b, unsigned int * out)

# Synopsis

unsigned char _addcarry_u32 (unsigned char c_in, unsigned int a, unsigned int b, unsigned int * out)
#include <immintrin.h>

# Description

Add unsigned 32-bit integers a and b with unsigned 8-bit carry-in c_in (carry flag), and store the unsigned 32-bit result in out, and the carry-out in dst (carry or overflow flag).

# Operation

out[31:0] := a[31:0] + b[31:0] + c_in dst := carry_out
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 * out)

# Synopsis

unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 * out)
#include <immintrin.h>

# Description

Add unsigned 64-bit integers a and b with unsigned 8-bit carry-in c_in (carry flag), and store the unsigned 64-bit result in out, and the carry-out in dst (carry or overflow flag).

# Operation

out[63:0] := a[63:0] + b[63:0] + c_in dst := carry_out
unsigned char _addcarryx_u32 (unsigned char c_in, unsigned int a, unsigned int b, unsigned int * out)

# Synopsis

unsigned char _addcarryx_u32 (unsigned char c_in, unsigned int a, unsigned int b, unsigned int * out)
#include <immintrin.h>

# Description

Add unsigned 32-bit integers a and b with unsigned 8-bit carry-in c_in (carry or overflow flag), and store the unsigned 32-bit result in out, and the carry-out in dst (carry or overflow flag).

# Operation

out[31:0] := a[31:0] + b[31:0] + c_in dst := carry_out

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake11
unsigned char _addcarryx_u64 (unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 * out)

# Synopsis

unsigned char _addcarryx_u64 (unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 * out)
#include <immintrin.h>

# Description

Add unsigned 64-bit integers a and b with unsigned 8-bit carry-in c_in (carry or overflow flag), and store the unsigned 64-bit result in out, and the carry-out in dst (carry or overflow flag).

# Operation

out[63:0] := a[63:0] + b[63:0] + c_in dst := carry_out

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake11
__m512d _mm512_addn_pd (__m512d v2, __m512d v3)

# Synopsis

__m512d _mm512_addn_pd (__m512d v2, __m512d v3)
#include <immintrin.h>
Instruction: vaddnpd zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition between packed double-precision (64-bit) floating-point elements in v2 and v3 and negates their sum, storing the results in dst.

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := -(v2[i+63:i] + v3[i+63:i]) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddnpd zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition between packed double-precision (64-bit) floating-point elements in v2 and v3 and negates their sum, storing the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := -(v2[i+63:i] + v3[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
__m512 _mm512_addn_ps (__m512 v2, __m512 v3)

# Synopsis

__m512 _mm512_addn_ps (__m512 v2, __m512 v3)
#include <immintrin.h>
Instruction: vaddnps zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition between packed single-precision (32-bit) floating-point elements in v2 and v3 and negates their sum, storing the results in dst.

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := -(v2[i+31:i] + v3[i+31:i]) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddnps zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element-by-element addition between packed single-precision (32-bit) floating-point elements in v2 and v3 and negates their sum, storing the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := -(v2[i+31:i] + v3[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
__m512d _mm512_addn_round_pd (__m512d v2, __m512d v3, int rounding)

# Synopsis

__m512d _mm512_addn_round_pd (__m512d v2, __m512d v3, int rounding)
#include <immintrin.h>
Instruction: vaddnpd zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element by element addition between packed double-precision (64-bit) floating-point elements in v2 and v3 and negates the sum, storing the result in dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 7 i := j*64 dst[i+63:i] := -(v2[i+63:i] + v3[i+63:i]) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddnpd zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element by element addition between packed double-precision (64-bit) floating-point elements in v2 and v3 and negates the sum, storing the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 7 i := j*64 IF k[j] dst[i+63:i] := -(v2[i+63:i] + v3[i+63:i]) ELSE dst[i+63:i] := src[i+63:i] FI ENDFOR dst[MAX:512] := 0
__m512 _mm512_addn_round_ps (__m512 v2, __m512 v3, int rounding)

# Synopsis

__m512 _mm512_addn_round_ps (__m512 v2, __m512 v3, int rounding)
#include <immintrin.h>
Instruction: vaddnps zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element by element addition between packed single-precision (32-bit) floating-point elements in v2 and v3 and negates the sum, storing the result in dst.
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 15 i := j*32 dst[i+31:i] := -(v2[i+31:i] + v3[i+31:i]) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
Instruction: vaddnps zmm {k}, zmm, zmm
CPUID Flags: KNCNI

# Description

Performs element by element addition between packed single-precision (32-bit) floating-point elements in v2 and v3 and negates the sum, storing the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Rounding is done according to the rounding parameter, which can be one of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions (_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions (_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions (_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions _MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

# Operation

FOR j := 0 to 15 i := j*32 IF k[j] dst[i+31:i] := -(v2[i+31:i] + v3[i+31:i]) ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR dst[MAX:512] := 0
__m128i _mm_adds_epi16 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_adds_epi16 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_adds_epi16 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_adds_epi16 (__m512i a, __m512i b)
#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 31 i := j*16 dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 16-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_Int16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128i _mm_adds_epi8 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_adds_epi8 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 15 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:128] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 15 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:128] := 0
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

# Synopsis

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
#include <immintrin.h>
CPUID Flags: AVX2

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 31 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:256] := 0

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:256] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 31 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:256] := 0
__m512i _mm512_adds_epi8 (__m512i a, __m512i b)

# Synopsis

__m512i _mm512_adds_epi8 (__m512i a, __m512i b)
#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 63 i := j*8 dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := src[i+7:i] FI ENDFOR dst[MAX:512] := 0

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512BW

# Description

Add packed 8-bit integers in a and b using saturation, and store the results in dst using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 63 i := j*8 IF k[j] dst[i+7:i] := Saturate_To_Int8( a[i+7:i] + b[i+7:i] ) ELSE dst[i+7:i] := 0 FI ENDFOR dst[MAX:512] := 0
__m128i _mm_adds_epu16 (__m128i a, __m128i b)

# Synopsis

__m128i _mm_adds_epu16 (__m128i a, __m128i b)
#include <emmintrin.h>
CPUID Flags: SSE2

# Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

# Operation

FOR j := 0 to 7 i := j*16 dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ENDFOR

# Performance

ArchitectureLatencyThroughput (CPI)
Skylake10.5
Haswell10.5
Ivy Bridge10.5

# Synopsis

#include <immintrin.h>
CPUID Flags: AVX512VL + AVX512BW

# Description

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

# Operation

FOR j := 0 to 7 i := j*16 IF k[j] dst[i+15:i] := Saturate_To_UnsignedInt16( a[i+15:i] + b[i+15:i] ) ELSE dst[i+15:i] := src[i+15:i] FI ENDFOR dst[MAX:128] := 0
Data Version: 3.4.4 - Release Notes
Data Updated: 04/17/2019

Questions? Issues? Go Here.
This intrinsic generates a sequence of instructions, which may perform worse than a native instruction. Consider the performance impact of this intrinsic.