float16

A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.

Features

Full IEEE 754-2008 compliance for 16-bit floating-point arithmetic
Complete special value support: ±0, ±∞, NaN (with payload), normalized and subnormal numbers
Multiple rounding modes: nearest-even, toward zero, toward ±∞, nearest-away
Flexible conversion modes: IEEE standard, strict error handling, fast approximations
High-performance operations with optional fast math optimizations
Comprehensive test suite with extensive edge case coverage
Zero dependencies - pure Go implementation

Installation

go get github.com/zerfoo/float16

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float16"
)

func main() {
    // Create float16 values
    a := float16.FromFloat32(3.14159)
    b := float16.FromFloat64(2.71828)
    
    // Basic arithmetic
    sum := a.Add(b)
    product := a.Mul(b)
    
    // Convert back to other types
    fmt.Printf("Sum: %v (float32: %f)\n", sum, sum.ToFloat32())
    fmt.Printf("Product: %v (float64: %f)\n", product, product.ToFloat64())
    
    // Work with special values
    inf := float16.Inf(1)  // positive infinity
    nan := float16.NaN()   // quiet NaN
    zero := float16.Zero() // positive zero
    
    fmt.Printf("Infinity: %v\n", inf)
    fmt.Printf("NaN: %v\n", nan)
    fmt.Printf("Zero: %v\n", zero)
}

Core Types and Constants

Float16 Type

The Float16 type represents a 16-bit IEEE 754 half-precision floating-point value:

type Float16 uint16

Special Values

const (
    PositiveZero     Float16 = 0x0000 // +0.0
    NegativeZero     Float16 = 0x8000 // -0.0
    PositiveInfinity Float16 = 0x7C00 // +∞
    NegativeInfinity Float16 = 0xFC00 // -∞
    MaxValue         Float16 = 0x7BFF // ~65504
    MinValue         Float16 = 0xFBFF // ~-65504
)

Conversion Functions

From Other Types

// From float32/float64
f16 := float16.FromFloat32(3.14159)
f16 := float16.FromFloat64(2.71828)

// From bit representation
f16 := float16.FromBits(0x4200) // 3.0

// From string
f16, err := float16.ParseFloat("3.14159", 32)

To Other Types

f32 := f16.ToFloat32()
f64 := f16.ToFloat64()
bits := f16.Bits()
str := f16.String()

Arithmetic Operations

a := float16.FromFloat32(5.0)
b := float16.FromFloat32(3.0)

// Basic arithmetic
sum := a.Add(b)        // 8.0
diff := a.Sub(b)       // 2.0
product := a.Mul(b)    // 15.0
quotient := a.Div(b)   // 1.666...

// Mathematical functions
sqrt := a.Sqrt()       // √5
abs := a.Abs()         // |a|
neg := a.Neg()         // -a

Rounding Modes

Configure rounding behavior for conversions:

import "github.com/zerfoo/float16"

// Set global rounding mode
config := float16.GetConfig()
config.DefaultRoundingMode = float16.RoundTowardZero
float16.Configure(config)

// Available rounding modes:
// - RoundNearestEven (default)
// - RoundTowardZero
// - RoundTowardPositive  
// - RoundTowardNegative
// - RoundNearestAway

Conversion Modes

Control conversion behavior and error handling:

config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

// Available modes:
// - ModeIEEE: Standard IEEE 754 behavior
// - ModeStrict: Returns errors for overflow/underflow
// - ModeFast: Optimized for performance

Special Value Handling

f := float16.FromFloat32(math.Inf(1))

// Check value types
if f.IsInf(0) {
    fmt.Println("Value is infinity")
}
if f.IsNaN() {
    fmt.Println("Value is NaN")
}
if f.IsFinite() {
    fmt.Println("Value is finite")
}
if f.IsNormal() {
    fmt.Println("Value is normalized")
}
if f.IsSubnormal() {
    fmt.Println("Value is subnormal")
}

// IEEE 754 classification
class := f.Class()
switch class {
case float16.ClassPositiveInfinity:
    fmt.Println("Positive infinity")
case float16.ClassQuietNaN:
    fmt.Println("Quiet NaN")
// ... other classes
}

Performance Features

Fast Math Operations

// Enable fast math for better performance (may sacrifice precision)
config := float16.GetConfig()
config.EnableFastMath = true
float16.Configure(config)

// Use fast operations
result := float16.FastAdd(a, b)
result := float16.FastMul(a, b)

Vectorized Operations

// Vectorized operations (optimized for SIMD when available)
a := []float16.Float16{...}
b := []float16.Float16{...}

sum := float16.VectorAdd(a, b)
product := float16.VectorMul(a, b)

Error Handling

// Strict mode returns errors for exceptional conditions
config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

f16, err := float16.FromFloat32WithMode(1e10, float16.ModeStrict)
if err != nil {
    if float16Err, ok := err.(*float16.Float16Error); ok {
        switch float16Err.Code {
        case float16.ErrOverflow:
            fmt.Println("Value too large for float16")
        case float16.ErrUnderflow:
            fmt.Println("Value too small for float16")
        }
    }
}

Utilities

Statistics for Slices

values := []float16.Float16{
    float16.FromFloat32(1.0),
    float16.FromFloat32(2.0),
    float16.FromFloat32(3.0),
}

stats := float16.ComputeSliceStats(values)
fmt.Printf("Min: %v, Max: %v, Mean: %v\n", stats.Min, stats.Max, stats.Mean)

Debugging and Monitoring

// Get memory usage
usage := float16.GetMemoryUsage()
fmt.Printf("Memory usage: %d bytes\n", usage)

// Get debug information
debug := float16.DebugInfo()
fmt.Printf("Debug info: %+v\n", debug)

Benchmarking

The package includes built-in benchmarking utilities:

ops := float16.GetBenchmarkOperations()
for name, op := range ops {
    // Benchmark operation
    fmt.Printf("Benchmarking %s\n", name)
}

Range and Precision

Float16 has the following characteristics:

Range: ±6.55×10⁴ (approximately ±65,504)
Precision: ~3-4 decimal digits
Smallest positive normal: ~6.10×10⁻⁵
Smallest positive subnormal: ~5.96×10⁻⁸
Machine epsilon: ~9.77×10⁻⁴

Use Cases

Float16 is ideal for:

Machine Learning: Reduced memory usage and faster training
Graphics Programming: Color values, texture coordinates
Scientific Computing: Large datasets where precision can be traded for memory
Embedded Systems: Memory-constrained environments
Data Compression: Storing floating-point data more efficiently

Performance Considerations

Conversions between float16 and float32/float64 have computational overhead
Native float16 arithmetic is generally faster than conversion-based approaches
Enable fast math mode for performance-critical applications where precision can be sacrificed
Use vectorized operations for bulk processing

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
arithmetic.go		arithmetic.go
arithmetic_mode_test.go		arithmetic_mode_test.go
arithmetic_test.go		arithmetic_test.go
bitpattern_test.go		bitpattern_test.go
convert.go		convert.go
float16.go		float16.go
float16_extra_test.go		float16_extra_test.go
float16_test.go		float16_test.go
float32_test.go		float32_test.go
go.mod		go.mod
math.go		math.go
math_test.go		math_test.go
mul_test.go		mul_test.go
multest4_test.go		multest4_test.go
multest_test.go		multest_test.go
slice_test.go		slice_test.go
types.go		types.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

float16

Features

Installation

Quick Start

Core Types and Constants

Float16 Type

Special Values

Conversion Functions

From Other Types

To Other Types

Arithmetic Operations

Rounding Modes

Conversion Modes

Special Value Handling

Performance Features

Fast Math Operations

Vectorized Operations

Error Handling

Utilities

Statistics for Slices

Debugging and Monitoring

Benchmarking

Range and Precision

Use Cases

Performance Considerations

Contributing

License

References

About

Uh oh!

Releases 1

Packages

Languages

License

zerfoo/float16

Folders and files

Latest commit

History

Repository files navigation

float16

Features

Installation

Quick Start

Core Types and Constants

Float16 Type

Special Values

Conversion Functions

From Other Types

To Other Types

Arithmetic Operations

Rounding Modes

Conversion Modes

Special Value Handling

Performance Features

Fast Math Operations

Vectorized Operations

Error Handling

Utilities

Statistics for Slices

Debugging and Monitoring

Benchmarking

Range and Precision

Use Cases

Performance Considerations

Contributing

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages