0
from numpy import genfromtxt, linalg, array, append, hstack, vstack

#Euclidean distance function
def euclidean(v1, v2):
    dist = linalg.norm(v1 - v2)
    return dist

#get the .csv files and eliminate heading and unused columns from test
BMUs = genfromtxt('BMU3.csv', delimiter=',')
data = genfromtxt('test.csv', delimiter=',')
data = data[1:, :-2]

i = 0
for obj in data:
    D = 0
    for BMU in BMUs:
        Dist = append(euclidean(obj, BMU[: -2]), BMU[-2:])
    D = hstack(Dist)

Map = vstack(D)

#iteration counter
i += 1
if not i % 1000:
    print (i, ' of ', len(data))

print (Map)

What I would like to do is:

  1. Take an object from data
  2. Calculate distance from BMU (euclidean(obj, BMU[: -2])
  3. Append to the distance the last two items of the BMU array
  4. create a 2d matrix that contains all the distances plus the last two items of all the BMU from a data object (D = hstack(Dist))
  5. create an array of those matrices with length equal to the number of objects in data. (Map = vstack(D))

The problem here, or at least what I think is the problem, is that hstack and vstack would like as input a tuple of an array and not a single array. It's like I'm trying to use them as I use List.append() for lists, sadly I'm a beginner and I have no idea how to do it differently.

Any help would be awesome, thank you in advance :)

2 Answers 2

1

First a usage note:

Instead of:

from numpy import genfromtxt, linalg, array, append, hstack, vstack

use

import numpy as np
....
data = np.genfromtxt(....)
....
     np.hstack...

Secondly, stay away from np.append. It too easy to misuse. Use np.concatenate so you get the full flavor of what it is doing.

list append is better for incremental work

alist = []
for ....
    alist.append(....)
arr = np.array(alist)

==================

Without sample arrays (or at least shapes) I'm guessing. But (n,2) arrays sound reasonable. Taking the distance of each pair of 'points' from each other, I can collect the values in a nested list comprehension:

In [121]: data = np.arange(6).reshape(3,2)
In [122]: [[euclidean(d,b) for b in data] for d in data]
Out[122]: 
[[0.0, 2.8284271247461903, 5.6568542494923806],
 [2.8284271247461903, 0.0, 2.8284271247461903],
 [5.6568542494923806, 2.8284271247461903, 0.0]]

and make that an array:

In [123]: np.array([[euclidean(d,b) for b in data] for d in data])
Out[123]: 
array([[ 0.        ,  2.82842712,  5.65685425],
       [ 2.82842712,  0.        ,  2.82842712],
       [ 5.65685425,  2.82842712,  0.        ]])

The equivalent with nested loops:

alist = []
for d in data:
    sublist=[]
    for b in data:
        sublist.append(euclidean(d,b))
    alist.append(sublist)
arr = np.array(alist)

There are ways of doing this without loops, but let's make sure the basic Python looping approach works first.

===============

If I want the difference (along the last axis) between every element (row) in data and every element in bmu (or here data), I can use array broadcasting. The result is a (3,3,2) array:

In [130]: data[None,:,:]-data[:,None,:]
Out[130]: 
array([[[ 0,  0],
        [ 2,  2],
        [ 4,  4]],

       [[-2, -2],
        [ 0,  0],
        [ 2,  2]],

       [[-4, -4],
        [-2, -2],
        [ 0,  0]]])

norm can handle larger dimensional arrays and takes an axis parameter.

In [132]: np.linalg.norm(data[None,:,:]-data[:,None,:],axis=-1)
Out[132]: 
array([[ 0.        ,  2.82842712,  5.65685425],
       [ 2.82842712,  0.        ,  2.82842712],
       [ 5.65685425,  2.82842712,  0.        ]])
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much, will be waiting for your advice :)
What's the shape (and dtype) for BMU and data? It's easier to replicate and test your code with samples. Otherwise I have to guess and make up sample arrays (like data=np.arange(24).reshape(12,2)).
(243, 7) BMUs.shape (19219, 5) data.shape
Type: they are both numpy arrays
0

Thanks to your help, I managed to implement the pseudo code, here the final program:

import numpy as np


def euclidean(v1, v2):
    dist = np.linalg.norm(v1 - v2)
    return dist


def makeKNN(dataSet, BMUSet, k, fileOut, test=False):
    # take input files
    BMUs = np.genfromtxt(BMUSet, delimiter=',')
    data = np.genfromtxt(dataSet, delimiter=',')

    final = data[1:, :]
    if test == False:
        data = data[1:, :]
    else:
        data = data[1:, :-2]

# Calculate all the distances between data and BMUs than reorder BMU with the distances information

    dist = np.array([[euclidean(d, b[:-2]) for b in BMUs] for d in data])
    BMU_K = np.array([BMUs[np.argsort(d)] for d in dist])

    # median over the closest k BMU
    Z = np.array([[np.sum(b[:k].T[5]) / k] for b in BMU_K])

    # error propagation
    Z_err = np.array([[np.sqrt(np.sum(np.power(b[:k].T[5], 2)))] for b in BMU_K])

    # Adding z estimates and errors to the data
    final = np.concatenate((final, Z, Z_err), axis=1)

    # print output file
    np.savetxt(fileOut, final, delimiter=',')
    print('So long, and thanks for all the fish')

Thank you very much and I hope that this code will help someone else in the future :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.