Skip to main content
deleted 11 characters in body; edited title
Source Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238

Haskell: Nearest Neighbour classification algorithm

We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam. So in this form: X1,X2,...,Xn,SPAM The

X1,X2,...,Xn,SPAM

The test data is also in this form (including the SPAM column, so we can verify the accuracy).

My question is, how can I make this code more idiomatic (haskell-newbie), and what speedups can I make to this code? For example, is there a way to write getMostCommongetMostCommon, without having to groupBygroupBy and then run a maximumBymaximumBy again?

Haskell: Nearest Neighbour classification algorithm

We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam. So in this form: X1,X2,...,Xn,SPAM The test data is also in this form (including the SPAM column, so we can verify the accuracy).

My question is, how can I make this code more idiomatic (haskell-newbie), and what speedups can I make to this code? For example, is there a way to write getMostCommon, without having to groupBy and then run a maximumBy again?

Nearest Neighbour classification algorithm

We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam. So in this form:

X1,X2,...,Xn,SPAM

The test data is also in this form (including the SPAM column, so we can verify the accuracy).

My question is, how can I make this code more idiomatic, and what speedups can I make to this code? For example, is there a way to write getMostCommon, without having to groupBy and then run a maximumBy again?

edited tags
Link
200_success
  • 145.7k
  • 22
  • 192
  • 481
Source Link

Haskell: Nearest Neighbour classification algorithm

The following code is from a university assignment of mine to write a classification algorithm (using nearest neighbour) to classify whether or not a given feature set (each feature is the frequency of words in an email) is spam or not.

We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam. So in this form: X1,X2,...,Xn,SPAM The test data is also in this form (including the SPAM column, so we can verify the accuracy).

My question is, how can I make this code more idiomatic (haskell-newbie), and what speedups can I make to this code? For example, is there a way to write getMostCommon, without having to groupBy and then run a maximumBy again?

import Text.CSV
import Data.List

data SpamType = Spam | NotSpam
    deriving (Show, Eq)

type FeatureSet = [Float]

toSpam :: Int -> SpamType
toSpam 0 = NotSpam
toSpam 1 = Spam 
toSpam a = NotSpam

parseClassifiedRecord :: Record -> (FeatureSet, SpamType)
parseClassifiedRecord x = (init converted, toSpam (truncate (last converted)))
    where 
        converted = map (\val -> read val :: Float) x

-- returns the euclidean distance between two feature sets
difference :: FeatureSet -> FeatureSet -> Float
difference first second = sqrt (sum (zipWith (\x y -> (x - y)^2) first second))

-- finds the SpamType of the k nearest 'nodes' in the training set
findKNearest :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> [SpamType]
findKNearest trainingSet toMatch k = take k (map snd (sortBy (\x y -> (compare (fst x)  (fst y))) [(difference (fst x) toMatch, snd x) | x <- trainingSet]))

-- returns item which occurs most often in the list
getMostCommon :: (Eq a) => [a] -> a
getMostCommon list = head (maximumBy (\x y -> (compare (length x) (length y))) (groupBy (\x y -> (x == y)) list))

-- given a feature set, returns an ordered (i.e. same order as input)
-- list of whether or not feature is spam or not spam
-- looks at the closest k neighbours
classify :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> SpamType
classify trainingSet toClassify k = getMostCommon (findKNearest trainingSet toClassify k)

-- gives a value for the accuracy of expected vs actual for spam classification
-- i.e. num classified correctly / num total
accuracy :: [(SpamType, SpamType)] -> Float
accuracy classifications = (fromIntegral $ length (filter (\x -> (fst x) == (snd x)) classifications)) / (fromIntegral $ length classifications)

main = do 
    packed <- parseCSVFromFile "spam-train.csv"
    packedtest <- parseCSVFromFile "spam-test.csv"
    let (Right trainingSet) = packed
    let (Right testSet) = packedtest
    let classifiedTrainingSet = map parseClassifiedRecord (tail (init trainingSet))
    let unclassified = map parseClassifiedRecord (tail (init testSet))
    let classified = map (\x -> (classify classifiedTrainingSet (fst x) 1, snd x)) unclassified
    putStrLn (show (accuracy classified))