RCTree cannot handle when the data consists of only one unique value #88

kongwilson · 2021-04-27T00:21:54Z

I ran into issues when a subset of my sample data points only contain ONE unique value. How should we handle such an exception?

The error message basically suggests a NaN value for probability (caused by division by zero). I tried to turn this into a uniform distribution, but it caused subsequent issue after a cut the right side contains no values. I think this violates the principle of the RRCF algo. Do we have better way of resolving such cases?

File "<ipython-input-2-b3a957a401e5>", line 139, in <listcomp>
    rrcf.RCTree(x[ix], index_labels=ix) for ix in ixs]
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 106, in __init__
    self._mktree(X, S, N, I, parent=self)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 177, in _mktree
    S1, S2, branch = self._cut(X, S, parent=parent, side=side)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 159, in _cut
    q = self.rng.choice(self.ndim, p=l)
  File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

The text was updated successfully, but these errors were encountered:

kongwilson · 2021-04-27T01:09:44Z

I have a walkaround for this scenario. If the dataset contains only one unique value, the tree can be build by creating an empty tree and insert 'new' data point into it iteratively.

kongwilson · 2021-05-07T01:56:11Z

The idea of the walkaround is if you reach a single value, instead of creating a tree with all the observations, you insert a point to a empty tree. An example code snippet would be like:

forest = []
while len(forest) < num_trees:
	# Select random subsets of points uniformly from point set
	ixs = np.random.choice(n, size=(n // tree_size, tree_size), replace=False)
	# Add sampled trees to forest
	trees = []
	for ix in ixs:
		if len(np.unique(x[ix])) > 1:
			tree = rrcf.RCTree(x[ix], index_labels=ix)
		else:  # If a subset contains only one value, RCTree will bug
			tree = rrcf.RCTree()
			for point, label in zip(x[ix], ix):
				tree.insert_point(point=point, index=label)
		trees.append(tree)
	forest.extend(trees)

kongwilson · 2021-05-07T02:01:26Z

Hey buddy, I have posted the code for the walkaround. Feel free to check it out. Cheers W

…

在 2021年5月5日，10:31，titaii2 ***@***.***> 写道： I have a walkaround for this scenario. If the dataset contains only one unique value, the tree can be build by creating an empty tree and insert 'new' data point into it iteratively. Hello :) Can you post your solution code here ? Because I have the same problem when my data is populated by X1 at 98% and X2 at 2% which makes the sample (created with "ixs") only populated by X1 which cause the same error "probabilities contain NaN" while doing "l /= l.sum()" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCTree cannot handle when the data consists of only one unique value #88

RCTree cannot handle when the data consists of only one unique value #88

kongwilson commented Apr 27, 2021 •

edited

Loading

kongwilson commented Apr 27, 2021

kongwilson commented May 7, 2021 •

edited

Loading

kongwilson commented May 7, 2021 via email

RCTree cannot handle when the data consists of only one unique value #88

RCTree cannot handle when the data consists of only one unique value #88

Comments

kongwilson commented Apr 27, 2021 • edited Loading

kongwilson commented Apr 27, 2021

kongwilson commented May 7, 2021 • edited Loading

kongwilson commented May 7, 2021 via email

kongwilson commented Apr 27, 2021 •

edited

Loading

kongwilson commented May 7, 2021 •

edited

Loading