Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCTree cannot handle when the data consists of only one unique value #88

Open
kongwilson opened this issue Apr 27, 2021 · 3 comments
Open

Comments

@kongwilson
Copy link

kongwilson commented Apr 27, 2021

I ran into issues when a subset of my sample data points only contain ONE unique value. How should we handle such an exception?

The error message basically suggests a NaN value for probability (caused by division by zero). I tried to turn this into a uniform distribution, but it caused subsequent issue after a cut the right side contains no values. I think this violates the principle of the RRCF algo. Do we have better way of resolving such cases?

File "<ipython-input-2-b3a957a401e5>", line 139, in <listcomp>
    rrcf.RCTree(x[ix], index_labels=ix) for ix in ixs]
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 106, in __init__
    self._mktree(X, S, N, I, parent=self)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 177, in _mktree
    S1, S2, branch = self._cut(X, S, parent=parent, side=side)
  File "C:\ProgramData\Anaconda3\lib\site-packages\rrcf-0.4.3-py3.8.egg\rrcf\rrcf.py", line 159, in _cut
    q = self.rng.choice(self.ndim, p=l)
  File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
@kongwilson
Copy link
Author

I have a walkaround for this scenario. If the dataset contains only one unique value, the tree can be build by creating an empty tree and insert 'new' data point into it iteratively.

@kongwilson
Copy link
Author

kongwilson commented May 7, 2021

The idea of the walkaround is if you reach a single value, instead of creating a tree with all the observations, you insert a point to a empty tree. An example code snippet would be like:

forest = []
while len(forest) < num_trees:
	# Select random subsets of points uniformly from point set
	ixs = np.random.choice(n, size=(n // tree_size, tree_size), replace=False)
	# Add sampled trees to forest
	trees = []
	for ix in ixs:
		if len(np.unique(x[ix])) > 1:
			tree = rrcf.RCTree(x[ix], index_labels=ix)
		else:  # If a subset contains only one value, RCTree will bug
			tree = rrcf.RCTree()
			for point, label in zip(x[ix], ix):
				tree.insert_point(point=point, index=label)
		trees.append(tree)
	forest.extend(trees)

@kongwilson
Copy link
Author

kongwilson commented May 7, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant